
Mapping Collocation Frequency to Source Document
Source:R/collocation_frequency.R
collocation_frequency.RdThis function provides the frequency of collocations in comments that correspond to the provided source document.
Usage
collocation_frequency(
tbl,
source_row,
text_column,
collocate_length = 5,
fuzzy = FALSE,
n_bands = 50,
threshold = 0.7,
n_gram_width = 4
)Arguments
- tbl
data frame containing documents, where each row represents a document
- source_row
row containing text to be treated as source
- text_column
string indicating the name of the column containing derivative text
- collocate_length
the length of the collocation. Default is 5
- fuzzy
whether or not to use fuzzy matching in collocation calculations
- n_bands
number of bands used in MinHash algorithm passed to
zoomerjoin::jaccard_right_join(). Default is 50- threshold
Jaccard distance threshold to be considered a match passed to
zoomerjoin::jaccard_right_join(). Default is 0.7- n_gram_width
width of n-grams used in Jaccard distance calculation passed to
zoomerjoin::jaccard_right_join(). Default is 4
Details
Collocations are sequences of words present in the source document. For example, the phrase "the blue bird flies" contains one collocation of length 4 ("the blue bird flies"), two collocations of length 3 ("the blue bird" and "blue bird flies"), and three collocations of length 2 ("the blue", "blue bird", and "bird flies"). This function counts the number of corresponding phrases in the 'notes', or the derivative documents. When fuzzy matching is included, indirect matches are included with a weight of (n*d)/m, where n is the frequency of the fuzzy collocation, d is the Jaccard similarity between the transcript and note collocation, and m is the number of closest matches for the note collocation.
Examples
src_row <- which(notepad_example$ID=="source")
merged_frequency <- collocation_frequency(notepad_example, src_row, "Text")