Skip to contents

This package is designed to map a group of derivative texts to the corresponding parent text, based on the frequency with which phrases occur in the derivative texts. The parent text is highlighted corresponding to this frequency, in order to create a ‘heatmap’ of popular phrases found in the derivative texts.

This example is taken from the initial description of a crime used in a study of jury perception of algorithm use and demonstrative evidence. The notepad_example data frame contains an ‘ID’ number corresponding to a study participant, as well as their notes, labelled as ‘Text’. The first six observations are shown below.


# load the library
library(highlightr)
library(knitr)

# View first 6 observations
knitr::kable(head(notepad_example))
ID Text
121 Richard Cole - charged with discharging firearm in business. // felony . NOT GUILTY.
197 Richard Cole - Def: Willfully discarge firearm in biz - Felony. Pleaded NG
168 willfully discharging firearm in a business - felony. not guilty
131 discharged firearm in business, intentionally
77 In this case, the defendant - Richard Cole - has been charged with willfully discharging a firearm in a place of business. This crime is a felony.
24 defendant - Richard Cole discharging a firearm in a place of business. pleaded not guilty.

Additionally, the source document (or study transcript) is included in notepad_example with an ID of ‘source’. The original transcript is shown here:


study_transcript <- notepad_example[notepad_example$ID == "source",]$Text

knitr::kable(study_transcript)
x
In this case, the defendant - Richard Cole - has been charged with willfully discharging a firearm in a place of business. This crime is a felony. Mr. Cole has pleaded not guilty to the charge. You will now read a summary of the case. This summary was prepared by an objective court clerk. It describes select evidence that was presented at trial.


Fuzzy collocation is used to match the tokenized derivative texts to the phrases in the tokenized source text. This function first determines the number of times a collocation of length 5 occurs in derivative texts, or participant notes on the case. Fuzzy (or indirect) matches are then added to the frequency count of the source collocation that is the closest match. These fuzzy matches are weighted based on the edit distance between the source collocation and the indirect phrase: n*dm \frac{n*d}{m}

Here, nn is the frequency of the fuzzy collocation, dd is the Jaccard similarity between the fuzzy collocation and the source collocation (ranging from 0 to 1, where 1 indicates identical strings), and mm is the number of closest matches for the fuzzy collocation.

The collocation_frequency() function attaches the collocation counts to the full text of the transcript. The collocation frequencies are averaged per word.


# connect collocation frequencies to source document

merged_frequency <- collocation_frequency(notepad_example, source_row=which(notepad_example$ID=="source"), text_column = "Text", fuzzy=TRUE)
#> Warning in join_func(a = a, b = b, by_a = by_a, by_b = by_b, block_by_a = block_by_a, : A pair of records at the threshold (0.7) have only a 95% chance of being compared.
#> Please consider changing `n_bands` and `band_width`.

knitr::kable(head(merged_frequency))
words word_num x_coord to_merge word_number col_1 col_2 col_3 col_4 col_5 collocation Freq
In 1 1 in 1 6.956522 NA NA NA NA in this case the defendant 6.956522
this 2 5 this 2 7.000000 6.956522 NA NA NA this case the defendant richard 6.978261
case, 3 13 case 3 7.928571 7.000000 6.956522 NA NA case the defendant richard cole 7.295031
the 4 23 the 4 10.000000 7.928571 7.000000 6.956522 NA the defendant richard cole has 7.971273
defendant 5 29 defendant 5 10.000000 10.000000 7.928571 7.000000 6.956522 defendant richard cole has been 8.377019
- 6 47 NA NA NA NA NA NA NA NaN

The output assigns the frequency of each collocation to each word that occurs in that collocation. For example, the first collocation in the description is “in this case the defendant”, which occurs with a frequency of 6.96. This is the only collocation in which the first word will appear, so this is the only collocation value provided for the first word. The second word, “this” appears in the next collocation as well: “this case the defendant richard”, whose frequency is 7, and so on for all words in the description. Collocations are weighted by the number of times they appear in the transcript text.

The combined document is then fed through ggplot to assign gradient colors based on frequency, and the minimum and maximum values are recorded.


# create `ggplot` object of the transcript

freq_plot <- collocation_plot(merged_frequency)

# add html tags to source document

page_highlight <- highlighted_text(freq_plot)

After colors have been assigned, HTML output is created for highlighted text is created based on frequency, as well as a gradient bar indicating the high and low values. The left side of each word gradient indicates the frequency of the previous word’s averaged collocation frequency, while the right side indicates the current word’s averaged collocation frequency. This HTML output can be rendered into highlighted text by specifying `r page_highlight` in an R Markdown document outside of a code chunk and knitting to HTML:

0
45
In 
this 
case, 
the 
defendant 
Richard 
Cole 
has 
been 
charged 
with 
willfully 
discharging 
firearm 
in 
place 
of 
business. 
This 
crime 
is 
felony. 
Mr. 
Cole 
has 
pleaded 
not 
guilty 
to 
the 
charge. 
You 
will 
now 
read 
summary 
of 
the 
case. 
This 
summary 
was 
prepared 
by 
an 
objective 
court 
clerk. 
It 
describes 
select 
evidence 
that 
was 
presented 
at 
trial. 


Alternatively, the xml2 package can be used to save the output as an html file, as shown in the following code:


# load `xml2` library

library(xml2)

# save html output to desired location

xml2::write_html(xml2::read_html(page_highlight), "filename.html")

In this case, the highlighting pattern resembles that when the fuzzy matches are included, but the maximum value reached is smaller. Note also that the colors used in highlighting can be changed in the “colors” argument of the collocation_plot function.


# connect collocation frequencies to source document

merged_frequency_nonfuzzy <- collocation_frequency(notepad_example, source_row=which(notepad_example$ID=="source"), text_column = "Text")

# create a `ggplot` object of the transcript, and change colors of the gradient

freq_plot_nonfuzzy <- collocation_plot(merged_frequency_nonfuzzy, colors=c("#15bf7e", "#fcc7ed"))

# add html tags to source document

page_highlight_nonfuzzy <- highlighted_text(freq_plot_nonfuzzy)
0
41
In 
this 
case, 
the 
defendant 
Richard 
Cole 
has 
been 
charged 
with 
willfully 
discharging 
firearm 
in 
place 
of 
business. 
This 
crime 
is 
felony. 
Mr. 
Cole 
has 
pleaded 
not 
guilty 
to 
the 
charge. 
You 
will 
now 
read 
summary 
of 
the 
case. 
This 
summary 
was 
prepared 
by 
an 
objective 
court 
clerk. 
It 
describes 
select 
evidence 
that 
was 
presented 
at 
trial. 


Additionally, the length of the collocation can be changed. The default collocation length (shown above) is 5 words. Below, this collocation length has been changed to 2 words.

In these shorter collocations, we can see that the collocation containing the name “Richard Cole” is popular, with a frequency of 89.


# connect collocation frequencies to source document

merged_frequency_2col <- collocation_frequency(notepad_example, source_row=which(notepad_example$ID=="source"), text_column = "Text", collocate_length = 2)

# create a `ggplot` object of the transcript

freq_plot_2col <- collocation_plot(merged_frequency_2col)

# add html tags to source document

page_highlight_2col <- highlighted_text(freq_plot_2col)
0
72
In 
this 
case, 
the 
defendant 
Richard 
Cole 
has 
been 
charged 
with 
willfully 
discharging 
firearm 
in 
place 
of 
business. 
This 
crime 
is 
felony. 
Mr. 
Cole 
has 
pleaded 
not 
guilty 
to 
the 
charge. 
You 
will 
now 
read 
summary 
of 
the 
case. 
This 
summary 
was 
prepared 
by 
an 
objective 
court 
clerk. 
It 
describes 
select 
evidence 
that 
was 
presented 
at 
trial.