Skip to contents

In a second example, we view previous versions of a Wikipedia article (in this case, on the Highlighter), in order to see which parts are consistently included. This data includes the first 150 versions of the article, as well as the latest 150 versions of the article. The data was gathered as follows:


class_of_interest <- ".mw-content-ltr" ## ids are #id-name, classes are .class-name

editurl <- "https://en.wikipedia.org/w/index.php?title=Highlighter&action=history&offset=&limit=150"
editclass_of_interest <- ".mw-changeslist-date"

url_list1 <- editurl %>%
  read_html() %>%
  html_nodes(editclass_of_interest) %>%
  map(., list()) %>%
  tibble(node = .) %>%
  mutate(link = map_chr(node, html_attr, "href") %>% paste0("https://en.wikipedia.org", .))

editurl2 <- "https://en.wikipedia.org/w/index.php?title=Highlighter&action=history&dir=prev&limit=150"

url_list2 <- editurl2 %>%
  read_html() %>%
  html_nodes(editclass_of_interest) %>%
  map(., list()) %>%
  tibble(node = .) %>%
  mutate(link = map_chr(node, html_attr, "href") %>% paste0("https://en.wikipedia.org", .))

url_list <- rbind(url_list1, url_list2)

wiki_pages <- data.frame(page_notes = rep(NA, dim(url_list)[1]))

for (i in 1:dim(url_list)[1]){

  wiki_list <-  url_list$link[i] %>%
    read_html() %>%
    html_node(class_of_interest) %>%
    html_children() %>%
    map(., list()) %>%
    tibble(node = .) %>%
    mutate(type = map_chr(node, html_name)) %>%
    filter(type == "p") %>%
    mutate(text = map_chr(node, html_text)) %>%
    mutate(cleantext = str_remove_all(text, "\\[.*?\\]") %>% str_trim()) %>%
    plyr::summarise(cleantext = paste(cleantext, collapse = "<br> "))

  wiki_pages$page_notes[i] <- wiki_list$cleantext[1]

}

Note that the Wikipedia version text is placed in a column labelled “page_notes”, as needed for the comment functions in this package. This allows for the comments to be tokenized.

library(highlightr)

toks_comment <- token_comments(highlightr::wiki_pages)

The latest version of the article is the first row in the dataset, and can be used as the “transcript text”, or the base text to which the highlighting is applied. In this case, the column must be named “text”.


transcript_example_rename <- data.frame(text=wiki_pages[1,])

toks_transcript <- token_transcript(transcript_example_rename)

The previous versions are then compared to the current version’s collocations with fuzzy matching in order to provide a count for the amount of times each collocation occurs.


collocation_object <- collocate_comments_fuzzy(toks_transcript, toks_comment)
#> Joining with `by = join_by(unlist.descript_ngrams.)`
#> Joining with `by = join_by(collocation)`
#> Joining with `by = join_by(collocation)`
#> Joining with `by = join_by(collocation.y)`
#> Joining with `by = join_by(collocation)`
#> Joining with `by = join_by(word_number)`

head(collocation_object)
#> # A tibble: 6 × 8
#>   word_number col_1 col_2 col_3 col_4 col_5 to_merge    collocation             
#>         <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>       <chr>                   
#> 1           1 10.2  NA    NA    NA    NA    a           a highlighter also call…
#> 2           2  7.64 10.2  NA    NA    NA    highlighter highlighter also called…
#> 3           3  9.50  7.64 10.2  NA    NA    also        also called a fluoresce…
#> 4           4  7.65  9.50  7.64 10.2  NA    called      called a fluorescent pe…
#> 5           5  8.40  7.65  9.50  7.64 10.2  a           a fluorescent pen is a  
#> 6           6  7.74  8.40  7.65  9.50  7.64 fluorescent fluorescent pen is a ty…

These frequencies can be mapped back to the transcript document, then highlighted as described based on the average collocation frequency that each word appeared in. The results are shown below. Note that the “labels” argument can be used to add additional labels to the gradient key.


merged_frequency <- transcript_frequency(transcript_example_rename, collocation_object)
#> Joining with `by = join_by(to_merge)`
#> Joining with `by = join_by(text, lines, n_words, words, word_num, word_length,
#> x_coord, to_merge, stanza_freq, word_number)`


freq_plot <- collocation_plot(merged_frequency)

page_highlight <- highlighted_text(freq_plot, labels=c("(fewest articles)", "(most articles)"))
(fewest articles) 0
269 (most articles)
highlighter, 
also 
called 
fluorescent 
pen, 
is 
type 
of 
writing 
device 
used 
to 
bring 
attention 
to 
sections 
of 
text 
by 
marking 
them 
with 
vivid, 
translucent 
colour. 
typical 
highlighter 
is 
fluorescent 
yellow, 
with 
the 
color 
coming 
from 
pyranine. 
Different 
compounds, 
such 
as 
rhodamines 
(Rhodamine 
6GD, 
Rhodamine 
B) 
are 
used 
for 
other 
colours. 

highlighter 
is 
felt 
tip 
marker 
filled 
with 
transparent 
fluorescent 
ink 
instead 
of 
black 
or 
opaque 
ink. 
The 
first 
highlighter 
was 
invented 
by 
Dr. 
Frank 
Honn 
in 
1962 
and 
produced 
by 
Carter’s 
Ink 
Company, 
using 
the 
trademarked 
name 
Hi 
Liter. 
Avery 
Dennison 
Corporation 
now 
owns 
the 
brand, 
having 
acquired 
Carter’s 
in 
1975. 

Many 
highlighters 
come 
in 
bright, 
often 
fluorescent 
and 
vibrant 
colors. 
Being 
fluorescent, 
highlighter 
ink 
glows 
under 
black 
light. 
The 
most 
common 
color 
for 
highlighters 
is 
yellow, 
but 
they 
are 
also 
found 
in 
orange, 
red, 
pink, 
purple, 
blue, 
and 
green 
varieties. 
Some 
yellow 
highlighters 
may 
look 
greenish 
in 
colour 
to 
the 
naked 
eye. 
Yellow 
is 
the 
preferred 
color 
to 
use 
when 
making 
photocopy 
as 
it 
will 
not 
produce 
shadow 
on 
the 
copy. 

Highlighters 
are 
available 
in 
multiple 
forms, 
including 
some 
that 
have 
retractable 
felt 
tip 
or 
an 
eraser 
on 
the 
end 
opposite 
the 
felt. 
Other 
types 
of 
highlighters 
include 
the 
trilighter, 
triangularly 
shaped 
pen 
with 
different 
coloured 
tip 
at 
each 
corner, 
and 
ones 
that 
are 
stackable. 
There 
are 
also 
some 
forms 
of 
highlighters 
that 
have 
wax 
like 
quality 
similar 
to 
an 
oil 
pastel. 

“Dry 
highlighters” 
(occasionally 
called 
“dry 
line 
highlighters”) 
have 
an 
applicator 
that 
applies 
thin 
strip 
of 
highlighter 
tape 
(physically 
similar 
to 
audio 
tape 
or 
correction 
tape) 
instead 
of 
felt 
tip. 
Unlike 
standard 
highlighters, 
they 
are 
easily 
erasable. 
They 
are 
different 
from 
“dry 
mark 
highlighters”, 
which 
are 
sometimes 
advertised 
as 
being 
useful 
for 
highlighting 
books 
with 
thin 
pages. 

“Gel 
highlighters” 
contain 
gel 
stick 
rather 
than 
felt 
tip. 
The 
gel 
does 
not 
bleed 
through 
paper 
or 
become 
dried 
out 
in 
the 
pen 
as 
other 
highlighters’ 
inks 
may, 
which 
renders 
them 
useless. 

“Liquid 
Highlighters” 
in 
range 
of 
colours 
are 
also 
available, 
and 
because 
they 
put 
more 
ink 
on 
page 
when 
highlighting, 
they 
make 
words 
stand 
out 
more 
than 
with 
non 
liquid 
types. 
Also 
the 
fact 
that 
more 
highlighting 
ink 
is 
put 
on 
the 
page 
with 
liquid 
highlighters 
means 
that 
the 
highlighting 
ink 
is 
much 
more 
resistive 
to 
fading 
with 
age. 

“Pastel 
Highlighters” 
use 
pastel 
dyes 
instead 
of 
fluorescent 
dyes. 

Some 
word 
processing 
software 
can 
simulate 
highlighting 
by 
using 
technique 
similar 
to 
reverse 
video 
on 
some 
terminals. 
Some 
forms 
of 
syntax 
highlighting 
may 
also 
be 
displayed 
in 
the 
style 
of 
highlighter 
pen, 
with 
bright 
or 
pastel 
background 
to 
the 
text. 
Some 
web 
browser 
extensions 
also 
enables 
users 
to 
create 
digital 
highlights 
on 
websites 
and 
online 
PDFs. 


This text indicates changes to the Wikipedia article, where yellow indicates more occurrences (such as yellow as the primary highlight color and information regarding the trilighter). Darker colors indicate text that is seen in fewer versions of the article (such as the introductory sentence and the reference to correction tape).

We could also use the oldest version of the Highlighter article in the dataset as the transcript reference to view which text has been changed:


transcript_example_rename2 <- data.frame(text=wiki_pages[dim(wiki_pages)[1],])

toks_transcript2 <- token_transcript(transcript_example_rename2)

collocation_object2 <- collocate_comments_fuzzy(toks_transcript2, toks_comment)
#> Joining with `by = join_by(unlist.descript_ngrams.)`
#> Joining with `by = join_by(collocation)`
#> Joining with `by = join_by(collocation)`
#> Joining with `by = join_by(collocation.y)`
#> Joining with `by = join_by(collocation)`
#> Joining with `by = join_by(word_number)`


merged_frequency2 <- transcript_frequency(transcript_example_rename2, collocation_object2)
#> Joining with `by = join_by(to_merge)`
#> Joining with `by = join_by(text, lines, n_words, words, word_num, word_length,
#> x_coord, to_merge, stanza_freq, word_number)`


freq_plot2 <- collocation_plot(merged_frequency2)

page_highlight2 <- highlighted_text(freq_plot2, labels=c("(fewest articles)", "(most articles)"))
(fewest articles) 0
504 (most articles)
highlighter 
is 
form 
of 
marker 
pen 
which 
is 
used 
to 
highlight 
sections 
of 
documents 
in 
vivid 
colour, 
but 
not 
intended 
to 
obscure 
the 
content 
beneath 
the 
marking. 
As 
such, 
highlighter 
ink 
is 
translucent. 

Many 
highlighters 
come 
in 
bright, 
often 
neon 
colors, 
such 
as 
yellow 
or 
pink, 
but 
also 
coming 
in 
colors 
such 
as 
blue, 
green, 
or 
purple. 


In this case, the beginning of the first sentence (“A highlighter is a form…”) is fairly popular. Note the counts for the gradient here are larger than those of the most recent article - with a minimum average collocation of 52 and a maximum average collocation of 496.

Wikipedia Citation: “Highlighter.” Wikipedia, 14 Mar. 2024. Wikipedia, https://en.wikipedia.org/w/index.php?title=Highlighter&oldid=1213690238.