In a second example, we view previous versions of a Wikipedia article (in this case, on the Highlighter), in order to see which parts are consistently included. This data includes the first 150 versions of the article, as well as the latest 150 versions of the article. The data was gathered as follows:
class_of_interest <- ".mw-content-ltr" ## ids are #id-name, classes are .class-name
editurl <- "https://en.wikipedia.org/w/index.php?title=Highlighter&action=history&offset=&limit=150"
editclass_of_interest <- ".mw-changeslist-date"
url_list1 <- editurl %>%
read_html() %>%
html_nodes(editclass_of_interest) %>%
map(., list()) %>%
tibble(node = .) %>%
mutate(link = map_chr(node, html_attr, "href") %>% paste0("https://en.wikipedia.org", .))
editurl2 <- "https://en.wikipedia.org/w/index.php?title=Highlighter&action=history&dir=prev&limit=150"
url_list2 <- editurl2 %>%
read_html() %>%
html_nodes(editclass_of_interest) %>%
map(., list()) %>%
tibble(node = .) %>%
mutate(link = map_chr(node, html_attr, "href") %>% paste0("https://en.wikipedia.org", .))
url_list <- rbind(url_list1, url_list2)
wiki_pages <- data.frame(page_notes = rep(NA, dim(url_list)[1]))
for (i in 1:dim(url_list)[1]){
wiki_list <- url_list$link[i] %>%
read_html() %>%
html_node(class_of_interest) %>%
html_children() %>%
map(., list()) %>%
tibble(node = .) %>%
mutate(type = map_chr(node, html_name)) %>%
filter(type == "p") %>%
mutate(text = map_chr(node, html_text)) %>%
mutate(cleantext = str_remove_all(text, "\\[.*?\\]") %>% str_trim()) %>%
plyr::summarise(cleantext = paste(cleantext, collapse = "<br> "))
wiki_pages$page_notes[i] <- wiki_list$cleantext[1]
}
Note that the Wikipedia version text is placed in a column labelled “page_notes”, as needed for the comment functions in this package. This allows for the comments to be tokenized.
library(highlightr)
toks_comment <- token_comments(highlightr::wiki_pages)
The latest version of the article is the first row in the dataset, and can be used as the “transcript text”, or the base text to which the highlighting is applied. In this case, the column must be named “text”.
transcript_example_rename <- data.frame(text=wiki_pages[1,])
toks_transcript <- token_transcript(transcript_example_rename)
The previous versions are then compared to the current version’s collocations with fuzzy matching in order to provide a count for the amount of times each collocation occurs.
collocation_object <- collocate_comments_fuzzy(toks_transcript, toks_comment)
#> Joining with `by = join_by(unlist.descript_ngrams.)`
#> Joining with `by = join_by(collocation)`
#> Joining with `by = join_by(collocation)`
#> Joining with `by = join_by(collocation.y)`
#> Joining with `by = join_by(collocation)`
#> Joining with `by = join_by(word_number)`
head(collocation_object)
#> # A tibble: 6 × 8
#> word_number col_1 col_2 col_3 col_4 col_5 to_merge collocation
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 1 10.2 NA NA NA NA a a highlighter also call…
#> 2 2 7.64 10.2 NA NA NA highlighter highlighter also called…
#> 3 3 9.50 7.64 10.2 NA NA also also called a fluoresce…
#> 4 4 7.65 9.50 7.64 10.2 NA called called a fluorescent pe…
#> 5 5 8.40 7.65 9.50 7.64 10.2 a a fluorescent pen is a
#> 6 6 7.74 8.40 7.65 9.50 7.64 fluorescent fluorescent pen is a ty…
These frequencies can be mapped back to the transcript document, then highlighted as described based on the average collocation frequency that each word appeared in. The results are shown below. Note that the “labels” argument can be used to add additional labels to the gradient key.
merged_frequency <- transcript_frequency(transcript_example_rename, collocation_object)
#> Joining with `by = join_by(to_merge)`
#> Joining with `by = join_by(text, lines, n_words, words, word_num, word_length,
#> x_coord, to_merge, stanza_freq, word_number)`
freq_plot <- collocation_plot(merged_frequency)
page_highlight <- highlighted_text(freq_plot, labels=c("(fewest articles)", "(most articles)"))
This text indicates changes to the Wikipedia article, where yellow indicates more occurrences (such as yellow as the primary highlight color and information regarding the trilighter). Darker colors indicate text that is seen in fewer versions of the article (such as the introductory sentence and the reference to correction tape).
We could also use the oldest version of the Highlighter article in the dataset as the transcript reference to view which text has been changed:
transcript_example_rename2 <- data.frame(text=wiki_pages[dim(wiki_pages)[1],])
toks_transcript2 <- token_transcript(transcript_example_rename2)
collocation_object2 <- collocate_comments_fuzzy(toks_transcript2, toks_comment)
#> Joining with `by = join_by(unlist.descript_ngrams.)`
#> Joining with `by = join_by(collocation)`
#> Joining with `by = join_by(collocation)`
#> Joining with `by = join_by(collocation.y)`
#> Joining with `by = join_by(collocation)`
#> Joining with `by = join_by(word_number)`
merged_frequency2 <- transcript_frequency(transcript_example_rename2, collocation_object2)
#> Joining with `by = join_by(to_merge)`
#> Joining with `by = join_by(text, lines, n_words, words, word_num, word_length,
#> x_coord, to_merge, stanza_freq, word_number)`
freq_plot2 <- collocation_plot(merged_frequency2)
page_highlight2 <- highlighted_text(freq_plot2, labels=c("(fewest articles)", "(most articles)"))
In this case, the beginning of the first sentence (“A highlighter is a form…”) is fairly popular. Note the counts for the gradient here are larger than those of the most recent article - with a minimum average collocation of 52 and a maximum average collocation of 496.
Wikipedia Citation: “Highlighter.” Wikipedia, 14 Mar. 2024. Wikipedia, https://en.wikipedia.org/w/index.php?title=Highlighter&oldid=1213690238.