This uses the longest common substring method of note cleaning, where the longest common substring between the two note pages is identified and removed if it is longer than a set threshold.
Examples
test_dataset <- data.frame(ID=c("1","1","2","2","1", "3","3"),
Notes=c("The","The cat","The","The dog","The cat ran",
"the chicken was chased", "The goat chased the chicken"),
Page=c(1,2,1,2,3,1,2))
lcsclean(test_dataset,"Notes",0.5,"ID","Page")
#> ID Notes Page page_notes
#> 1 1 The 1 The
#> 2 1 The cat 2 cat
#> 3 2 The 1 The
#> 4 2 The dog 2 dog
#> 5 1 The cat ran 3 ran
#> 6 3 the chicken was chased 1 the chicken was chased
#> 7 3 The goat chased the chicken 2 The goat chased the chicken