Skip to contents

This uses the longest common substring method of note cleaning, where the longest common substring between the two note pages is identified and removed if it is longer than a set threshold.

Usage

lcsclean(dataset, notes, propor, identifier, pageid)

Arguments

dataset

the dataset containing the notes

notes

the column name for the notes

propor

minimum necessary of matching proportion of previous notes for removal

identifier

column name for uniquely identifying identification

pageid

column name for page number

Value

a data frame

Examples

test_dataset <- data.frame(ID=c("1","1","2","2","1", "3","3"),
Notes=c("The","The cat","The","The dog","The cat ran",
"the chicken was chased", "The goat chased the chicken"),
Page=c(1,2,1,2,3,1,2))
lcsclean(test_dataset,"Notes",0.5,"ID","Page")
#>   ID                       Notes Page                  page_notes
#> 1  1                         The    1                         The
#> 2  1                     The cat    2                         cat
#> 3  2                         The    1                         The
#> 4  2                     The dog    2                         dog
#> 5  1                 The cat ran    3                         ran
#> 6  3      the chicken was chased    1      the chicken was chased
#> 7  3 The goat chased the chicken    2 The goat chased the chicken