Longest Common Substring Note Cleaning — lcsclean • seqstrclean

This uses the longest common substring method of note cleaning, where the longest common substring between the two note pages is identified and removed if it is longer than a set threshold.

Usage

lcsclean(dataset, notes, propor, identifier, pageid)

Arguments

dataset: the dataset containing the notes
notes: the column name for the notes
propor: minimum necessary of matching proportion of previous notes for removal
identifier: column name for uniquely identifying identification
pageid: column name for page number

Value

a data frame

Examples

test_dataset <- data.frame(ID=c("1","1","2","2","1", "3","3"),
Notes=c("The","The cat","The","The dog","The cat ran",
"the chicken was chased", "The goat chased the chicken"),
Page=c(1,2,1,2,3,1,2))
lcsclean(test_dataset,"Notes",0.5,"ID","Page")
#>   ID                       Notes Page                  page_notes
#> 1  1                         The    1                         The
#> 2  1                     The cat    2                         cat
#> 3  2                         The    1                         The
#> 4  2                     The dog    2                         dog
#> 5  1                 The cat ran    3                         ran
#> 6  3      the chicken was chased    1      the chicken was chased
#> 7  3 The goat chased the chicken    2 The goat chased the chicken