Skip to contents

This function is used to apply the longest common substring method to extreme values in a dataset. To be used after applying firstnchar() and extremeid(). Dataset should have a "page_notes" column corresponding to the cleaned notes outcome from firstnchar().

Usage

lcsclean_hybrid(dataset, notes, propor, identifier, pageid, toclean)

Arguments

dataset

the dataset containing the notes

notes

the column name for the notes

propor

minimum necessary of matching proportion of previous notes for removal

identifier

column name for uniquely identifying identification

pageid

column name for page number

toclean

column name for identifying column of notes to clean (TRUE/FALSE)

Value

a data frame

Examples

test_dataset <- data.frame(ID=c("1","1","2","2","1", "3","3"),
Notes=c("The","The cat","The","The dog","The cat ran",
"the chicken was chased", "The goat chased the chicken"),
Page=c(1,2,1,2,3,1,2), cleaning = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),
page_notes = c("The","The cat","The","The dog","The cat ran",
"the chicken was chased", "The goat chased the chicken"))
lcsclean_hybrid(test_dataset,"Notes",0.25,"ID","Page", "cleaning")
#>   ID                       Notes Page cleaning                  page_notes
#> 1  1                         The    1    FALSE                         The
#> 2  1                     The cat    2    FALSE                     The cat
#> 3  2                         The    1    FALSE                         The
#> 4  2                     The dog    2     TRUE                     The dog
#> 5  1                 The cat ran    3    FALSE                 The cat ran
#> 6  3      the chicken was chased    1    FALSE      the chicken was chased
#> 7  3 The goat chased the chicken    2     TRUE The goat chased the chicken
#>   lcs_notes           hybrid_notes
#> 1      <NA>                    The
#> 2      <NA>                The cat
#> 3      <NA>                    The
#> 4       dog                    dog
#> 5      <NA>            The cat ran
#> 6      <NA> the chicken was chased
#> 7  The goat               The goat