Longest Common Substring Note Cleaning for Hybrid Method

This function is used to apply the longest common substring method to extreme values in a dataset. To be used after applying firstnchar() and extremeid(). Dataset should have a "page_notes" column corresponding to the cleaned notes outcome from firstnchar().

Usage

lcsclean_hybrid(dataset, notes, propor, identifier, pageid, toclean)

Arguments

dataset: the dataset containing the notes
notes: the column name for the notes
propor: minimum necessary of matching proportion of previous notes for removal
identifier: column name for uniquely identifying identification
pageid: column name for page number
toclean: column name for identifying column of notes to clean (TRUE/FALSE)

Value

a data frame

Examples

test_dataset <- data.frame(ID=c("1","1","2","2","1", "3","3"),
Notes=c("The","The cat","The","The dog","The cat ran",
"the chicken was chased", "The goat chased the chicken"),
Page=c(1,2,1,2,3,1,2), cleaning = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),
page_notes = c("The","The cat","The","The dog","The cat ran",
"the chicken was chased", "The goat chased the chicken"))
lcsclean_hybrid(test_dataset,"Notes",0.25,"ID","Page", "cleaning")
#>   ID                       Notes Page cleaning                  page_notes
#> 1  1                         The    1    FALSE                         The
#> 2  1                     The cat    2    FALSE                     The cat
#> 3  2                         The    1    FALSE                         The
#> 4  2                     The dog    2     TRUE                     The dog
#> 5  1                 The cat ran    3    FALSE                 The cat ran
#> 6  3      the chicken was chased    1    FALSE      the chicken was chased
#> 7  3 The goat chased the chicken    2     TRUE The goat chased the chicken
#>   lcs_notes           hybrid_notes
#> 1      <NA>                    The
#> 2      <NA>                The cat
#> 3      <NA>                    The
#> 4       dog                    dog
#> 5      <NA>            The cat ran
#> 6      <NA> the chicken was chased
#> 7  The goat               The goat