Identify duplicate works based on title similarity
Source:R/fp_identify_duplicate_works.R
fp_identify_duplicate_works.RdGroups potentially duplicate bibliographic records by computing pairwise string distances between work titles and clustering similar items.
Usage
fp_identify_duplicate_works(
data = NULL,
string_dist = "lv",
hclust_method = "single",
threshold = 0.2
)Arguments
- data
a
data.framecontaining at least atitlecolumn.- string_dist
a
characterstring specifying the distance metric used bystringdist::stringdistmatrix(). Defaults to"lv"(Levenshtein distance).- hclust_method
a
characterstring specifying the hierarchical clustering method used bystats::hclust(). Defaults to"single".- threshold
a
numericvalue controlling cluster separation. Lower values produce more fine-grained clusters (stricter matching), while higher values merge more records into the same group.
Value
The input data.frame with an additional column:
- ref_id
Integer cluster identifier grouping similar titles.
Details
Title similarity is computed after basic text normalization
(lowercasing, punctuation removal, whitespace trimming).
Distances are calculated using stringdist::stringdistmatrix() and
normalized by title length before hierarchical clustering.
This function does not remove duplicates but assigns a cluster identifier that can be used for downstream deduplication or grouping.
Examples
if (FALSE) { # \dontrun{
df <- data.frame(
title = c(
"Deep Learning for NLP",
"Deep learning for natural language processing",
"Quantum Computing Basics"
)
)
fp_identify_duplicate_works(df)
} # }