r - Removing loops in RecordLinkage -
i using recordlinkage package in r deduplicate dataset. deduped output recordlinkage package has loops in it.
for example:
table rlinkage
id name id2 name2 1 jane johnson 5 jane johnson 5 jane johnson 17 jane johnson
i trying make table lists each id associated other id numbers in loop of records.
for example:
id1 id2 id3 name 1 5 17 jane johnson
or
name ids jane johnson 1,5,17
is possible in r? tried using sqldf package join dataset onto multiple times try , id's on same line.
for example:
rlinkage2 <-sqldf('select a.id, a.id2, b.id id3 b.id2 id4 rlinkage left join rlinkage b on a.id = b.id or a.id = b.id2 or a.id2 = b.id or a.id2 = b.id2')
this creates messy dataset , not put of id's on same line unless join table rlinkage many times. there better way this?
1) sqldf using sqldf
union 2 sets of columns , use group_concat
sqldf("select name, group_concat(distinct id) ids ( select id, name rlinkage union select id2 id, name2 name rlinkage ) group name")
giving:
name ids 1 jane johnson 1,5,17
2) rbind/aggregate plain r:
long <- rbind(rlinkage[1:2], setnames(rlinkage[3:4], names(rlinkage)[1:2])) aggregate(id ~ name, long, function(x) tostring(unique(x)))
giving:
name id 1 jane johnson 1, 5, 17
note: used data:
lines <- "id,name,id2,name2 1,jane johnson,5,jane johnson 5,jane johnson,17,jane johnson" rlinkage <- read.csv(text = lines, as.is = true)
Comments
Post a Comment