Extracting a specific format substring (ID, code) from a string using R -
suppose have data frame composed of tweets harvested using twitter. want extract substring unique id data set, contained in each tweet. ids of same form, 3-4 uppercase letters, followed hyphen, , followed 6 digit number. examples are: yld-000123,ylsl-000323,ylp-000135. need id , can drop else on each tweet.
here 2 examples of tweets i'm working with:
st1="elijo entertimer, ylc-000354, como ganador para http://t.co/jcldk8d796 #younglionsco #fantasylions" st2="elijo #aesetrennomelesubo, ylsl-000169, como ganador para http://t.co/wppm7x5ecn #younglionsco #fantasylions" tweets=c(st1,st2)
the result need "ylc-000354" "ylsl-000169". id not between commas.
an approach using gsub
:
gsub('.*[^[:alpha:]]([[:alpha:]]+-\\d+).*','\\1',tweets) #[1] "ylc-000354" "ylsl-000169"
Comments
Post a Comment