r - tm package: inspect() returning char count, rather than content -


whenever run inspect() function in tm r package, i'm getting char count instead of content of documents. happening regardless data source i'm using.

here code:

library(tm)  data <- c("one 2 three", "two 3 four", "three 4 five")  corp <- vcorpus(vectorsource(data))  inspect(corp) 

my output example:

inspect(corp)  vcorpus  metadata:  corpus specific: 0, document level (indexed): 0 content:  documents: 3  [[1]] plaintextdocument  metadata:  7  content: chars: 13  [[2]] plaintextdocument  metadata:  7  content:  chars: 14  [[3]] plaintextdocument metadata:  7  content:  chars: 15 

but want is:

[[1]] plaintextdocument  metadata:  7  1 2 3  [[2]] plaintextdocument  metadata:  7  2 3 4  [[3]] plaintextdocument metadata:  7  3 4 5 

here example using ovid text files come default tm package , referenced in "introduction tm package" @ beginning ingo feinerer. http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

code:

txt <- system.file("texts", "txt", package = "tm") ovid <- vcorpus(dirsource(txt, encoding = "utf-8"),  + readercontrol = list(language = "lat")) inspect(ovid[1:2]) 

what want , should output:

<<vcorpus>> metadata:  corpus specific: 0, document level (indexed): 0 content:  documents: 2   [[1]] <<plaintextdocument (metadata: 7)>>   si quis in hoc artem populo non novit amandi, hoc legat et lecto carmine doctus amet. arte citae veloque rates remoque moventur, arte leves currus: arte regendus amor. curribus automedon lentisque erat aptus habenis, tiphys in haemonia puppe magister erat: me venus artificem tenero praefecit amori; tiphys et automedon dicar amoris ego. ille quidem ferus est et qui mihi saepe repugnet: sed puer est, aetas mollis et apta regi. phillyrides puerum cithara perfecit achillem, atque animos placida contudit arte feros. qui totiens socios, totiens exterruit hostes, creditur annosum pertimuisse senem. [[2]] <<plaintextdocument (metadata: 7)>> quas hector sensurus erat, poscente magistro verberibus iussas praebuit ille manus. aeacidae chiron, ego sum praeceptor amoris: saevus uterque puer, natus uterque dea. sed tamen et tauri cervix oneratur aratro, frenaque magnanimi dente teruntur equi; et mihi cedet amor, quamvis mea vulneret arcu pectora, iactatas excutiatque faces. quo me fixit amor, quo me violentius ussit, 

what outputs me:

<<vcorpus>> metadata:  corpus specific: 0, document level (indexed): 0 content:  documents: 2  [[1]] <<plaintextdocument>> metadata:  7 content:  chars: 49 content:  chars: 48 content:  chars: 46 content:  chars: 47 content:  chars: 0 content:  chars: 52 content:  chars: 48 content:  chars: 46 content:  chars: 46 content:  chars: 53 content:  chars: 0 content:  chars: 49 content:  chars: 49 content:  chars: 50 content:  chars: 49 content:  chars: 44  [[2]] <<plaintextdocument>> metadata:  7 content:  chars: 48 content:  chars: 47 content:  chars: 47 content:  chars: 48 content:  chars: 46 content:  chars: 0 content:  chars: 48 content:  chars: 49 content:  chars: 45 content:  chars: 47 content:  chars: 45 content:  chars: 0 content:  chars: 51 content:  chars: 42 content:  chars: 45 content:  chars: 48 content:  chars: 44 

version 0.6-1 of tm package changed way documents printed screen. outputs compact representation of document rather document text itself.

to obtain document text, you'll need apply as.character() function documents in corpus.

for example, using ovid example (shown here using tm version 0.6-2):

> txt <- system.file("texts", "txt", package = "tm") > ovid <- vcorpus(dirsource(txt, encoding = "utf-8"),     readercontrol = list(language = "lat")) 

the new inspect function outputs compact representation of each document:

> inspect(ovid[1:2]) <<vcorpus>> metadata:  corpus specific: 0, document level (indexed): 0 content:  documents: 2  [[1]] <<plaintextdocument>> metadata:  7 content:  chars: 676  [[2]] <<plaintextdocument>> metadata:  7 content:  chars: 700 

to obtain full text of each document, apply as.character() function document want examine (note output has been truncated):

> as.character(ovid[[1]])  [1] "    si quis in hoc artem populo non novit amandi,"      [2] "         hoc legat et lecto carmine doctus amet."       [3] "    arte citae veloque rates remoque moventur,"         [4] "         arte leves currus: arte regendus amor."  

to clean output display, combine above writelines() function:

> writelines(as.character(ovid[[1]]))     si quis in hoc artem populo non novit amandi,          hoc legat et lecto carmine doctus amet.     arte citae veloque rates remoque moventur,          arte leves currus: arte regendus amor. 

to multiple documents in corpus, combine above lapply() function (output truncated):

> lapply(ovid[1:2], as.character) $ovid_1.txt  [1] "    si quis in hoc artem populo non novit amandi,"      [2] "         hoc legat et lecto carmine doctus amet."       [3] "    arte citae veloque rates remoque moventur,"         [4] "         arte leves currus: arte regendus amor."   $ovid_2.txt  [1] "    quas hector sensurus erat, poscente magistro"     [2] "         verberibus iussas praebuit ille manus."      [3] "    aeacidae chiron, ego sum praeceptor amoris:"      [4] "         saevus uterque puer, natus uterque dea." 

finally, clean output , replicate previous inspect behavior, try using l_ply() function in plyr package follows (output truncated):

> l_ply(ovid[1:2], function(doc) {      print(doc) # output summary of document     writelines("") # output blank line between results     writelines(as.character(doc)) # output clean document text     writelines("") # output blank line between results   })  <<plaintextdocument>> metadata:  7 content:  chars: 676      si quis in hoc artem populo non novit amandi,          hoc legat et lecto carmine doctus amet.     arte citae veloque rates remoque moventur,          arte leves currus: arte regendus amor.  <<plaintextdocument>> metadata:  7 content:  chars: 700      quas hector sensurus erat, poscente magistro          verberibus iussas praebuit ille manus.     aeacidae chiron, ego sum praeceptor amoris:          saevus uterque puer, natus uterque dea. 

hope helps!


Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -