r - tm package: inspect() returning char count, rather than content -
whenever run inspect()
function in tm r package, i'm getting char count instead of content of documents. happening regardless data source i'm using.
here code:
library(tm) data <- c("one 2 three", "two 3 four", "three 4 five") corp <- vcorpus(vectorsource(data)) inspect(corp)
my output example:
inspect(corp) vcorpus metadata: corpus specific: 0, document level (indexed): 0 content: documents: 3 [[1]] plaintextdocument metadata: 7 content: chars: 13 [[2]] plaintextdocument metadata: 7 content: chars: 14 [[3]] plaintextdocument metadata: 7 content: chars: 15
but want is:
[[1]] plaintextdocument metadata: 7 1 2 3 [[2]] plaintextdocument metadata: 7 2 3 4 [[3]] plaintextdocument metadata: 7 3 4 5
here example using ovid text files come default tm package , referenced in "introduction tm package" @ beginning ingo feinerer. http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
code:
txt <- system.file("texts", "txt", package = "tm") ovid <- vcorpus(dirsource(txt, encoding = "utf-8"), + readercontrol = list(language = "lat")) inspect(ovid[1:2])
what want , should output:
<<vcorpus>> metadata: corpus specific: 0, document level (indexed): 0 content: documents: 2 [[1]] <<plaintextdocument (metadata: 7)>> si quis in hoc artem populo non novit amandi, hoc legat et lecto carmine doctus amet. arte citae veloque rates remoque moventur, arte leves currus: arte regendus amor. curribus automedon lentisque erat aptus habenis, tiphys in haemonia puppe magister erat: me venus artificem tenero praefecit amori; tiphys et automedon dicar amoris ego. ille quidem ferus est et qui mihi saepe repugnet: sed puer est, aetas mollis et apta regi. phillyrides puerum cithara perfecit achillem, atque animos placida contudit arte feros. qui totiens socios, totiens exterruit hostes, creditur annosum pertimuisse senem. [[2]] <<plaintextdocument (metadata: 7)>> quas hector sensurus erat, poscente magistro verberibus iussas praebuit ille manus. aeacidae chiron, ego sum praeceptor amoris: saevus uterque puer, natus uterque dea. sed tamen et tauri cervix oneratur aratro, frenaque magnanimi dente teruntur equi; et mihi cedet amor, quamvis mea vulneret arcu pectora, iactatas excutiatque faces. quo me fixit amor, quo me violentius ussit,
what outputs me:
<<vcorpus>> metadata: corpus specific: 0, document level (indexed): 0 content: documents: 2 [[1]] <<plaintextdocument>> metadata: 7 content: chars: 49 content: chars: 48 content: chars: 46 content: chars: 47 content: chars: 0 content: chars: 52 content: chars: 48 content: chars: 46 content: chars: 46 content: chars: 53 content: chars: 0 content: chars: 49 content: chars: 49 content: chars: 50 content: chars: 49 content: chars: 44 [[2]] <<plaintextdocument>> metadata: 7 content: chars: 48 content: chars: 47 content: chars: 47 content: chars: 48 content: chars: 46 content: chars: 0 content: chars: 48 content: chars: 49 content: chars: 45 content: chars: 47 content: chars: 45 content: chars: 0 content: chars: 51 content: chars: 42 content: chars: 45 content: chars: 48 content: chars: 44
version 0.6-1 of tm
package changed way documents printed screen. outputs compact representation of document rather document text itself.
to obtain document text, you'll need apply as.character()
function documents in corpus.
for example, using ovid example (shown here using tm
version 0.6-2):
> txt <- system.file("texts", "txt", package = "tm") > ovid <- vcorpus(dirsource(txt, encoding = "utf-8"), readercontrol = list(language = "lat"))
the new inspect function outputs compact representation of each document:
> inspect(ovid[1:2]) <<vcorpus>> metadata: corpus specific: 0, document level (indexed): 0 content: documents: 2 [[1]] <<plaintextdocument>> metadata: 7 content: chars: 676 [[2]] <<plaintextdocument>> metadata: 7 content: chars: 700
to obtain full text of each document, apply as.character()
function document want examine (note output has been truncated):
> as.character(ovid[[1]]) [1] " si quis in hoc artem populo non novit amandi," [2] " hoc legat et lecto carmine doctus amet." [3] " arte citae veloque rates remoque moventur," [4] " arte leves currus: arte regendus amor."
to clean output display, combine above writelines()
function:
> writelines(as.character(ovid[[1]])) si quis in hoc artem populo non novit amandi, hoc legat et lecto carmine doctus amet. arte citae veloque rates remoque moventur, arte leves currus: arte regendus amor.
to multiple documents in corpus, combine above lapply()
function (output truncated):
> lapply(ovid[1:2], as.character) $ovid_1.txt [1] " si quis in hoc artem populo non novit amandi," [2] " hoc legat et lecto carmine doctus amet." [3] " arte citae veloque rates remoque moventur," [4] " arte leves currus: arte regendus amor." $ovid_2.txt [1] " quas hector sensurus erat, poscente magistro" [2] " verberibus iussas praebuit ille manus." [3] " aeacidae chiron, ego sum praeceptor amoris:" [4] " saevus uterque puer, natus uterque dea."
finally, clean output , replicate previous inspect behavior, try using l_ply()
function in plyr
package follows (output truncated):
> l_ply(ovid[1:2], function(doc) { print(doc) # output summary of document writelines("") # output blank line between results writelines(as.character(doc)) # output clean document text writelines("") # output blank line between results }) <<plaintextdocument>> metadata: 7 content: chars: 676 si quis in hoc artem populo non novit amandi, hoc legat et lecto carmine doctus amet. arte citae veloque rates remoque moventur, arte leves currus: arte regendus amor. <<plaintextdocument>> metadata: 7 content: chars: 700 quas hector sensurus erat, poscente magistro verberibus iussas praebuit ille manus. aeacidae chiron, ego sum praeceptor amoris: saevus uterque puer, natus uterque dea.
hope helps!
Comments
Post a Comment