objective c - Extract only the text from PDF files with CGPDFScanner -


there number of questions (some answered , others not) extracting simple text pdf files. stackoverflow has been helpful point out pdf adobe documentation clear detect objects during parsing: i.e. 1 should use 'bt' , 'et' pdf reference operators construct callbacks when using cgpdfscanner.

the apple documentation shows callback example:

static void op_bt (cgpdfscannerref s, void *info) {     const char *name;     if (!cgpdfscannerpopname(s, &name))         return;     printf("bt /%s\n", name);    } 

and, among other cgpdfscanner commands, above call-back set-up first creating:

mytable = cgpdfoperatortablecreate(); cgpdfoperatortablesetcallback (mytable, "bt", &op_bt); 

all far, apple documentation doesn't appear low-to-intermediate programmers me understand next step: beyond identifying text block (presumably between bt , callbacks?), few steps/lines needed during/in/outside callback capture identified text block nsstring?

many thanks.

the first thing should download pdf reference. these days that's iso standard, can download acrobat sdk (http://www.adobe.com/devnet/acrobat.html) contains adobe copy serve well.

read chapter 9. it'll teach on 1 hand need understand text operators (tj, ', ", tj) , on other hand need understand fonts , encodings.

the text operators operators can intercept add "strings" pdf document; while text operators must appear between bt , et blocks, intercepting these bt , et blocks isn't going think.

fonts important because define how bytes used operators correspond actual (unicode) characters. if want derive meaning of bytes pdf file, need know how use fonts derive meaning.

some additional points:

  • don't assume bt , et correspond actual text block or paragraph may know application such indesign or word. 1 text block may contain whole page or single character (or nothing).

  • there text state operators determine how text going shown on page. there ways example draw invisible text; may or may not wish extract type of text. if don't, you'll need support enough text state operators can tell difference.

not small task :)

update after looking @ sample pdf

because in comments question refined indicate text extraction of specific type of pdf file, let me add little additional information.

1) looking @ pdf file reference, won't able skip font/encoding problem. fonts in sample pdf file subsetted means don't have "cleartext" in pdf page description instead indexes have mapped through encoding of fonts used meaningful text.

2) extracting text possible, if @ following output pdftoolbox (warning, i'm affiliated rather heavily tool):

<page id="33">     <words>         <word txt="senator">             <parts>                 <part tlh="28.3481" tlv="868.534" trh="55.4455" trv="868.534" blh="28.3481" blv="859.902" brh="55.4455" brv="859.902"></part>             </parts>         </word>         <word txt="house,">             <parts>                 <part tlh="57.5305" tlv="868.534" trh="82.123" trv="868.534" blh="57.5305" blv="859.902" brh="82.123" brv="859.902"></part>             </parts>         </word>         <word txt="85">             <parts>                 <part tlh="84.208" tlv="868.534" trh="92.548" trv="868.534" blh="84.208" blv="859.902" brh="92.548" brv="859.902"></part>                 </parts>         </word> 

there undoubtedly other tools can give similar (or better) result, extracting text should doable.

the big problem going finding text you're interested in in right order. extraction used here gives text of each "word" , it's position (bounding box) on page. when through xml when table, challenge going text belongs table cell, rows , columns end etc...

in way problem harder problem of detecting lines of text because you're dealing pretty dense table (and problem largely one-dimensional (gathering on same line) problem two-dimensional.


Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -