python - Is there a way to automate specific data extraction from a number of pdf files and add them to an excel sheet? -
regularly have go through list of pdf files , search specific data , add them excel sheet later review. number of pdf files around 50 per month, both time taking , frustrating manually.
can process automated in windows python or other scripting language? require have pdf files in folder , run script generate excel sheet data added. pdf files work tabular , have similar structures.
yes. , no. , maybe.
the problem here not extracting something pdf document. extracting something possible , there plenty of tools available extract content pdf document. text, images, whatever need.
the major problem (and reason "no" or "maybe") pdf in general not structured file format. doesn't care columns, paragraphs, tables, sentences or words. in general case cares characters on page in specific location.
this means in general case cannot query pdf document , ask every paragraph or third sentence in fifth paragraph. can ask library of text or of text in specific location. , have hope library able extract text need in legible format. because there doesn't have case can copy , paste or otherwise understandable characters pdf file. many pdf files don't contain enough information that.
so... if have type of document , can test predictably behaves way extraction engine, yes, can extract information pdf file.
if pdf files receive different time or layout on page totally different every time answer cannot reliably extract information want.
as side note:
there types of pdf documents easier handle others if you're lucky might make life easier. 2 examples:
many pdf files will in fact contain textual information in such way can extracted in legible way. pdf files follow standards (such pdf/a-1a, pdf/a-2a or pdf/a-2u etc...) required created way.
some pdf files "tagged" means contain additional structural information allows extract information in easier , more meaningful way. structure in fact identify paragraphs, images, tables etc , if tagging done in way make job of content extraction easier.
Comments
Post a Comment