星期一, 十月 03, 2011

Pdf file to text

A friend of mine asked me how to convert pdf file to text.  There are plenty of tools in Linux world.  Two situations need to be considered.



One is if the file is `printed' with some pdf creators.  For example M$ Word.  This is much easier to handle.  If poppler-utils is not installed ($ rpm -qa |grep poppler-utils), then
# yum install poppler-utils
$ pdftotext file.pdf

A file.txt file will be created.  One should note that the texts are not usually in good format.

The other situation is more difficult.  If the pdf file is scanned, we then need OCR to recognize the texts.  Be sure the options for scanning are text mode (black and white) and 300dpi.  One way can be below:
# yum install pdftk imagemagick gocr  #if they are not installed
$ pdftk yourfile.pdf burst
$ for i in pg*; do convert $i $i.pnm; gocr $i.pnm >> text; done
 If the images are two sides a page, then we also need to cut the images into two.  I will add more scripts if I have time.

没有评论: