One is if the file is `printed' with some pdf creators. For example M$ Word. This is much easier to handle. If poppler-utils is not installed ($ rpm -qa |grep poppler-utils), then
# yum install poppler-utils
$ pdftotext file.pdf
A file.txt file will be created. One should note that the texts are not usually in good format.
The other situation is more difficult. If the pdf file is scanned, we then need OCR to recognize the texts. Be sure the options for scanning are text mode (black and white) and 300dpi. One way can be below:
# yum install pdftk imagemagick gocr #if they are not installedIf the images are two sides a page, then we also need to cut the images into two. I will add more scripts if I have time.
$ pdftk yourfile.pdf burst
$ for i in pg*; do convert $i $i.pnm; gocr $i.pnm >> text; done
没有评论:
发表评论