How to scan documents: Difference between revisions

From Wildsong
Jump to navigationJump to search
Brian Wilson (talk | contribs)
Brian Wilson (talk | contribs)
mNo edit summary
Line 9: Line 9:


I scan at 300 DPI grey scale, with the scanner doing conversion to PDF and upload to a server via FTP. 300 is probably overkill for text but line art looks good. I note also that for the manuals I just scanned I can actually see ghosts of what's printed on the other side of two sided pages.
I scan at 300 DPI grey scale, with the scanner doing conversion to PDF and upload to a server via FTP. 300 is probably overkill for text but line art looks good. I note also that for the manuals I just scanned I can actually see ghosts of what's printed on the other side of two sided pages.
== Clean ==
You can use unpaper to clean the PDF files before OCR. It can remove the ghosts I mentioned anove.
== OCR ==


== Compress ==
== Compress ==
Line 29: Line 36:


On the Mac I use the viewer djview-libre which is also available for Linux and Windows.
On the Mac I use the viewer djview-libre which is also available for Linux and Windows.
== See also ==
[http://www.linux.com/archive/feature/138511 How to scan and OCR like a pro with open source tools] from the Linux.com site

Revision as of 03:33, 5 December 2010

Scan

I have access to a Brother scanner that has a document feed on it. It scans a multipage doc and puts the output into a PDF file on my server via FTP.

  1. Scan the odd pages, front to back, resulting in a single PDF file.
  2. Scan the even pages, back to front, resulting in a second PDF file.
  3. If you have it, you can use Adobe Acrobat to collate the files into a single PDF.

I scan at 300 DPI grey scale, with the scanner doing conversion to PDF and upload to a server via FTP. 300 is probably overkill for text but line art looks good. I note also that for the manuals I just scanned I can actually see ghosts of what's printed on the other side of two sided pages.

Clean

You can use unpaper to clean the PDF files before OCR. It can remove the ghosts I mentioned anove.


OCR

Compress

I am trying out the djvu format, it seems like a good way to manage the scanned pages. Compression is very good. See http://djvu.org/

  1. Convert the scanned PDF documents to DJVU documents, 1 per page. mkdir f && pdf2djvu -i f frontpages.pdf
  2. Optionally perform any additional processing on the individual pages, such as image filtering or contrast enhancement.
  3. Perform OCR on the individual page files so they can be searched separately
  4. Merge the page files into one DJVU bundle, making sure they get into the right order. (QA!)

Ubuntu Notes

Packages under Ubuntu are

  • poppler-utils (PDF utilities)
  • psutils (Postscript utilities)
  • tesseract (OCR software)

Installing the gscan2pdf package pulled in sundry and various useful things such as cuneiform (the OCR program that Canonical has blessed at 10.10) and djvu2pdf.

On the Mac I use the viewer djview-libre which is also available for Linux and Windows.

See also

How to scan and OCR like a pro with open source tools from the Linux.com site