How to scan documents: Difference between revisions

From Wildsong
Jump to navigationJump to search
Brian Wilson (talk | contribs)
Brian Wilson (talk | contribs)
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
I have a Brother scanner that has a document feed on it.
== Scan ==
 
I have access to a Brother scanner that has a document feed on it.
It scans a multipage doc and puts the output into a PDF file on my server via FTP.
It scans a multipage doc and puts the output into a PDF file on my server via FTP.


# Scan the odd pages, front to back, resulting in a single PDF file.
# Scan the odd pages, front to back, resulting in a single PDF file.
# Scan the even pages, back to front, resulting in a second PDF file.
# Scan the even pages, back to front, resulting in a second PDF file.
# Convert the 2 PDF documents to 2 PS documents. ''pdftops infile.pdf outfile.ps''
# If you have it, you can use Adobe Acrobat to collate the files into a single PDF.
# Split the PS documents into separate files, one page per file
 
# Optionally perform any additional processing on the individual pages, such as image compression
I scan at 300 DPI grey scale, with the scanner doing conversion to PDF and upload to a server via FTP. 300 is probably overkill for text but line art looks good. I note also that for the manuals I just scanned I can actually see ghosts of what's printed on the other side of two sided pages.
# For WEB version
 
## Perform OCR on the individual page files so they can be searched separately
== OCR ==
## Convert individual pages into PNG files for viewing
 
## Put all pages into a book viewer collection
Run gscan2pdf and you can perform all the steps needed to go from raw scan to polished PDF doc.
# Merge the page files into one PS document
 
# Convert the merged document back into PDF document
# Start it up.
# Perform OCR on the PDF doc
# File->Import to load the raw PDF
# Tools->Threshold will help eliminate the ghosts.
# Tools->Cleanup to run 'unpaper' over the doc. This will clean up the pages and deskew them. The pages get cockeyed when they go through the ADF (automatic document feeder) and this straightens them up.
# Tools->OCR - select an engine, GOCR, Tesseract, Cuneiform are the choices.
# Work around for Save bug: save as a session then exit, restart, reload session file and save as PDF.


Notes:
In my test, I ran Tesseract over the entire manual and watched the console window. I saw errors on some pages.
Commands with '2' like 'pdf2ps' are from the psutils package.
Commands with 'to' like 'pdftops' are from the poppler-utils package.
I am not sure if tehre are any advantages to use one or the other when there are  equivalent commands (for example 'pdf2ps' versus 'pdftops')


== Supporting scripts ==
Doing the OCR creates a text version of each scanned page, that allows the PDF to be searchable.


=== paginate.pl ===
== Compress ==


I split the documents into separate pages with this perl script.
I am trying out the djvu format, it seems like a good way to manage the scanned pages. Compression is very good. See http://djvu.org/


<pre>
DJVU is an alternative to PDF. It uses wavelet compression.
#!/usr/bin/perl
#
# Separate a postscript file called file.ps into separate pages named file.N.ps
# -r will reverse output page order
#


$pages = 0;
== Ubuntu Notes ==
$rev = 0;


$fname = shift;
Installing the '''gscan2pdf''' package pulled in sundry and various useful things such as unpaper (to clean up scans) and cuneiform (the OCR program that Canonical has blessed at 10.10) and djvu2pdf.
if ($fname eq '-r') {
    $rev = 1;
    $fname = shift;
}
print "Processing file $fname\n";
$fname =~ /(.*)\.ps$/;
$base = $1 || die;


# Find number of pages in doc
I installed the '''teserract''' OCR program as well.
open(IN,$fname)||die;
while (<IN>){
    if (/^%%Pages: (.*)/) {
$pages = $1;
break;
    }
}
close IN;


$p = 0;
On the Mac I use the viewer djview-libre which is also available for Linux and Windows.
$r = $pages;
while ($pages--){
    $p++;


    # This handles docs up to 999 pages
== See also ==
    $p0 = ($rev)? $r : $p;
    if ($p0 < 10) {
$p0 = '00' . $p0;
    } elsif ($p0 < 100) {
$p0 = '0' . $p0;
    }
    $cmd = "psselect -p $p $fname $base.$p0.ps";
    print $cmd;
    system($cmd);
    print "\n";


    $r--;
[http://www.linux.com/archive/feature/138511 How to scan and OCR like a pro with open source tools] from the Linux.com site
}
</pre>

Latest revision as of 06:06, 5 December 2010

Scan

I have access to a Brother scanner that has a document feed on it. It scans a multipage doc and puts the output into a PDF file on my server via FTP.

  1. Scan the odd pages, front to back, resulting in a single PDF file.
  2. Scan the even pages, back to front, resulting in a second PDF file.
  3. If you have it, you can use Adobe Acrobat to collate the files into a single PDF.

I scan at 300 DPI grey scale, with the scanner doing conversion to PDF and upload to a server via FTP. 300 is probably overkill for text but line art looks good. I note also that for the manuals I just scanned I can actually see ghosts of what's printed on the other side of two sided pages.

OCR

Run gscan2pdf and you can perform all the steps needed to go from raw scan to polished PDF doc.

  1. Start it up.
  2. File->Import to load the raw PDF
  3. Tools->Threshold will help eliminate the ghosts.
  4. Tools->Cleanup to run 'unpaper' over the doc. This will clean up the pages and deskew them. The pages get cockeyed when they go through the ADF (automatic document feeder) and this straightens them up.
  5. Tools->OCR - select an engine, GOCR, Tesseract, Cuneiform are the choices.
  6. Work around for Save bug: save as a session then exit, restart, reload session file and save as PDF.

In my test, I ran Tesseract over the entire manual and watched the console window. I saw errors on some pages.

Doing the OCR creates a text version of each scanned page, that allows the PDF to be searchable.

Compress

I am trying out the djvu format, it seems like a good way to manage the scanned pages. Compression is very good. See http://djvu.org/

DJVU is an alternative to PDF. It uses wavelet compression.

Ubuntu Notes

Installing the gscan2pdf package pulled in sundry and various useful things such as unpaper (to clean up scans) and cuneiform (the OCR program that Canonical has blessed at 10.10) and djvu2pdf.

I installed the teserract OCR program as well.

On the Mac I use the viewer djview-libre which is also available for Linux and Windows.

See also

How to scan and OCR like a pro with open source tools from the Linux.com site