How to scan documents: Difference between revisions

Revision as of 00:27, 8 November 2009

I have a Brother scanner that has a document feed on it. It scans a multipage doc and puts the output into a PDF file on my server via FTP.

Scan the odd pages, front to back, resulting in a single PDF file.
Scan the even pages, back to front, resulting in a second PDF file.
Convert the 2 PDF documents to 2 PS documents. pdftops infile.pdf outfile.ps
Split the PS documents into separate files, one page per file
Optionally perform any additional processing on the individual pages, such as image compression
For WEB version
1. Perform OCR on the individual page files so they can be searched separately
2. Convert individual pages into PNG files for viewing
3. Put all pages into a book viewer collection
Merge the page files into one PS document
Convert the merged document back into PDF document
Perform OCR on the PDF doc

Notes: Commands with '2' like 'pdf2ps' are from the psutils package. Commands with 'to' like 'pdftops' are from the poppler-utils package. I am not sure if tehre are any advantages to use one or the other when there are equivalent commands (for example 'pdf2ps' versus 'pdftops')

Supporting scripts

paginate.pl

I split the documents into separate pages with this perl script.

#!/usr/bin/perl
#
# Separate a postscript file called file.ps into separate pages named file.N.ps
# -r will reverse output page order
#

$pages = 0;
$rev = 0;

$fname = shift;
if ($fname eq '-r') {
    $rev = 1;
    $fname = shift;
}
print "Processing file $fname\n";
$fname =~ /(.*)\.ps$/;
$base = $1 || die;

# Find number of pages in doc
open(IN,$fname)||die;
while (<IN>){
    if (/^%%Pages: (.*)/) {
	$pages = $1;
	break;
    }
}
close IN;

$p = 0;
$r = $pages;
while ($pages--){
    $p++;

    # This handles docs up to 999 pages
    $p0 = ($rev)? $r : $p;
    if ($p0 < 10) {
	$p0 = '00' . $p0;
    } elsif ($p0 < 100) {
	$p0 = '0' . $p0;
    }
    $cmd = "psselect -p $p $fname $base.$p0.ps";
    print $cmd;
    system($cmd);
    print "\n";

    $r--;
}

@@ Line 30: / Line 30: @@
 #
 # Separate a postscript file called file.ps into separate pages named file.N.ps
+# -r will reverse output page order
 #
@@ Line 73: / Line 74: @@
      $r--;
 }
-bwilson@bellman:~/Magnavox$
 </pre>

How to scan documents: Difference between revisions

Revision as of 00:27, 8 November 2009

Supporting scripts

paginate.pl

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools