How to scan documents: Difference between revisions
From Wildsong
Jump to navigationJump to search
Brian Wilson (talk | contribs) |
Brian Wilson (talk | contribs) |
||
Line 27: | Line 27: | ||
<pre> | <pre> | ||
#!/usr/bin/perl | |||
# | |||
# Separate a postscript file called file.ps into separate pages named file.N.ps | |||
# | |||
$pages = 0; | |||
$rev = 0; | |||
$fname = shift; | |||
if ($fname eq '-r') { | |||
$rev = 1; | |||
$fname = shift; | |||
} | |||
print "Processing file $fname\n"; | |||
$fname =~ /(.*)\.ps$/; | |||
$base = $1 || die; | |||
# Find number of pages in doc | |||
open(IN,$fname)||die; | |||
while (<IN>){ | |||
if (/^%%Pages: (.*)/) { | |||
$pages = $1; | |||
break; | |||
} | |||
} | |||
close IN; | |||
$p = 0; | |||
$r = $pages; | |||
while ($pages--){ | |||
$p++; | |||
# This handles docs up to 999 pages | |||
$p0 = ($rev)? $r : $p; | |||
if ($p0 < 10) { | |||
$p0 = '00' . $p0; | |||
} elsif ($p0 < 100) { | |||
$p0 = '0' . $p0; | |||
} | |||
$cmd = "psselect -p $p $fname $base.$p0.ps"; | |||
print $cmd; | |||
system($cmd); | |||
print "\n"; | |||
$r--; | |||
} | |||
bwilson@bellman:~/Magnavox$ | |||
</pre> | </pre> |
Revision as of 00:26, 8 November 2009
I have a Brother scanner that has a document feed on it. It scans a multipage doc and puts the output into a PDF file on my server via FTP.
- Scan the odd pages, front to back, resulting in a single PDF file.
- Scan the even pages, back to front, resulting in a second PDF file.
- Convert the 2 PDF documents to 2 PS documents. pdftops infile.pdf outfile.ps
- Split the PS documents into separate files, one page per file
- Optionally perform any additional processing on the individual pages, such as image compression
- For WEB version
- Perform OCR on the individual page files so they can be searched separately
- Convert individual pages into PNG files for viewing
- Put all pages into a book viewer collection
- Merge the page files into one PS document
- Convert the merged document back into PDF document
- Perform OCR on the PDF doc
Notes: Commands with '2' like 'pdf2ps' are from the psutils package. Commands with 'to' like 'pdftops' are from the poppler-utils package. I am not sure if tehre are any advantages to use one or the other when there are equivalent commands (for example 'pdf2ps' versus 'pdftops')
Supporting scripts
paginate.pl
I split the documents into separate pages with this perl script.
#!/usr/bin/perl # # Separate a postscript file called file.ps into separate pages named file.N.ps # $pages = 0; $rev = 0; $fname = shift; if ($fname eq '-r') { $rev = 1; $fname = shift; } print "Processing file $fname\n"; $fname =~ /(.*)\.ps$/; $base = $1 || die; # Find number of pages in doc open(IN,$fname)||die; while (<IN>){ if (/^%%Pages: (.*)/) { $pages = $1; break; } } close IN; $p = 0; $r = $pages; while ($pages--){ $p++; # This handles docs up to 999 pages $p0 = ($rev)? $r : $p; if ($p0 < 10) { $p0 = '00' . $p0; } elsif ($p0 < 100) { $p0 = '0' . $p0; } $cmd = "psselect -p $p $fname $base.$p0.ps"; print $cmd; system($cmd); print "\n"; $r--; } bwilson@bellman:~/Magnavox$