How to scan documents: Difference between revisions

From Wildsong
Jump to navigationJump to search
Brian Wilson (talk | contribs)
Brian Wilson (talk | contribs)
mNo edit summary
Line 1: Line 1:
I have a Brother scanner that has a document feed on it.
I have a Brother scanner that has a document feed on it.
It scans a multipage doc and puts the output into a PDF file on my server via FTP.
It scans a multipage doc and puts the output into a PDF file on my server via FTP.
I am trying out the djvu format, it seems like a good way to manage the scanned pages. Compression is very good.


# Scan the odd pages, front to back, resulting in a single PDF file.
# Scan the odd pages, front to back, resulting in a single PDF file.
# Scan the even pages, back to front, resulting in a second PDF file.
# Scan the even pages, back to front, resulting in a second PDF file.
# Convert the 2 PDF documents to 2 PS documents. ''pdftops infile.pdf outfile.ps''
# Convert the 2 PDF documents to DJVU documents, 1 per page. ''mkdir f && pdf2djvu -i f frontpages.pdf''
# Split the PS documents into separate files, one page per file
# Optionally perform any additional processing on the individual pages, such as image filtering or contrast enhancement.
# Optionally perform any additional processing on the individual pages, such as image compression
# Perform OCR on the individual page files so they can be searched separately
# For WEB version
# Merge the page files into one DJVU bundle, making sure they get into the right order. (QA!)
## Perform OCR on the individual page files so they can be searched separately
## Convert individual pages into PNG files for viewing
## Put all pages into a book viewer collection
# Merge the page files into one PS document
# Convert the merged document back into PDF document
# Perform OCR on the PDF doc
 
Notes:
Commands with '2' like 'pdf2ps' are from the psutils package.
Commands with 'to' like 'pdftops' are from the poppler-utils package.
I am not sure if tehre are any advantages to use one or the other when there are  equivalent commands (for example 'pdf2ps' versus 'pdftops')
 
== Supporting scripts ==
 
=== paginate.pl ===
 
I split the documents into separate pages with this perl script.
 
<pre>
#!/usr/bin/perl
#
# Separate a postscript file called file.ps into separate pages named file.N.ps
# -r will reverse output page order
#
 
$pages = 0;
$rev = 0;
 
$fname = shift;
if ($fname eq '-r') {
    $rev = 1;
    $fname = shift;
}
print "Processing file $fname\n";
$fname =~ /(.*)\.ps$/;
$base = $1 || die;
 
# Find number of pages in doc
open(IN,$fname)||die;
while (<IN>){
    if (/^%%Pages: (.*)/) {
$pages = $1;
break;
    }
}
close IN;


$p = 0;
== Notes ==
$r = $pages;
while ($pages--){
    $p++;


    # This handles docs up to 999 pages
Packages under Ubuntu are poppler-utils, psutils. Installing the gscan2pdf package pulled in sundry and various useful things such as tesseract and djvu2pdf.
    $p0 = ($rev)? $r : $p;
    if ($p0 < 10) {
$p0 = '00' . $p0;
    } elsif ($p0 < 100) {
$p0 = '0' . $p0;
    }
    $cmd = "psselect -p $p $fname $base.$p0.ps";
    print $cmd;
    system($cmd);
    print "\n";


    $r--;
On the Mac I use the viewer djview-libre which is also available for Linux and Windows.
}
</pre>

Revision as of 03:02, 8 November 2009

I have a Brother scanner that has a document feed on it. It scans a multipage doc and puts the output into a PDF file on my server via FTP.

I am trying out the djvu format, it seems like a good way to manage the scanned pages. Compression is very good.

  1. Scan the odd pages, front to back, resulting in a single PDF file.
  2. Scan the even pages, back to front, resulting in a second PDF file.
  3. Convert the 2 PDF documents to DJVU documents, 1 per page. mkdir f && pdf2djvu -i f frontpages.pdf
  4. Optionally perform any additional processing on the individual pages, such as image filtering or contrast enhancement.
  5. Perform OCR on the individual page files so they can be searched separately
  6. Merge the page files into one DJVU bundle, making sure they get into the right order. (QA!)

Notes

Packages under Ubuntu are poppler-utils, psutils. Installing the gscan2pdf package pulled in sundry and various useful things such as tesseract and djvu2pdf.

On the Mac I use the viewer djview-libre which is also available for Linux and Windows.