No more paper: Difference between revisions

From Wildsong
Jump to navigationJump to search
Brian Wilson (talk | contribs)
Brian Wilson (talk | contribs)
 
(7 intermediate revisions by the same user not shown)
Line 6: Line 6:


The computer aka "the doc server" needs bullet-proof and secure backups, so I need some way to encrypt the docs and ship them off somewhere. Only critical docs need this treatment. I think I can afford slightly less bullet proof backups for my photo and music collections.
The computer aka "the doc server" needs bullet-proof and secure backups, so I need some way to encrypt the docs and ship them off somewhere. Only critical docs need this treatment. I think I can afford slightly less bullet proof backups for my photo and music collections.
Currently I am using cloud storage for this.


===Digital docs===
===Digital docs===
Line 23: Line 25:
'''Flatbed scanner''' - I have an HP. I need to drag it out of the closet.  
'''Flatbed scanner''' - I have an HP. I need to drag it out of the closet.  


'''ADF scanner''' 2013-May - a friend gave me a '''Canon Pixma MX330''' which has an "ADF" = automatic document feeder. AWESOME. It probably ran out of ink so the former owner dumped it. Inkjet printers are a scourge. I only need the ADF part right now.
'''ADF scanner''' 2013-May - a friend gave me a [[Canon Pixma MX330]] which has an "ADF" = automatic document feeder. AWESOME. It probably ran out of ink so the former owner dumped it. Inkjet printers are a scourge. I only need the ADF part right now. (Update: Yes, I bought ink and it works fine as a printer too.)


'''Digital camera''' - it's been suggested that you can use a camera and tripod.  
'''Digital camera''' - it's been suggested that you can use a camera and tripod.  
Line 31: Line 33:
===Image to words===
===Image to words===


OCR via [http://sourceforge.net/projects/tesseract-ocr Tesseract]
[[Comparing OCR software]] - conclusion, tesseract is better than gocr.
My idea is not to make a perfect readable copy of the original but
just to be able to grab enough keywords for indexing.


'''Tesseract''' is AWESOME. It turned this
==== Setting up tesseract in xsane ====


[[Image:tesseract_sample.png]]
Would be great but requires an external script at the moment.


Into this
== Scanner tips ==


  .5 miles
  sudo apt-get install xsane tesseract-ocr
Open April 1 to October 31
Trailhead is located at
turnout off Bruce Road.
Walk past the gate here,
onto the dike. This ?at trail
(out and back) follows the
banks of Cheadle Marsh
and the riparian forest of
Muddy Creek, ending at
Cheadle Barn. Here you
can observe native wetland
plants, in?uenced by
changing water levels. Or,
rading in and around these areas. NOTE:
Dheadle Barn is NOT open to the public.


=== Scanner tips ===
=== XSANE scanning ===


sudo apt-get install xsane
Using xsane is tedious, too many things to remember and too many things to hack around. Fun for experimenting or doing special projects.
 
xsane scans each one to a PNM file, sequentially numbered.
==== ADF scanning ====
To save the scanned images in a single file, select 'Multipage' (Ctrl+M).


I run xsane, and it detects the scanner correctly. I select the source = Automatic Document Feed. I put in the number of pages to scan (>= actual number).
* source = Automatic Document Feed
It scans each one to a PNM file, sequentially numbered.
* set Number of pages
To save the scanned images in a single file, select 'Multipage' (Ctrl+M).
* Gray
* 150 DPI


==== These are my old notes from the flatbed scanner. ====
=== Command line scanning ===


Find out where the scanner is hiding:<br>
Find out where the scanner is hiding:<br>
Line 79: Line 63:
% '''scanimage -d hp3900:libusb:005:010 --help'''
% '''scanimage -d hp3900:libusb:005:010 --help'''


Now measure the printed area, for example 180 x 250mm and the margin widths, top and side. Use a 150 dpi resolution. Greyscale not color. Write the output to a TIFF file. Since there is only once scanner I don't have to specify the device.
% '''scanimage -L''' <br>
device `pixma:04A91737_22F601' is a CANON Canon PIXMA MX330 multi-function peripheral
 
Use a 150 dpi resolution. Gray scale not color. Write the output to a TIFF file. Since there is only once scanner I don't have to specify the device.
 
* 150dpi, 8.5x11 in pixels: 1275x1650, or about 500 x 640 in mm
 
Flatbed mode example
scanimage --progress -x 500 -y 640 --source Flatbed --resolution 150 --mode Gray --format=tiff  > page.tif
 
ADF mode example
scanimage --progress -x 500 -y 640 --source 'Automatic Document Feeder' --resolution 150 --mode Gray --format=tiff  > out.tif
 
Multipage ADF example: will produce 9 files numbered starting with out10.tif
scanimage --progress -x 500 -y 640 --source 'Automatic Document Feeder' --resolution 150 --mode Gray --format=tiff --batch-start=10 --batch-count=9
 
Compression for tiff


'''scanimage --progress -x 200 -y 250 -l 22 -t 20 --resolution 300 --mode Gray --opt_nowarmup=yes --format=tiff > page.tif'''
tiffcp -c zip name.tif name.z.tif
  rm name.tif
 
=== Linux image viewers ===


View the result; pick an image viewer.
View the result; pick an image viewer.
Line 93: Line 96:
gimageview -- (run 'gimv') controls not intuitive<br>
gimageview -- (run 'gimv') controls not intuitive<br>
gwenview -- the KDE viewer
gwenview -- the KDE viewer
=== OCR ===
"tesseract billname.pnm billname" will generate a text file "billname.txt"
convert doc.pdf doc.pnm


==Photographic images==
==Photographic images==
Line 108: Line 117:
==Storing documents==
==Storing documents==


I don't want things to be stored in a hierarchy or a taxonomy or anything
Ideally I want all my documents in a pile, just like I do with paper.
ending in the letter 'y'. I want it all in a pile, just like I do with paper.
Then I want to be able search through the pile. The difference is that I want
Then I want to be able search through the pile. The difference is that I want
to be able to do the searching MUCH MUCH FASTER.
to be able to do the searching MUCH MUCH FASTER.

Latest revision as of 17:15, 17 March 2016

I just decided I am really tired of filing all the bits of paper that collect in piles around here every year around 'tax time'. Also, I don't want to own a file cabinet anymore. It's heavy and bulky.

Capturing documents

First I need to get the docs into the computer. This includes anything paper that I want to file. Receipts, bills, articles torn from magazines, notes to myself.

The computer aka "the doc server" needs bullet-proof and secure backups, so I need some way to encrypt the docs and ship them off somewhere. Only critical docs need this treatment. I think I can afford slightly less bullet proof backups for my photo and music collections.

Currently I am using cloud storage for this.

Digital docs

I can already access many personal docs online.

I have to do this in a timely way though. they don't all archive everything forever. Plus I want them all accessible in one system.

I need to program the doc server to grab every relevant personal doc and archive it. This means bank and credit card records, and bills from various entities like the power company and the phone company.

I'd like to stash email in this server, too.

I also own a substantial number of e-books and e-magazines that I'd like indexed.

Paper to image

Flatbed scanner - I have an HP. I need to drag it out of the closet.

ADF scanner 2013-May - a friend gave me a Canon Pixma MX330 which has an "ADF" = automatic document feeder. AWESOME. It probably ran out of ink so the former owner dumped it. Inkjet printers are a scourge. I only need the ADF part right now. (Update: Yes, I bought ink and it works fine as a printer too.)

Digital camera - it's been suggested that you can use a camera and tripod. My eyesight is not that good. It might be useful to consider its ability to capture other things than paper docs. Also I'd like to store photos.

Image to words

Comparing OCR software - conclusion, tesseract is better than gocr.

Setting up tesseract in xsane

Would be great but requires an external script at the moment.

Scanner tips

sudo apt-get install xsane tesseract-ocr

XSANE scanning

Using xsane is tedious, too many things to remember and too many things to hack around. Fun for experimenting or doing special projects. xsane scans each one to a PNM file, sequentially numbered. To save the scanned images in a single file, select 'Multipage' (Ctrl+M).

  • source = Automatic Document Feed
  • set Number of pages
  • Gray
  • 150 DPI

Command line scanning

Find out where the scanner is hiding:
% scanimage -L
device `hp3900:libusb:005:010' is a Hewlett-Packard Scanjet 3970 flatbed scanner

Get the options for this scanner
% scanimage -d hp3900:libusb:005:010 --help

% scanimage -L
device `pixma:04A91737_22F601' is a CANON Canon PIXMA MX330 multi-function peripheral

Use a 150 dpi resolution. Gray scale not color. Write the output to a TIFF file. Since there is only once scanner I don't have to specify the device.

  • 150dpi, 8.5x11 in pixels: 1275x1650, or about 500 x 640 in mm

Flatbed mode example

scanimage --progress -x 500 -y 640 --source Flatbed --resolution 150 --mode Gray --format=tiff  > page.tif

ADF mode example

scanimage --progress -x 500 -y 640 --source 'Automatic Document Feeder' --resolution 150 --mode Gray --format=tiff  > out.tif

Multipage ADF example: will produce 9 files numbered starting with out10.tif

scanimage --progress -x 500 -y 640 --source 'Automatic Document Feeder' --resolution 150 --mode Gray --format=tiff --batch-start=10 --batch-count=9

Compression for tiff

tiffcp -c zip name.tif name.z.tif
rm name.tif

Linux image viewers

View the result; pick an image viewer.

sudo apt-get install gimp paul mirage imview gimageview

Gimp -- works but is a heavy side
paul -- forget it
mirage -- ok
imview -- does not scale image to fit screen
gimageview -- (run 'gimv') controls not intuitive
gwenview -- the KDE viewer

OCR

"tesseract billname.pnm billname" will generate a text file "billname.txt"

convert doc.pdf doc.pnm

Photographic images

(and movies, I suppose)

Thinking along the same lines, I'd also like to store...

Audio data

Let's get rid of the CD collection, too.

Update 12/2008: My music collection is now on my hard drive.

Storing documents

Ideally I want all my documents in a pile, just like I do with paper. Then I want to be able search through the pile. The difference is that I want to be able to do the searching MUCH MUCH FASTER.

So I want at least the metadata stored in a SQL database. The docs themselves can be stored in BLOBS or files. I don't care as long as I can still get at them in 10 or 20 years when I need them. I think files would be best. Document File Formats

Indexing documents

Fine, so the metadata can live in a database. I need to decide what the best way to access the data is.

This journey started when I ran across the announcement for a release of phpMyArchive so I better check it out. This project has some very good ideas about making data storage and indexing.

I remember evaluating Swish-E so I will look at it again too. Swish-E indexes text docs.

I also played with MnoGoSearch for a while. It indexes documents and puts the index into a database. For articles it could go like this: Scan article. Attach title to it. OCR it. Then run indexer on it.