No more paper: Difference between revisions
Brian Wilson (talk | contribs) mNo edit summary |
Brian Wilson (talk | contribs) mNo edit summary |
||
Line 1: | Line 1: | ||
I just decided I am really tired of filing all the bits of paper that collect in piles around here every year around 'tax time'. Also, I don't want to own a file cabinet anymore. It's heavy and bulky. | |||
==Capturing documents== | |||
First I need to get the docs into the computer. This includes anything paper that I want to file. Receipts, bills, articles torn from magazines, notes to myself. | First I need to get the docs into the computer. This includes anything paper that I want to file. Receipts, bills, articles torn from magazines, notes to myself. | ||
Line 5: | Line 7: | ||
The computer aka "the doc server" needs bullet-proof and secure backups, so I need some way to encrypt the docs and ship them off somewhere. Only critical docs need this treatment. I think I can afford slightly less bullet proof backups for my photo and music collections. | The computer aka "the doc server" needs bullet-proof and secure backups, so I need some way to encrypt the docs and ship them off somewhere. Only critical docs need this treatment. I think I can afford slightly less bullet proof backups for my photo and music collections. | ||
===Digital docs=== | |||
I can already access many personal docs online. | I can already access many personal docs online. | ||
Line 17: | Line 19: | ||
I also own a substantial number of e-books and e-magazines that I'd like indexed. | I also own a substantial number of e-books and e-magazines that I'd like indexed. | ||
===Paper to image=== | |||
'''Flatbed scanner''' - Yep, I have one of these somewhere. I need to drag it out of the closet. If things go well I will probably want to trade it in for one with a feeder on it. | '''Flatbed scanner''' - Yep, I have one of these somewhere. I need to drag it out of the closet. If things go well I will probably want to trade it in for one with a feeder on it. | ||
Line 25: | Line 27: | ||
to capture other things than paper docs. Also I'd like to store photos. | to capture other things than paper docs. Also I'd like to store photos. | ||
==Photographic images== | |||
(and movies, I suppose) | (and movies, I suppose) | ||
Line 31: | Line 33: | ||
Thinking along the same lines, I'd also like to store... | Thinking along the same lines, I'd also like to store... | ||
==Audio data== | |||
Let's get rid of the CD collection, too. | Let's get rid of the CD collection, too. | ||
===Image to words=== | |||
OCR via [http://sourceforge.net/projects/tesseract-ocr Tesseract] | OCR via [http://sourceforge.net/projects/tesseract-ocr Tesseract] | ||
Line 41: | Line 43: | ||
just to be able to grab keywords for indexing. | just to be able to grab keywords for indexing. | ||
==Storing documents== | |||
I don't want things to be stored in a hierarchy or a taxonomy or anything | I don't want things to be stored in a hierarchy or a taxonomy or anything | ||
Line 48: | Line 50: | ||
to be able to do the searching MUCH MUCH FASTER. | to be able to do the searching MUCH MUCH FASTER. | ||
So I want at least the metadata stored in a SQL database. The docs themselves can be stored in BLOBS or files. I don't care as long as I can still get at them in 10 or 20 years when I need them. I think files would be best. | So I want at least the metadata stored in a SQL database. The docs themselves can be stored in BLOBS or files. I don't care as long as I can still get at them in 10 or 20 years when I need them. I think files would be best. [[Document File Formats]] | ||
==Indexing documents== | |||
Fine, so the metadata can live in a database. I need to decide what the best way to access the data is. | Fine, so the metadata can live in a database. I need to decide what the best way to access the data is. |
Revision as of 08:02, 8 January 2007
I just decided I am really tired of filing all the bits of paper that collect in piles around here every year around 'tax time'. Also, I don't want to own a file cabinet anymore. It's heavy and bulky.
Capturing documents
First I need to get the docs into the computer. This includes anything paper that I want to file. Receipts, bills, articles torn from magazines, notes to myself.
The computer aka "the doc server" needs bullet-proof and secure backups, so I need some way to encrypt the docs and ship them off somewhere. Only critical docs need this treatment. I think I can afford slightly less bullet proof backups for my photo and music collections.
Digital docs
I can already access many personal docs online.
I have to do this in a timely way though. they don't all archive everything forever. Plus I want them all accessible in one system.
I need to program the doc server to grab every relevant personal doc and archive it. This means bank and credit card records, and bills from various entities like the power company and the phone company.
I'd like to stash email in this server, too.
I also own a substantial number of e-books and e-magazines that I'd like indexed.
Paper to image
Flatbed scanner - Yep, I have one of these somewhere. I need to drag it out of the closet. If things go well I will probably want to trade it in for one with a feeder on it.
Digital camera - it's been suggested that you can use a camera and tripod. My eyesight is not that good. It might be useful to consider its ability to capture other things than paper docs. Also I'd like to store photos.
Photographic images
(and movies, I suppose)
Thinking along the same lines, I'd also like to store...
Audio data
Let's get rid of the CD collection, too.
Image to words
OCR via Tesseract My idea is not to make a perfect readable copy of the original but just to be able to grab keywords for indexing.
Storing documents
I don't want things to be stored in a hierarchy or a taxonomy or anything ending in the letter 'y'. I want it all in a pile, just like I do with paper. Then I want to be able search through the pile. The difference is that I want to be able to do the searching MUCH MUCH FASTER.
So I want at least the metadata stored in a SQL database. The docs themselves can be stored in BLOBS or files. I don't care as long as I can still get at them in 10 or 20 years when I need them. I think files would be best. Document File Formats
Indexing documents
Fine, so the metadata can live in a database. I need to decide what the best way to access the data is.
This journey started when I ran across the announcement for a release of phpMyArchive so I better check it out. This project has some very good ideas about making data storage and indexing.
I remember evaluating Swish-E so I will look at it again too. Swish-E indexes text docs.
I also played with MnoGoSearch for a while. It indexes documents and puts the index into a database. For articles it could go like this: Scan article. Attach title to it. OCR it. Then run indexer on it.