PDA

View Full Version : Epson Perfection V700, scanning typewritten text


browjo
Nov 17, 2011, 07:41 PM
We have a zillion historical dockets that need entering into a very simple ExCel spreadsheet:

All the dockets are 50 to 75 years old. They have some "foxing" but are otherwise very well archived.

We could go on simply typing directly into ExCel, but the fact is it will take four of us at least 20 to 30 years to finish at the current rate!

We have to an Epson Perfection V700 Photo flat bed scanner. Currently this is operating on Adobe Professional to convert to PDF, but we can access a licenced copy of Omnipage software.

The data is quite straight forward:

-Date commenced (eight digits,typewritten)
-Author (typewritten)
-subject matter(typewritten)
-location(typewritten)
-docket number (four digits, handwritten)
-date finalised(eight digits, handwritten)


Does anyone have any ideas how we can tweak the scanner to improve accuracyand efficiency,please?

- Is PDF converter software better to use than OCR,here?

- Will tweaking the contrast, resolution etc and using overlays decrease the error rate?

- Is Omnipage our best option?

John B

cdad
Nov 17, 2011, 08:20 PM
Im going to step in with this note. Are you sure the paper can withstand the process of being put under light like that? If it can then your OK.

A question I have is at what color depth are you scanning the documents? If they are black print on white paper. Then choose a low bit black/white scan. That way there is no confusion (or less confusion) for your OCR software.

ScottGem
Nov 18, 2011, 04:52 AM
PDF and OCR are different things. What's not clear is what your purpose is here. Are you trying to catalog these documents or do you really need to convert the whole document?

There are third party services who can do this for you.

I would NOT use Excel for this. I would use a database program like Access to catalog these files. I would first enter docket #, dates, author and location to create your record. Maybe include a brief description. You could hire temps to do this.

browjo
Nov 18, 2011, 10:33 PM
Thanks Scott & Cal,

The purpose of the exercise is to creat an online index of contents of the many thousands of files ("dockets") we hold.

We certainly have no intent of ever scanning whole documents,given the size of the task... just the items mentioned, namely:

Author, comm date, subject matter etc... so that people who want to access the archives can work out which dockets are likely to be of interest to them.

Yes we do know that PDF and OCR are different.

Yes Access might be a better data base but your suggestion Scott still leaves us with the enormity of the task (and no, we will not be paying Temps or third party operators for 20 to 30 years to get the work done)

Cal -yes the dockets are very well archived and can easily cope with the process of scanning. The originals were blue/black typewritten and handwritten on white.

As mentioned, the dockets have some foxing and the base colour has faded to ivory/parchmnet. But we only need to capture black on white, of course.

Looking forward to your thoughts

John B

cdad
Nov 19, 2011, 06:23 AM
Try a test document(s) and scan them under 8-bit (black/white) and see if that helps you pull the lettering out more clearly. Also it would save a ton of drive space.

ScottGem
Nov 19, 2011, 08:58 AM
Ok, do as Dad suggested and test the scanning, You may be able to create a template that only scans certain parts of the page. But frankly, I think you may find data entry faster.