Talkback for article: 382, June2005

A toolchain for transformation from paper to HTML

Back to: http://cgi.linuxfocus.org/English/June2005/article382.shtml

From: juanjo <juan.j.gil _at_ gmail _dot_ com> [ date: 2005-07-16 ]
Great article, i find it usefull :D
I need some directives to search docs about ocr, mybe you can help me (i hope so)

Ive 25 peoples scanning pages and another 25 doing data-entry task with the same pages, BUT as there are differents formats between a page and another many errors are carried at data entry task :(

do you know of some ocr free soft with script capabilyties to search regions of some text in some coordinate sistems (maybe pixels)? (i was thinking to "parse" the image looking for some keywords inside of it and applying a water-mark on the top of the region and then "highlight it" )

thanks in advance and keep the good work :D
From: Andreas Haunschmidt <haunschmidt(at)gmx.net> [ date: 2005-07-18 ]
nice article!

but I think

cat *.txt > test.txt

will not work as expected if test.txt already exists
(e.g. when you start the script a 2nd time),
because *.txt matches now test.txt



From: Holger Kiffmeyer <h.kiffmeyer_at_freemail_dot_de> [ date: 2006-01-18 ]
Interesting article concerning about ocr. I tried quite intensive myself before, so there are some comments I have:

1.) It is not necessessary to use 600 dpi at full color (24 bit) or grayscale (8 bit) to scan articels from newspapers or magazines. Why? Because the original doesn't provide this information. I used to work in an advertising agency and know a lot about offset printing. The original images used in the dtp-workflow should have at least 150 dpi, normally 300 dpi and sometimes 400 dpi, but only when using high glossy, expensive paper. In effect, the printed images have about 100 to max 200 dpi REAL resolution. So it is sufficient to scan at 200-max300 dpi color or grayscale. Some Overhead is necessary, as the scanner itself is not perfect, but that's enough. Saves you a lot of time ;)

2.) Moreover, the used programs don't take advantage of this big files, they may even bring worse results. I had best results with scanning 150 dpi greyscale (very fast), changing afterwards resolution to 300 or 600 dpi and color to b/w (1 bit). You can use mogrify from the imagemagick-package for this purpose. It's important to know, that some of the free ocr-packages use only b/w lineart-images, they are not able to handle greyscale images or they just dont use the given more informations - i don't remember xcactly about gocr, but i think it can handle greyscale but doesn't use it yet. This means it makes internally an 150 dpi b/w image from an 150 dpi greyscale, which leads to bad results, because 150 dpi b/w is about the quality of an fax.
Hmmm, too difficult to understand? Scan some images in resolutions like i said, print them and you'll see...

3.) You can spare time too by using the command-line tools to scan. scanadf or scanimage are both easy to use and you don't have to click every single side of the book or magazine you scan. All sides of a book have the same size, right? Once started scanimage with this parameters, it will scan as often as you told him, and you only have to change the sides. Try to do this on windows ... :) scanadf even uses an automatic document feeder, if your scanner has one.

4.) I had very good results in using a digitalcamera, too. If it is supported by linux, you can even use the scanimage tool. That's really fast, I built a stative for the camera, so i could scan while reading a magazine, no need to lay it on top of the scanner. And it gets two sides by one click, or even big formated journals.

5.) Unfortunately, the free available gocr and other free ocr bring very bad ocr results in comparison to propritary ocr software. I made some tests with a 100000 words big file of different german words (as I and the developer of gocr are germans). I used ghostscript to make images, using 12 point Times font, and compared the ocr-result with the original: more than 30 % errors. Thats far too much for me, sorry, although it may be enough for certain purposes. I recommend ABBYY Finereader. They have a linux sdk for companies to integrate in their software. Don't know prices, ask them (abbyy.com). I heard something about getting Finereader 7 to work with wine, but i have to try this myself. Okay, folks, that's it. Feel free to drop me a line.
From: Peter Jones <pcaalu2601(at)free-fast-email.com> [ date: 2006-06-03 ]
Your site is very useful.
From: Dave <nospam(at)blocker.net> [ date: 2006-09-16 ]
Nothing wrong with HTML, but you can create a nice PDF file from ImageMagick in one blow:

convert *.jpg output.pdf


From: Lew [ date: 2006-09-17 ]
Good article.

One question, though. Why the reboot after you installed the scanner?
As far as I can tell, there would be no reason to reboot.

sane uses userspace drivers, but even the limited kernelspace driver should be loadable without a reboot (modprobe or insmod would do). No reboot necessary here.

Adding a user to a specific group doesn't require a reboot, and the new group association is recognized at the next login. All you would have to do here is log off and log on again.


6 talkbacks




Due to the increased amount of web spam we have deciced to removed the talkback posting possibility. You can read old talkbacks but you can no longer post new ones.

Back to http://cgi.linuxfocus.org/English/June2005/article382.shtml

Please contact webmaster(at)linuxfocus.org if you have any questions with regards to this talkback

lftalkback version 3.10