Monday, December 22, 2008

Open source OCR on Mac OS X: tesseract

I am looking for a optical character recognition solution and I've checked out OCRopus, but the early alpha stages it's in make it very hard to compile. OCRopus lists tesseract as a dependancy so I've compiled and ran tesseract on a couple of scanned pages.

The results are impressive (see below the results of running it on a page from a Cisco manual).









Chapter 24 • Mixed-Media Bridging

ending delimiter, which follows the data field) are treated differently depende
ing on the bridge manufacturer Some bridge manufacturers simply ignore the
bits. Others have the bridge set the C bit (to indicate that the frame has been
copied) but not the A bit (which indicates that the destination station recog-
nizes die address). Ln the former case, a Token Ring source node determines
whether the frame it sent has become lost. Proponents of this approach sug~
gest that reliability mechanisms, such as the tracking of lost frames [..]

3 comments:

Adam Logan said...

Tried my hand at this... Uber fail. I don't have much experience with terminal, so this is way out of my capabilities atm. Very cool thing to be able to do though.

diciu said...

I've created a very simple GUI on top of tesseract.

If you want to experiment with tesseract, this should make things easier.

Adam Logan said...

Thanks diciu, I'll give it another go sometimes installing Tesseract. I am just going to need to meet up with a linux guru or something and get some tutoring =/.