Optical Character Recognition (OCR) software is nothing new. Ever since the early 2000s when I first owned a scanner, OCR software has had the capability to convert a scanned document into a text file. Recently, however, I needed to convert an entire book to text using a similar software. The book was about 200 pages of a church history. The intent was to publish the book to Wikipedia (with the author’s consent, of course)
My first thought was to contact a university who does this. I earned a degree from Gardner-Webb University in North Carolina and I knew they used a machine like this for visually impaired students. So, I learned a little about their process.
They would take a book, unbind it and send it through the machine, which would then convert the pages to a text document. At first, I asked if they would be able to do this for me. They said it wasn’t typical to do this for someone outside the university, so I conceded and considered other options.
When I realized I had access to the first step in the process, I knew I was halfway there. Most modern copiers have the “Scan To…” ability. The copier will scan to e-mail, JPG, PDF or other format specified.
So, I did it. I took the book and cut the hardback from it. The I separated the page sections and trimmed the pages so that they would run through the copier smoothly. Then, I told the copier to scan to PDF and ran the book through the copier. After that, I clipped it all together using one of those black binder clips to keep everything in order.
The PDF document was great – pictures intact and text clearly readable. The only problem is that it wasn’t yet searchable. I still needed to get this PDF to text. That’s where Online OCR comes in.
Online OCR allows you to upload a document to their web service. Then it returns the document in the format you request. I wanted nothing more than plain text for my purposes. My first attempts were troublesome because the file was so large. After contacting support, they quickly found the source of the problem and sent the text file I needed.
As with any OCR software, the text file isn’t perfect. The images in the document don’t show up either. The next step in the process is to clean up the text document, correct misspellings and get the formatting correct. This will take some time, but it sure beats re-typing an entire book. You can look for the article on Wikipedia soon!