Home Programmierung Using Perl to build an Ultra-Simple Document Management System with free Components on Windows

Main Menu

Home
Publications
Texte / Essays
Programmierung
------------------
Contact
Legal
Privacy
------------------
Using Perl to build an Ultra-Simple Document Management System with free Components on Windows
There are no translations available.

 

I was looking for an extremely simple solution to scan bills and other documents with the following characteristics:

  • It should operate on a single icon click. No Scan GUIs, no User Dialogues, no additional user interaction
  • It should store Documents in some form of a Standard File Format which will remain accessible for the next 10 Years (which I assume PDF will be)
  • It should give me the power to search quickly both in some form of structured way (what kind of document) and also by a keyword
  • It should support OCR so keyword search becomes better  (recognition rate is secondary)
  • It should make use of existing OS Features wherever possible
  • It should be extremely simple but yet flexible
  • It should use components free of charge (besides your OS)
  • So there are two Use Cases: 1) Scanning a  Document and 2) Finding & opening it.

Finally, I wrote a Perl Script which does the Job very nicely. It is very short.

Limitations are:

  • No Multipage Support
  • Only designed as a single-user Solution, no Access Control etc
  • Requires TWAIN Scanner

How it meets the Criteria:

  • It uses PDF as a File Format, generated out of PERL, to store the scanned image
  • It stores a Keyword within the File Name
  • It may use a directory structure for each kind of document type to be scanned
  • It puts the recognized OCR Text into the PDF File as Keywords
  • When activated, Windows will add the PDF Files to its Search Index allowing a quick way to search for the Documents out of Windows Explorer, to match either File Name Keyword or other Keywords stored in the PDF.

How it works:

  • It calls an external free Command Line Scan Program called CmdTwain to get the Image from the Scanner and save it as a temp file
  • It calls an external free OCR Engine called Tesseract to OCR the Image
  • It converts the BMP to TIFF with Perl Lib Imager
  • It uses PDF API2 to generate a PDF File containing a) the Image and b) the recognized OCR Text

Installation and Use:

  1. I assume that you have installed Perl. Make sure that the following Libraries are installed: PDF::API2, PDF::API2::Content, PDF::API2::Lite, Imager and PDF::TextBlock.
  2. Create a Scan Document Path and a Temp Path and put them in section [1] of the Script.
  3. Make sure you have a working TWAIN Scanner. Download and install the command line tool CmdTwain Free published by GSSEziSoft (http://www.gssezisoft.com). It allows you to scan pages without user interaction. Success Criteria: You manage to scan pages with the same command as used in section [2] of the Script.
  4. Download and Install "Tesseract" from google code.
    If desired, add your language Pack. Play with it a bit. Scan 200 dpi BMP Documents and try to get some recognition results.
    Success Criteria:
    You manage to run tesseract with the same command as used in section [3] of the Script
  5. The Script is called with exactly one Parameter which is a descriptive Keyword. This Keyword will become part of the File Name next to the Scan Date / Time. The Keyword is a kind of category: Given my scenario, I have bills for my Creditcard, for my current bank account and cash bills.
  6. For each Category, create a Desktop Shortcut: It links to the perl script and includes the Category Name as Parameter, so it might look like this (German Windows Installation). I assume that .pl Files are associated with the Perl Binary:http://glas-consulting.com/images/desktop-link.png
  7. If you did everything right, then double-clicking this Icon should Scan the Image, save it as a temp file, run the OCR Engine, convert the Image, create a PDF, store the recognized Text as "properties" inside the PDF and save it with the keyword.
  8. You now have to activate the Windows Index for PDF Files. In my Windows 7 64-Bit Edition, I had to download and install the Adobe PDF iFilter from http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611
    Besides that, make sure you have added the Document Folder to the Windows "Indexing Options" so Windows Indexing Service will add the Files of the Folder to its Index. This may require re-starting the Service and waiting some time after the iFilter is installed.
    Success Criteria is: Open the Windows Explorer, open your Scan Document Folder and use the Search Function on the upper right. When entering a Keyword of the Filename or a Keyword of the OCR´ed Text Explorer should show you the corresponding Documents.
    This might look like this: http://glas-consulting.com/images/explorer.search.png
  9. Clicking on the PDF File should open a Preview of it on the Right, double-clicking it should open it.
  10. Voilá:
    Use Case 1: Put the Document on the Scanner, Double Click the Scanner Icon with the desired Keyword.
    Use Case 2: Open Windows Explorer for the Scan-Folder and Use the Built-In Search Function to look for Strings of the File Name and / or of recognized Text.

For my personal needs, this Solution is absolutely sufficient. And the Windows Search is very fast.

 

Here is the Source Code