Most OCR Text extraction projects involve scanned/ photographed images, whose text needs to be extracted and the text to be further classified into fields that are relevant for the solution. Having implemented such data extractions at Rare Mile, we found that apart from the obvious must-do items of selection of OCR product, building analytics rules and so forth, there are some important things to look out for; those if done at the right time and effectively will go a long way in empowering the customer experience of such projects.
Good reference data:
The accuracy of an OCR solution largely depends on how well the rules of extraction and classification are built. These rules help in overcoming the limitations of the OCR product and poor quality images. A good reference data is the first step to base the rules on. The larger the size, the better the variety of this reference data – the extraction rules stand the test of time and any new input.
Start with a good reference set for building your rules and testing the algorithm.
Learn from mistakes:
Like any algorithm, even a text extraction algorithm only gets better with time. For it to get better with time, the algorithm should provision for collecting the inputs to feedback into itself. Any wrongly extracted field and its correct form should be collected on which we can then run analytics to further improve the initial set of extraction rules.
Feed the corrections back into the rules.
Externalize the rules:
Having already mentioned the importance of extraction rules in the algorithm , it is very crucial to plan on how to manage those rules. Keeping the rules separated from the core extraction logic lets us tweak them easily without involving build/deployment and to evaluate the algorithm’s accuracy. Keeping in mind the complexity of the rules, the target audience who would like to edit them and the need for a security/access control mechanism to modify them, the choice of externalizing the rules either to flat files, databases or to rule engines should be made. With the simplest of them being flat files – for when we don’t desire data safety and security or to use rule engines to put in access control and build complex rules on the fly.
Keep the rules out of the code and keep them easily manageable.
Plan your User Acceptance Testing:
For projects that involve OCR Text extraction, the User Acceptance Testing phase can prove more challenging than the go-live phase, since this is not as straight forward as typical web applications with front end. In most cases, the results of the Text extraction are not directly needed on the UI and they feed into another system for further analysis. Not putting thought into how the User Acceptance Team will test the algorithm can cause bottlenecks to take the solution live. A basic UI tool with bare minimum features to facilitate verification and to evaluate the solution’s correctness is a good investment to make to bring relief to your User Acceptance team.
Take the solution smoothly to your User Acceptance Test team.