Summary of github issue #49. (now resolved)
Peter and I were experimenting with the whitelisting functionality of tesseract-ocr. This functionality is described at this StackOverflow post here.
Yesterday Peter amended the code to make use of this functionality -> to prevent our OCR process from inserting difficult characters into tip labels such as $ ` ' ,
So we made our own custom whitelist of accepted characters:
On Peter's Mac this config file had to be placed in:
Whereas on my Lubuntu PC the tesseract-ocr config directory is:
I had neglected to exactly replicate Peter's whitelist config file and so had OCR output that was entirely composed of numbers, no letters e.g.:
<otu id="otu7"> 36611118 1111611 1.1146 218341 0115425091</otu>
<otu id="otu7">Bacillus vireti LMG 21834T (AJ542509)</otu>
This was quickly fixed by correcting my local config file. This was a useful problem to bump into. A good reminder that working across different machines, on different operating systems, in different locations is hard, and that you can't always sync ALL changes via version controlled repositories.