Originally published at: http://contentmine.org/2015/08/mining-images-for-identifiers/
Figure images in scholarly articles typically contain a wealth of interesting data. In terms of textual data, identifiers such as GenBank / ENA accession numbers can be relatively common in certain types of figure images. Take for example the below figure reproduced from an Open Access article in IJSEM:
[caption id="attachment_809" align="aligncenter" width="380"] Figure 3, Nakai et al (2014) doi: 10.1099/ijs.0.060798-0 Licence: CC-BY[/caption]
Most tip labels of the phylogeny above specify the exact accession number of the nucleotide sequence data that underlies the result. e.g. AB540021 - the accession number for Oligoflexus tunisiensis 16S rRNA partial sequence. This aids the reproducibility of the research presented.
But at scale, how can we discover, index and/or re-use this identifier information? It's worthy of noting that for this paper and all others like it in IJSEM, only the accession number of the taxon of interest is printed in the textual content of paper or supplementary materials. All the other accession numbers only appear in this image, as pixels, utterly unindexed, undiscoverable by anyone. So if you want to query: Which papers have used or referred to accession number 'JF181808'? --you really have delve into OCR technology to unlock this data from these image pixels.
With the ContentMine toolset, I've been using tesseract-ocr to perform OCR on a set of over 4,000 of these phylogentic tree figures from the journal IJSEM. From this, it's becoming obvious that these accession numbers were never designed with OCR-retrieval in mind! Unlike other identifier systems, these identifiers have no 'check digit' to help validate them, thus some OCR-errors can be easier to detect and correct than others.
Some examples of OCR error
(A8486128) -- if OCR returns this eight character string between parentheses, we know it is an error of some sort because GenBank/ENA accession numbers have a specific format, which this example does not quite conform to. An error is detected because it's an identifier of eight characters in length but the second character is a number (it should be a letter to conform). It is highly likely that the '8' was originally a 'B' in the source image and thus in this instance automatic error correction can be safely applied, post-OCR.
(FJ71077Z) -- in a similar example the last character is a letter, not a number which does not conform to what we expect from the proscribed accession number pattern. In this instance the 'Z' must be corrected to a '2' to get the real accession number.
(AB2B8551) -- this example demonstrates the difference in ease between error detection and error correction. The 4th character should be a number, but OCR has determined it incorrectly to be a 'B'. This is easy to detect. However, what is the original number that the 'B' should be? OCR errors often mistake either '3', '6' or '8' for a 'B' depending on what font is used. Without further lookup and/or checking, this error cannot be safely corrected with certainty.
- (DQ289039) -- this is an example of difficult error detection and difficult error correction. If the OCR incorrectly returns '8' in the numerical section of the identifier instead of a '3' or a '6' it is difficult to detect that an error has occurred because the string is still a valid accession number. Furthermore, the most likely alternatives DQ239039 and DQ269039 are also valid accession number, so to detect and correct this error one really needs to cross-reference with another data source e.g. the taxon name or the strain information that is also given in the diagram.
Automatic error detection is thus somewhat easier than automatic error correction and both are not infallible. In processing this accession data from the 4,000+ figure images I must work hard to provide a valid estimate of both the accuracy of my error detection and the accuracy of my error correction routines, for the data to be usable.
What can we do with this identifier data? One example...
There's a trend towards increased data citation in the biological sciences at the moment. Use and re-use of molecular sequence data in phylogenetic analyses is almost never formally cited. But by extracting these accession numbers from figure images, combined with some simple lookup code to determine authors, we can retrospectively look at who's data people have using -- who was responsible for originally sequencing the phylogenetically useful sequences (typically 16S rRNA) from these organisms. Using some very rough, first pass OCR data obtained from phylogenetic tree images from 3864 IJSEM papers I can generate a tantalizing 'leaderboard' of microbial sequence contributors (most sequences have multiple authorship like academic papers):
Used in 3864 IJSEM papers
It's no surprise to see Carl Woese up there in the top 10 as he was one of the early pioneers of the use of 16S rRNA for bacterial phylogenetics. The others were news to me as I'm not a specialist in the area. Amazingly, on average Stackebrandt appears to have contributed to the sequencing of more than one sequence per paper across my sample. This isn't too surprising, there are often 30 or more sequences used in a single phylogenetic analysis, but still - that's an impressive impact on the field!
Could this be the start of altmetrics for data-usage? I don't know. But it sure is interesting to see who's behind all this sequencing data that is so commonly re-used by others. Credit where credit is due, even if it's many years after the fact.