Originally published at: http://contentmine.org/2015/09/contentmine-tools-produce-a-supertree/
Using ContentMine tools, the BBSRC-funded PLUTo project (Mounce, Murray-Rust, Wills) has created what we believe is the first ever machine-compiled supertree created entirely from figure images. To be precise, we accurately extracted phylogenetic relationships from 924 figure images published in the journal IJSEM to create this formal synthesis of knowledge.
Here it is below for your viewing pleasure, as represented as an image, in radial format. It's rather big, including 2269 taxa so it's a challenge just to fit it all in one screen!
So what next? Are we done? No. To give a sense of validity, we need to compare our machine-compiled supertree to trees of similar taxon-coverage such as the NCBI taxonomy tree and the SILVA 'Living Tree Project'. We have already done some work on the former. Our supertree as measured using Robinson-Foulds distance is only 1691 units different to the NCBI taxonomy tree composed of the same taxa. More randomisation work is needed to determine whether this distance is significantly different from that of any random tree with the same tip labels.
It's also time to start writing it all up for formal publication. I hope this will encourage more scientists to publish machine-readable phylogenetic data rather than just figure images, as syntheses like this would be much easier to produce from proper machine-readable phylogenetic data formats, rather than pixel-based raster graphics as we have used for this supertree. We also hope this provides a compelling example of research that you can do with a content mining philosophy; published academic literature is there not just to be read, but to be mined, and re-used too!