Just shown a few examples of these to Vince Smith in the office here at NHM, London. Impressed. Feels great to be doing Open Notebook Science!
Chart Junk & Annotations Result in No Valid Newick Output
Batch: 500AOriginal Image ID: 65220-0
One for the blacklist I think.Output Newick file just contains: UNKNOWN;The output NeXML reveals it got some of the taxon labels but didn't get the tree:
UNKNOWN in Newick but still gets some of the tree
Batch: 500AOriginal Image ID: 65221-0
The Newick output is:
The retained relationships are correct with respect to each other but I am concerned about the 'UNKNOWN'
100% (13 / 13 taxa)
Batch: 500AOriginal Image ID: 65226-0
AFAIK this is essentially perfect on tips, relationships and branch lengths!
16 / 22 taxa retained. Relationships correct
Batch: 500AOriginal Image ID: 65229-0
See rendered Newick below. The log file for this one is worth close attention, 4 tip labels are rejected I believe simply due to incorrect case e.g. Uppercase in first letter of species or lowercase for first letter of Genus: "Janibacter Iimosus" & "lntrasporangium calvum" & "Oryzihumus Ieptocrescens". Can we relax the case sensitivity? Levenshtein edit distance could perhaps save the day here.
Incorrect introduction of error: duplicated taxon label
Batch: 500AOriginal Image ID: 65230-0-001
The rendered Newick is actually better than it looks. A lot of taxa in the original image are sp. and so we want them to be pruned. I am not sure why C. apicola got duplicated, it is only in the image once.
I think this is because it has two close relations, both of which are pruned. The result is a bug.
Whilst I am at #wikisci
I am fuzzy-matching the ~1790 taxon pseudo-names we have OCR'd out that don't absolutely match any names in the NCBI taxdump. This will give us useful knowledge on how many characters off we are for each. As well as the ability to correct those incorrect names.
I am using tre-agrep to do this: http://manpages.ubuntu.com/manpages/lucid/man1/tre-agrep.1.html
For transparency the simple script I'm running is below. It's quite slow. Could certainly optimize the names dump somewhat:
while read i ; do
echo $i ;
tre-agrep -3 -s -B "$i" scinames.dmp | head -1 | tee -a approxmatch.txt ;
Script still chugging away. ~44 seconds per name to fuzzy match over ~1 million lines in NCBI taxdump.
By luck the matching appears to work well (so far) where there has been synonymization occuring e.g.:
original name from OCR: Candida lignohabitansbest fuzzy match from taxdump: 5:796027 | Sugiyamaella lignohabitans
the 5 is apparently the distance score of the match between the input and the given best match name output.Below is clearer, there is just a distance of '1':
1:178899 | Carboxydocella thermautotrophica
Candida / Sugiyamaella is of course also in the Kingdom: Fungi , which flags-up some further issues with our input data. It's not all just bacteria & archaebacteria
First cut at supertree displayed in conventional form at https://github.com/ContentMine/ijsem/blob/master/supertree-analysis/strict1.svg . There are many artefacts in this (singleton tips, garbles) but it shows the general structure of the tree.
Ross,Can you outline the workflow for creating the next version of the supertree. Part of it includes trimming the singleton tips for which we can use ami-phylo. I imagine it is something like:
I will commit latest AMI which is capable of creating conventional SVG trees from NWK.
It will be useful if we identify those operations forwhich we shall need third-party software. There is no need for this to be part of AMI. However it's worth noting that XML is much better suited than NWK for carrying annotations and history. For example NEXML can retain deleted tips in the OTUS list and indicate that they have been deleted and why.
I am already well under-way with task 1.)I will be uploading a csv spreadsheet with a mapping from "our tip labels" -> NCBI Taxdump labels they should be, soon.
2.) exactly. Simple find & replace operations should suffice to replace garbled tips in my spreadsheet above with the NCBI-corrected name of what they should be. Note this includes correction of both garbles AND synonyms where the name in the image has been OCR'd correctly but taxonomic nomenclatural changes have changed the taxon name associated with that organism/sequence.
3.) Yes, for pruning.
4.) Yes. I could do this in R on the Newick strings, but this would be great for you to tackle if you have time.
5.) Well, yes but for clarity I'd state 5a.) is creating the MRP matrix from Newick string 'source trees' and 5b.) is passing that singular MRP matrix to TNT to compute the shortest length tree possible given the data. a) & b) are rather distinct tasks. b) will be very time consuming once we're happy with the input data matrix. I will run more stringent search methods only when we're happy with the data input. No point until then.
6.) Display, and mathematically compare to NCBI Taxonomy tree using tree-to-tree distance measures e.g. Robinson Foulds, this step provides some actual quantitative measure of the similarity of our results to previous hypothesis of relationships between these taxa. Our supertree certainly won't be identical / perfectly matching but I hope it has significantly closer fit than any random tree of the same size & labels.
New folder added to github demonstrating label-cleaning work that is still in progress: https://github.com/ContentMine/ijsem/tree/master/label-cleaning
agrepping.sh is the fuzzy-matching script I'm using to help automate the matching of our tip labels to those in the NCBI taxdump (I have provided the modified version of this that I am working against as scinames.dmp.
No-hit-names.txt is the input list of names/labels which do not exactly match anything in NCBI taxdump. approxmatch.txt is the output from the fuzzy-matching script of the closest matching name in NCBI taxdump (not always correctly matched!).
name-results.csv contains the input unmatched labels and the output best-match-guess labels along with the edit distance and a column of my manual checking comments. If not correctly matched, I add the correct name/label in column F.This manual checking is still ongoing.
The first example in the csv is this:"Acfinaplanes palleronii" (matches to ->) Actinoplanes palleronii (edit distance ->) 2 (NCBI Tax ID ->) 113570
More complex rows are where the name in the image was OCR'd nearly or 100% correctly but is now a synonym of the current accepted name e.g.:"Chryseobaclerium meningaseplicum" (fuzzy-matched to ->) Chryseobacterium hungaricum (edit distance ->) 9 (TaxID ->) 454006 (my statement ->) WRONG (the current accepted name this label should be ->) Elizabethkingia meningoseptica
I should explain the scoring of the manual checking column. It should have only one of four states:
Looks good.The small edit distances come from (a) OCR garbles and (b) typos by the authors. I assume that the larger distances might resolve to a large number of congeners and that these could only be distinguished by lookup against the EGID. The last example can presumably only be resolved by human inspection (or by a re-naming authority) as all parts of the name are different .
Taxize an Ropensci package can handle synonymous names... but only for plants (because the plant nomenclature community is better organized), and it can't do fuzzy-matching AND synonymous names (in the same name e.g. a mispelt synonym). https://ropensci.org/tutorials/taxize_tutorial.htmlSo I think the only way is manual cleaning I'm afraid. I'd love to be wrong but there a number of different issues. It's not a simple problem of just OCR garble.
We have some generic problems which could apply to any project: - OCR garbles - author typos (including transpositions)
Both can be tackled with Levenstein or other fuzzy matching.
Major name changes require a lookup somewhere. If there is an ID this is easy to check. If names are completely change and ID is garbled then a human will have to guess.
For PLUTo, what is the frequency of changed names. In a sense they aren't "errors" but they can't be reconciled. How much work does it involve?
Usually you can throw the non-matching name string into the box at: http://www.ncbi.nlm.nih.gov/taxonomy and it will send you to the current accepted name for the input, IF the input has relatively few garbles. Perhaps there is an API or package that does this somewhere but I haven't found it.
Spent yesterday in Sussex at the Resource Politics conference, talking with Cameron Neylon & Cindy Regalado in an open science session. Work resumes now...
I have just uploaded to github a list of taxon name labels found in Newick trees which need pruning e.g. 'Elbe river'. There are 35 of them. We must delete/prune all instances of them, even if there are multiple instances. These are not useful to a supertree analysis. The three Candida names are in there because I was unable to confidently match-up these names with accepted NCBI taxonomy names. They are fungal anyway and so not central to this analysis
80 names left to check / correct. Nearly there. Tomorrow I can create a clean supertree with taxon names that exactly match those that NCBI uses!
All names now cleaned or deleted relative to NCBI taxonomy accepted names. I will go home now. Will do the actual replacing of old/wrong names tomorrow or tonight. Updated names sheet pushed to github