Script still chugging away. ~44 seconds per name to fuzzy match over ~1 million lines in NCBI taxdump.
By luck the matching appears to work well (so far) where there has been synonymization occuring e.g.:
original name from OCR: Candida lignohabitansbest fuzzy match from taxdump: 5:796027 | Sugiyamaella lignohabitans
the 5 is apparently the distance score of the match between the input and the given best match name output.Below is clearer, there is just a distance of '1':
1:178899 | Carboxydocella thermautotrophica
Candida / Sugiyamaella is of course also in the Kingdom: Fungi , which flags-up some further issues with our input data. It's not all just bacteria & archaebacteria
First cut at supertree displayed in conventional form at https://github.com/ContentMine/ijsem/blob/master/supertree-analysis/strict1.svg . There are many artefacts in this (singleton tips, garbles) but it shows the general structure of the tree.
Ross,Can you outline the workflow for creating the next version of the supertree. Part of it includes trimming the singleton tips for which we can use ami-phylo. I imagine it is something like:
I will commit latest AMI which is capable of creating conventional SVG trees from NWK.
It will be useful if we identify those operations forwhich we shall need third-party software. There is no need for this to be part of AMI. However it's worth noting that XML is much better suited than NWK for carrying annotations and history. For example NEXML can retain deleted tips in the OTUS list and indicate that they have been deleted and why.
I am already well under-way with task 1.)I will be uploading a csv spreadsheet with a mapping from "our tip labels" -> NCBI Taxdump labels they should be, soon.
2.) exactly. Simple find & replace operations should suffice to replace garbled tips in my spreadsheet above with the NCBI-corrected name of what they should be. Note this includes correction of both garbles AND synonyms where the name in the image has been OCR'd correctly but taxonomic nomenclatural changes have changed the taxon name associated with that organism/sequence.
3.) Yes, for pruning.
4.) Yes. I could do this in R on the Newick strings, but this would be great for you to tackle if you have time.
5.) Well, yes but for clarity I'd state 5a.) is creating the MRP matrix from Newick string 'source trees' and 5b.) is passing that singular MRP matrix to TNT to compute the shortest length tree possible given the data. a) & b) are rather distinct tasks. b) will be very time consuming once we're happy with the input data matrix. I will run more stringent search methods only when we're happy with the data input. No point until then.
6.) Display, and mathematically compare to NCBI Taxonomy tree using tree-to-tree distance measures e.g. Robinson Foulds, this step provides some actual quantitative measure of the similarity of our results to previous hypothesis of relationships between these taxa. Our supertree certainly won't be identical / perfectly matching but I hope it has significantly closer fit than any random tree of the same size & labels.
New folder added to github demonstrating label-cleaning work that is still in progress: https://github.com/ContentMine/ijsem/tree/master/label-cleaning
agrepping.sh is the fuzzy-matching script I'm using to help automate the matching of our tip labels to those in the NCBI taxdump (I have provided the modified version of this that I am working against as scinames.dmp.
No-hit-names.txt is the input list of names/labels which do not exactly match anything in NCBI taxdump. approxmatch.txt is the output from the fuzzy-matching script of the closest matching name in NCBI taxdump (not always correctly matched!).
name-results.csv contains the input unmatched labels and the output best-match-guess labels along with the edit distance and a column of my manual checking comments. If not correctly matched, I add the correct name/label in column F.This manual checking is still ongoing.
The first example in the csv is this:"Acfinaplanes palleronii" (matches to ->) Actinoplanes palleronii (edit distance ->) 2 (NCBI Tax ID ->) 113570
More complex rows are where the name in the image was OCR'd nearly or 100% correctly but is now a synonym of the current accepted name e.g.:"Chryseobaclerium meningaseplicum" (fuzzy-matched to ->) Chryseobacterium hungaricum (edit distance ->) 9 (TaxID ->) 454006 (my statement ->) WRONG (the current accepted name this label should be ->) Elizabethkingia meningoseptica
I should explain the scoring of the manual checking column. It should have only one of four states:
Looks good.The small edit distances come from (a) OCR garbles and (b) typos by the authors. I assume that the larger distances might resolve to a large number of congeners and that these could only be distinguished by lookup against the EGID. The last example can presumably only be resolved by human inspection (or by a re-naming authority) as all parts of the name are different .
Taxize an Ropensci package can handle synonymous names... but only for plants (because the plant nomenclature community is better organized), and it can't do fuzzy-matching AND synonymous names (in the same name e.g. a mispelt synonym). https://ropensci.org/tutorials/taxize_tutorial.htmlSo I think the only way is manual cleaning I'm afraid. I'd love to be wrong but there a number of different issues. It's not a simple problem of just OCR garble.
We have some generic problems which could apply to any project: - OCR garbles - author typos (including transpositions)
Both can be tackled with Levenstein or other fuzzy matching.
Major name changes require a lookup somewhere. If there is an ID this is easy to check. If names are completely change and ID is garbled then a human will have to guess.
For PLUTo, what is the frequency of changed names. In a sense they aren't "errors" but they can't be reconciled. How much work does it involve?
Usually you can throw the non-matching name string into the box at: http://www.ncbi.nlm.nih.gov/taxonomy and it will send you to the current accepted name for the input, IF the input has relatively few garbles. Perhaps there is an API or package that does this somewhere but I haven't found it.
Spent yesterday in Sussex at the Resource Politics conference, talking with Cameron Neylon & Cindy Regalado in an open science session. Work resumes now...
I have just uploaded to github a list of taxon name labels found in Newick trees which need pruning e.g. 'Elbe river'. There are 35 of them. We must delete/prune all instances of them, even if there are multiple instances. These are not useful to a supertree analysis. The three Candida names are in there because I was unable to confidently match-up these names with accepted NCBI taxonomy names. They are fungal anyway and so not central to this analysis
80 names left to check / correct. Nearly there. Tomorrow I can create a clean supertree with taxon names that exactly match those that NCBI uses!
All names now cleaned or deleted relative to NCBI taxonomy accepted names. I will go home now. Will do the actual replacing of old/wrong names tomorrow or tonight. Updated names sheet pushed to github
You have discovered an number of actual name changes rather than garbles. Is there anything to do ATM?
@petermr If you could work on a method of taking the names listed as DELETE. And automatically deleting /pruning these taxa from the source XML / Newick. That would be great.
Otherwise it's just me holding things up
OK - This is where we need the workflow.
The DELETE would be
ijs123.xml = ijs123.nexml.delete(listOfDeletions);
That means deletions from 1000 files (no problem - might take a minute).
Then re-aggregate the nexml into the revised *.nex or *.tre whatever it's called
I can, in principle, add it to the ami-phylo.
Cleaned names, swapped into source trees now. Uploaded to github. New rankings of name occurrences across the 924 source trees below. There are 2309 names with >1 occurrence. Although that figure includes ones we must delete e.g. 'Uncultured_bacterium'. So final supertree will include 2269 NCBI-valid species. Even after name cleaning, there are 2376 names that only occur once across all 924 source trees. These 'singletons' must be pruned.
Have added 'whitelists' and 'blacklists' to the github of names encountered in trees. The whitelist (cleanlist.txt) is the list of 2269 NCBI-valid species names we want to keep. The blacklist (tips-to-prune.txt) contains the list of 2402 names that we need to prune from trees wherever they occur because they are either a) not specific to a species e.g. "human_intestinal" , "Eutrophic_lake" or b) they are 'singletons' that only occur once across all the input source trees and thus provide no value to the supertree analysis. I will use the whitelist now to create a comparision taxonomy tree from the phyloT webservice: http://phylot.biobyte.de/
Using the drop.tip function in R to prune all the taxa that need to be pruned from all the source trees. There is a nice way to loop this through any number of trees with drop.tip but it doesn't work for me. I suspect it's because in perhaps many trees the tips I want to drop my make up 100% of the tree.
Trying drop.tip on each individual tree reveals that standard use of drop.tip can only successfully remove tips from 421/924 of the source trees. There are a variety of error messages for the others.
drop.tip error messages encountered:
Error in kids[[parent[i]]] : subscript out of bounds
Others have also encountered this error using the drop.tip function
Error in phy$edge[, 1] <- newNb[phy$edge[, 1]] :
number of items to replace is not a multiple of replacement length
I have not yet found ways around these problems or what is causing them. Still investigating.
Do you have all the code you need?
I think so. Working on the output supertree from the 24hr analysis now. Log file of analysis:
Consensus tree of the supertree analysis: