New folder added to github demonstrating label-cleaning work that is still in progress: https://github.com/ContentMine/ijsem/tree/master/label-cleaning
agrepping.sh
is the fuzzy-matching script I'm using to help automate the matching of our tip labels to those in the NCBI taxdump (I have provided the modified version of this that I am working against as scinames.dmp
.
No-hit-names.txt
is the input list of names/labels which do not exactly match anything in NCBI taxdump. approxmatch.txt
is the output from the fuzzy-matching script of the closest matching name in NCBI taxdump (not always correctly matched!).
name-results.csv
contains the input unmatched labels and the output best-match-guess labels along with the edit distance and a column of my manual checking comments. If not correctly matched, I add the correct name/label in column F.
This manual checking is still ongoing.
The first example in the csv is this:
"Acfinaplanes palleronii" (matches to ->) Actinoplanes palleronii (edit distance ->) 2 (NCBI Tax ID ->) 113570
More complex rows are where the name in the image was OCR'd nearly or 100% correctly but is now a synonym of the current accepted name e.g.:
"Chryseobaclerium meningaseplicum" (fuzzy-matched to ->) Chryseobacterium hungaricum (edit distance ->) 9 (TaxID ->) 454006 (my statement ->) WRONG (the current accepted name this label should be ->) Elizabethkingia meningoseptica
https://github.com/ContentMine/ijsem/blob/master/label-cleaning/name-results.csv
I should explain the scoring of the manual checking column. It should have only one of four states:
-
(blank) the edit distance is low so the fuzzy-matching probably found the right name by correcting for one or two OCR mistakes.
-
DELETE the label does not represent a specific taxon. It is an environmental sample e.g. "oral clone" or a "genomospecies" , "unindentified" , "uncultured" et cetera...
-
CORRECT miraculously despite the edit distance being >3, I have manually checked the suggested label from the fuzzy-matching process and I believe it to be correct. Some impressively different matches have been made.
-
WRONG the fuzzy-matching process has not suggested the correct name for this label. I have manually checked it and the correct, currently accepted name is given in the next column to the left of this column