Discovering knowledge about the Zika virus
We want to discover research about related viruses, that may have relevancy for researcher concerned with the Zika virus.
- Basic command line knowledge
- In this example we will employ tools developed by ContentMine, as well as other Open Source software. You can use a current ContentMine VirtualBox, or install the tools locally.
Looking for papers that may be of relevance
First we create a project folder which will hold our input data and results, and change our working directory to it.
We can use getpapers to bulk download papers, but what should we look for? We start with a search on UniProt, to get the names of viruses of other Flaviviruses, to which the Zika virus belongs to.
In the taxonomy search we query for ancestor:"Flavivirus (arboviruses group B) " reviewed:no complete:no.
This gives us as list of 185 entries which we download as tab-separated file to
Flavivirus.tab. From this file we want to extract the second column which contains the scientific names. We also want to exclude very specific names that contain numbers such as West Nile virus strain PT6.39. We can extract it manually with Excel or LibreOffice, or use the Unix tools. With
cat Flavivirus.tab we read the file line by line.
| is the pipe character, which takes the output of one tools and gives it as input to another (it is like a conveyor belt between two machines).
cut -f3 reads the file as a tab-delimited file and extracts the third column. The third command
grep "[0-9]" -vE excludes any lines that contain numbers.
>> scientificnames.txt then writes the output to the file scientificnames.txt. If you want to have a closer look at what is happening at each step, start with the first command and add the others step by step.
cat Flavivirus.tab | cut -f3
cat Flavivirus.tab | cut -f3 | grep "[0-9]" -vE
cat Flavivirus.tab | cut -f3 | grep "[0-9]" -vE >> scientificnames.txt
cat scientificnames.txt or any texteditor we then look into the file to get an idea of what we are going to search for.
wc -l scientificnames.txt counts the number of lines, which is 130 in our example. We also notice that there is an unwanted line (the header "Scientific name") which we don't want to query for.
We now run a small
while-loop to have getpapers type in the search queries and download the papers form EuropePMC for us.
tail -n +2 scientificnames.txt reads line by line of the scientificnames.txt,
-n +2 skips the header ("Scientific name"). The names are then piped to a
while-loop which iterates through the lines. For each line the loops asks getpapers to query EuropePMC for the name in the variable
"$line", download the fulltext.xml where availabe with-x
and store the files to the outputfolder-o zika
repeats the name so we see where we are at. It is advised to first check how many search results to expect by using-n
tail -n +2 scientificnames.txt | while read line; do
getpapers -q "$line" -x -o zika;
As an intermediary step we transform the fulltext.xml-files into scholarly.html-files with norma. This serves the purpose to have a standardized input format which can be consumed by machines easily.
norma --project zika -i fulltext.xml -o scholarly.html --transform nlm2html
We now run the ami-plugins which will search for the occurrences of species names, gene names and protein sequences.
ami2-gene --project zika -i scholarly.html --g.gene --g.type human
for type in genus binomial genussp; do
ami2-species --project zika -i scholarly.html --sp.species --sp.type $type;
for type in dna rna prot prot3 carb3; do
ami2-sequence --project zika -i scholarly.html --sq.sequence --sq.type $type;