CM Frequently Asked Questions
What is ContentMine's Mission?
To use machines and humans to mine the scientific literature for facts on a massive scale. There are three major activities:
Advocacy . We are active in lobbying the UK and EU parliaments for liberalisation of Copyright , especially for Text and Data Mining (TDM). We were members of the EC H2020 FutureTDM project and regularly make presentations to interested parties.
Community . We believe that citizens everywhere and of early age (15 years) can take part in mining and its development. We run a Fellowship program for anyone.
Tools. Tools are essential for promoting TDM and our Open Source tools are specifically developed for widespread use in fact-rich subjects. We have several partnerships and are keen to hear from anyone wanting tool development.
What are the tools?
getpapers (Rik Smith-Unna) queries repositories and then downloads the resulting papers automatically (often 500 / minute. Current repositories are
Crossref. see http://github.com/contentmine/getpapers. Node.js
quickscrape (Rik Smith-Unna) uses a list of DOIs or URLs to download (scrape) papers from publisher sites (where the user has permission). Node.js
cephis a large library of tools for creating structured semantic documents from low-level articles (e.g. PDF or bitmaps). arrays, strings, XML, SVG, HTML, PDF parsing (uses PDFBox), bitmap analysis, diagram analysis (Figures), Tables. The major data strucires are exposed on disk as
CProjects made of
CTrees for each article. CTrees contain explicit text, metadata, citations, tables, figures, schemes, supplemental data, and the results of
norma a declarative approach to transforming between types (PNG, SVG, PDF, HTML and higher types).
AMI higher level transformations (Chemistry, phylogenetics , species, genes, diagrams), text search (Bloom filter and regex). Fundamentally suported by http://wikidata.org to add semantics to many thousand terms, held in dictionaries.
Why not just use Google or WebOfScience?
None of the major search enegines make their results Open. They are usually in walled gardens with "snoop and control". The output is not reproducible, so cannot be used for research and with no guarantee of permanence.
ContentMine is oriented towards readers, so puts the searches in their hands. You can run whatever search you want and aren't dependent on what the corporations want to feed you. It's also free (the academic search engines cost a lot of money to libraries). It's can also be pointed at NGO output or any files on your own disk.
Where are the results?
The results end up on your disk so if you value privavcy no one else needs to know. For public knowledge we put the results on Zenodo. We use W3C annotation to ensure standards.
The ouptut is modular and is easy to ingest into modern display tools (Jupyter, R/Tidyverse, D3, etc.). That's because all data is semantic (JSON or XML).
How can I create my own search?
Searches use dictionaries which are search terms linked to Wikidata where possible. It's easy to create a new dictionary by finding an authoritative list of terms (e.g. from a learned society or NGO). You can also create a personal llist of the terms related to your research.
what skills do I (and my friends) need?
Although we may be able to run the software automatically at a later date, early adopters will probably need
* download and install software (often consecutively).
git and (probably) Github
* use the commandline
* use simple tools for navigating directories and searching. (in Unix:
* use a text editor (NOT Word)
* use Web technology (REST,
curl , maybe CSS, etc.)
* edit HTML
* understand JSON
what environment do I need?
All code is written as platform independent, but there are sometimes minor glitches (especially with filenames). For most platforms installation is straighforward
getpapers and quickscrape
These require Node.js. You can find recent installation instructions anywhere on the web. Use
nvm to install Node and then
npm to install applications
normami and cephis
- To run you need a Java JRE 8 or later.
- To build you need
maven and a
JDK (8 or later)
What subjects do you cover?
Our main interests are physical science, medicine, bioscience, and we've worked in plant science, systematic reviews and clinical trials, chemistry, crystallography, biollogical systematics etc. However we can help in any subject where there are well-defined terms (e.g. place names, organizations, statutes, etc.) We've engaged with Human Rights judgments and elsewhere in transgender studies. CM works best with formal documents (academic articles, theses, NGO reports, etc.)
What formats of documents can CM process?
In order of decreasing tractability:
- Modern born digital documents (e.g. post 2000 CE) . XML > HTML >> PDF >> scanned bitmaps.
- scanned documents . Very variable (typography gets worse with early dates). We have done C19th science. Careful academic scanning can be very good, casual scanning has spine deformations, stray light, contrast, bleeding and misalignment. We can do quite a lot but there will be errors. Also many characters are homoglyphs (e.g. "l" (el) and"1" (one)).
- handwriting. No! Not even numbers.
What's the performance?
We have downloaded 500 articles from EuropePMC in 1 minute, converted them to HTML in another 1 min, and searched them in another 1 min. Less than the time to drink a coffee. But large theses can take 30 secs each.
Where can I see case studies?
There are several case studies alongside this FAQ in http://discuss.contentmine.org . Examples are:
When was ContentMine launched?
ContentMine was created in 2013 by Peter Murray-Rust as a proposal to the Shuttleworth Foundation. PMR was awarded a fellowship in 2014-01 and started in March. The Fellowship ran for 2 years. Jenny Molloy acted as Business manager and we decided to create a company in 2016 (https://beta.companieshouse.gov.uk/company/10172863/filing-history ).
Who are the current directors and what is the mission?
- Peter Murray-Rust (founder, co-director)
- Jenny Molloy (co-director, 2016)
- Cesar Gomez (co-director, 2017)
The mission is to run as a social venture, so we have added an OpenLock clause which prevents the company acting against its mission, or being bought. In practice, therefore we act as a non-profit company (i.e. not offering shares for sale) but we are a trading company and are able to build reserves.