Corpus for mining.
The results from OATD search are not easy for machines to follow automatically. Like so many sites they are arranged as a series of single references. Here are the first five.
NOTE: There is no indication of licence or permissions on this page. I downloaded this by hand (like all the others). As I said before I have no complaints with the authors
Thesis 1
Author: Anastasia Gousseva
Title: _Investigating the Expansion of Angiosperms during the Late
Cretaceous using a Modeling Approach_
Licence: None
Copyright: © Copyright by Anastasia Gousseva 2010
Pages: 211
Thesis 2
Author: CORINNE ALEXANDRA FAY
Title: _MID-CRETACEOUS pCO2, CARBON-CYCLING
AND THE RISE OF THE FLOWERING PLANTS_
Licence: None
Copyright: None
Pages: 387
Thesis 3
Author: Karina I. Neimanis
Title: AN INVESTIGATION OF ALTERNATIVE OXIDASE PRESENCE, EXPRESSION, AND REGULATION IN THE MOSS PHYSCOMITRELLA PATENS
Licence: None
Copyright: None explicit
Pages: 211
Includes: Follow this and additional works at: http://scholars.wlu.ca/etd
Part of the Bioinformatics Commons, Biology Commons, and theMolecular Biology Commons
Unfortunately these "Commons" now probably belong to #elsevier
Wilfrid Laurier University
Thesis 4
Author: Charles Stuart Piper Foster
Title: Using Phylogenomic Data to Untangle the Patterns and Timescale of Flowering Plant Evolution
Licence: None
Copyright: None explicit
Pages: 249
Thesis 5
Author: Lisa Maria Ebner
Title: Untersuchung an Angiospermen- und Gymnospermenpollenkörnern aus der Potomac-Formation (Unterkreide) in den USA
Licence: None
Copyright: All rights reserved (site)
Pages: 129
NOTE: the text is in German but AMI can still index much of it. We can use Wikipedia later to index the German words.
Files were manually copied to a directory:
└── oatd
├── 2014-06-09_0801231.pdf
├── CORINNE_ALEXANDRA_FAY.pdf
├── Foster_CSP_Thesis.pdf
├── Gousseva_Anastasia_201011_MSc_thesis.pdf
└── OXIDASE_IN_PHYSCOMITRELLA.pdf
(The filenames were edited to remove non-printing and URL-escaped characters).
terms
-
oatp
is called a CProject
in the AMI
system.
- The PDFs will all become
CTree
s after this.
full corpus
I selected 30 records from OATD. Of these:
- 2 sites failed to respond
- 1 link was broken
- 1 was embargoed til 2021
- 1 required me to sign in even though it was CC BY-NC
- 1 was "not available"
Leaving 24 that I'll work with.
So 20% of links in OATD won't give theses.
Conclusions
It took me about 30 mins to download 30 theses. There were at least 15 different "styles" to the repository, most were clunky and give a clear impression that libraries regard theses with a C20th mind, not a C21st. I agree that they are part of the educational and assessment process, but many are funded by Research funders and industry and IMO the theses themselves are excellently produced. It's a shame that they are not better deployed on the sites.
from now on almost everything is automatic ...