We had a conversation with Mike Taylor about the harvesters of scholarly content being built for The One Repo and how they differ from the scrapers ContentMine are using with quickscrape. Mike is looking for partner organisations who might be willing to sponsor the creation of harvesters and in the spirit of collaboration and reducing unnecessary duplication of work we wondered if we could use each other's tools.
Mike responded:
The big differences, from what I saw when I looked at your scrapers, are:
* Scrapers seem to work on one page at a time, while harvesters crawl
around sites and extract multiple pages.
* Harvesters have lots of a heavy machinery to deal with awkward pages
that obscure content with JavaScript, cookies or crazy authentication
schemes.
* Scrapers are more neatly positioned as tools that can be integrated
into any system, whereas harvesters are wired into a bigger ecosystem.
* Crucially, scrapers are driven by an open-source engine, whereas the
harvester engine is encumbered by co-ownership.
Peter Murray-Rust:
I use "crawl" instead of harvest. It's about discovery and it's 90%+ political and non-technical.
Our scrapers have Javascript machinery but don't do wget - they get precisely what they are asked for
I think your harvesters overlap at one end with crawlers and at the other end with scrapers.
We hope the projects can find useful collaborative ground and we're interested to hear from anyone else building scraper/harvester/interfaces for the scholarly literature.