We had a conversation with Mike Taylor about the harvesters of scholarly content being built for The One Repo and how they differ from the scrapers ContentMine are using with quickscrape. Mike is looking for partner organisations who might be willing to sponsor the creation of harvesters and in the spirit of collaboration and reducing unnecessary duplication of work we wondered if we could use each other's tools.
The big differences, from what I saw when I looked at your scrapers, are:
* Scrapers seem to work on one page at a time, while harvesters crawl
around sites and extract multiple pages.
* Harvesters have lots of a heavy machinery to deal with awkward pages
* Scrapers are more neatly positioned as tools that can be integrated
into any system, whereas harvesters are wired into a bigger ecosystem.
* Crucially, scrapers are driven by an open-source engine, whereas the
harvester engine is encumbered by co-ownership.
I use "crawl" instead of harvest. It's about discovery and it's 90%+ political and non-technical.
I think your harvesters overlap at one end with crawlers and at the other end with scrapers.
We hope the projects can find useful collaborative ground and we're interested to hear from anyone else building scraper/harvester/interfaces for the scholarly literature.