To explore the extraction of data from Funnel Plots in published papers (or preprints). Project sponsored by @chjh and University of Tilburg.
All data and discussion will be carried out as Open Notebook.
9 Funnel plots were provide for development puposes from a rang of publications . See https://github.com/ContentMine/svg2xml/tree/master/src/test/resources/org/xmlcml/svg2xml/funnel
These were created by running
norma pdf2svg on PDFs to create SVG pages, and then snipping the diagrams using Inkscape and then running and developing
svg2xml creation of higher primitives (currently
svg:rect). (Next phase is to create plots).
The papers contained a range of SVG primitives, including:
- circles (from 4 Cubic Beziers - seems standard).
- Unicode characters (text)
Conversion was possible in all cases , but with some problems:
- text rendered as scalable (outline) fonts. We may have to use OCR to decode this.
- some legacy characters (including
Identity-H). These may also need OCR.
- out of view characters (e.g. Copyright notices at minus values of
- MANY space or null text characters which may need removing. These might be an Inkscape artefact.
PMR is still interpreting the
SVGPath:d syntax but most of it seems to work.
Too difficult at this stage to implement elliptical arcs (
A a syntax). And Probbaly wouldn't know how to extract useful data from them anyway. Replaced by
svg:line at present.
extract "box" and "median diamond" and "funnel lines"
rotated text (y-axis)
extract circles of dots into CSV
The key things are the tick marks, and scales. We know it's a funnel plot so the semantics are implicit