Coordinate driven table cropping
Currently tables have to be manually identified by humans and cropped from the document. The standard way of doing this is through an interactive editor such as Inkscape, where the page (as SVG) is read into the editor, cropped, and output.
In future processes we expect the cropping coordinates may become available, either through algorithms such as GROBID or our own heuristics. In addition there may be other annotation tools where coordinates are extracted by third parties. In all these cases there should be a cropping process. This reports the start of the process, where coordinates are used to crop an SVG page.
**NOTE this code is currently on a feature development branch ** pmr
clipping coordinates
Users with their own graphical systems are likely to have different coordinate systems, which differ in at least:
- origin and offsets
- units (pixels, inches, points, cm, etc.)
- direction (y-axis running up or down the page/screen
The system developed here manages these by requiring a user-supplied mediaBox
mediaBox
The coordinates used in PDF documents are described in
https://stackoverflow.com/questions/30319228/where-is-the-origin-x-y-of-a-pdf-page.
The coordinate system in Java is different and has the origin at the TopLeft (TL). Java does not have the concept of a page limit and this is deduced from the page sizes. All PDF documents are converted to Java coordinates by PDFBox
and Pdf2Svg
.
the "mediaBox" is a PDF term used to describe the page size and we use this also to describe the Java page.
javaMediaBox (localMediaBox)
The default mediaBox in which all calculations are done is
this.localMediaBox = new Real2Range(new Real2(0,0), new Real2(600, 800));
This defines TL(0,0) and BR(600,800). It can be reset for other page sizes. The units are SCREEN UNITS ("pixels"). The y-coordinate increases down the page, to approximately 800 at the bottom. This probably does NOT correspond to the limit of your screen and scrolling may be required.
userMediaBox
The user must supply their page coordinates to define the page and the mapping.
cropper.setTLBRUserMediaBox(TLcorner, BRcorner);
this should use the same coordinate system as the userCropBox below
. If the units, origin and direction are consistent between the userMediaBox
and the userCropBox
then there is hopefully little scope for errors.
userCropBox
The user must supply their cropBox coordinates which are used to crop the page.
cropper.setTLBRUserCropBox(new Real2(30, 580), new Real2(570, 310))
example
typical usage:
// create element for page
SVGElement svgElement = SVGElement.readAndCreateSVG(inputFile);
// make a cropper
PageCropper cropper = new PageCropper();
// set the user page coordinates. Note the Y-coordinate DECREASES down the page
cropper.setTLBRUserMediaBox(new Real2(0, 800), new Real2(600, 0));
// set the area to crop
cropper.setTLBRUserCropBox(new Real2(30, 580), new Real2(570, 310));
// remove cropped elements
cropper.detachElementsOutsideBox(svgElement);
// display the result
SVGSVG.wrapAndWriteAsSVG(svgElement, new File(new File("target/crop/"), "materials-05-00027-page7.crop2.svg"));
// pass element into TableContentCreator ... as usual
tableContentCreator.createContent(svgElement);
...
result
The page to be cropped:
with the user-supplied cropBox (blue) shown for clarity.
This gives the clipped SVG result:
this result will be fed into
TableContentCreator
commandline
This model can be included in a norma
commandline as:
norma -i singlepage.pdf --svg.mediaBox (0,800) (600,0) --svg.crop (30,580)(570,310)
Note that only one extraction in one page can be done at once, as we do not have a format for multiple pages and tables. Later this will be incorporated into the cproject
structure