workflow and program structure
norma
iterates over all *.svg
files picked out by the regex and the
--transform svgtable2html
option submits them to the NormaConverter
org.xmlcml.norma.table.SVGTable2HTMLConverter.convert()
module. (NOTE - we are still developing the API contract for these converters but in this case we input an SVG file and return an HtmlElement
(e.g. <table>
or <html>
).
converter code
The code is essentially:
public HtmlElement convert() {
getOrCreateOutputDir();
tableContentCreator.markupAndOutputTable(inputFile, outputDir); //1
textStructurer = tableContentCreator.getTextStructurer();
tableContentCreator.annotateAreasInSVGChunk(); //2
outputHtmlElement = tableContentCreator.createHtmlFromSVG(); //3
return outputHtmlElement;
}
This is in 3 parts:
-
markupAndOutputTable (find the graphical areas in the document which correspond to "boxes". The highest level is "sections" (e.g. THBF).
-
annotateAreasInSVGChunk annotate these areas with semantic labels
-
createHtmlFromSVG create HTML from the labelled areas.
markupAndOutputTable
org.xmlcml.svg2xml.table.TableContentCreator.markupAndOutputTable ();
public void markupAndOutputTable(File inputFile, File outDir) {
//..
annotatedSvgChunk = annotateAreas(inputFile);
//...
}
This creates an SVG file which looks like:
interpretation
(Note that some colours are formed by overlay of semi-transparent boxes). Sections in order for those who can't see colours)
The YELLOW section ("Table 1...) is the title.
The PINK section ("Brain ...") is the header. The 4 brown regions are column headers. These can be split (in this case into 2 levels)
- The BLUE section is the body. It is overlaid by 5 RED columns and has 3 GREEN subtables.
- The PURPLE section is the *footer it has no structure at present.
annotateAreasInSVGChunk
org.xmlcml.svg2xml.table.TableContentCreator.annotateAreasInSVGChunk()
public SVGElement annotateAreasInSVGChunk() {
SVGElement svgChunk = createMarkedSections(
new String[] {"yellow", "red", "cyan", "blue"},
new double[] {0.2, 0.2, 0.2, 0.2}
);
//...
TableTitleSection tableTitle = getOrCreateTableTitleSection();
svgChunk = tableTitle.createMarkedContent(// colours)
// ... and for header, body, footer
This creates `annotatedSvgChunk`
### createHtmlFromSVG
org.xmlcml.svg2xml.table.TableContentCreator.createHtmlFromSVG()
public HtmlHtml createHtmlFromSVG() {
HtmlHtml html = new HtmlHtml();
HtmlBody body = new HtmlBody();
html.appendChild(body);
HtmlTable table = new HtmlTable();
table.addAttribute(new Attribute("style", "border: 1px solid black;"));
body.appendChild(table);
addCaption(annotatedSvgChunk, table);
addHeader(annotatedSvgChunk, table);
addBody(annotatedSvgChunk, table);
return html;
}
This passes the box-annotated SVG to the Caption, Header and Body modules which attempt to extract rows and columns.