Terminology and definitions
We must develop a terminology agreed between all parties in the CM-UCL project. (Hopefully this may also be part of a definitive publication later).
Tables consist of the following:
(in principle any Unicode code points). We attempt to normalize to Unicode where possible.
- non-Unicode characters.
pdf2svg has conversion tables for some of the commonest.
- diacritics. We attempt to normalize these with Unicode rules.
- ligatures. We attempt to normalize these with Unicode rules.
subscripts and superscripts . We refer to these as suscripts. We do NOT use Unicode subscripts but rather transform the text into a structure ("Phrases") which supports suscripts. This is unique to ContentMine software.
which structure the table visually. They are generally some or all of:
- Horizontal rulers
- Vertical rulers
- Rectangular cells or supercells. These may have
stroke='none' , and/or
NOTE: images (bitmaps) are out of scope.
* "whitespace" This is a virtual component, created by the absence of text or graphics at a series of neighbouring points.
It is often impossible to deduce precisely what the components are. Reasons include:
- overlapping paths (frequently lines are written twice, or overlap at the ends , or are decorated with external borders. Our strategy is that the non-visual microstructure must be ignored - the software aims to simply and normalize all objects.
Styles often convey implicit semantics and CM tries to preserve style info as far as possible. Common styles include:
- large or small font-size
fill and stroke colours
- dashed and dotted lines
Preserved info will be captured as
- HTML or SVG CSS or hardcode attributes for
opacity or "grey" colour will be translated to
We propose the following anatomy of tables. A "simple" table is given in
A one-sentence (sometimes more) description of the table. Usually starts with "Table ", often in bold
A list of column headings. Generally all columns (except sometimes the first) MUST have a column heading. The example has 3 column headings.
The header structure may be:
simple . a single row of phrases , each describing a column.
split . two or more rows of phrases. The earlier (higher) ones may encompass two or more headings in the row below. (example later).
The main "data" in the table. In simple tables this consists of rows . This table has 6 rows (the header does not count) and 3 columns. Sometimes the first field in a row (the first column) is a row header and provides metadata or ID for the row.
The data is often conflated into single columns. In the example it could have been semantically clearer to have 3 data columns (percent, number responsive, total neurons).
This is a free-format section, often as "notes" or abbreviations (as in the example, which explains the abbreviations TG, WT and KO. In principle (but not in the CM-UCL project) it would be possible to resolve the abbreviations in the body section).
Superscripts and punctuation often provide implicit hyperlinks between the body and the footer. IN this project we shall recreate the superscripts but not specifically resolve the links.
The body of some tables may have indents (especially in column1) - usually to indicate some form of sub-list or sub-table.
example 10.1007_s00213-015-4198-1 table1.png
There are effectively 3 sub-tables ("5-HT1A", "5-HT2A", "5-HTT") which share the same column-names. There are two types of rows:
- header rows ("5-HT1A", etc.). These sub-headers flag the start of a subtable (Some sub-headers may be centre-justified).
- data rows ("MPC" and 4 values).
The data rows are flagged by being indented (in this case by about 2 spaces). However this is not always present and some sub-tables are left justified.
We intend to flag indents by a leading string (e.g. "~" or "HTAB") to avoid relying on (fragile) whitespace. For double indent this would be repeated.
triple indentation 10.1016_j.jadohealth.2012.08.009 table2.png
The table has three levels of indent.
These appear to be somewhat three independent axes:
- Female / male
- Recalled receipt
- Read materials
It may be possible to normalize this into a hypercube.
When a phrase overruns the end of a (virtual) table cell, it may be wrapped.
example 10.1016_j.amepre.2016.07.024 table3.png
The first line has wrapped ("healthful foods" is part of the phrase), and in this case there is also a clue from the background grey colour. The third sub-header has also wrapped ("removing ... sugary")
Where there is no background it can often be difficult to distinguish between wrapping and sub-lists. We may use heuristics so that lines starting with a Capital letter are separate from the preceding line.
example 10.1016_j.amepre.2016.08.013 table2.png
Column headers are:
* Col1 "Length of follow-up"
* Col2.1 Crude mortality | Counseled smokers, n(%) (n=5,695)
* Col2.2 Crude mortality | Non-counseled smokers, n(%) (n=8,120)
* Col2.3 Adjusted
* Col2.3 Matched analysis
data example 10.1016_j.jadohealth.2012.08.009 table1.png
There is wrapping in column 1 which overlaps with new rows in column 2. It is very difficult to work out what the semantics of these two columns are
wrapping, indent, subtable, and list policy
All irregular structure (wrapping, lists, indents) will be captured in partial columns, or indent indicators. That allows it to be resolved later.
We intend that wrapping