key: cord-0193624-3a5ad67e
authors: Shtok, Joseph; Harary, Sivan; Azulai, Ophir; Goldfarb, Adi Raz; Arbelle, Assaf; Karlinsky, Leonid
title: CHARTER: heatmap-based multi-type chart data extraction
date: 2021-11-28
journal: nan
DOI: nan
sha: be7c400e5de6ba7cf9245d506fd78747b4d0284b
doc_id: 193624
cord_uid: 3a5ad67e

The digital conversion of information stored in documents is a great source of knowledge. In contrast to the documents text, the conversion of the embedded documents graphics, such as charts and plots, has been much less explored. We present a method and a system for end-to-end conversion of document charts into machine readable tabular data format, which can be easily stored and analyzed in the digital domain. Our approach extracts and analyses charts along with their graphical elements and supporting structures such as legends, axes, titles, and captions. Our detection system is based on neural networks, trained solely on synthetic data, eliminating the limiting factor of data collection. As opposed to previous methods, which detect graphical elements using bounding-boxes, our networks feature auxiliary domain specific heatmaps prediction enabling the precise detection of pie charts, line and scatter plots which do not fit the rectangular bounding-box presumption. Qualitative and quantitative results show high robustness and precision, improving upon previous works on popular benchmarks

Charts carry important part of the information content in many kinds of documents -financial reports, scholarly articles, presentations, to name a few. Recent interest in automatic document processing and conversion, such as in summarization and question answering tasks, has increased the importance of the extraction of underlying tabular data from chart images embedded in the converted documents. Chart analysis methods have evolved substantially in recent years from human-in-the-loop platforms relying on manual annotations [8, 15] , through early data extraction algorithms [2] , hybrid neural-algorithmic pipelines [7, 13] , to end-to-end processing by a neural network [9, 12, 16] .

Commonly a two-stage approach is used, first detecting the chart regions in the documents, and then applying some data extraction process to the detected charts. While the scope of the detection stage can be quite wide, including many types of charts [7] , current tabular data extraction systems are mostly limited to the bar charts [2, 7, 9, 16] , with few exceptions. One of the possible reasons is that standard object detectors, employed in recent works, better cope with (and enable easy inference from) objects like rectangular bars and text elements, less so with pie segments, while elements like line or area plots defy handling by box proposals. Additionally, matching of the chart legend to the corresponding chart elements, non-trivial due to small size of the color/texture chart component samples and the legend layout variability, remained out of scope for most of the recent works. * authors contributed equally to this research. In this work we present a chart detection and analysis pipeline, addressing the mentioned limitations. We train a CNN object detector to produce predictions in the form of heatmaps, in addition to its standard bounding boxes proposals, thus enabling localization of graphical elements of arbitrary shape (as visualized in Figure 1 ). This allows us to better incorporate the geometrical domain knowledge of the chart structure into the detector during training, as well as extending the scope of chart conversion to bar, pie, line and scatter charts, all within the same (single) model. Beyond the improved detection of the graphical elements, we introduce algorithms for explicit label matching for bar and pie elements from colored legends, as well as from connector-based labels.

Our main contributions are: (1) introducing heatmap prediction for general graphical chart elements; (2) presenting novel algorithms for data extraction from pie and line charts; (3) presenting quantitative results and an ablation study of the proposed methods, showing improvement over present benchmarks.

Early works addressing automatic charts classification and data extraction [2, 14] , used classical computer vision techniques, such as clustering image patches, Hough transform and OCR. More recently, [6, 7, 13] have presented hybrid neural-algorithmic pipelines, performing detection of the graphical objects with following extraction of numerical and textual information using OCR, Computer Vision techniques and rules; our approach belongs to this group of methods in terms of its general design. Other line of works [9, 12, 16] proposes an end-to-end analysis of the charts by a neural network. [16] develops an encoder-decoder architecture an attention mechanism for direct data extraction from bar charts by an RNN. [9] introduced a neural model for Question Answering about charts, and in particular they allow to extract tabular data by appropriate sets of questions. In [12] a standard object detector is equipped with a relation network to address the connections between the different chart elements, this model is able to produce bar heights and angles of pie segments (for single pie chart), and to match them against the legend entries. However, this approach requires an individual network design for each type of charts, and the inference process involves affine transformations of the input image for spatial alignment of features.

We offer a lighter solution of detecting all the required kinds of graphical elements with a single network; however, the integration stage, applying domain knowledge to support a wide variety of real-world chart types, designs and issues, better fits the bill of a module built with logical rules and explicit algorithmics. It has enabled us to achieve a robust behaviour on a number of large real-world collections of documents.

In Stage 1 of our system an object detector is applied to complete document pages to retrieve regions of charts, title, legend, and caption 1 . In stage 2, another object detector operates on the detected chart regions, producing both bounding box proposals and heatmaps for various chart components. It is trained to predict the important encapsulating regions (e.g. the outer and inner boxes around the chart) and some graphical elements (e.g. bars) using bounding boxes, while heatmaps are used to locate fiducials (e.g. axes ticks, pie junctions) and non-rectangular elements (e.g. circumference and radial lines in pie chart, line-plot graphs, etc.). In Stage 3, the chart image and text regions are processed by an OCR system for text extraction. Finally, in Stage 4, the Analysis module aggregates the data produced so far, applying computer vision algorithms and domain knowledge of different types of charts, to reproduce the original source tabular data represented by the chart. The system flowchart is presented in Figure 2 , while its details are provided in following subsections.

Our full-page chart regions detector is a Faster-RCNN model employing the FPN detection meta architecture and RN50 backbone implemented using the Detectron2 framework and trained on a real data.The object categories of the model are: bar chart, pie chart, line chart, scatter plot, legend, title, caption, and x-and y-labels. 1 We have observed cases where these elements are shared by a group of charts in a figure, and are therefor not necessarily adjacent to any one of them; in this case, page-level logic is beneficial

Generation of synthetic charts data: we extended the FiqureQA dataset [10] to generate the synthetic data for training the stage 2 detector. The chart generation variability was increased by: random background color, random border line color and style, random spaces between the bars, random bar bottom value, allowing uniform bar/slice color, allowing hidden axes etc. Moreover, we randomly add texture to the background, bars, and pie slices by pasting tiles from Describable Textures Dataset [5] .

Heatmap based detector: to analyze the content of the chart image and discover the graphical elements, we use the CenterNet model [17] modified to produce a number of heatmap types, in addition to the standard bounding box predictions. The model builds upon the stacked Hourglass-104 architecture and was trained only on synthetic chart images (Sec. 3.2). Categories supported by box predictions are vertical bar, horizontal bar, pie sector, and bar/pie chart regions (for cascade-style validation of Stage-1 predictions and refining the chart location). The categories supported by heatmaps are: four corners of the bar (in separate categories), x-ticks, y-ticks, center, circumference and radial lines of the pie chart, corners of the pie sectors, line-segment knee-points and lines of the line charts (which are in fact polygonal lines comprised of these segments), and dots of the scatter plots. The predictions made by Stage 2 model are later used in the Analysis module of Stage 4 for recovery of the chart tabular data.

Texts in charts may contain rotated X-axis labels and numbers with exponential notation. Our OCR pipeline is composed of the CRAFT text detection model [4] , producing text regions and angles, and the Clova AI recognition model [3] with added support for rotated text and numbers with exponents. The rotated text regions are handled appropriately before passing to the recognition model. Since Clova AI was not originally designed to handle numbers with exponents, we first detect horizontally aligned text boxes where the text begins with the digits 10, and check if following digits are superscript. 

The Analysis model builds upon recognized texts and graphical objects to discover the tabular data generating the chart. There is a dedicated logic for each type (bar, pie and line) of chart. Regretfully, we omit the details for line charts due to space limitation. Bar charts We rely on Stage 2 detections of individual bars, while filtering them assuming equal bar width and a common ylevel for one of the horizontal edges for all the bars. The bar heights are computed by first recovering the numerical y-axis (if available), via a Hough transform aligning the numerical OCR outputs on a vertical line, followed by interpolating the heights of the horizontal edges. Alternatively, we look for height values written inside or above the bars. Bar labels are read below the bars, or are retrieved from a legend by color matching; the treatment of textured legends is left for future work. The axis titles are detected by combining the detections of Stage-1 and Stage-2 models. Finally, the common chart elements, such as title and caption (found nearby by the Stage-1 detector) are added. The process was made robust by addressing the numerous special cases and issues encountered in the real world data, this robustness was then validated via results on the three datasets detailed in Section 4.

Pie charts: we use both box proposals and image-size heatmaps, produced by the CenterNet detector [17] , for the detection of pie circle center, radius and the radial lines. Initial attempt to use only the sectors box proposals have proven insufficiently robust (it was difficult to filter proposals by the predicted score, especially for narrow sectors), although the circles were well fitted. Instead, we generate separate heatmaps for a circle circumference, the center point, the radial lines, and the sector corners; then the final center and pie circle are computed by initiating the centers local maxima of the center point heatmap (multiple pies are allowed), voting to a range of radii with the circle points heatmap values (each pixel votes to its distance from the center), and update the center locations as centers-of-mass of pixels at predicted radius. Finally, we detect the radial lines (disc sectors) using the radial lines heatmap within the established circles. Using the recovered pie geometry, the sector labels are detected, if a legend is present, by matching the legend colors to those of sectors. Otherwise, we look for lines starting inside the pie disc and leading out (connected components crossing the pie edge), and extract the label using the text piece on the other end as a seed. Finally, if no connectors are present, we match texts located within/outside of the corresponding sectors.

Line charts: for line charts, we use heatmaps for the graph lines, and heatmaps for knee points in the piece-wise linear graphs the model was trained on. The individual lines are extracted by colorbased clustering of the highlighted region ; non-continuous line types are supported by a stitching algorithm building the output line from available pieces. As color clustering is not always accurate, we use noise filtering and morphological operations to robustify the inference. We extract x-and y-numerical axes (in same way as for the bar charts) to reconstruct the numeric data (an example is presented in Figure 3 ) as (x,y) pairs.

In this section we provide the quantitative analysis of different modules and an ablation study for the heatmap-based data extraction.

For stage 1 detector, we have collected and manually annotated 743 document pages, of which 122 were crawled from the web, and 521 taken from PubMed COVID-19 related medical articles. They contain 1260 bar charts, 336 pie charts, 256 line charts, and 213 

We have trained the stage 1 detector in the k-fold fashion, using 5 random 90%/10% train/test splits of the 743 pages. The results, averaged over the 5 splits, are given in Table 1 . As reflected by the AP values, the performance of the stage 1 detector is satisfactory for the rest of the system to rely on. Stage 2 detector performance on the 169 real-world charts is given in the Table 2 ). It is compared to [12] , where a similar evaluation was conducted (on their Annotated dataset).

We evaluate the quality of recovering the tabular data, following the experiment designs in [16] and [12] . The test is conducted on the real-world data of 70 bar charts and 30 pie charts from the ICPR2020 dataset, and is compared to similar Annotated datasets loc.cit. (30 charts in [16] and 10 bar charts + 10 pie charts in [12] ). The accuracy of numeric information (bar heights and pie sectors angles) is measured as a portion of detected elements which numeric value is recovered within relative error: ( − )/ ≤ , as in [7] , for ∈ [0.01, ...0.25]. The accuracy of label reconstruction was not considered in [16] , whereas in [12] TPs only elements with perfectly reproduced labels are counted; this is the reason part of the entries in Table 3 are empty. We measure accuracy of text label E.L -exact label prediction, A.L -any labels. prediction by the ratio of Levenshtein distance [11] , defined as ( , ) = (| | + | | − ( , ))/(| | + | |) for strings , . The exactly recovered labels correspond to = 1.0, which we used as a condition for comparison to [12] results. We observe that our numerical accuracy for bar charts is lower than that of [16] at the more strict 2% error restriction, but is better at 5% test; this may follow from the limited 512x512 resolution of the CenterNet input images. On pie charts (Table 4 ) our system outperforms [12] except at the low accuracy standard = 0.25. Our accuracy saturates at = 0.1, meaning that we don't have errors higher than 10%. This behaviour may be more suitable for real-world applications, where inaccurate numerical predictions can mislead the user systems.

We explore the impact made by introducing the heatmaps for detection of pie objects, which can be alternatively detected using sectors boxes. In Table 5 we present accuracy of predicted sector angles in pie charts, using only the sector box proposals or using the set of heatmaps detailed in Section 3.4. The accuracy is substantially improved (up to 10%) by using heatmaps-based computation of the graphical elements. This table also sheds some light on the distribution of label reconstruction accuracy among the reasonably accurate (5% rel. error) sectors.

We have presented the CHARTER -a practical system for document charts conversion to tabular data, which can extend the scope of real-world document conversion tools to include the content of the embedded document charts. The proposed use of heatmaps allows more accurate analysis of line, scatter, bar and pie charts within the same model. Moreover, our model can be easily extended to include additional chart types, which is in the scope of our future work.

No Title

Automatic extraction of data from bar charts

What is wrong with scene text recognition model comparisons? dataset and model analysis

Character region awareness for text detection

Sammy Mohamed, and Andrea Vedaldi

Scatteract: Automated Extraction of Data from Scatter Plots

Chart decoder: Generating textual and numeric information from chart images automatically

Answering questions about data visualizations using efficient bimodal fusion

Figureqa: An annotated figure dataset for visual reasoning

Binary codes capable of correcting deletions, insertions and reversals

Data extraction from charts via single deep neural network. arXiv

Reverse-Engineering Visualizations: Recovering Visual Encodings from Chart Images

ReVision: Automated classification, analysis and redesign of chart images

Using UnGraph to extract data from image files: Verification of reliability and validity

Reverse-engineering bar charts using neural networks

Objects as points. arXiv