Graph Convolution for Multimodal Information Extraction ...count. Besides, some of the studies...

8
Proceedings of NAACL-HLT 2019, pages 32–39 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics 32 Graph Convolution for Multimodal Information Extraction from Visually Rich Documents Xiaojing Liu, Feiyu Gao, Qiong Zhang, Huasha Zhao Alibaba Group {huqiang.lxj,feiyu.gfy,qz.zhang,huasha.zhao}@alibaba-inc.com Abstract Visually rich documents (VRDs) are ubiqui- tous in daily business and life. Examples are purchase receipts, insurance policy docu- ments, custom declaration forms and so on. In VRDs, visual and layout information is critical for document understanding, and texts in such documents cannot be serialized into the one-dimensional sequence without losing information. Classic information extraction models such as BiLSTM-CRF typically op- erate on text sequences and do not incorpo- rate visual features. In this paper, we intro- duce a graph convolution based model to com- bine textual and visual information presented in VRDs. Graph embeddings are trained to summarize the context of a text segment in the document, and further combined with text em- beddings for entity extraction. Extensive ex- periments have been conducted to show that our method outperforms BiLSTM-CRF base- lines by significant margins, on two real-world datasets. Additionally, ablation studies are also performed to evaluate the effectiveness of each component of our model. 1 Introduction Information Extraction (IE) is the process of ex- tracting structured information from unstructured documents. IE is a classic and fundamental Nat- ural Language Processing (NLP) task, and exten- sive research has been made in this area. Tradi- tionally, IE research focuses on extracting entities and relationships from plain texts, where informa- tion is primarily expressed in the format of natural language text. However, a large amount of infor- mation remains untapped in VRDs. VRDs present information in the form of both text and vision. The semantic structure of the doc- ument is not only determined by the text within it but also the visual features such as layout, tabular structure and font size of the document. Examples (a) Purchase receipt (b) Value-added tax invoice Figure 1: Examples of VRDs and example entities to extract. of VRDs are purchase receipts, insurance policy documents, custom declaration forms and so on. Figure 1 shows example VRDs and example enti- ties to extract. VRDs can be represented as a graph of text seg- ments (Figure 2), where each text segment is com- prised of the position of the segment and the text within it. The position of the text segment is de- termined by the four coordinates that generate the bounding box of the text. There are other poten- tially useful visual features in VRDs, such as fonts and colors, which are complementary to the posi- tion of the text. They are out of the scope of this paper, and we leave them to future works. The problem we address in this paper is to ex- tract the values of pre-defined entities from VRDs. We propose a graph convolution based method to combine textual and visual information presented in VRDs. The graph embeddings produced by graph convolution summarize the context of a text segment in the document, which are further com- bined with text embeddings for entity extraction using a standard BiLSTM-CRF model. The fol- lowing paragraphs summarize the challenges of the task and the contributions of our work.

Transcript of Graph Convolution for Multimodal Information Extraction ...count. Besides, some of the studies...

Page 1: Graph Convolution for Multimodal Information Extraction ...count. Besides, some of the studies (d’Andecy et al.,2018;Medvet et al.,2011;Rusinol et al., 2013) in the area of document

Proceedings of NAACL-HLT 2019, pages 32–39Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

32

Graph Convolution for Multimodal Information Extractionfrom Visually Rich Documents

Xiaojing Liu, Feiyu Gao, Qiong Zhang, Huasha ZhaoAlibaba Group

{huqiang.lxj,feiyu.gfy,qz.zhang,huasha.zhao}@alibaba-inc.com

Abstract

Visually rich documents (VRDs) are ubiqui-tous in daily business and life. Examplesare purchase receipts, insurance policy docu-ments, custom declaration forms and so on.In VRDs, visual and layout information iscritical for document understanding, and textsin such documents cannot be serialized intothe one-dimensional sequence without losinginformation. Classic information extractionmodels such as BiLSTM-CRF typically op-erate on text sequences and do not incorpo-rate visual features. In this paper, we intro-duce a graph convolution based model to com-bine textual and visual information presentedin VRDs. Graph embeddings are trained tosummarize the context of a text segment in thedocument, and further combined with text em-beddings for entity extraction. Extensive ex-periments have been conducted to show thatour method outperforms BiLSTM-CRF base-lines by significant margins, on two real-worlddatasets. Additionally, ablation studies arealso performed to evaluate the effectiveness ofeach component of our model.

1 Introduction

Information Extraction (IE) is the process of ex-tracting structured information from unstructureddocuments. IE is a classic and fundamental Nat-ural Language Processing (NLP) task, and exten-sive research has been made in this area. Tradi-tionally, IE research focuses on extracting entitiesand relationships from plain texts, where informa-tion is primarily expressed in the format of naturallanguage text. However, a large amount of infor-mation remains untapped in VRDs.

VRDs present information in the form of bothtext and vision. The semantic structure of the doc-ument is not only determined by the text within itbut also the visual features such as layout, tabularstructure and font size of the document. Examples

(a) Purchase receipt (b) Value-added tax invoice

Figure 1: Examples of VRDs and example entities toextract.

of VRDs are purchase receipts, insurance policydocuments, custom declaration forms and so on.Figure 1 shows example VRDs and example enti-ties to extract.

VRDs can be represented as a graph of text seg-ments (Figure 2), where each text segment is com-prised of the position of the segment and the textwithin it. The position of the text segment is de-termined by the four coordinates that generate thebounding box of the text. There are other poten-tially useful visual features in VRDs, such as fontsand colors, which are complementary to the posi-tion of the text. They are out of the scope of thispaper, and we leave them to future works.

The problem we address in this paper is to ex-tract the values of pre-defined entities from VRDs.We propose a graph convolution based method tocombine textual and visual information presentedin VRDs. The graph embeddings produced bygraph convolution summarize the context of a textsegment in the document, which are further com-bined with text embeddings for entity extractionusing a standard BiLSTM-CRF model. The fol-lowing paragraphs summarize the challenges ofthe task and the contributions of our work.

Page 2: Graph Convolution for Multimodal Information Extraction ...count. Besides, some of the studies (d’Andecy et al.,2018;Medvet et al.,2011;Rusinol et al., 2013) in the area of document

33

1.1 Challenges

IE from VRDs is a challenging task, and the dif-ficulties mainly arise from how to effectively in-corporate visual cues from the document and thescalability of the task.

First, text alone is not adequate to represent thesemantic meaning in VRDs, and the contexts ofthe texts are usually expressed in visual cues. Forexample, there might be multiple dates in the pur-chase receipts. However, it is up to the visual partof the model to distinguish between the InvoiceDate, Transaction Date, and Due Date. Anotherexample is the tax amount in value-added tax in-voice, as shown in Figure 1(b). There are multiple“money” entities in the document, and there is alack of any textual context to determine which oneis tax amount. To extract the tax amount correctly,we have to leverage the (relative) position of thetext segment and visual features of the documentin general.

Template matching based algorithms (Chiti-cariu et al., 2013; Dengel and Klein, 2002; Schus-ter et al., 2013) utilize visual features of the doc-ument to extract entities; however, we argue thatthey are mostly not scalable for the task in real-world business settings. There are easily thou-sands of vendors on the market, and the templatesof purchase receipts from each vendor are not thesame. Thousands of templates need to be createdand maintained in this single scenario. It requiressubstantial efforts to update the template and makesure it’s not conflicting with the rest every time anew template comes in, and the process is error-prone. Besides, user uploaded pictures introduceanother dimension of variance from the template.An example is shown in Figure 1(b). Value-addedtax invoice is a nation-wide tax document, and thelayout is fixed. However, pictures taken by usersare usually distorted, often blurred and sometimescontain interfering objects in the image. A simpletemplate-based system performs poorly in such ascenario, while sophisticated rules require signif-icant engineering efforts for each scenario, whichwe believe is not scalable.

1.2 Contributions

In this paper, we present a novel method for IEfrom VRDs. The method first computes graph em-beddings for each text segment in the documentusing graph convolution. The graph embeddingsrepresent the context of the current text segment

where the convolution operation combines bothtextual and visual features of the context. Thenthe graph embeddings are combined with text em-beddings to feed into a standard BiLSTM for in-formation extraction.

Extensive experiments have been conductedto show our method outperforms BiLSTM-CRFbaselines by significant margins, on two real-world datasets. Additionally, ablation studies arealso performed to evaluate the effectiveness ofeach component of our model. Furthermore, wealso provide analysis and intuitions on why andhow individual components work in our experi-ments.

2 Related Works

Our work is inspired by recent research in the areaof graph convolution and information extraction.

2.1 Graph Convolution Network

Neural network architectures such as CNN andRNNs have demonstrated huge success on manyartificial intelligence tasks where the underly-ing data has grid-like or sequential structure(Krizhevsky et al., 2012; Kim et al., 2016; Kim,2014). Recently, there is a surge of interest instudying the neural network structure operating ongraphs (Kipf and Welling, 2016; Hamilton et al.,2017), since much data in the real world is natu-rally represented as graphs. Many works attemptto generalize convolution on the graph structure.Some use a spectrum based approach where thelearned model depends on the structure of thegraph. As a result, the approach does not workwell on dynamic graph structures. The others de-fine convolution directly on the graph (Velickovicet al., 2017; Hamilton et al., 2017; Xu et al., 2018;Johnson et al., 2018; Duvenaud et al., 2015). Wefollow the latter approach in our work to model thetext segment graph of VRDs.

Different from existing works, this paper in-troduces explicit edge embeddings into the graphconvolution network, which models the rela-tionship between vertices directly. Similar to(Velickovic et al., 2017), we apply self-attention(Vaswani et al., 2017) to define convolution onvariable-sized neighbors, and the approach is com-putationally efficient since the operation is paral-lelizable across node pairs.

Page 3: Graph Convolution for Multimodal Information Extraction ...count. Besides, some of the studies (d’Andecy et al.,2018;Medvet et al.,2011;Rusinol et al., 2013) in the area of document

34

2.2 Information Extraction

Recently, significant progress has been made in in-formation extraction from unstructured or semi-structured text. However, most works focus onplain text documents (Peng et al., 2017; Lam-ple et al., 2016; Ma and Hovy, 2016; Chiu andNichols, 2016). For information extraction fromVRDs, (Palm et al., 2017) which uses a recurrentneural network (RNN) to extract entities of in-terest from VRDs (invoices) is the closest to ourwork, but does not take visual features into ac-count. Besides, some of the studies (d’Andecyet al., 2018; Medvet et al., 2011; Rusinol et al.,2013) in the area of document understanding dealwith a similar problem to our work, and exploreusing visual features to aid text extraction fromVRDs; however, approaches they proposed arebased on a large amount of heuristic knowledgeand human-designed features, as well as limitedin known templates, which are not scalable inreal-world business settings. We also acknowl-edge a concurrent work of (Katti et al., 2018),which models 2-D document using convolutionnetworks. However, there are several key differ-ences. Our neural network architecture is graph-based, and our model operates on text segmentsinstead of characters as in (Katti et al., 2018).

Besides, information extraction based on thegraph structure has been developed most recently.(Peng et al., 2017; Song et al., 2018) presenta graph LSTM to capture various dependenciesamong the input words and (Wang et al., 2018) de-signs a novel graph schema to extract entities andrelations jointly. However, their models are notconcerned with visual information directly.

3 Model Architecture

This section describes the document model and thearchitecture of our proposed model. Our modelfirst encodes each text segment in the documentinto graph embedding, using multiple layers ofgraph convolution. The embedding represents theinformation in the text segment given its visual andtextual context. By visual context, we refer to thelayout of the document and relative positions ofthe individual segment to other segments. Tex-tual context is the aggregate of text information inthe document overall; our model learns to assignhigher weights on texts from neighbor segments.Then we combine the graph embeddings with textembeddings and apply a standard BiLSTM-CRF

Figure 2: Document graph. Every node in the graph isfully connected to each other.

model for entity extraction.

3.1 Document Modeling

We model each document as a graph of text seg-ments (see Figure 2), where each text segment iscomprised of the position of the segment and thetext within it. The graph is comprised of nodesthat represent text segments, and edges that repre-sent visual dependencies, such as relative shapesand distance, between two nodes. Text segmentsare generated using an in-house Optical CharacterRecognition (OCR) system.

Mathematically, a document D is a tuple(T,E), where T = {t1, t2, · · · , tn}, ti ∈T is a set of n text boxes/nodes, R ={ri1, ri2, · · · , rij}, rij ∈ R is a set of edges, andE = T × R × T is a set of directed edges of theform (ti, rij , tj) where ti, tj ∈ T and rij ∈ R. Inour experiments, every node is connected to eachother.

3.2 Feature Extraction

For node ti, we calculate node embedding ti us-ing a single layer Bi-LSTM (Schuster and Paliwal,1997) to extract features from the text content inthe segment.

Edge embedding between node ti and node tj isdefined as follows,

rij = [xij , yij ,wi

hi,hjhi,wj

hi], (1)

where xij and yij are horizontal and vertical dis-tance between the two text boxes respectively, andwi and hi are the width and height of the corre-sponding text box. The third, fourth and fifth valueof the embedding are the aspect ratio of node ti,relative height, and width of node tj respectively.Empirically, a visual distance between two seg-ments is an important feature. For example, ingeneral, the positions of relevant information are

Page 4: Graph Convolution for Multimodal Information Extraction ...count. Besides, some of the studies (d’Andecy et al.,2018;Medvet et al.,2011;Rusinol et al., 2013) in the area of document

35

Figure 3: Graph convolution of document graph. Convolution is defined on node-edge-node triplets (ti, rij , tj).Each layer produces new embeddings for both nodes and edges.

closer in one document, such as the key and valueof an entity. Moreover, the shape of the text seg-ment plays a critical role in representing seman-tic meanings. For example, the length of the textsegment which has address information is usuallylonger than that of one which has a buyer name.Therefore, we use edge embedding to encode in-formation regarding the visual distance betweentwo segments, the shape of the source node, andthe relative size of the destination node.

To summarize, node embedding encodes textualfeatures, while edge embedding primarily repre-sents visual features.

3.3 Graph ConvolutionGraph convolution is applied to compute visualtext embeddings of text segments in the graph,as shown in Figure 3. Different from existingworks, we define convolution on the node-edge-node triplets (ti, rij , tj) instead of on the nodealone. We compare the performances of the mod-els using nodes only and node-edge-node tripletsin Section 5.3. For node ti, we extract features hij

for each neighbour tj using a multi-layer percep-tron (MLP) network,

hij = g(ti, rij , tj) = MLP([ti‖rij‖tj ]), (2)

where ‖ is the concatenate operation. Thereare several benefits of using this triplet feature set.First, it combines visual features directly into theneighbor representation. Furthermore, the infor-mation of the current node is copied across theneighbors. As a result, the neighbor features canpotentially learn where to attend given the currentnode.

In our model, graph convolution is definedbased on the self-attention mechanism. The idea

is to compute the output hidden representation ofeach node by attending to its neighbors. In its mostgeneral form, each node can attend to all the othernodes, assuming a fully connected graph.

Concretely the output embedding t′i of the layerfor node ti is computed by,

t′i = σ(∑

j∈{1,··· ,n}

αijhij), (3)

where αij are the attention coefficients, and σis an activation function. In our experiments, theattention mechanism is designed as the follows,

αij =exp(LeakyRelu(wT

a hij))∑j∈{1,··· ,n} exp(LeakyRelu(wT

a hij)),

(4)

where wa is a shared attention weight vector.We apply the LeakyRelu activation function toavoid the “dying Relu” problem and to increasethe “contrast” of the attention coefficients poten-tially.

The edge embedding output of the graph convo-lution layer is defined as,

r′ij = MLP(hij). (5)

Outputs t′i, r′ij are fed as inputs to the next layer

of graph convolution (as computed in equation 2)or network modules for downstream tasks.

3.4 BiLSTM-CRF with Graph EmbeddingsWe combine graph embeddings with token embed-dings and feed them into standard BiLSTM-CRFfor entity extraction. As illustrated in Figure 4,visual text embedding generated from the graphconvolution layers of the current text segment is

Page 5: Graph Convolution for Multimodal Information Extraction ...count. Besides, some of the studies (d’Andecy et al.,2018;Medvet et al.,2011;Rusinol et al., 2013) in the area of document

36

Figure 4: BiLSTM-CRF with graph embeddings.

concatenated to each token embedding of the in-put sequence; Intuitively, graph embedding addscontextual information to the input sequence.

Formally, assume for node ti, the input tokensequence of the text segment is x1, x2, · · · , xm,and the graph embedding of the node is t′i. Theinput embedding ui is defined as,

ui = e(xi)‖t′i (6)

where e is token embedding lookup function,and Word2Vec vectors are used as token embed-dings in our experiments.

Then the input embeddings are fed into a Bi-LSTM network to be encoded, and the output isfurther passed to a fully connected network andthen a CRF layer.

4 Model Supervision and Training

We build an annotation system to facilitate the la-beling of the ground truth data. For each docu-ment, we label the values for each pre-defined en-tity, and their locations (bounding boxes). To gen-erate training data, we first identify the text seg-ment each entity belongs to, and then we label thetext in the segment according to IOB tagging for-mat (Sang and Veenstra, 1999). We assign label Oto all tokens in empty text segments.

Since human annotated bounding boxes can-not match OCR detected box exactly, we ap-ply a simple heuristic to determine which textsegment an entity belongs to based on over-lap area. A text segment is considered to con-tain an entity if Aoverlap/min(Aannotator, Aocr)is bigger than a manually set threshold; HereAannotator, Aocr, Aoverlap are the area of the an-notated box of the corresponding entity, the areaof the ocr detected box and the area of the overlapbetween the two boxes respectively.

In our experiments, the graph convolutionlayers and BiLSTM-CRF extractors are trainedjointly. Furthermore, to improve prediction accu-racy, we add the segment classification task whichclassifies each text segment into a pre-defined tagas an auxiliary task and discuss the effect of multi-task learning in Section 5.4.3. We feed the graphembedding of each text segment into a sigmoidclassifier to predict the tag. Since the parame-ters of the graph convolution layers are sharedacross the extraction task and segment classifi-cation task, we employ a multi-task learning ap-proach for model training. In multi-task training,the goal is to optimize the weighted sum of thetwo losses. In our experiments, the weight is de-termined using a principled approach as describedin (Kendall et al., 2017). The idea is to adjust eachtask’s relative weight in the loss function by con-sidering task-dependant uncertainties.

5 Experiments

We apply our model for information extractionfrom two real-world datasets. They are Value-Added Tax Invoices (VATI) and International Pur-chase Receipts (IPR).

5.1 Datasets Description

VATI consists of 3000 user-uploaded pictures andhas 16 entities to exact. Example entities are thenames of buyer/seller, date and tax amount. Theinvoices are in Chinese, and it has a fixed templatesince it is national standard invoice. However,there are many noises in the documents which in-clude distracting objects in the image and skeweddocument orientation to name a few. IPR is a dataset of 1500 scanned receipt documents in Englishwhich has 4 entities to exact (Invoice Number,Vendor Name, Payer Name and Total Amount).There exist 146 templates for the receipts. Vari-able templates introduce additional difficulties tothe IPR dataset. For both datasets, we assign 70%of each dataset for training, 15% for validation and15% for the test. The number of text segmentsvaries per document from 100 to 300.

5.2 Baselines

We compare the performance of our system withtwo BiLSTM-CRF baselines. Baseline I appliesBiLSTM-CRF to each text segment, where eachtext segment is an individual sentence. BaselineII applies the tagging model to the concatenated

Page 6: Graph Convolution for Multimodal Information Extraction ...count. Besides, some of the studies (d’Andecy et al.,2018;Medvet et al.,2011;Rusinol et al., 2013) in the area of document

37

Model VATI IPRBaseline I 0.745 0.747Baseline II 0.854 0.820BiLSTM-CRF + GCN 0.873 0.836

Table 1: F1 score. Performance comparisons.

Entities Baseline I Baseline II Our modelInvoice # 0.952 0.961 0.975Date 0.962 0.963 0.963Price 0.527 0.910 0.943Tax 0.584 0.902 0.924Buyer 0.402 0.797 0.833Seller 0.681 0.731 0.782

Table 2: F1 score. Performance comparisons for indi-vidual entities from VATI dataset.

document. Text segments in a document are con-catenated from left to right and from top to bottomaccording to (Palm et al., 2017). Baseline II incor-porates a one-dimensional textual context to themodel.

5.3 Results

We use the F1 score to evaluate the performancesof our model in all experiments. The main resultsare shown in Table 1. As we can see, our pro-posed model outperforms both baselines by sig-nificant margins. Capturing patterns from VRDswith one-dimensional text sequence is difficult.More specifically, we present performance com-parisons of six entities from VATI dataset in Table2. It can be seen that compared with two baselines,our model performs almost identical on “simple”entities which can be distinguished by the text seg-ment’s text feature alone (i.e., Invoice Numberand Date) where visual features and context in-formation are not necessary. However, our pro-posed model clearly outperforms baselines on en-tities which can not be represented by text alone,such as Price, Tax, Buyer, and Seller.

To further examine the contributions made byeach sub-component of the graph convolution net-work, we perform the following ablation studies.In each study, we exclude visual features (edgeembeddings), textual features (node embeddings)and the use of attention mechanism respectively,to see their impacts on F1 scores on both twodatasets. As presented in Table 3, it can be seenthat visual features play a critical role in the per-formance of our model; they lead to more than

Configurations/Datasets VATI IPRFull model 0.873 0.836w/o vis. features 0.808 0.775w/o text features 0.871 0.817w/o attention 0.872 0.821

Table 3: F1 score. Ablation studies of individual com-ponent of graph convolution.

5% performance drop in both datasets. Intuitively,visual features provide more information aboutcontexts of the text segments, so it improves theperformance by discriminating between text seg-ments with similar semantic meanings. Moreover,textual features make similar contributions. Fur-thermore, the attention mechanism shows more ef-fectiveness on variable template datasets, whichresults in 1.5% performance gains. However, itmakes no contribution to fixed layout datasets. Wemake further discussions on attention in the nextsection.

5.4 Discussions5.4.1 Attention AnalysisTo better understand how attention works in ourmodel, we study the attention weights (on all othertext segments) of each text segment in the docu-ment graph. Interestingly, for the VATI dataset,we find that attention weights are usually concen-trated on a fixed text segment with strong textualfeatures (the segment contains address informa-tion) regardless of the text segment studied. Thereason behind that may be attention mechanismtries to find an anchor point of the document, andone anchor is enough as VATI documents all sharethe same template. Strong textual features help lo-cate the anchor more accurately.

For variable layout documents, more attentionis paid to nearby text segments, which reflects thelocal structure of the document. Specifically, at-tention weights of the left and upper segments areeven higher. Furthermore, segments that containslot keywords such as “address”, “amount” and“name” receive higher attention weights.

5.4.2 The Number of Graph ConvolutionLayers

Here we evaluate the impact of different numbersof graph convolution layers. In theory, a highernumber of layers encodes more information andtherefore can model more complex relationships.We perform the experiments on selected entity

Page 7: Graph Convolution for Multimodal Information Extraction ...count. Besides, some of the studies (d’Andecy et al.,2018;Medvet et al.,2011;Rusinol et al., 2013) in the area of document

38

Entities 1 layer 2 layer 3 layerInvoice # 0.959 0.975 0.964Date 0.960 0.963 0.960Price 0.931 0.943 0.931Tax 0.915 0.924 0.917Buyer 0.829 0.833 0.827Seller 0.772 0.782 0.775

Table 4: F1 score. Performance comparisons of dif-ferent graph convolution layers for individual entitiesfrom VATI dataset.

Model VATI IPRBiLSTM-CRF + GCN 0.873 0.836

+ Multi-task 0.881 0.849

Table 5: F1 score. Effectiveness of multi-task learningapproach.

types of the VATI dataset, and the results are pre-sented in Table 4. As we can see, additional layersin the network do not help simple tasks. By sim-ple task, we mean that task achieves high accuracywith single layer graph convolution. However,more layers indeed improve the performances ofmore difficult tasks. As shown in the table, the op-timal number of layers is two for our task, as threelayers overfit the model. Ideally, the number ofgraph convolution layers used should be adaptiveto the specific task, of which we leave the study tofuture works.

5.4.3 Multi-Task LearningAs shown in Table 5, our task benefits from thesegment classification task and multi-task learn-ing method in both datasets. The two tasks in ourexperiments are complementary, and comparedwith the single task model, the multi-task learn-ing model may have better generalization perfor-mance by adopting more information. Further-more, we find that the multi-task learning helpsthe training converge much faster.

6 Conclusions and Future Works

This paper studies the problem of entity extractionfrom VRDs. A graph convolution architecture isproposed to encode text embeddings given visu-ally rich context. BiLSTM-CRF is applied to ex-tract the final results. We manually annotated tworeal-world datasets of VRDs, and perform com-prehensive experiments and analysis. Our sys-tem outperforms BiLSTM baselines and presents a

novel method for IE from VRDs. Furthermore, weplan to extend the graph convolution framework toother tasks in VRDs, such as document classifica-tion.

Acknowledgments

We thank the OCR team of Alibaba Group for pro-viding us OCR service support and anonymous re-viewers for their helpful comments.

ReferencesLaura Chiticariu, Yunyao Li, and Frederick R Reiss.

2013. Rule-based information extraction is dead!long live rule-based information extraction systems!In Proceedings of the 2013 conference on empiricalmethods in natural language processing, pages 827–832.

Jason PC Chiu and Eric Nichols. 2016. Named entityrecognition with bidirectional lstm-cnns. Transac-tions of the Association for Computational Linguis-tics, 4:357–370.

Vincent Poulain d’Andecy, Emmanuel Hartmann, andMarcal Rusinol. 2018. Field extraction by hybridincremental and a-priori structural templates. In2018 13th IAPR International Workshop on Docu-ment Analysis Systems (DAS), pages 251–256. IEEE.

Andreas R Dengel and Bertin Klein. 2002. smart-fix: A requirements-driven system for documentanalysis and understanding. In International Work-shop on Document Analysis Systems, pages 433–444. Springer.

David K Duvenaud, Dougal Maclaurin, Jorge Ipar-raguirre, Rafael Bombarell, Timothy Hirzel, AlanAspuru-Guzik, and Ryan P Adams. 2015. Convo-lutional networks on graphs for learning molecularfingerprints. In Advances in neural information pro-cessing systems, pages 2224–2232.

Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017.Inductive representation learning on large graphs.In Advances in Neural Information Processing Sys-tems, pages 1024–1034.

Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018.Image generation from scene graphs. arXivpreprint.

Anoop Raveendra Katti, Christian Reisswig, CordulaGuder, Sebastian Brarda, Steffen Bickel, JohannesHohne, and Jean Baptiste Faddoul. 2018. Char-grid: Towards understanding 2d documents. arXivpreprint arXiv:1809.08799.

Alex Kendall, Yarin Gal, and Roberto Cipolla. 2017.Multi-task learning using uncertainty to weighlosses for scene geometry and semantics. arXivpreprint arXiv:1705.07115, 3.

Page 8: Graph Convolution for Multimodal Information Extraction ...count. Besides, some of the studies (d’Andecy et al.,2018;Medvet et al.,2011;Rusinol et al., 2013) in the area of document

39

Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882.

Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der M Rush. 2016. Character-aware neural languagemodels. In AAAI, pages 2741–2749.

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In Advances in neuralinformation processing systems, pages 1097–1105.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.arXiv preprint arXiv:1603.01360.

Xuezhe Ma and Eduard Hovy. 2016. End-to-endsequence labeling via bi-directional lstm-cnns-crf.arXiv preprint arXiv:1603.01354.

Eric Medvet, Alberto Bartoli, and Giorgio Davanzo.2011. A probabilistic approach to printed documentunderstanding. International Journal on DocumentAnalysis and Recognition (IJDAR), 14(4):335–347.

Rasmus Berg Palm, Ole Winther, and Florian Laws.2017. Cloudscan-a configuration-free invoice anal-ysis system using recurrent neural networks. arXivpreprint arXiv:1708.07403.

Nanyun Peng, Hoifung Poon, Chris Quirk, KristinaToutanova, and Wen-tau Yih. 2017. Cross-sentencen-ary relation extraction with graph lstms. arXivpreprint arXiv:1708.03743.

Marcal Rusinol, Tayeb Benkhelfallah, and VincentPoulain dAndecy. 2013. Field extraction from ad-ministrative documents by incremental structuraltemplates. In Document Analysis and Recognition(ICDAR), 2013 12th International Conference on,pages 1100–1104. IEEE.

Erik F Sang and Jorn Veenstra. 1999. Representing textchunks. In Proceedings of the ninth conference onEuropean chapter of the Association for Computa-tional Linguistics, pages 173–179. Association forComputational Linguistics.

Daniel Schuster, Klemens Muthmann, Daniel Esser,Alexander Schill, Michael Berger, Christoph Wei-dling, Kamil Aliyev, and Andreas Hofmeier. 2013.Intellix–end-user trained information extraction fordocument archiving. In 2013 12th InternationalConference on Document Analysis and Recognition,pages 101–105. IEEE.

Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-tional recurrent neural networks. IEEE TRANSAC-TIONS ON SIGNAL PROCESSING, 45(11):2673.

Linfeng Song, Yue Zhang, Zhiguo Wang, and DanielGildea. 2018. N-ary relation extraction using graphstate lstm. arXiv preprint arXiv:1808.09101.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Lio, and Yoshua Bengio.2017. Graph attention networks. arXiv preprintarXiv:1710.10903.

Shaolei Wang, Yue Zhang, Wanxiang Che, and TingLiu. 2018. Joint extraction of entities and relationsbased on a novel graph scheme. In IJCAI, pages4461–4467.

Kun Xu, Lingfei Wu, Zhiguo Wang, and VadimSheinin. 2018. Graph2seq: Graph to sequencelearning with attention-based neural networks.arXiv preprint arXiv:1804.00823.