GraphIE: A Graph-Based Framework for Information Extraction

Most modern Information Extraction (IE) systems are implemented as sequential taggers and only model local dependencies. Non-local and non-sequential context is, however, a valuable source of information to improve predictions. In this paper, we introduce GraphIE, a framework that operates over a graph representing a broad set of dependencies between textual units (i.e. words or sentences). The algorithm propagates information between connected nodes through graph convolutions, generating a richer representation that can be exploited to improve word-level predictions. Evaluation on three different tasks — namely textual, social media and visual information extraction — shows that GraphIE consistently outperforms the state-of-the-art sequence tagging model by a significant margin.


Introduction
Most modern Information Extraction (IE) systems are implemented as sequential taggers. While such models effectively capture relations in the local context, they have limited capability of exploiting non-local and non-sequential dependencies. In many applications, however, such dependencies can greatly reduce tagging ambiguity, thereby improving overall extraction performance. For instance, when extracting entities from a document, various types of non-local contextual information such as co-references and identical mentions may provide valuable cues. See for example Figure 1, in which the non-local relations are crucial to discriminate the entity type of the second mention of Washington (i.e. PERSON, ORGANIZATION or LOCATION).
Most of the prior work looking at the non-local dependencies incorporates them by constraining He was involved in a wide range of duties for Washington 's request … … his father. Washington came from a prosperous family of planters … Jefferson was invited by Washington to serve as Secretary of State … He was involved in a wide range of duties for Washington's request … Figure 1: Example of the entity extraction task with an ambiguous entity mention (i.e. "...for Washington's request..."). Aside from the sentential forward and backward edges (green) which aggregate local contextual information, non-local relations -such as the co-referent edges (red) and the identical-mention edges (blue) -provide additional valuable information to reduce tagging ambiguity (i.e. PERSON, ORGANIZA-TION or LOCATION).
the output space in a structured prediction framework (Finkel et al., 2005;Reichart and Barzilay, 2012;Hu et al., 2016). Such approaches, however, mostly overlook the richer set of structural relations in the input space. With reference to the example in Figure 1, the co-referent dependencies would not be readily exploited by simply constraining the output space, as they would not necessarily be labeled as entities (e.g. pronouns). In the attempt to capture non-local dependencies in the input space, alternative approaches define a graph that outlines the input structure and engineer features to describe it . Designing effective features is however challenging, arbitrary and time consuming, especially when the underlying structure is complex. Moreover, these approaches have limited capacity of capturing node interactions informed by the graph structure.
In this paper, we propose GraphIE, a framework that improves predictions by automatically learning the interactions between local and non-local dependencies in the input space. Our approach integrates a graph module with the encoder-decoder architecture for sequence tagging. The algorithm operates over a graph, where nodes correspond to textual units (i.e. words or sentences) and edges describe their relations. At the core of our model, a recurrent neural network sequentially encodes local contextual representations and then the graph module iteratively propagates information between neighboring nodes using graph convolutions (Kipf and Welling, 2016). The learned representations are finally projected back to a recurrent decoder to support tagging at the word level. We evaluate GraphIE on three IE tasks, namely textual, social media, and visual (Aumann et al., 2006) information extraction. For each task, we provide in input a simple task-specific graph, which defines the data structure without access to any major processing or external resources. Our model is expected to learn from the relevant dependencies to identify and extract the appropriate information. Experimental results on multiple benchmark datasets show that GraphIE consistently outperforms a strong and commonly adopted sequential model (SeqIE, i.e. a bidirectional long-short term memory (BiLSTM) followed by a conditional random fields (CRF) module). Specifically, in the textual IE task, we obtain an improvement of 0.5% over SeqIE on the CONLL03 dataset, and an improvement of 1.4% on the chemical entity extraction (Krallinger et al., 2015). In the social media IE task, GraphIE improves over SeqIE by 3.7% in extracting the EDU-CATION attribute from twitter users. In visual IE, finally, we outperform the baseline by 1.2%.

Related Work
The problem of incorporating non-local and nonsequential context to improve information extraction has been extensively studied in the literature. The majority of methods have focused on enforcing constraints in the output space during inference, through various mechanisms such as posterior regularization or generalized expectations (Finkel et al., 2005;Mann and McCallum, 2010;Reichart and Barzilay, 2012;Li et al., 2013;Hu et al., 2016).
Research capturing non-local dependencies in the input space have mostly relied on feature-based approaches. Roberts et al. (2008) and Swampillai and Stevenson (2011) have designed intra-and inter-sentential features based on discourse and syntactic dependencies (e.g., shortest paths) to improve relation extraction.  used document graphs to flexibly represent multiple types of relations between words (e.g., syntactic, adjacency and discourse relations).
Graph-based representations can be also learned with neural networks. The most related work to ours is the graph convolutional network by Kipf and Welling (2016), which was developed to encode graph structures and perform node classification. In our framework, we adapt GCN as an intermediate module that learns non-local context, which -instead of being used directly for classification -is projected to the decoder to enrich local information and perform sequence tagging.
A handful of other information extraction approaches have used graph-based neural networks. Miwa and Bansal (2016) applied Tree LSTM (Tai et al., 2015) to jointly represent sequences and dependency trees for entity and relation extraction. On the same line of work, Peng et al. (2017) and Song et al. (2018) introduced Graph LSTM, which extended the traditional LSTM to graphs by enabling a varied number of incoming edges at each memory cell.  exploited graph convolutions to pool information over pruned dependency trees, outperforming existing sequence and dependency-based neural models in a relation extraction task. These studies differ from ours in several respects. First, they can only model wordlevel graphs, whereas our framework can learn non-local context either from word-or sentencelevel graphs, using it to reduce ambiguity during tagging at the word level. Second, all these studies achieved improvements only when using dependency trees. We extend the graph-based approach to validate the benefits of using other types of relations in a broader range of tasks, such as coreference in named entity recognition, followed-by link in social media, and layout structure in visual information extraction.

Problem Definition
We formalize information extraction as a sequence tagging problem. Rather than simply modeling inputs as sequences, we assume there exists a graph structure in the data that can be exploited to cap- : GraphIE framework: (a) an overview of the framework; (b) architecture for sentence-level graph, where each sentence is encoded to a node vector and fed into the graph module, and the output of the graph module is used as the initial state of the decoder; (c) architecture for word-level graph, where the hidden state for each word of the encoder is taken as the input node vector of the graph module, and then the output is fed into the decoder.
ture non-local and non-sequential dependencies between textual units, namely words or sentences. We consider the input to be a set of sentences S = {s 1 , . . . , s N } and an auxiliary graph G = (V, E), where V = {v 1 , . . . , v M } is the node set and E ⊂ V × V is the edge set. Each sentence is a sequence of words. We consider two different designs of the graph: (1) sentence-level graph, where each node is a sentence (i.e. M = N ), and the edges encode sentence dependencies; (2) word-level graph, where each node is a word (i.e. M is the number of words in the input), and the edges connect pairs of words, such as co-referent tokens.
The edges e i,j = (v i , v j ) in the graph can be either directed or undirected. Multiple edge types can also be defined to capture different structural factors underlying the task-specific input data. We use the BIO (Begin, Inside, Outside) tagging scheme in this paper. For each sentence

Method
GraphIE jointly learns local and non-local dependencies by iteratively propagating information between node representations. Our model has three components: • an encoder, which generates local contextaware hidden representations for the textual unit (i.e. word or sentence, depending on the task) with a recurrent neural network; • a graph module, which captures the graph structure, learning non-local and nonsequential dependencies between textual units; • a decoder, which exploits the contextual information generated by the graph module to perform labelling at the word level. Figure 2 illustrates the overview of GraphIE and the model architectures for both sentence-and word-level graphs. In the following sections, we first introduce the case of the sentence-level graph, and then we explain how to adapt the model for the word-level graph.

Encoder
In GraphIE, we first use an encoder to generate text representations. Given a sentence s i = (w which is the concatenation of its word embedding and a feature vector learned with a character-level convolutional neural network (CharCNN; Kim et al. (2016)). We encode the sentence with a recurrent neural network (RNN), defining it as where , 0 indicates the initial hidden state is zero, and Θ enc represents the encoder parameters. We implement the RNN as a bi-directional LSTM (Hochreiter and Schmidhuber, 1997), and encode each sentence independently.
We obtain the sentence representation for s i by averaging the hidden states of its words, i.e.

Enc(s
. The sentence representations are then fed into the graph module.

Graph Module
The graph module is designed to learn the nonlocal and non-sequential information from the graph. We adapt the graph convolutional network (GCN) to model the graph context for information extraction.
Given the sentence-level graph G = (V, E), where each node v i (i.e. sentence s i ) has the encoding Enc(s i ) capturing its local information, the graph module enriches such representation with neighbor information derived from the graph structure.
Our graph module is a GCN which takes as input the sentence representation, i.e. g (0) i = Enc(s i ), and conducts graph convolution on every node, propagating information between its neighbors, and integrating such information into a new hidden representation. Specifically, each layer of GCN has two parts. The first gets the information of each node from the previous layer, i.e.
where W (l) v is the weight to be learned. The second aggregates information from the neighbors of each node, i.e. for node v i , we have where d(v i ) is the degree of node v i (i.e. the number of edges connected to v i ) and is used to normalize β (l) i , ensuring that nodes with different degrees have representations of the same scale. 3 In the simplest case, where the edges in the graph are undirected and have the same type, we use the same weight W (l) e for all of them. In a more general case, where multiple edge types exist, we expect them to have different impacts on the aggregation. Thus, we model these edge types with different weights in Eq. 3, similar to the relational GCN proposed by Schlichtkrull et al. (2018). When edges are directed, i.e. edge e i,j is different from e j,i , the propagation mechanism should mirror such difference. In this case, we consider directed edges as two types of edges (forward and backward), and use different weights for them.
Finally, α i and β (l) i are combined to obtain the representation at the l-th layer, where σ(·) is the non-linear activation function, and b (l) is a bias parameter. Because each layer only propagates information between directly connected nodes, we can stack multiple graph convolutional layers to get a larger receptive field, i.e. each node can be aware of more distant neighbors. After L layers, for each node v i we obtain a contextual representation, GCN(s i ) = g (L) i , that captures both local and non-local information.

Decoder
To support tagging, the learned representation is propagated to the decoder.
In our work, the decoder is instantiated as a BiLSTM+CRF tagger (Lample et al., 2016). The output representation of the graph module, GCN(s i ), is split into two vectors of the same length, which are used as the initial hidden states for the forward and backward LSTMs, respectively. In this way, the graph contextual information is propagated to each word through the LSTM. Specifically, we have where h (i) 1:k are the output hidden states of the encoder, GCN(s i ) represents the initial state, and Θ dec is the decoder parameters. A simpler way to incorporate the graph representation into the decoder is concatenating with its input, but the empirical performance is worse than using as the initial state.
Finally, we use a CRF layer (Lafferty et al., 2001) on top of the BiLSTM to perform tagging, where Y k is the set of all possible tag sequences of length k, and Θ crf represents the CRF parameters, i.e. transition scores of tags. CRF combines the local predictions of BiLSTM and the transition scores to model the joint probability of the tag sequence. 4

Adaptation to Word-level Graphs
GraphIE can be easily adapted to model wordlevel graphs. In such case, the nodes represent words in the input, i.e. the number of nodes M equals the total number of words in the N sentences. At this point, each word's hidden state in the encoder can be used as the input node vector g (0) i of the graph module. GCN can then conduct graph convolution on the word-level graph and generate graph-contextualized representations for the words. Finally, the decoder directly operates on the GCN's outputs, i.e. we change the BiLSTM decoder to 4 In GraphIE, the graph module models the input space structure, i.e. the dependencies between textual units (i.e. sentences or words), and the final CRF layer models the sequential connections of the output tags. Even though loops may exist in the input graph, CRF operates sequentially, thus the inference is tractable.
In this case, the BiLSTM initial states are set to the default zero vectors. The CRF layer remains unchanged.
As it can be seen in Figure 2(c), the word-level graph module differs from the sentence-level one because it directly takes the word representations from the encoder and feeds its output to the decoder. In sentence-level graph, the GCN operates on sentence representations, which are then used as the initial states of the decoder BiLSTM.

Experimental Setup
We evaluate the model on three tasks, including two traditional IE tasks, namely textual information extraction and social media information extraction, and an under-explored task -visual information extraction. For each of these tasks, we created a simple task-specific graph topology, designed to easily capture the underlying structure of the input data without any major processing. Table 1 summarizes the three tasks.

Task 1: Textual Information Extraction
In this task, we focus on named entity recognition at discourse level (DiscNER). In contrast to traditional sentence-level NER (SentNER), where sentences are processed independently, in DiscNER, long-range dependencies and constraints across sentences have a crucial role in the tagging process. For instance, multiple mentions of the same entity are expected to be tagged consistently in the same discourse. Here we propose to use this (soft) constraint to improve entity extraction.
Dataset We conduct experiments on two NER datasets: the CoNLL-2003 dataset (CONLL03) (Tjong et al., 2003), and the CHEMDNER dataset for chemical entity extraction (Krallinger et al., 2015). We follow the standard split of each corpora. Statistics are shown in Table 2.

Graph Construction
In this task, we use a word-level graph where nodes represent words. We create two types of edges for each document: • Local edges: forward and backward edges are created between neighboring words in each sentence, allowing local contextual information to be utilized.
• Non-local edges: re-occurrences of the same token other than stop words are connected, so   that information can be propagated through, encouraging global consistency of tagging. 5

Task 2: Social Media Information Extraction
Social media information extraction refers to the task of extracting information from users' posts in online social networks (Benson et al., 2011;Li et al., 2014). In this paper, we aim at extracting education and job information from users' tweets. Given a set of tweets posted by a user, the goal is to extract mentions of the organizations to which they belong. The fact that the tweets are short, highly contextualized and show special linguistic features makes this task particularly challenging.
Dataset We construct two datasets, EDUCA-TION and JOB, from the Twitter corpus released by Li et al. (2014). The original corpus contains millions of tweets generated by ≈ 10 thousand users, where the education and job mentions are annotated using distant supervision (Mintz et al., 2009). We sample the tweets from each user, maintaining the ratio between positive and negative posts. 6 The obtained EDUCATION dataset consists of 443, 476 tweets generated by 7, 208 users, and the JOB dataset contains 176, 043 tweets generated by 1, 772 users. Dataset statistics are reported in Table 3. 5 Note that other non-local relations such as co-references (cf. the example in Figure 1) may be used for further improvement. However, these relations require additional resources to obtain, and we leave them to future work. 6 Positive and negative refer here to whether or not the education or job mention is present in the tweet.  The datasets are both split in 60% for training, 20% for development, and 20% for testing. We perform 5 different random splits and report the average results.
Graph Construction We construct the graph as ego-networks (Leskovec and Mcauley, 2012), i.e. when we extract information about one user, we consider the subgraph formed by the user and his/her direct neighbors. Each node corresponds to a Twitter user, who is represented by the set of posted tweets. 7 Edges are defined by the followedby link, under the assumption that connected users are more likely to come from the same university or company. An example of the social media graph is reported in the appendices.

Task 3: Visual Information Extraction
Visual information extraction refers to the extraction of attribute values from documents formatted in various layouts. Examples include invoices and forms, whose format can be exploited to infer valuable information to support extraction.
Dataset The corpus consists of 25,200 Adverse Event Case Reports (AECR) recording drugrelated side effects. Each case contains an average of 9 pages. Since these documents are produced by multiple organizations, they exhibit large variability in the layout and presentation styles (e.g. text, table, etc.). 8 The collection is provided with a separate human-extracted ground truth database that is used as a source of distant supervision.
Our goal is to extract eight attributes related to the patient, the event, the drug and the reporter (cf. Table 6 for the full list). Attribute types include dates, words and phrases -which can be directly extracted from the document.
The dataset is split in 50% cases for training, 10% for development, and 40% for testing.

Graph Construction
We first turn the PDFs to text using PDFMiner, 9 which provides words along with their positions in the page (i.e. bounding-box coordinates). Consecutive words are then geometrically joined into text boxes. Each text box is considered as a "sentence" in this task, and corresponds to a node in the graph.
Since the page layout is the major structural factor in these documents, we work on page-by-page basis, i.e. each page corresponds to a graph. The edges are defined to horizontally or vertically connect nodes (text boxes) that are close to each other (i.e. when the overlap of their bounding boxes, in either the vertical or horizontal direction, is over 50%). Four types of edge are considered: left-toright, right-to-left, up-to-down, and down-to-up. When multiple nodes are aligned, only the closest ones are connected. An example of visual document graph is reported in the appendices.

Baseline and Our Method
We implement a two-layer BiLSTM with a conditional random fields (CRF) tagger as the sequential baseline (SeqIE). This architecture and its variants have been extensively studied and demonstrated to be successful in previous work on information extraction (Lample et al., 2016;. In the textual IE task (Task 1), our baseline is shown to obtain competitive results with the state-of-the-art method in the CONLL03 dataset. In the visual IE task (Task 3), in order to further increase the competitiveness of the baseline, we sequentially concatenate the horizontally aligned text boxes, therefore fully modeling the horizontal edges of the graph.
Our baseline shares the same encoder and decoder architecture with GraphIE, but without the graph module. Both architectures have similar  Table 4: NER accuracy on the CONLL03 and the CHEMDNER datasets (Task 1). Scores for our methods are the average of 5 runs. * indicates statistical significance of the improvement over SeqIE (p < 0.01).
computational cost. In Task 1, we apply GraphIE with word-level graph module (cf. Figure 2(c)), and in Task 2 and Task 3, we apply GraphIE with sentence-level graph module (cf. Figure 2(b)).

Implementation Details
The models are trained with Adam (Kingma and Ba, 2014) to minimize the CRF objective. For regularization, we choose dropout with a ratio of 0.1 on both the input word representation and the hidden layer of the decoder. The learning rate is set to 0.001. We use the development set for early-stopping and the selection of the best performing hyperparameters. For CharCNN, we use 64-dimensional character embeddings and 64 filters of width 2 to 4 (Kim et al., 2016). The 100dimensional pretrained GloVe word embeddings (Pennington et al., 2014) are used in Task 1 and 2, and 64-dimensional randomly initialized word embeddings are used in Task 3. We use a twolayer GCN in Task 1, and a one-layer GCN in Task 2 and Task 3. The encoder and decoder BiLSTMs have the same dimension as the graph convolution layer. In Task 3, we concatenate a positional encoding to each text box's representation by transforming its bounding box coordinates to a vector of length 32, and then applying a tanh activation.   Figure 3: Analysis on the CONLL03 dataset. We compare with two alternative designs: (1) random connection, where we replace the constructed graph by a random graph with the same number of edges; (2) feature augmentation, where we use the average embedding of each node and its neighbors as the input to the decoder, instead of the GCN which has additional parameters. We report F1 scores on the development set.

Task 1: Textual Information Extraction
it, highlights once more the importance of modeling non-local and non-sequential dependencies and confirms that our approach is an appropriate method to achieve this goal. 10 For CHEMDNER, we show the best performance reported in Krallinger et al. (2015), obtained with a feature-based method. Our baseline outperforms the feature-based method, and GraphIE further improves the performance by 1.4%.
Analysis To understand the advantage of GraphIE, we first investigate the importance of graph structure to the model. As shown in Figure 3, using random connections clearly hurts the performance, bringing down the F1 score of GraphIE from 95.12% to 94.29%. It indicates that the task-specific graph structures introduce beneficial inductive bias. Trivial feature augmentation also does not work well, confirming the necessity of learning the graph embedding with GCN.
We further conduct error analysis on the test set to validate our motivation that GraphIE resolves tagging ambiguity by encouraging consistency among identical entity mentions (cf. Figure   10 We achieve the best reported performance among methods not using the recently introduced ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018), which are pretrained on extra-large corpora and computationally demanding. 1). Here we examine the word-level tagging accuracy. We define the words that have more than one possible tags in the dataset as ambiguous. We find that among the 1.78% tagging errors of SeqIE, 1.16% are ambiguous and 0.62% are unambiguous. GraphIE reduces the error rate to 1.67%, with 1.06% to be ambiguous and 0.61% unambiguous. We can see that most of the error reduction indeed attributes to the ambiguous words. Table 5 shows the results for the social media information extraction task. We first report a simple dictionary-based method as a baseline. Neural IE models achieve much better performance, showing that meaningful patterns are learned by the models rather than simply remembering the entities in the training set. The proposed GraphIE outperforms SeqIE in both the EDUCATION and JOB datasets, and the improvements are more significant for the EDUCATION dataset (3.7% versus 0.3%). The reason for such difference is the variance in the affinity scores (Mislove et al., 2010) between the two datasets. Li et al. (2014) underline that affinity value for EDUCATION is 74.3 while for JOB it is only 14.5, which means that in the datasets neighbors are 5 times more likely to have studied in the same university than worked in the same company. We can therefore expect that a model like GraphIE, which exploits neighbors' information, obtains larger advantages in a dataset characterized by higher affinity. Table 6 shows the results in the visual information extraction task. GraphIE outperforms the SeqIE baseline in most attributes, and achieves 1.2% improvement in the mirco average F1 score. It confirms that the benefits of using layout graph structure in visual information extraction. The extraction performance varies across the attributes, ranging from 61.4% for Drug Name to   In the ablation test described in Table 7, we can see the contribution of: using separate weights for different edge types (+0.8%), horizontal edges (+3.1%), vertical edges (+5.4%), and CRF (+5.7%).

Task 3: Visual Information Extraction
Generalization We also assess GraphIE's capacity of dealing with unseen layouts through an extra analysis. From our dataset, we sample 2, 000 reports containing the three most frequent templates, and train the models on this subset. Then we test all models in two settings: 1) seen templates, consisting of 1, 000 additional reports in the same templates used for training; and 2) unseen templates, consisting of 1, 000 reports in two new template types.
The performance of GraphIE and SeqIE is reported in Figure 4. Both models achieve good results on seen templates, with GraphIE still scoring 2.8% higher than SeqIE. The gap becomes even  Figure 4: Micro average F1 scores tested on seen and unseen templates (Task 3). larger when our model and the sequential one are tested on unseen templates (i.e. 20.3%), demonstrating that by explicitly modeling the richer structural relations, GraphIE achieves better generalizability.

Conclusions
We introduced GraphIE, an information extraction framework that learns local and non-local contextual representations from graph structures to improve predictions. The system operates over a task-specific graph topology describing the underlying structure of the input data. GraphIE jointly models the node (i.e. textual units, namely words or sentences) representations and their dependencies. Graph convolutions project information through neighboring nodes to finally support the decoder during tagging at the word level.
We evaluated our framework on three IE tasks, namely textual, social media and visual information extraction. Results show that it efficiently models non-local and non-sequential context, consistently enhancing accuracy and outperforming the competitive SeqIE baseline (i.e. BiLSTM+CRF).
Future work includes the exploration of automatically learning the underlying graphical structure of the input data.