A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing

Text-level discourse parsing remains a challenge.

Discourse parsing is the task of identifying the presence and the type of the discourse relations between discourse units.

2.1 HILDA discourse parser

Figure 3 demonstrates the overall work flow of our discourse parser.

For both intra- and multi-sentential parsing, our bottom-up tree-building process adopts a similar greedy pipeline framework like the HILDA discourse parser (discussed in Section 2.1), to guarantee efficiency for large documents.

After an intra- or multi-sentential discourse tree is fully built, we perform a post-editing to consider possible modifications to the current tree, by considering useful information from the discourse constituents on upper levels, which is unavailable in the bottom-up tree-building process.

Here we analyze the time complexity of each component in our discourse parser, to quantitatively demonstrate the time efficiency of our model.

In our local models, to encode two adjacent units, U j and U j+1, within a CRF chain, we use the following 10 sets of features, some of which are modified from J oty et al.’s model.

For preprocessing, we use the Stanford CoreNLP (Klein and Manning, 2003; de Marneffe et al., 2006; Recasens et al., 2013) to syntactically parse the texts and extract coreference relations, and we use Penn2Malt5 to lexicalize syntactic trees to extract dominance features.

9.1 Parsing accuracy

In this paper, we presented an efficient text-level discourse parser with time complexity linear in the total number of sentences in the document.

Appears in 21 sentences as: discourse parser (14) Discourse parsing (1) discourse parsing (6)

In *A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing*

- Text-level discourse parsing remains a challenge.Page 1, “Abstract”
- Discourse parsing is the task of identifying the presence and the type of the discourse relations between discourse units.Page 1, “Introduction”
- While research in discourse parsing can be partitioned into several directions according to different theories and frameworks, Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) is probably the most ambitious one, because it aims to identify not only the discourse relations in a small local context, but also the hierarchical tree structure for the full text: from the relations relating the smallest discourse units (called elementary discourse units, EDUs), to the ones connecting paragraphs.Page 1, “Introduction”
- Conventionally, there are two major subtasks related to text-level discourse parsing : (l) EDU segmentation: to segment the raw text into EDUs, and (2) tree-building: to build a discourse tree from EDUs, representing the discourse relations in the text.Page 1, “Introduction”
- However, as an optimal discourse parser , Joty et al.’s model is highly inefficient in practice, with respect to both their DCRF-based local classifiers, and their CKY-like bottom-up parsing algorithm.Page 1, “Introduction”
- The main objective of this work is to develop a more efficient discourse parser , with similar or even better performance with respect to Joty et al.’s optimal parser, but able to produce parsing results in real time.Page 1, “Introduction”
- greedy bottom-up strategy, we develop a discourse parser with a time complexity linear in the total number of sentences in the document.Page 2, “Introduction”
- As a result of successfully avoiding the expensive non-greedy parsing algorithms, our discourse parser is very efficient in practice.Page 2, “Introduction”
- 2.1 HILDA discourse parserPage 2, “Related work”
- The HILDA discourse parser by Hernault et al.Page 2, “Related work”
- (2010) is the first attempt at RST-style text-level discourse parsing .Page 2, “Related work”

See all papers in *Proc. ACL 2014* that mention discourse parser.

See all papers in *Proc. ACL* that mention discourse parser.

Back to top.

Appears in 17 sentences as: EDUs (21)

In *A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing*

- While research in discourse parsing can be partitioned into several directions according to different theories and frameworks, Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) is probably the most ambitious one, because it aims to identify not only the discourse relations in a small local context, but also the hierarchical tree structure for the full text: from the relations relating the smallest discourse units (called elementary discourse units, EDUs ), to the ones connecting paragraphs.Page 1, “Introduction”
- For example, Figure 1 shows a text fragment consisting of two sentences with four EDUs in total (el-e4).Page 1, “Introduction”
- shown below the text, following the notation convention of RST: the two EDUs el and e2 are related by a mononuclear relation CONSEQUENCE, where e2 is the more salient span (called nucleus, and e1 is called satellite); e3 and e4 are related by another mononuclear relation CIRCUMSTANCE, with e4 as the nucleus; the two spans em and e3;4 are further related by a multi-nuclear relation SEQUENCE, with both spans as the nucleus.Page 1, “Introduction”
- Conventionally, there are two major subtasks related to text-level discourse parsing: (l) EDU segmentation: to segment the raw text into EDUs, and (2) tree-building: to build a discourse tree from EDUs , representing the discourse relations in the text.Page 1, “Introduction”
- Figure 1: An example text fragment composed of two sentences and four EDUs , with its RST discourse tree representation shown below.Page 2, “Introduction”
- In particular, starting from EDUs , at each step of the tree-building, a binary SVM classifier is first applied to determine which pair of adjacent discourse constituents should be merged to form a larger span, and another multi-class SVM classifier is then applied to assign the type of discourse relation that holds between the chosen pair.Page 2, “Related work”
- Each sentence 5,, after being segmented into EDUs (not shown in the figure), goes through an intra-sentential bottom-up tree-building model Minna, to form a sentence-level discourse tree Tgi, with the EDUs as leaf nodes.Page 3, “Overall work flow”
- In particular, starting from the constituents on the bottom level ( EDUs for intra-sentential parsing and sentence-level discourse trees for multi-sentential parsing), at each step of the tree-building, we greedily merge a pair of adjacent discourse constituents such that the merged constituent has the highest probability as predicted by our structure model.Page 3, “Bottom-up tree-building”
- ,em}, which are the EDUs of the sentence; after merging el and 62 on the second level, we have E2 = {613,63, .Page 4, “Bottom-up tree-building”
- In contrast, J oty et al.’s computation of intra-sentential sequences depends on the particular pair of constituents: the sequence is composed of the pair in question, with other EDUs in the sentence, even if those EDUs have already been merged.Page 5, “Bottom-up tree-building”
- In addition to efficiency, our use of a single CRF chain for all constituents can better capture the sequential dependencies among context, by taking into account the information from partially built discourse constituents, rather than bottom-level EDUs only.Page 5, “Bottom-up tree-building”

See all papers in *Proc. ACL 2014* that mention EDUs.

See all papers in *Proc. ACL* that mention EDUs.

Back to top.

Appears in 15 sentences as: time complexity (15)

In *A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing*

- However, their model has a high order of time complexity , and thus cannot be applied in practice.Page 1, “Abstract”
- In this work, we develop a much faster model whose time complexity is linear in the number of sentences.Page 1, “Abstract”
- greedy bottom-up strategy, we develop a discourse parser with a time complexity linear in the total number of sentences in the document.Page 2, “Introduction”
- Due to the 0(n3) time complexity , where n is the numberPage 2, “Related work”
- 2The time complexity will be reduced to 0(M2n2), if wePage 5, “Bottom-up tree-building”
- Here we analyze the time complexity of each component in our discourse parser, to quantitatively demonstrate the time efficiency of our model.Page 6, “Linear time complexity”
- overall time complexity to perform intra-sentential parsing is The reason is the following.Page 7, “Linear time complexity”
- The time complexity for performing forward-backward inference on the single chain is 0((mk — i) x M2) = 0(mk — i), where the constant M is the size of the output vocabulary.Page 7, “Linear time complexity”
- Starting from the EDUs on the bottom level, we need to perform inference for one chain on each level during the bottom-up tree-building, and thus the total time complexity is 2310(mk — i) =Page 7, “Linear time complexity”
- Therefore, the total time complexity Zz=10(m%) g n x 0(max1£j£n(m2-)) = n x 0(1) 2 0(n), i.e., linearPage 7, “Linear time complexity”
- By including a constant number k of discourse units in each chain, and considering a constant number I of such chains for computing each adjacent pair of discourse constituents (k = 4 for and k = 3 for Mfrfélti; l = 3), we have an overall time complexity of The reason is that it takes l x 0(kM2) = 0(1) time, where l,k,M are all constants, to perform exact inference for a given pair of adjacent constituents, and we need to perform such computation for all n — 1 pairs of adjacent sentences on the first level of the tree-building.Page 7, “Linear time complexity”

See all papers in *Proc. ACL 2014* that mention time complexity.

See all papers in *Proc. ACL* that mention time complexity.

Back to top.

Appears in 14 sentences as: CRF (14)

In *A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing*

- To enhance the accuracy of the pipeline, we add additional constraints in the Viterbi decoding of the first CRF .Page 1, “Abstract”
- Specifically, in the Viterbi decoding of the first CRF , we include additional constraints elicited from common sense, to make more effective local decisions.Page 2, “Introduction”
- (2013) approach the problem of text-level discourse parsing using a model trained by Conditional Random Fields ( CRF ).Page 2, “Related work”
- Secondly, as a joint model, it is mandatory to use a dynamic CRF , for which exact inference is usually intractable or slow.Page 4, “Bottom-up tree-building”
- Figure 4a shows our intra-sentential structure model in the form of a linear-chain CRF .Page 4, “Bottom-up tree-building”
- Thus, different CRF chains have to be formed for different pairs of constituents.Page 5, “Bottom-up tree-building”
- In addition to efficiency, our use of a single CRF chain for all constituents can better capture the sequential dependencies among context, by taking into account the information from partially built discourse constituents, rather than bottom-level EDUs only.Page 5, “Bottom-up tree-building”
- Instead, we choose to take a sliding-window approach to form CRF chains for a particular pair of constituents, as shown in Figure 4b.Page 5, “Bottom-up tree-building”
- The linear-chain CRF contains a first layer of all discourse constituents U j’s in the sentence on level i, and a second layer of relation nodes R j’s to represent the relation between a pair of discourse constituents.Page 5, “Bottom-up tree-building”
- 3These leaf constituents are represented using a special feature vector i 8 _le a f = True; thus the CRF never labels them with relations other than LEAF.Page 5, “Bottom-up tree-building”
- In our local models, to encode two adjacent units, U j and U j+1, within a CRF chain, we use the following 10 sets of features, some of which are modified from J oty et al.’s model.Page 7, “Features”

See all papers in *Proc. ACL 2014* that mention CRF.

See all papers in *Proc. ACL* that mention CRF.

Back to top.

Appears in 12 sentences as: CRFs (12) CRF’s (1)

In *A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing*

- Our model adopts a greedy bottom-up approach, with two linear-chain CRFs applied in cascade as local classifiers.Page 1, “Abstract”
- DCRF (Dynamic Conditional Random Fields) is a generalization of linear-chain CRFs , in which each time slice contains a set of state variables and edges (Sutton et al., 2007).Page 1, “Introduction”
- Second, by using two linear-chain CRFs to label a sequence of discourse constituents, we can incorporate contextual information in a more natural way, compared to using traditional discriminative classifiers, such as SVMs.Page 2, “Introduction”
- 4.1 Linear-chain CRFs as Local modelsPage 3, “Bottom-up tree-building”
- While our bottom-up tree-building shares the greedy framework with HILDA, unlike HILDA, our local models are implemented using CRFs .Page 3, “Bottom-up tree-building”
- Therefore, our model incorporates the strengths of both HILDA and Joty et al.’s model, i.e., the efficiency of a greedy parsing algorithm, and the ability to incorporate sequential information with CRFs .Page 3, “Bottom-up tree-building”
- Therefore, it is well motivated to use Conditional Random Fields ( CRFs ) (Lafferty et al., 2001), which is a discriminative probabilistic graphical model, to make predictions for a sequence of constituents surrounding the pair of interest.Page 3, “Bottom-up tree-building”
- chain CRFs to model the structure and the relation separately.Page 4, “Bottom-up tree-building”
- In contrast, for linear-chain CRFs , efficient algorithms and implementations for exact inference exist.Page 4, “Bottom-up tree-building”
- is almost identical to their counterparts of the bottom-up tree-building, except that the linear-chain CRFs in post-editing includes additional features to represent information from constituents on higher levels (to be introduced in Section 7).Page 6, “Post-editing”
- For local models, our structure models are trained using MALLET (McCallum, 2002) to include constraints over transitions between adjacent labels, and our relation models are trained using CRFSuite (Okazaki, 2007), which is a fast implementation of linear-chain CRFs .Page 8, “Experiments”

See all papers in *Proc. ACL 2014* that mention CRFs.

See all papers in *Proc. ACL* that mention CRFs.

Back to top.

Appears in 8 sentences as: joint model (4) joint modeling (2) jointly model (1) jointly modeled (2)

In *A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing*

- 2.2 Joty et al.’s joint modelPage 2, “Related work”
- Second, they jointly modeled the structure and the relation for a given pair of discourse units.Page 2, “Related work”
- The strength of J oty et al.’s model is their joint modeling of the structure and the relation, such that information from each aspect can interact with the other.Page 2, “Related work”
- The inefficiency lies in both their DCRF—based joint model , on which inference is usually slow, and their CKY-like parsing algorithm, whose issue is more prominent.Page 2, “Related work”
- However, the major distinction between our models and theirs is that we do not jointly model the structure and the relation; rather, we use two linear-Page 3, “Bottom-up tree-building”
- Although joint modeling has shown to be effective in various NLP and computer Vision applications (Sutton et al., 2007; Yang et al., 2009; Wojek and Schiele, 2008), our choice of using two separate models is for the following reasons:Page 4, “Bottom-up tree-building”
- Then, in the tree-building process, we will have to deal with the situations where the joint model yields conflicting predictions: it is possible that the model predicts Sj = 1 and RJ- 2 NO-REL, or Vice versa, and we will have to decide which node to trust (and thus in some sense, the structure and the relation is no longer jointly modeled ).Page 4, “Bottom-up tree-building”
- Secondly, as a joint model , it is mandatory to use a dynamic CRF, for which exact inference is usually intractable or slow.Page 4, “Bottom-up tree-building”

See all papers in *Proc. ACL 2014* that mention joint model.

See all papers in *Proc. ACL* that mention joint model.

Back to top.

Appears in 8 sentences as: parsing algorithm (7) parsing algorithms (1)

In *A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing*

- However, as an optimal discourse parser, Joty et al.’s model is highly inefficient in practice, with respect to both their DCRF-based local classifiers, and their CKY-like bottom-up parsing algorithm .Page 1, “Introduction”
- CKY parsing is a bottom-up parsing algorithm which searches all possible parsing paths by dynamic programming.Page 1, “Introduction”
- As a result of successfully avoiding the expensive non-greedy parsing algorithms , our discourse parser is very efficient in practice.Page 2, “Introduction”
- Their model assigns a probability to each possible constituent, and a CKY-like parsing algorithm finds the globally optimal discourse tree, given the computed probabilities.Page 2, “Related work”
- The inefficiency lies in both their DCRF—based joint model, on which inference is usually slow, and their CKY-like parsing algorithm , whose issue is more prominent.Page 2, “Related work”
- Therefore, our model incorporates the strengths of both HILDA and Joty et al.’s model, i.e., the efficiency of a greedy parsing algorithm , and the ability to incorporate sequential information with CRFs.Page 3, “Bottom-up tree-building”
- gSVMFH in the second row is our implementation of HILDA’s greedy parsing algorithm using Feng and Hirst (2012)’s enhanced feature set.Page 8, “Results and Discussion”
- First, as shown in Table 2, the average number of sentences in a document is 26.11, which is already too large for optimal parsing models, e.g., the CKY—like parsing algorithm in jCRF, let alone the fact that the largest document contains several hundred of EDUs and sentences.Page 9, “Results and Discussion”

See all papers in *Proc. ACL 2014* that mention parsing algorithm.

See all papers in *Proc. ACL* that mention parsing algorithm.

Back to top.

Appears in 7 sentences as: sentence-level (7)

In *A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing*

- First, they decomposed the problem of text-level discourse parsing into two stages: intra-sentential parsing to produce a discourse tree for each sentence, followed by multi-sentential parsing to combine the sentence-level discourse trees and produce the text-level discourse tree.Page 2, “Related work”
- (2013), we perform a sentence-level parsing for each sentence first, followed by a text-level parsing to generate a full discourse tree for the whole document.Page 3, “Overall work flow”
- Each sentence 5,, after being segmented into EDUs (not shown in the figure), goes through an intra-sentential bottom-up tree-building model Minna, to form a sentence-level discourse tree Tgi, with the EDUs as leaf nodes.Page 3, “Overall work flow”
- We then combine all sentence-level discourse tree TS}: ’s using our multi-sentential bottom-up tree-building model Mmum to generate the text-level discourse tree TD.Page 3, “Overall work flow”
- Similar to sentence-level parsing, we also post-edit TD using Pmum to produce the final discourse tree T5 .Page 3, “Overall work flow”
- In particular, starting from the constituents on the bottom level (EDUs for intra-sentential parsing and sentence-level discourse trees for multi-sentential parsing), at each step of the tree-building, we greedily merge a pair of adjacent discourse constituents such that the merged constituent has the highest probability as predicted by our structure model.Page 3, “Bottom-up tree-building”
- The total time to generate sentence-level discourse trees for n sentences is ZZ=10(m%).Page 7, “Linear time complexity”

See all papers in *Proc. ACL 2014* that mention sentence-level.

See all papers in *Proc. ACL* that mention sentence-level.

Back to top.

Appears in 5 sentences as: Viterbi (5)

In *A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing*

- To enhance the accuracy of the pipeline, we add additional constraints in the Viterbi decoding of the first CRF.Page 1, “Abstract”
- Specifically, in the Viterbi decoding of the first CRF, we include additional constraints elicited from common sense, to make more effective local decisions.Page 2, “Introduction”
- Therefore, to improve its accuracy, we enforce additional commonsense constraints in its Viterbi decoding.Page 4, “Bottom-up tree-building”
- Similar to Mfgfi‘cf’, for , we also include additional constraints in the Viterbi decoding, by disallowing transitions between two ones, and disallowing the sequence to be all zeros if it contains all the remaining constituents in the document.Page 5, “Bottom-up tree-building”
- Our approach was to adopt a greedy bottom-up tree-building, with two linear-chain CRFs as local probabilistic models, and enforce reasonable constraints in the first CRF’s Viterbi decoding.Page 9, “Conclusions”

See all papers in *Proc. ACL 2014* that mention Viterbi.

See all papers in *Proc. ACL* that mention Viterbi.

Back to top.

Appears in 3 sentences as: subtrees (3)

In *A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing*

- Rather, each relation node R j attempts to model the relation of one single constituent U], by taking U j’s left and right subtrees U j; and U LR as its first-layer nodes; if U j is a single EDU, then the first-layer node of R J- is simply U j, and R j is a special relation symbol LEAF3.Page 5, “Bottom-up tree-building”
- Substructure features: The root node of the left and right discourse subtrees of each unit.Page 7, “Features”
- In future work, we wish to further explore the idea of post-editing, since currently we use only the depth of the subtrees as upper-level information.Page 9, “Conclusions”

See all papers in *Proc. ACL 2014* that mention subtrees.

See all papers in *Proc. ACL* that mention subtrees.

Back to top.

Appears in 3 sentences as: SVM (4)

In *A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing*

- In particular, starting from EDUs, at each step of the tree-building, a binary SVM classifier is first applied to determine which pair of adjacent discourse constituents should be merged to form a larger span, and another multi-class SVM classifier is then applied to assign the type of discourse relation that holds between the chosen pair.Page 2, “Related work”
- Also, the employment of SVM classifiers allows the incorporation of rich features for better data representation (Feng and Hirst, 2012).Page 2, “Related work”
- However, HILDA’s approach also has obvious weakness: the greedy algorithm may lead to poor performance due to local optima, and more importantly, the SVM classifiers are not well-suited for solving structural problems due to the difficulty of taking context into account.Page 2, “Related work”

See all papers in *Proc. ACL 2014* that mention SVM.

See all papers in *Proc. ACL* that mention SVM.

Back to top.