A Syntactic and Lexical-Based Discourse Segmenter
Tofiloski, Milan and Brooke, Julian and Taboada, Maite

Article Structure

Abstract

We present a syntactic and lexically based discourse segmenter (SLSeg) that is designed to avoid the common problem of over-segmenting text.

Introduction*

Discourse segmentation is the process of decomposing discourse into elementary discourse units (EDUs), which may be simple sentences or clauses in a complex sentence, and from which discourse trees are constructed.

Related Work

Soricut and Marcu (2003) construct a statistical discourse segmenter as part of their sentence-level discourse parser (SPADE), the only implementation available for our comparison.

Principles For Discourse Segmentation

Our primary concern is to capture interesting discourse relations, rather than all possible relations, i.e., capturing more specific relations such as Condition, Evidence or Purpose, rather than more general and less informative relations such as Elaboration or Joint, as defined in Rhetorical Structure Theory (Mann and Thompson, 1988).

Implementation

The core of the implementation involves the construction of 12 syntactically-based segmentation rules, along with a few lexical rules involving a list of stop phrases, discourse cue phrases and word-level parts of speech (POS) tags.

Data and Evaluation

5.1 Data

Results

Results are shown in Table l. The combined informal and formal texts show SLSeg (using Char-niak’s parser) with high precision; however, our overall recall was lower than both SPADE and the baseline.

Discussion

We have shown that SLSeg, a conservative rule-based segmenter that inserts fewer discourse boundaries, leads to higher precision compared to a statistical segmenter.

Topics

gold standard

Appears in 6 sentences as: gold standard (6)
In A Syntactic and Lexical-Based Discourse Segmenter
  1. The gold standard test set consists of 9 human-annotated texts.
    Page 2, “Data and Evaluation”
  2. The texts were segmented by one of the authors following guidelines that were established from the project’s beginning and was used as the gold standard .
    Page 2, “Data and Evaluation”
  3. Using F—score, average agreement of the two annotators against the gold standard was also high at .86.
    Page 2, “Data and Evaluation”
  4. use the coauthor’s segmentations as the gold standard .
    Page 3, “Data and Evaluation”
  5. Precision is the number of boundaries in agreement with the gold standard .
    Page 3, “Data and Evaluation”
  6. Recall is the total number of boundaries correct in the system’s output divided by the number of total boundaries in the gold standard .
    Page 3, “Data and Evaluation”

See all papers in Proc. ACL 2009 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

discourse parser

Appears in 5 sentences as: discourse parser (3) discourse parsing (2)
In A Syntactic and Lexical-Based Discourse Segmenter
  1. Segmentation is the first step in a discourse parser , a system that constructs discourse trees from elementary discourse units.
    Page 1, “Abstract”
  2. Since segmentation is the first stage of discourse parsing , quality discourse segments are critical to building quality discourse representations (Soricut and Marcu, 2003).
    Page 1, “Introduction*”
  3. Most parsers can break down a sentence into constituent clauses, approaching the type of output that we need as input to a discourse parser .
    Page 1, “Introduction*”
  4. Soricut and Marcu (2003) construct a statistical discourse segmenter as part of their sentence-level discourse parser (SPADE), the only implementation available for our comparison.
    Page 1, “Related Work”
  5. Besides its use in automatic discourse parsing , the system could
    Page 4, “Discussion”

See all papers in Proc. ACL 2009 that mention discourse parser.

See all papers in Proc. ACL that mention discourse parser.

Back to top.

Treebank

Appears in 5 sentences as: Treebank (6)
In A Syntactic and Lexical-Based Discourse Segmenter
  1. SPADE is trained on the RST Discourse Treebank (Carlson et al., 2002).
    Page 1, “Related Work”
  2. (2004) construct a rule-based segmenter, employing manually annotated parses from the Penn Treebank .
    Page 1, “Related Work”
  3. Many of our differences with Carlson and Marcu (2001), who defined EDUs for the RST Discourse Treebank (Carlson et al., 2002), are due to the fact that we adhere closer to the original RST proposals (Mann and Thompson, 1988), which defined as ‘spans’ adjunct clauses, rather than complement (subject and object) clauses.
    Page 2, “Principles For Discourse Segmentation”
  4. The 9 documents include 3 texts from the RST literature2 , 3 online product reviews from Epinions.com, and 3 Wall Street Journal articles taken from the Penn Treebank .
    Page 2, “Data and Evaluation”
  5. High F—score in the Treebank data can be attributed to the parsers having been trained on Treebank .
    Page 4, “Results”

See all papers in Proc. ACL 2009 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

EDUs

Appears in 3 sentences as: EDUs (3)
In A Syntactic and Lexical-Based Discourse Segmenter
  1. Discourse segmentation is the process of decomposing discourse into elementary discourse units ( EDUs ), which may be simple sentences or clauses in a complex sentence, and from which discourse trees are constructed.
    Page 1, “Introduction*”
  2. Many of our differences with Carlson and Marcu (2001), who defined EDUs for the RST Discourse Treebank (Carlson et al., 2002), are due to the fact that we adhere closer to the original RST proposals (Mann and Thompson, 1988), which defined as ‘spans’ adjunct clauses, rather than complement (subject and object) clauses.
    Page 2, “Principles For Discourse Segmentation”
  3. In particular, we propose that complements of attributive and cognitive verbs (He said (that)..., I think (that)...) are not EDUs .
    Page 2, “Principles For Discourse Segmentation”

See all papers in Proc. ACL 2009 that mention EDUs.

See all papers in Proc. ACL that mention EDUs.

Back to top.

fine-grained

Appears in 3 sentences as: fine-grained (3)
In A Syntactic and Lexical-Based Discourse Segmenter
  1. The segments produced by a parser, however, are too fine-grained for discourse purposes, breaking off complement and other clauses that are not in a discourse relation to any other segment.
    Page 1, “Introduction*”
  2. Since Sundance clauses are also too fine-grained for our purposes, we use a few simple rules to collapse clauses that are unlikely to meet our definition of EDU.
    Page 3, “Data and Evaluation”
  3. A clearer example that illustrates the pitfalls of fine-grained discourse segmenting is shown in the following output from SPADE:
    Page 3, “Data and Evaluation”

See all papers in Proc. ACL 2009 that mention fine-grained.

See all papers in Proc. ACL that mention fine-grained.

Back to top.

segmentations

Appears in 3 sentences as: segmentations (2) segmenters (1)
In A Syntactic and Lexical-Based Discourse Segmenter
  1. use the coauthor’s segmentations as the gold standard.
    Page 3, “Data and Evaluation”
  2. Additionally, we compared SLSeg and SPADE to the original RST segmentations of the three RST texts taken from RST literature.
    Page 4, “Results”
  3. Also to be investigated is a quantitative study of the effects of high-precision/low-recall vs. low-precision/high-recall segmenters on the construction of discourse trees.
    Page 4, “Discussion”

See all papers in Proc. ACL 2009 that mention segmentations.

See all papers in Proc. ACL that mention segmentations.

Back to top.