SystemT: An Algebraic Approach to Declarative Information Extraction
Chiticariu, Laura and Krishnamurthy, Rajasekar and Li, Yunyao and Raghavan, Sriram and Reiss, Frederick and Vaithyanathan, Shivakumar

Article Structure

Abstract

As information extraction (IE) becomes more central to enterprise applications, rule-based IE engines have become increasingly important.

Introduction

In recent years, enterprises have seen the emergence of important text analytics applications like compliance and data redaction.

Grammar-based Systems and CPSL

A cascading grammar consists of a sequence of phases, each of which consists of one or more rules.

SystemT

SystemT is a declarative IE system based on an algebraic framework.

Grammar vs. Algebra

Having described both the traditional cascading grammar approach and the declarative approach

Experimental Evaluation

In this section we present an extensive comparison study between SystemT and implementations of expanded CPSL grammar in terms of quality, runtime performance and resource requirements.

Conclusion

In this paper, we described SystemT, a declarative IE system based on an algebraic framework.

Topics

regular expression

Appears in 5 sentences as: Regular Expression (2) regular expression (3) regular expressions (2)
In SystemT: An Algebraic Approach to Declarative Information Extraction
  1. Figure 3: Regular Expression Extraction Operator
    Page 3, “Grammar-based Systems and CPSL”
  2. 3 illustrates the regular expression extraction operator in the algebra, which performs character-level regular expression matching.
    Page 3, “SystemT”
  3. 0 The Extract operator (5) performs character-level operations such as regular expression and dictionary matching over text, creating a tuple for each match.
    Page 3, “SystemT”
  4. Character-Level Regular Expression CPSL cannot specify character-level regular expressions that span multiple tokens.
    Page 6, “Grammar vs. Algebra”
  5. (b) Example operations supported in AQL that cannot be expressed in expanded code-free CPSL grammars include (i) character-level regular expressions spanning multiple tokens, (ii) counting the number of annotations occurring within a given bounded window and (iii) deleting annotations if they overlap with other annotations starting later in the document.
    Page 7, “Grammar vs. Algebra”

See all papers in Proc. ACL 2010 that mention regular expression.

See all papers in Proc. ACL that mention regular expression.

Back to top.

NER

Appears in 4 sentences as: NER (4)
In SystemT: An Algebraic Approach to Declarative Information Extraction
  1. 0 NER : named-entity recognition for Person, Organization, Location, Address, PhoneNumber, EmaiIAddress, URL and DateTime.
    Page 7, “Experimental Evaluation”
  2. We chose NER primarily because named-entity recognition is a well-studied problem and standard datasets are available for evaluation.
    Page 7, “Experimental Evaluation”
  3. 3To the best of our knowledge, ANNIE (Cunningham et al., 2002) is the only publicly available NER library implemented in a grammar-based system (JAPE in GATE).
    Page 7, “Experimental Evaluation”
  4. For this purpose, we have configured both ANNIE and T-NE to identify only the same eight types of entities listed for NER task.
    Page 8, “Experimental Evaluation”

See all papers in Proc. ACL 2010 that mention NER.

See all papers in Proc. ACL that mention NER.

Back to top.

rule-based

Appears in 4 sentences as: rule-based (4)
In SystemT: An Algebraic Approach to Declarative Information Extraction
  1. As information extraction (IE) becomes more central to enterprise applications, rule-based IE engines have become increasingly important.
    Page 1, “Abstract”
  2. In this paper, we describe SystemT, a rule-based IE system whose basic design removes the expressivity and performance limitations of current systems based on cascading grammars.
    Page 1, “Abstract”
  3. In recent years, these systemic requirements have led to renewed interest in rule-based IE systems (Doan et al., 2008; SAP, 2010; IBM, 2010; SAS, 2010).
    Page 1, “Introduction”
  4. Until recently, rule-based IE systems (Cunningham et al., 2000; Boguraev, 2003; Drozdzynski et al., 2004) were predominantly based on the cascading grammar formalism exemplified by the
    Page 1, “Introduction”

See all papers in Proc. ACL 2010 that mention rule-based.

See all papers in Proc. ACL that mention rule-based.

Back to top.