Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents
Mehdad, Yashar and Negri, Matteo and Federico, Marcello

Article Structure

Abstract

We address a core aspect of the multilingual content synchronization task: the identification of novel, more informative or semantically equivalent pieces of information in two documents about the same topic.

Introduction

Given two documents about the same topic written in different languages (e. g. Wiki pages), content synchronization deals with the problem of automatically detecting and resolving differences in the information they provide, in order to produce aligned, mutually enriched versions.

CLTE-based content synchronization

CLTE has been proposed by (Mehdad et al., 2010) as an extension of textual entailment which consists of deciding, given a text T and an hypothesis H in different languages, if the meaning of H can be inferred from the meaning of T. The adoption of entailment-based techniques to address content synchronization looks promising, as several issues inherent to such task can be formalized as entailment-related prob-

Beyond lexical CLTE

In order to enrich the feature space beyond pure lexical match through phrase table entries, our model

Experiments and results

4.1 Content synchronization scenario

Conclusion

We addressed the identification of semantic equivalence and information disparity in two documents about the same topic, written in different languages.

Topics

phrase table

Appears in 15 sentences as: Phrase Table (1) phrase table (7) phrase tables (7)
In Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents
  1. The CLTE methods proposed so far adopt either a “pivoting approach” based on the translation of the two input texts into the same language (Mehdad et al., 2010), or an “integrated solution” that exploits bilingual phrase tables to capture lexical relations and contextual information (Mehdad et al., 2011).
    Page 1, “Introduction”
  2. CLTE has been previously modeled as a phrase matching problem that exploits dictionaries and phrase tables extracted from bilingual parallel corpora to determine the number of word sequences in H that can be mapped to word sequences in T. In this way a semantic judgement about entailment is made exclusively on the basis of lexical evidence.
    Page 2, “CLTE-based content synchronization”
  3. In order to enrich the feature space beyond pure lexical match through phrase table entries, our model
    Page 2, “Beyond lexical CLTE”
  4. builds on two additional feature sets, derived from i) semantic phrase tables , and ii) dependency relations.
    Page 2, “Beyond lexical CLTE”
  5. Semantic Phrase Table (SPT) matching represents a novel way to leverage the integration of semantics and MT—derived techniques.
    Page 2, “Beyond lexical CLTE”
  6. SPT matching extends CLTE methods based on pure lexical match by means of “generalized” phrase tables annotated with shallow semantic labels.
    Page 2, “Beyond lexical CLTE”
  7. SPTs, with entries in the form “[LABEL] wordl...w0rdn [LABEL]”, are used as a recall-oriented complement to the phrase tables used in MT.
    Page 2, “Beyond lexical CLTE”
  8. A motivation for this augmentation is that semantic tags allow to match tokens that do not occur in the original bilingual parallel corpora used for phrase table extraction.
    Page 2, “Beyond lexical CLTE”
  9. Like lexical phrase tables , SPTs are extracted from parallel corpora.
    Page 2, “Beyond lexical CLTE”
  10. Finally, we extract the semantic phrase table from the augmented aligned corpora using the Moses toolkit (Koehn et al., 2007).
    Page 2, “Beyond lexical CLTE”
  11. To build the English-German phrase tables we combined the Europarl, News Commentary and “de-news”3 parallel corpora.
    Page 3, “Experiments and results”

See all papers in Proc. ACL 2012 that mention phrase table.

See all papers in Proc. ACL that mention phrase table.

Back to top.

cross-lingual

Appears in 10 sentences as: Cross-lingual (1) cross-lingual (8) “Cross-Lingual (1)
In Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents
  1. Using a combination of lexical, syntactic, and semantic features to train a cross-lingual textual entailment system, we report promising results on different datasets.
    Page 1, “Abstract”
  2. In this paper we set such problem as an application-oriented, cross-lingual variant of the Textual Entailment (TE) recognition task (Dagan and Glickman, 2004).
    Page 1, “Introduction”
  3. (a) Experiments with multidirectional cross-lingual textual entailment.
    Page 1, “Introduction”
  4. So far, cross-lingual
    Page 1, “Introduction”
  5. textual entailment (CLTE) has been only applied to: i) available TE datasets (unidirectional relations between monolingual pairs) transformed into their cross-lingual counterpart by translating the hypotheses into other languages (Negri and Mehdad, 2010), and ii) machine translation (MT) evaluation datasets (Mehdad et al., 2012).
    Page 1, “Introduction”
  6. 2Recently, a new dataset including “Unknown” pairs has been used in the “Cross-Lingual Textual Entailment for Content Synchronization” task at SemEval—2012 (Negri et al., 2012).
    Page 3, “Experiments and results”
  7. (3-way) demonstrates the effectiveness of our approach to capture meaning equivalence and information disparity in cross-lingual texts.
    Page 4, “Experiments and results”
  8. Cross-lingual models also significantly outperform pivoting methods.
    Page 4, “Experiments and results”
  9. This suggests that the noise introduced by incorrect translations makes the pivoting approach less attractive in comparison with the more robust cross-lingual models.
    Page 4, “Experiments and results”
  10. Our results in different cross-lingual settings prove the feasibility of the approach, with significant state-of-the-art improvements also on RTE-derived data.
    Page 4, “Conclusion”

See all papers in Proc. ACL 2012 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

parallel corpora

Appears in 8 sentences as: parallel corpora (8)
In Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents
  1. CLTE has been previously modeled as a phrase matching problem that exploits dictionaries and phrase tables extracted from bilingual parallel corpora to determine the number of word sequences in H that can be mapped to word sequences in T. In this way a semantic judgement about entailment is made exclusively on the basis of lexical evidence.
    Page 2, “CLTE-based content synchronization”
  2. A motivation for this augmentation is that semantic tags allow to match tokens that do not occur in the original bilingual parallel corpora used for phrase table extraction.
    Page 2, “Beyond lexical CLTE”
  3. Like lexical phrase tables, SPTs are extracted from parallel corpora .
    Page 2, “Beyond lexical CLTE”
  4. As a first step we annotate the parallel corpora with named-entity taggers for the source and target languages, replacing named entities with general semantic labels chosen from a coarse-grained taxonomy (person, location, organization, date and numeric expression).
    Page 2, “Beyond lexical CLTE”
  5. For the matching phase, we first annotate T and H in the same way we labeled our parallel corpora .
    Page 2, “Beyond lexical CLTE”
  6. To build the English-German phrase tables we combined the Europarl, News Commentary and “de-news”3 parallel corpora .
    Page 3, “Experiments and results”
  7. The dictionary created during the alignment of the parallel corpora provided the lexical knowledge to perform matches when the connected words are different, but semantically equivalent in the two languages.
    Page 3, “Experiments and results”
  8. In order to build the English-Spanish lexical phrase table (PT), we used the Europarl, News Commentary and United Nations parallel corpora .
    Page 4, “Experiments and results”

See all papers in Proc. ACL 2012 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

Dependency relations

Appears in 6 sentences as: Dependency Relation (1) dependency relation (1) Dependency relations (2) dependency relations (2)
In Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents
  1. builds on two additional feature sets, derived from i) semantic phrase tables, and ii) dependency relations .
    Page 2, “Beyond lexical CLTE”
  2. Dependency Relation (DR) matching targets the increase of CLTE precision.
    Page 2, “Beyond lexical CLTE”
  3. We define a dependency relation as a triple that connects pairs of words through a grammatical relation.
    Page 3, “Beyond lexical CLTE”
  4. DR matching captures similarities between dependency relations , combining the syntactic and lexical level.
    Page 3, “Beyond lexical CLTE”
  5. Dependency relations (DR) have been extracted running the Stanford parser (Rafferty and Manning, 2008; De Marneffe et al., 2006).
    Page 3, “Experiments and results”
  6. Dependency relations (DR) have been extracted parsing English texts and Spanish hypotheses with DepPattern (Gamallo and Gonzalez, 2011).
    Page 4, “Experiments and results”

See all papers in Proc. ACL 2012 that mention Dependency relations.

See all papers in Proc. ACL that mention Dependency relations.

Back to top.

feature sets

Appears in 3 sentences as: feature set (1) feature sets (2)
In Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents
  1. builds on two additional feature sets , derived from i) semantic phrase tables, and ii) dependency relations.
    Page 2, “Beyond lexical CLTE”
  2. (a) In both settings all the feature sets used outperform the approaches taken as terms of comparison.
    Page 3, “Experiments and results”
  3. As shown in Table 1, the combined feature set (PT+SPT+DR) significantly5 outperforms the leXical model (64.5% vs 62.6%), while SPT and DR features separately added to PT (PT+SPT, and PT+DR) lead to marginal improvements over the results achieved by the PT model alone (about 1%).
    Page 4, “Experiments and results”

See all papers in Proc. ACL 2012 that mention feature sets.

See all papers in Proc. ACL that mention feature sets.

Back to top.