SciSurf: Index of 'Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora'

Topics

baseline system (21)
relation extraction (10)
parallel corpora (9)
SMT system (9)
translation system (9)
phrase table (7)
BLEU (6)
co-occurrence (6)
NIST (6)
machine translation (5)
language model (4)
named entity (3)
parallel data (3)
phrase-based (3)

Topics

baseline system (21)
relation extraction (10)
parallel corpora (9)
SMT system (9)
translation system (9)
phrase table (7)
BLEU (6)
co-occurrence (6)
NIST (6)
machine translation (5)
language model (4)
named entity (3)
parallel data (3)
phrase-based (3)

Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

Li, Zhifei and Yarowsky, David

Published in Proc. ACL, 2008

Article Structure

Abstract

Chinese abbreviations are widely used in modern Chinese texts.

Introduction

The modern Chinese language is a highly abbreviated one due to the mixed use of ancient single-character words with modern multi-character words and compound words.

Background: Chinese Abbreviations

In general, Chinese abbreviations are formed based on three major methods: redaction, elimination and generalization (Lee, 2005; Yin, 1999).

Unsupervised Translation Induction for Chinese Abbreviations

In this section, we describe an unsupervised method to induce translation entries for Chinese abbreviations, even when these abbreviations never appear in the Chinese side of the parallel corpora.

Experimental Results

4.1 Corpora

Related Work

Though automatically extracting the relations between full-form Chinese phrases and their abbreviations is an interesting and important task for many natural language processing applications (e.g., machine translation, question answering, information retrieval, and so on), not much work is available in the literature.

Conclusions

In this paper, we present a novel method that automatically extracts relations between full-form phrases and their abbreviations from monolingual corpora, and induces translation entries for these abbreviations by using their full-form as a bridge.

Topics

baseline system

Appears in 21 sentences as: Baseline System (1) baseline system (21)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

We integrate our method into a state-of-the-art baseline translation system and show that it consistently improves the performance of the baseline system on various NIST MT test sets.
Page 1, “Abstract”
For example, if the baseline system knows that the translation for “EWE )E‘L’EX” is “Hong Kong Governor”, and it also knows that “7% E” is an abbreviation of “éfi , then it can translate “7%?” to “Hong Kong Governor”.
Page 2, “Introduction”
We also need to make sure that the baseline system has at least one valid translation for the full-form phrase.
Page 2, “Introduction”
Moreover, our approach integrates the abbreviation translation component into the baseline system in a natural way, and thus is able to make use of the minimum-error-rate training (Och, 2003) to automatically adjust the model parameters to reflect the change of the integrated system over the baseline system .
Page 2, “Introduction”
o Step-5: augment the baseline system with translation entries obtained in Step-4.
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”
Moreover, obtaining a list using a dedicated tagger does not guarantee that the baseline system knows how to translate the list.
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”
On the contrary, in our approach, since the Chinese entities are translation outputs for the English entities, it is ensured that the baseline system has translations for these Chinese entities.
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”
the baseline system .
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”
This is critical to make the abbreviation translation get performance gains over the baseline system as will be clear later.
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”
It is worth pointing out that the baseline system may not be able to translate all the English entities.
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
However, it does not mean that this English entity is the best translation that the baseline system has for the Chinese full-form phrase.
Page 5, “Unsupervised Translation Induction for Chinese Abbreviations”

See all papers in Proc. ACL 2008 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.

relation extraction

Appears in 10 sentences as: Relation Extraction (4) relation extraction (5) relation extractions (1)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

3.3 F all-abbreviation Relation Extraction from Chinese Monolingual Corpora
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
3.3.2 F all-abbreviation Relation Extraction Algorithm
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
Figure 2 presents the pseudocode of the full-abbreviation relation extraction algorithm.
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
Figure 2: F tall-abbreviation Relation Extraction
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
Table 4: F all-abbreviation Relation Extraction Precision
Page 6, “Experimental Results”
To further show the advantage of our relation extraction algorithm (see Section 3.3), in the third column of Table 4 we report the results on a simple baseline.
Page 6, “Experimental Results”
As shown in Table 4, the baseline performs significantly worse than our relation extraction algorithm.
Page 7, “Experimental Results”
Compared with the baseline, our relation extraction algorithm allows arbitrary abbreviation patterns as long as they satisfy the alignment constraints.
Page 7, “Experimental Results”
Note that the results here are highly biased to our relation extraction algorithm (see Section 3.3).
Page 7, “Experimental Results”
Our method exploits the data co-occarrence phenomena that is very useful for relation extractions .
Page 8, “Conclusions”

See all papers in Proc. ACL 2008 that mention relation extraction.

See all papers in Proc. ACL that mention relation extraction.

Back to top.

parallel corpora

Appears in 9 sentences as: parallel corpora (9)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

Due to the richness of Chinese abbreviations, many of them may not appear in available parallel corpora , in which case current machine translation systems simply treat them as unknown words and leave them untranslated.
Page 1, “Abstract”
While the research in statistical machine trans-ation (SMT) has made significant progress, most SMT systems (Koehn et al., 2003; Chiang, 2007; 3alley et al., 2006) rely on parallel corpora to extract ,ranslation entries.
Page 1, “Introduction”
In particular, many Chinese abbrevi-1ti0ns may not appear in available parallel corpora , n which case current SMT systems treat them as mknown words and leave them untranslated.
Page 1, “Introduction”
To be able to translate a Chinese abbreviation that s unseen in available parallel corpora , one may an-lotate more parallel data.
Page 1, “Introduction”
Even if an abbreviation has been seen in parallel corpora , it may still be worth to consider its full-form phrase as an additional alternative to the abbreviation since abbreviated words are normally semantically ambiguous, while its full-form contains more context information that helps the MT system choose a right translation for the abbreviation.
Page 2, “Introduction”
In this section, we describe an unsupervised method to induce translation entries for Chinese abbreviations, even when these abbreviations never appear in the Chinese side of the parallel corpora .
Page 2, “Unsupervised Translation Induction for Chinese Abbreviations”
Regarding the data resource used, Step-l, -2, and -3 rely on the English monolingual corpora, parallel corpora , and the Chinese monolingual corpora, respectively.
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”
This is because the entities are extracted from the English monolingual corpora, which has a much larger vocabulary than the English side of the parallel corpora .
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
On the other hand, if the entry is already in the baseline phrase table, then we merge the entries by enforcing the translation probability as we obtain the same translation entry from two different knowledge sources (one is from parallel corpora and the other one is from the Chinese monolingual corpora).
Page 5, “Unsupervised Translation Induction for Chinese Abbreviations”

See all papers in Proc. ACL 2008 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

SMT system

Appears in 9 sentences as: SMT system (5) SMT systems (5)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

While the research in statistical machine trans-ation (SMT) has made significant progress, most SMT systems (Koehn et al., 2003; Chiang, 2007; 3alley et al., 2006) rely on parallel corpora to extract ,ranslation entries.
Page 1, “Introduction”
The richness and complexness )f Chinese abbreviations imposes challenges to the SMT systems .
Page 1, “Introduction”
In particular, many Chinese abbrevi-1ti0ns may not appear in available parallel corpora, n which case current SMT systems treat them as mknown words and leave them untranslated.
Page 1, “Introduction”
into its full-form for which the current SMT system knows how to translate.
Page 2, “Introduction”
Conceptually, the approach of translating an abbreviation by using its full-form as a bridge involves four components: identifying abbreviations, learning their full-forms, inducing their translations, and integrating the abbreviation translations into the baseline SMT system .
Page 2, “Introduction”
On the other hand, integrating an additional component into a baseline SMT system is notoriously tricky as evident in the research on integrating word sense disambiguation (WSD) into SMT systems : different ways of integration lead to conflicting conclusions on whether WSD helps MT performance (Chan et al., 2007; Carpuat and Wu, 2007).
Page 2, “Introduction”
Our approach exploits the data co-occarrence phenomena and does not require any additional annotated data except the parallel and monolingual corpora that the baseline SMT system uses.
Page 2, “Introduction”
We carry out experiments on a state-of-the-art SMT system , i.e., Moses (Koehn et al., 2007), and show that the abbreviation translations consistently improve the translation performance (in terms of BLEU (Papineni et al., 2002)) on various NIST MT test sets.
Page 2, “Introduction”
Moreover, our approach utilizes both Chinese and English monolingual data to help MT, while most SMT systems utilizes only the English monolingual data to build a language model.
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”

See all papers in Proc. ACL 2008 that mention SMT system.

See all papers in Proc. ACL that mention SMT system.

Back to top.

translation system

Appears in 9 sentences as: Translation System (1) translation system (7) translation systems (1)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

Due to the richness of Chinese abbreviations, many of them may not appear in available parallel corpora, in which case current machine translation systems simply treat them as unknown words and leave them untranslated.
Page 1, “Abstract”
Our method does not require any additional annotated data other than the data that a regular translation system uses.
Page 1, “Abstract”
We integrate our method into a state-of-the-art baseline translation system and show that it consistently improves the performance of the baseline system on various NIST MT test sets.
Page 1, “Abstract”
o Step-2: translate the list into Chinese using a baseline translation system ;
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”
Step-4 and -5 are natural ways to integrate the abbreviation translation component with the baseline translation system .
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”
However, since most of statistical translation models (Koehn et al., 2003; Chiang, 2007; Galley et al., 2006) are symmetrical, it is relatively easy to train a translation system to translate from English to Chinese, except that we need to train a Chinese language model from the Chinese monolingual data.
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
3.5 Integration with Baseline Translation System
Page 5, “Unsupervised Translation Induction for Chinese Abbreviations”
Our method is scalable enough to handle large amount of monolingual data, and is essentially unsupervised as it does not require any additional annotated data than the baseline translation system .
Page 8, “Conclusions”
We integrate our method into a state-of-the-art phrase-based baseline translation system , i.e., Moses (Koehn et al., 2007), and show that the integrated system consistently improves the performance of the baseline system on various NIST machine translation test sets.
Page 8, “Conclusions”

See all papers in Proc. ACL 2008 that mention translation system.

See all papers in Proc. ACL that mention translation system.

Back to top.

phrase table

Appears in 7 sentences as: phrase table (8) phrase tables (1)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

the baseline phrase table .
Page 5, “Unsupervised Translation Induction for Chinese Abbreviations”
Since the obtained translation entries for abbreviations have the same format as the regular translation entries in the baseline phrase table, it is relatively easy to add them into the baseline phrase table .
Page 5, “Unsupervised Translation Induction for Chinese Abbreviations”
Specifically, if a translation entry (signatured by its Chinese and English strings) to be added is not in the baseline phrase table , we simply add the entry into the baseline table.
Page 5, “Unsupervised Translation Induction for Chinese Abbreviations”
On the other hand, if the entry is already in the baseline phrase table , then we merge the entries by enforcing the translation probability as we obtain the same translation entry from two different knowledge sources (one is from parallel corpora and the other one is from the Chinese monolingual corpora).
Page 5, “Unsupervised Translation Induction for Chinese Abbreviations”
Once we obtain the augmented phrase table, we should run the minimum-error-rate training (Och, 2003) with the augmented phrase table such that the model parameters are properly adjusted.
Page 5, “Unsupervised Translation Induction for Chinese Abbreviations”
As clear in Table 7, it is important to rerun MERT (on MT02 only) with the augmented phrase table in order to get performance gains.
Page 7, “Experimental Results”
the MERT weights with different phrase tables .
Page 8, “Experimental Results”

See all papers in Proc. ACL 2008 that mention phrase table.

See all papers in Proc. ACL that mention phrase table.

Back to top.

BLEU

Appears in 6 sentences as: BLEU (6)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

We carry out experiments on a state-of-the-art SMT system, i.e., Moses (Koehn et al., 2007), and show that the abbreviation translations consistently improve the translation performance (in terms of BLEU (Papineni et al., 2002)) on various NIST MT test sets.
Page 2, “Introduction”
The feature functions are combined under a log-linear framework, and the weights are tuned by the minimum-error-rate training (Och, 2003) using BLEU (Papineni et al., 2002) as the optimization metric.
Page 6, “Experimental Results”
This precision is extremely high because the BLEU score (precision with brevity penalty) that one obtains for a Chinese sentence is normally between 30% to 50%.
Page 7, “Experimental Results”
4.5.2 BLEU on NIST MT Test Sets
Page 7, “Experimental Results”
The MT performance is measured by lowercase 4-gram BLEU (Papineni et al., 2002).
Page 7, “Experimental Results”
Table 7: MT Performance measured by BLEU Score
Page 7, “Experimental Results”

See all papers in Proc. ACL 2008 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

co-occurrence

Appears in 6 sentences as: Co-occurrence (2) co-occurrence (4)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

3.3.1 Data Co-occurrence
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
In a monolingual corpus, relevant words tend to appear together (i.e., co-occurrence ).
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
The co-occurrence may imply a relationship (e.g., Bill Gates is the founder of Microsoft).
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
Table 2: Data Co-occurrence Example for the Fall-abbreviation Relation (%§EE%,$§Q%) meaning “winter Olympics”
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
By exploiting the data co-occurrence phenomena, we identify possible abbreviations for full-form phrases.
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
Moreover, the HMM model is computationally-expensive and unable to exploit the data co-occurrence phenomena that we
Page 8, “Related Work”

See all papers in Proc. ACL 2008 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

NIST

Appears in 6 sentences as: NIST (6)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

We integrate our method into a state-of-the-art baseline translation system and show that it consistently improves the performance of the baseline system on various NIST MT test sets.
Page 1, “Abstract”
We carry out experiments on a state-of-the-art SMT system, i.e., Moses (Koehn et al., 2007), and show that the abbreviation translations consistently improve the translation performance (in terms of BLEU (Papineni et al., 2002)) on various NIST MT test sets.
Page 2, “Introduction”
We compile a parallel dataset which consists of various corpora distributed by the Linguistic Data Consortium (LDC) for NIST MT evaluation.
Page 5, “Experimental Results”
4.5.2 BLEU on NIST MT Test Sets
Page 7, “Experimental Results”
Table 7 reports the results on various NIST MT test sets.
Page 7, “Experimental Results”
We integrate our method into a state-of-the-art phrase-based baseline translation system, i.e., Moses (Koehn et al., 2007), and show that the integrated system consistently improves the performance of the baseline system on various NIST machine translation test sets.
Page 8, “Conclusions”

See all papers in Proc. ACL 2008 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

machine translation

Appears in 5 sentences as: machine translation (5)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

Due to the richness of Chinese abbreviations, many of them may not appear in available parallel corpora, in which case current machine translation systems simply treat them as unknown words and leave them untranslated.
Page 1, “Abstract”
Though automatically extracting the relations between full-form Chinese phrases and their abbreviations is an interesting and important task for many natural language processing applications (e.g., machine translation , question answering, information retrieval, and so on), not much work is available in the literature.
Page 8, “Related Work”
None of the above work has addressed the Chinese abbreviation issue in the context of a machine translation task, which is the primary goal in this paper.
Page 8, “Related Work”
To the best of our knowledge, our work is the first to systematically model Chinese abbreviation expansion to improve machine translation .
Page 8, “Related Work”
We integrate our method into a state-of-the-art phrase-based baseline translation system, i.e., Moses (Koehn et al., 2007), and show that the integrated system consistently improves the performance of the baseline system on various NIST machine translation test sets.
Page 8, “Conclusions”

See all papers in Proc. ACL 2008 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

language model

Appears in 4 sentences as: language model (3) language models (1)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

Moreover, our approach utilizes both Chinese and English monolingual data to help MT, while most SMT systems utilizes only the English monolingual data to build a language model .
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”
However, since most of statistical translation models (Koehn et al., 2003; Chiang, 2007; Galley et al., 2006) are symmetrical, it is relatively easy to train a translation system to translate from English to Chinese, except that we need to train a Chinese language model from the Chinese monolingual data.
Page 4, “Unsupervised Translation Induction for Chinese Abbreviations”
To handle different directions of translation between Chinese and English, we built two trigram language models with modified Kneser-Ney smoothing (Chen and Goodman, 1998) using the SRILM toolkit (Stolcke, 2002).
Page 6, “Experimental Results”
Feature Baseline AAMT language model 0.137 0.133 phrase translation 0.066 0.023 lexical translation 0.061 0.078 reverse phrase translation 0.059 0.103 reverse lexical translation 0.
Page 8, “Experimental Results”

See all papers in Proc. ACL 2008 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

named entity

Appears in 3 sentences as: named entities (1) named entity (2)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

While the abbreviations mostly originate from noun phrases (in particular, named entities ), other general phrases are also abbreviatable.
Page 2, “Background: Chinese Abbreviations”
One may use a named entity tagger to obtain such a list.
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”
However, this relies on the existence of a Chinese named entity tagger with high-precision.
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”

See all papers in Proc. ACL 2008 that mention named entity.

See all papers in Proc. ACL that mention named entity.

Back to top.

parallel data

Appears in 3 sentences as: parallel data (3)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

To be able to translate a Chinese abbreviation that s unseen in available parallel corpora, one may an-lotate more parallel data .
Page 1, “Introduction”
This is particularly interesting since we normally have enormous monolingual data, but a small amount of parallel data .
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”
For example, in the translation task between Chinese and English, both the Chinese and English Gigaword have billions of words, but the parallel data has only about 30 million words.
Page 3, “Unsupervised Translation Induction for Chinese Abbreviations”

See all papers in Proc. ACL 2008 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

phrase-based

Appears in 3 sentences as: phrase-based (3)

In Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora

Using the toolkit Moses (Koehn et al., 2007), we built a phrase-based baseline system by following
Page 5, “Experimental Results”
This is analogous to the concept of “phrase” in phrase-based MT.
Page 6, “Experimental Results”
We integrate our method into a state-of-the-art phrase-based baseline translation system, i.e., Moses (Koehn et al., 2007), and show that the integrated system consistently improves the performance of the baseline system on various NIST machine translation test sets.
Page 8, “Conclusions”

See all papers in Proc. ACL 2008 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.