Learning Translation Consensus with Structured Label Propagation
Liu, Shujie and Li, Chi-Ho and Li, Mu and Zhou, Ming

Article Structure

Abstract

In this paper, we address the issue for learning better translation consensus in machine translation (MT) research, and explore the search of translation consensus from similar, rather than the same, source sentences or their spans.

Introduction

Consensus in translation has gained more and more attention in recent years.

Graph-based Translation Consensus

Our MT system with graph-based translation consensus adopts the conventional log-linear model.

Graph-based Structured Learning

In general, a graph-based model assigns labels to instances by considering the labels of similar instances.

Features and Training

The last section sketched the structured label propagation algorithm.

Graph Construction

A technical detail is still needed to complete the description of graph-based consensus, namely, how the actual consensus graph is constructed.

Experiments and Results

In this section, graph-based translation consensus is tested on the Chinese to English translation tasks.

Conclusion and Future Work

In this paper, we extend the consensus method by collecting consensus statistics, not only from translation candidates of the same source sentence/span, but also from those of similar ones.

Topics

graph-based

Appears in 30 sentences as: Graph-based (4) graph-based (28)
In Learning Translation Consensus with Structured Label Propagation
  1. We convert such graph-based translation consensus from similar source strings into useful features both for n-best output re-ranking and for decoding algorithm.
    Page 1, “Abstract”
  2. Alexandrescu and Kirchhoff (2009) proposed a graph-based semi-supervised model to re-rank n-best translation output.
    Page 2, “Introduction”
  3. In this paper, we attempt to leverage translation consensus among similar (spans of) source sentences in bilingual training data, by a novel graph-based model of translation consensus.
    Page 2, “Introduction”
  4. Our MT system with graph-based translation consensus adopts the conventional log-linear model.
    Page 2, “Graph-based Translation Consensus”
  5. Based on the commonly used features, two kinds of feature are added to equation (1), one is graph-based consensus features, which are about consensus among the translations of similar sentences/spans; the other is local consensus features, which are about consensus among the translations of the same sentence/span.
    Page 2, “Graph-based Translation Consensus”
  6. In general, a graph-based model assigns labels to instances by considering the labels of similar instances.
    Page 2, “Graph-based Structured Learning”
  7. The gist of graph-based model is that, if two instances are connected by a strong edge, then their labels tend to be the same (Zhu, 2005).
    Page 2, “Graph-based Structured Learning”
  8. This scenario differs from the general case of graph-based model in two aspects.
    Page 3, “Graph-based Structured Learning”
  9. Therefore, the principle of graph-based translation consensus must be reformulated as, if two instances (source spans) are similar, then their labels (translations) tend to be similar (rather than the same).
    Page 3, “Graph-based Structured Learning”
  10. Thus their graph-based model is a normal example of the general graph-based model.
    Page 3, “Graph-based Structured Learning”
  11. 3.1 Label Propagation for General Graph-based Models
    Page 3, “Graph-based Structured Learning”

See all papers in Proc. ACL 2012 that mention graph-based.

See all papers in Proc. ACL that mention graph-based.

Back to top.

NIST

Appears in 9 sentences as: NIST (10)
In Learning Translation Consensus with Structured Label Propagation
  1. Experimental results show that, our method can significantly improve machine translation performance on both IWSLT and NIST data, compared with a state-of-the-art baseline.
    Page 1, “Abstract”
  2. We conduct experiments with IWSLT and NIST data, and experimental results show that, our method
    Page 2, “Introduction”
  3. We test our method with two data settings: one is IWSLT data set, the other is NIST data set.
    Page 7, “Experiments and Results”
  4. For the NIST data set, the bilingual training data we used is NIST 2008 training set excluding the Hong Kong Law and Hong Kong Hansard.
    Page 7, “Experiments and Results”
  5. The baseline results on NIST data are shown in Table 2.
    Page 7, “Experiments and Results”
  6. Baselines for NIST data
    Page 7, “Experiments and Results”
  7. Consensus-based re-ranking and decoding for NIST data set.
    Page 8, “Experiments and Results”
  8. We also conduct experiments on NIST data, and results are shown in Table 4.
    Page 8, “Experiments and Results”
  9. We conduct experiments on IWSLT and NIST data, and our method can improve the performance significantly.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2012 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

sentence pairs

Appears in 9 sentences as: sentence pair (3) sentence pairs (6)
In Learning Translation Consensus with Structured Label Propagation
  1. For the nodes representing the training sentence pairs , this posterior is fixed.
    Page 4, “Features and Training”
  2. If there are sentence pairs with the same source sentence but different translations, all the translations will be assigned as labels to that source sentence, and the corresponding probabilities are estimated by MLE.
    Page 5, “Graph Construction”
  3. There is no edge between training nodes, since we suppose all the sentences of the training data are correct, and it is pointless to reestimate the confidence of those sentence pairs .
    Page 5, “Graph Construction”
  4. Forced alignment performs phrase segmentation and alignment of each sentence pair of the training data using the full translation system as in decoding (Wuebker et al., 2010).
    Page 6, “Graph Construction”
  5. In simpler term, for each sentence pair in training data, a decoder is applied to the source side, and all the translation candidates that do not match any substring of the target side are deleted.
    Page 6, “Graph Construction”
  6. the decoder may not be able to produce target side of a sentence pair .
    Page 6, “Graph Construction”
  7. The training data contains 81k sentence pairs , 655k Chinese words and 806 English words.
    Page 7, “Experiments and Results”
  8. The training data contains 354k sentence pairs , 8M Chinese words and 10M English words.
    Page 7, “Experiments and Results”
  9. re-ranking methods are performed in the same way as for IWSLT data, but for consensus-based decoding, the data set contains too many sentence pairs to be held in one graph for our machine.
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2012 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.

similarity measure

Appears in 9 sentences as: similarity measure (5) similarity measures (4)
In Learning Translation Consensus with Structured Label Propagation
  1. wilj defines the weight of the edge, which is a similarity measure between nodes i and j.
    Page 3, “Graph-based Structured Learning”
  2. Propagation probability TS (f, f ') is as defined in equation (3), and Tl(e,e') is defined given some similarity measure sim(e, 6') between labels 6 and
    Page 4, “Graph-based Structured Learning”
  3. Tl(e,e') is the propagating probability in equation (8), with the similarity measure Sim(e,e') defined as the Dice coefficient over the set of all n-grams in e and those in e'.
    Page 4, “Features and Training”
  4. defined in equation (3), takes symmetrical sentence level BLEU as similarity measure ]:
    Page 4, “Features and Training”
  5. In theory we could use other similarity measures such as edit distance, string kernel.
    Page 4, “Features and Training”
  6. Like GC , there are four features with respect to the value of n in n-gram similarity measure .
    Page 5, “Features and Training”
  7. Instead of using graph-based consensus confidence as features in the log-linear model, we perform structured label propagation (Struct-LP) to re-rank the n-best list directly, and the similarity measures for source sentences and translation candidates are symmetrical sentence level BLEU (equation (10)).
    Page 7, “Experiments and Results”
  8. In this paper, we only tried Dice coefficient of n-grams and symmetrical sentence level BLEU as similarity measures .
    Page 8, “Conclusion and Future Work”
  9. In the future, we will explore other consensus features and other similarity measures , which may take document level information, or syntactic and semantic information into consideration.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2012 that mention similarity measure.

See all papers in Proc. ACL that mention similarity measure.

Back to top.

BLEU

Appears in 7 sentences as: BLEU (7)
In Learning Translation Consensus with Structured Label Propagation
  1. defined in equation (3), takes symmetrical sentence level BLEU as similarity measure]:
    Page 4, “Features and Training”
  2. BLEUWW ) = (10) where i — BLE U (f, f ') is the IBM BLEU score computed over i-grams for hypothesis f using f ’ as reference.
    Page 4, “Features and Training”
  3. 1 BLEU is not symmetric, which means, different scores are obtained depending on which one is reference and which one is hypothesis.
    Page 4, “Features and Training”
  4. This alternation of structured label propagation and MERT stops when the BLEU score on dev data converges, or a preset limit (10 rounds) is reached.
    Page 5, “Features and Training”
  5. In our experiment we measure similarity by symmetrical sentence level BLEU of source sentences, and 0.3 is taken as the threshold for edge creation.
    Page 6, “Graph Construction”
  6. Instead of using graph-based consensus confidence as features in the log-linear model, we perform structured label propagation (Struct-LP) to re-rank the n-best list directly, and the similarity measures for source sentences and translation candidates are symmetrical sentence level BLEU (equation (10)).
    Page 7, “Experiments and Results”
  7. In this paper, we only tried Dice coefficient of n-grams and symmetrical sentence level BLEU as similarity measures.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2012 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

feature weights

Appears in 6 sentences as: feature weights (6)
In Learning Translation Consensus with Structured Label Propagation
  1. where 1/) is the feature vector, A is the feature weights , and H (f) is the set of translation hypotheses in the search space.
    Page 2, “Graph-based Translation Consensus”
  2. Before elaborating how the graph model of consensus is constructed for both a decoder and N-best output re-ranking in section 5, we will describe how the consensus features and their feature weights can be trained in a semi-supervised way, in section 4.
    Page 2, “Graph-based Translation Consensus”
  3. Therefore, we can alternatively update graph-based consensus features and feature weights in the log-linear model.
    Page 5, “Features and Training”
  4. The decoder then adds the new features and retrains all the feature weights by Minimum Error Rate Training (MERT) (Och, 2003).
    Page 5, “Features and Training”
  5. The decoder with new feature weights then provides new n-best candidates and their posteriors for constructing another consensus graph, which in turn gives rise to next round of
    Page 5, “Features and Training”
  6. The development data utilized to tune the feature weights of our decoder is NIST’03 evaluation set, and test sets are NIST’05 and NIST’08 evaluation sets.
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2012 that mention feature weights.

See all papers in Proc. ACL that mention feature weights.

Back to top.

MT system

Appears in 6 sentences as: MT system (6) MT systems (1)
In Learning Translation Consensus with Structured Label Propagation
  1. The principle of consensus can be sketched as “a translation candidate is deemed more plausible if it is supported by other translation candidates.” The actual formulation of the principle depends on whether the translation candidate is a complete sentence or just a span of it, whether the candidate is the same as or similar to the supporting candidates, and whether the supporting candidates come from the same or different MT system .
    Page 1, “Introduction”
  2. Others extend consensus among translations from the same MT system to those from different MT systems .
    Page 1, “Introduction”
  3. For the source (Chinese) span “fig 73 H T 57 3x91 the MT system produced the correct translation for the second sentence, but it failed to do so for the first one.
    Page 1, “Introduction”
  4. Our MT system with graph-based translation consensus adopts the conventional log-linear model.
    Page 2, “Graph-based Translation Consensus”
  5. Before elaborating the details of how the actual graph is constructed, we would like to first introduce how the graph-based translation consensus can be used in an MT system .
    Page 4, “Features and Training”
  6. When graph-based consensus is applied to an MT system , the graph will have nodes for training data, development (dev) data, and test data (details in Section 5).
    Page 5, “Features and Training”

See all papers in Proc. ACL 2012 that mention MT system.

See all papers in Proc. ACL that mention MT system.

Back to top.

semi-supervised

Appears in 6 sentences as: Semi-Supervised (1) semi-supervised (5)
In Learning Translation Consensus with Structured Label Propagation
  1. Alexandrescu and Kirchhoff (2009) proposed a graph-based semi-supervised model to re-rank n-best translation output.
    Page 2, “Introduction”
  2. Before elaborating how the graph model of consensus is constructed for both a decoder and N-best output re-ranking in section 5, we will describe how the consensus features and their feature weights can be trained in a semi-supervised way, in section 4.
    Page 2, “Graph-based Translation Consensus”
  3. Algorithm 1 Semi-Supervised Learning
    Page 5, “Features and Training”
  4. Algorithm 1 outlines our semi-supervised method for such alternative training.
    Page 5, “Features and Training”
  5. To perform consensus-based re-ranking, we first use the baseline decoder to get the n-best list for each sentence of development and test data, then we create graph using the n-best lists and training data as we described in section 5.1, and perform semi-supervised training as mentioned in section 4.3.
    Page 7, “Experiments and Results”
  6. The features and weights are tuned with an iterative semi-supervised method.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2012 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

language model

Appears in 4 sentences as: language model (6)
In Learning Translation Consensus with Structured Label Propagation
  1. We also use other fundamental features, such as translation probabilities, lexical weights, distortion probability, word penalty, and language model probability.
    Page 5, “Features and Training”
  2. The features we used are commonly used features as standard BTG decoder, such as translation probabilities, lexical weights, language model , word penalty and distortion probabilities.
    Page 7, “Experiments and Results”
  3. The language model is 5-gram language model trained with the target sentences in the training data.
    Page 7, “Experiments and Results”
  4. The language model is 5-gram language model trained with the Giga-Word corpus plus the English sentences in the training data.
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2012 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

log-linear

Appears in 4 sentences as: log-linear (4)
In Learning Translation Consensus with Structured Label Propagation
  1. Our MT system with graph-based translation consensus adopts the conventional log-linear model.
    Page 2, “Graph-based Translation Consensus”
  2. Therefore, we can alternatively update graph-based consensus features and feature weights in the log-linear model.
    Page 5, “Features and Training”
  3. Instead of using graph-based consensus confidence as features in the log-linear model, we perform structured label propagation (Struct-LP) to re-rank the n-best list directly, and the similarity measures for source sentences and translation candidates are symmetrical sentence level BLEU (equation (10)).
    Page 7, “Experiments and Results”
  4. The consensus statistics are integrated into the conventional log-linear model as features.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2012 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

log-linear model

Appears in 4 sentences as: log-linear model (4)
In Learning Translation Consensus with Structured Label Propagation
  1. Our MT system with graph-based translation consensus adopts the conventional log-linear model .
    Page 2, “Graph-based Translation Consensus”
  2. Therefore, we can alternatively update graph-based consensus features and feature weights in the log-linear model .
    Page 5, “Features and Training”
  3. Instead of using graph-based consensus confidence as features in the log-linear model , we perform structured label propagation (Struct-LP) to re-rank the n-best list directly, and the similarity measures for source sentences and translation candidates are symmetrical sentence level BLEU (equation (10)).
    Page 7, “Experiments and Results”
  4. The consensus statistics are integrated into the conventional log-linear model as features.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2012 that mention log-linear model.

See all papers in Proc. ACL that mention log-linear model.

Back to top.

machine translation

Appears in 4 sentences as: machine translation (4)
In Learning Translation Consensus with Structured Label Propagation
  1. In this paper, we address the issue for learning better translation consensus in machine translation (MT) research, and explore the search of translation consensus from similar, rather than the same, source sentences or their spans.
    Page 1, “Abstract”
  2. Experimental results show that, our method can significantly improve machine translation performance on both IWSLT and NIST data, compared with a state-of-the-art baseline.
    Page 1, “Abstract”
  3. G-Re-Rank-GC and G-Decode-GC improve the performance of machine translation according to the baseline.
    Page 8, “Experiments and Results”
  4. To calculate consensus statistics, we develop a novel structured label propagation method for structured learning problems, such as machine translation .
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2012 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

model trained

Appears in 4 sentences as: model trained (3) model training (1)
In Learning Translation Consensus with Structured Label Propagation
  1. Note that, due to pruning in both decoding and translation model training , forced alignment may fail, i.e.
    Page 6, “Graph Construction”
  2. Our baseline decoder is an in-house implementation of Bracketing Transduction Grammar (Dekai Wu, 1997) (BTG) in CKY-style decoding with a lexical reordering model trained with maximum entropy (Xiong et al., 2006).
    Page 7, “Experiments and Results”
  3. The language model is 5-gram language model trained with the target sentences in the training data.
    Page 7, “Experiments and Results”
  4. The language model is 5-gram language model trained with the Giga-Word corpus plus the English sentences in the training data.
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2012 that mention model trained.

See all papers in Proc. ACL that mention model trained.

Back to top.

n-gram

Appears in 4 sentences as: n-gram (4)
In Learning Translation Consensus with Structured Label Propagation
  1. Collaborative decoding (Li et al., 2009) scores the translation of a source span by its n-gram similarity to the translations by other systems.
    Page 1, “Introduction”
  2. Here simple n-gram similarity is used for the sake of efficiency.
    Page 4, “Features and Training”
  3. Like GC , there are four features with respect to the value of n in n-gram similarity measure.
    Page 5, “Features and Training”
  4. Solid lines are edges connecting nodes with sufficient source side n-gram similarity, such as the one between "E A M N" and "E A B C".
    Page 6, “Graph Construction”

See all papers in Proc. ACL 2012 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

n-grams

Appears in 3 sentences as: n-grams (3)
In Learning Translation Consensus with Structured Label Propagation
  1. Tl(e,e') is the propagating probability in equation (8), with the similarity measure Sim(e,e') defined as the Dice coefficient over the set of all n-grams in e and those in e'.
    Page 4, “Features and Training”
  2. where N Grn(x) is the set of n-grams in string x, and Dice (A, B) is the Dice coefficient over sets A and B:
    Page 4, “Features and Training”
  3. In this paper, we only tried Dice coefficient of n-grams and symmetrical sentence level BLEU as similarity measures.
    Page 8, “Conclusion and Future Work”

See all papers in Proc. ACL 2012 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.