Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
Yan, Rui and Gao, Mingkun and Pavlick, Ellie and Callison-Burch, Chris

Article Structure

Abstract

Crowdsourcing is a viable mechanism for creating training data for machine translation.

Introduction

Statistical machine translation (SMT) systems are trained using bilingual sentence-aligned parallel corpora.

Related work

In the HCI community, several researchers have proposed protocols for collaborative translation efforts (Morita and Ishida, 2009b; Morita and Ishida, 2009a; Hu, 2009; Hu et al., 2010).

Crowdsourcing Translation

Setup We conduct our experiments using the data collected by Zaidan and Callison-Burch (2011).

Problem Formulation

The problem definition of the crowdsourcing translation task is straightforward: given a set of candidate translations for a source sentence, we want to choose the best output translation.

Evaluation

We are interested in testing our random walk method, which incorporates information from both the candidate translations and from the Turkers.

Conclusion

We have proposed an algorithm for using a two-step collaboration between nonprofessional translators and post-editors to obtain professional-quality translations.

Topics

Turker

Appears in 25 sentences as: Turker (13) Turkers (11) Turkers’ (1)
In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
  1. ferent Turkers for a collection of Urdu sentences that had been previously professionally translated by the Linguistics Data Consortium.
    Page 3, “Related work”
  2. They also hired US-based Turkers to edit the translations, since the translators were largely based in Pakistan and exhibited errors that are characteristic of speakers of English as a language.
    Page 3, “Related work”
  3. 52 different Turkers took part in the translation task, each translating 138 sentences on average.
    Page 3, “Crowdsourcing Translation”
  4. In the editing task, 320 Turkers participated, averaging 56 sentences each.
    Page 3, “Crowdsourcing Translation”
  5. We form two graphs: the first graph (GT) represents Turkers (translator/editor pairs) as nodes; the second graph (G0) represents candidate translated and
    Page 4, “Problem Formulation”
  6. GT = (VT, ET) is a weighted undirected graph representing collaborations between Turkers .
    Page 5, “Problem Formulation”
  7. The mutual reinforcement framework couples the two random walks on GT and G0 that rank candidates and Turkers in isolation.
    Page 5, “Problem Formulation”
  8. We use two vectors c = [7r(c)]lclx1 and t = [7r(t)]ltlx1 to denote the saliency scores of candidates and Turker pairs.
    Page 5, “Problem Formulation”
  9. [M helxlcl to describe the homogeneous affinity between candidates and [N ]ltlxlt| to describe the affinity between Turkers .
    Page 6, “Problem Formulation”
  10. where c = |VC| is the number of vertices in the candidate graph and t = |VT| is the number of vertices in the Turker graph.
    Page 6, “Problem Formulation”
  11. The adjacency matrix [M] denotes the transition probabilities between candidates, and analogously matrix [N] denotes the affinity between Turker collaboration pairs.
    Page 6, “Problem Formulation”

See all papers in Proc. ACL 2014 that mention Turker.

See all papers in Proc. ACL that mention Turker.

Back to top.

TER

Appears in 12 sentences as: (1) TER (12)
In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
  1. "8.0 0.5 1.0 1.5 2.0 TER between pre- and post-edit translation
    Page 3, “Crowdsourcing Translation”
  2. Aggressiveness (x-axis) is measured as the TER between the pre-edit and post-edit version of the translation, and effectiveness (y-axis) is measured as the average amount by which the editing reduces the translation’s TERgold.
    Page 3, “Crowdsourcing Translation”
  3. We use translation edit rate ( TER ) as a measure of translation similarity.
    Page 3, “Crowdsourcing Translation”
  4. TER represents the amount of change necessary to transform one sentence into another, so a low TER means the two
    Page 3, “Crowdsourcing Translation”
  5. Editor A TER d 0.01 — 0.03
    Page 4, “Crowdsourcing Translation”
  6. To capture the quality (“professionalness”) of a translation, we take the average TER of the translation against each of our gold translations.
    Page 4, “Crowdsourcing Translation”
  7. We measure aggressiveness by looking at the TER between
    Page 4, “Crowdsourcing Translation”
  8. the pre- and post-edited versions of each editor’s translations; higher TER implies more aggressive editing.
    Page 4, “Crowdsourcing Translation”
  9. Lowest TER 35.78
    Page 8, “Evaluation”
  10. The first method selects the translation with the minimum average TER (Snover et al., 2006) against the other translations; intuitively, this would represent the “consensus” translation.
    Page 8, “Evaluation”
  11. The second method selects the translation generated by the Turker who, on average, provides translations with the minimum average TER .
    Page 8, “Evaluation”

See all papers in Proc. ACL 2014 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.

BLEU

Appears in 11 sentences as: BLEU (12)
In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
  1. Metric Since we have four professional translation sets, we can calculate the Bilingual Evaluation Understudy ( BLEU ) score (Papineni et al., 2002) for one professional translator (Pl) using the other three (P2,3,4) as a reference set.
    Page 7, “Evaluation”
  2. In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations.
    Page 7, “Evaluation”
  3. This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators.
    Page 7, “Evaluation”
  4. Table 2: Overall BLEU performance for all methods (with and without post-editing).
    Page 8, “Evaluation”
  5. The first oracle operates at the segment level on the sentences produced by translators only: for each source segment, we choose from the translations the one that scores highest (in terms of BLEU ) against the reference sentences.
    Page 8, “Evaluation”
  6. As expected, random selection yields bad performance, with a BLEU score of 30.52.
    Page 8, “Evaluation”
  7. Figure 5: Effect of candidate-Turker coupling (A) on BLEU score.
    Page 8, “Evaluation”
  8. The approach which selects the translations with the minimum average TER (Snover et al., 2006) against the other three translations (the “consensus” translation) achieves BLEU scores of 35.78.
    Page 8, “Evaluation”
  9. Using the raw translations without post-editing, our graph-based ranking method achieves a BLEU score of 38.89, compared to Zaidan and Callison-Burch (2011)’ s reported score of 28.13, which they achieved using a linear feature-based classification.
    Page 8, “Evaluation”
  10. This boost in BLEU score confirms our intuition that the hidden collaboration networks between candidate translations and transltor/editor pairs are indeed useful.
    Page 8, “Evaluation”
  11. In order to determine a value for A, we used the average BLEU , computed against the professional refer-
    Page 8, “Evaluation”

See all papers in Proc. ACL 2014 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

machine translation

Appears in 8 sentences as: machine translation (8)
In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
  1. Crowdsourcing is a viable mechanism for creating training data for machine translation .
    Page 1, “Abstract”
  2. Statistical machine translation (SMT) systems are trained using bilingual sentence-aligned parallel corpora.
    Page 1, “Introduction”
  3. These have focused on an iterative collaboration between monolingual speakers of the two languages, facilitated with a machine translation system.
    Page 2, “Related work”
  4. In our setup the poor translations are produced by bilingual individuals who are weak in the target language, and in their experiments the translations are the output of a machine translation system.1 Another significant difference is that the HCI studies assume cooperative participants.
    Page 2, “Related work”
  5. 1A variety of HC1 and NLP studies have confirmed the efficacy of monolingual or bilingual individuals post-editing of machine translation output (Callison-Burch, 2005; Koehn, 2010; Green et al., 2013).
    Page 2, “Related work”
  6. Although hiring professional translators to create bilingual training data for machine translation systems has been deemed infeasible, Mechanical Turk has provided a low cost way of creating large volumes of translations (Callison-Burch, 2009; Ambati and Vogel, 2010).
    Page 2, “Related work”
  7. art machine translation system (the syntax-based variant of Joshua) achieves a score of 26.91, which is reported in (Zaidan and Callison-Burch, 2011).
    Page 8, “Evaluation”
  8. In addition to its benefits of cost and scalability, crowdsourcing provides access to languages that currently fall outside the scope of statistical machine translation research.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

BLEU score

Appears in 7 sentences as: BLEU score (5) BLEU scores (3)
In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
  1. In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations.
    Page 7, “Evaluation”
  2. This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators.
    Page 7, “Evaluation”
  3. As expected, random selection yields bad performance, with a BLEU score of 30.52.
    Page 8, “Evaluation”
  4. Figure 5: Effect of candidate-Turker coupling (A) on BLEU score .
    Page 8, “Evaluation”
  5. The approach which selects the translations with the minimum average TER (Snover et al., 2006) against the other three translations (the “consensus” translation) achieves BLEU scores of 35.78.
    Page 8, “Evaluation”
  6. Using the raw translations without post-editing, our graph-based ranking method achieves a BLEU score of 38.89, compared to Zaidan and Callison-Burch (2011)’ s reported score of 28.13, which they achieved using a linear feature-based classification.
    Page 8, “Evaluation”
  7. This boost in BLEU score confirms our intuition that the hidden collaboration networks between candidate translations and transltor/editor pairs are indeed useful.
    Page 8, “Evaluation”

See all papers in Proc. ACL 2014 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

Mechanical Turk

Appears in 5 sentences as: Mechanical Turk (5)
In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
  1. Rather than relying on volunteers or gamifica-tion, NLP research into crowdsourcing translation has focused on hiring workers on the Amazon Mechanical Turk (MTurk) platform (Callison-Burch, 2009).
    Page 1, “Introduction”
  2. Our setup uses anonymous crowd workers hired on Mechanical Turk , whose motivation to participate is financial.
    Page 2, “Related work”
  3. Most NLP research into crowdsourcing has focused on Mechanical Turk , following pioneering work by Snow et al.
    Page 2, “Related work”
  4. Although hiring professional translators to create bilingual training data for machine translation systems has been deemed infeasible, Mechanical Turk has provided a low cost way of creating large volumes of translations (Callison-Burch, 2009; Ambati and Vogel, 2010).
    Page 2, “Related work”
  5. This data set consists 1,792 Urdu sentences from a variety of news and online sources, each paired with English translations provided by nonprofessional translators on Mechanical Turk .
    Page 3, “Crowdsourcing Translation”

See all papers in Proc. ACL 2014 that mention Mechanical Turk.

See all papers in Proc. ACL that mention Mechanical Turk.

Back to top.

graph-based

Appears in 4 sentences as: graph-based (4)
In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
  1. We develop graph-based ranking models that automatically select the best output from multiple redundant versions of translations and edits, and improves translation quality closer to professionals.
    Page 1, “Abstract”
  2. 0 A new graph-based algorithm for selecting the best translation among multiple translations of the same input.
    Page 2, “Introduction”
  3. Using the raw translations without post-editing, our graph-based ranking method achieves a BLEU score of 38.89, compared to Zaidan and Callison-Burch (2011)’ s reported score of 28.13, which they achieved using a linear feature-based classification.
    Page 8, “Evaluation”
  4. In contrast, our proposed graph-based ranking framework achieves a score of 41.43 when using the same information.
    Page 8, “Evaluation”

See all papers in Proc. ACL 2014 that mention graph-based.

See all papers in Proc. ACL that mention graph-based.

Back to top.

translation system

Appears in 4 sentences as: translation system (3) translation systems (1)
In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
  1. These have focused on an iterative collaboration between monolingual speakers of the two languages, facilitated with a machine translation system .
    Page 2, “Related work”
  2. Although hiring professional translators to create bilingual training data for machine translation systems has been deemed infeasible, Mechanical Turk has provided a low cost way of creating large volumes of translations (Callison-Burch, 2009; Ambati and Vogel, 2010).
    Page 2, “Related work”
  3. (2013) translated 1.5 million words of Levine Arabic and Egyptian Arabic, and showed that a statistical translation system trained on the dialect data outperformed a system trained on 100 times more MSA data.
    Page 2, “Related work”
  4. art machine translation system (the syntax-based variant of Joshua) achieves a score of 26.91, which is reported in (Zaidan and Callison-Burch, 2011).
    Page 8, “Evaluation”

See all papers in Proc. ACL 2014 that mention translation system.

See all papers in Proc. ACL that mention translation system.

Back to top.

PageRank

Appears in 3 sentences as: PageRank (3)
In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
  1. The standard PageRank algorithm starts from an arbitrary node and randomly selects to either follow a random outgoing edge (considering the weighted transition matrix) or to jump to a random node (treating all nodes with equal probability).
    Page 6, “Problem Formulation”
  2. where 1 is a vector with all elements equaling to l and the size is correspondent to the size of V0 or VT. ,u is the damping factor usually set to 0.85, as in the PageRank algorithm.
    Page 7, “Problem Formulation”
  3. We set the damping factor ,u to 0.85, following the standard PageRank paradigm.
    Page 8, “Evaluation”

See all papers in Proc. ACL 2014 that mention PageRank.

See all papers in Proc. ACL that mention PageRank.

Back to top.

parallel corpora

Appears in 3 sentences as: parallel corpora (3)
In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
  1. Statistical machine translation (SMT) systems are trained using bilingual sentence-aligned parallel corpora .
    Page 1, “Introduction”
  2. Because of this, collecting parallel corpora for minor languages has become an interesting research challenge.
    Page 1, “Introduction”
  3. (2012) used MTurk to create parallel corpora for six Indian languages for less than $0.01 per word.
    Page 2, “Related work”

See all papers in Proc. ACL 2014 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

translation task

Appears in 3 sentences as: translation task (3)
In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors
  1. 52 different Turkers took part in the translation task , each translating 138 sentences on average.
    Page 3, “Crowdsourcing Translation”
  2. The problem definition of the crowdsourcing translation task is straightforward: given a set of candidate translations for a source sentence, we want to choose the best output translation.
    Page 4, “Problem Formulation”
  3. This suggests that both sources of information— the candidate itself and its authors— are important for the crowdsourcing translation task .
    Page 9, “Evaluation”

See all papers in Proc. ACL 2014 that mention translation task.

See all papers in Proc. ACL that mention translation task.

Back to top.