Collecting Highly Parallel Data for Paraphrase Evaluation
Chen, David and Dolan, William

Article Structure

Abstract

A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years.

Introduction

Machine paraphrasing has many applications for natural language processing tasks, including machine translation (MT), MT evaluation, summary evaluation, question answering, and natural language generation.

Related Work

Since paraphrase data are not readily available, various methods have been used to extract parallel text from other sources.

Data Collection

Since our goal was to collect large numbers of paraphrases quickly and inexpensively using a crowd, our framework was designed to make the tasks short, simple, easy, accessible and somewhat fun.

Paraphrase Evaluation Metrics

One of the limitations to the development of machine paraphrasing is the lack of standard metrics like BLEU, which has played a crucial role in driving progress in MT.

Experiments

To verify the usefulness of our paraphrase corpus and the BLEU/PINC metric, we built and evaluated several paraphrase systems and compared the automatic scores to human ratings of the generated paraphrases.

Discussions and Future Work

While our data collection framework yields useful parallel data, it also has some limitations.

Conclusion

We introduced a data collection framework that produces highly parallel data by asking different annotators to describe the same video segments.

Topics

BLEU

Appears in 28 sentences as: +BLEU (1) BLEU (32)
In Collecting Highly Parallel Data for Paraphrase Evaluation
  1. In addition to the lack of standard datasets for training and testing, there are also no standard metrics like BLEU (Papineni et al., 2002) for evaluating paraphrase systems.
    Page 1, “Introduction”
  2. One of the limitations to the development of machine paraphrasing is the lack of standard metrics like BLEU , which has played a crucial role in driving progress in MT.
    Page 5, “Paraphrase Evaluation Metrics”
  3. Thus, researchers have been unable to rely on BLEU or some derivative: the optimal paraphrasing engine under these terms would be one that simply returns the input.
    Page 5, “Paraphrase Evaluation Metrics”
  4. To measure semantic equivalence, we simply use BLEU with multiple references.
    Page 5, “Paraphrase Evaluation Metrics”
  5. In addition to measuring semantic adequacy and fluency using BLEU , we also need to measure lexical dissimilarity with the source sentence.
    Page 5, “Paraphrase Evaluation Metrics”
  6. In essence, it is the inverse of BLEU since we want to minimize the number of n-gram overlaps between the two sentences.
    Page 5, “Paraphrase Evaluation Metrics”
  7. Also notice that we do not put additional constraints on sentence length: while extremely short and extremely long sentences are likely to score high on PINC, they still must maintain semantic adequacy as measured by BLEU .
    Page 6, “Paraphrase Evaluation Metrics”
  8. We use BLEU and PINC together as a 2-dimensional scoring metric.
    Page 6, “Paraphrase Evaluation Metrics”
  9. BLEU
    Page 6, “Experiments”
  10. As more training pairs are used, the model produces more varied sentences (PIN C) but preserves the meaning less well ( BLEU )
    Page 6, “Experiments”
  11. As a comparison, evaluating each human description as a paraphrase for the other descriptions in the same cluster resulted in a BLEU score of 52.9 and a PINC score of 77.2.
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

human judgments

Appears in 15 sentences as: human judges (4) human judgments (12)
In Collecting Highly Parallel Data for Paraphrase Evaluation
  1. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments .
    Page 1, “Abstract”
  2. Without these resources, researchers have resorted to developing their own small, ad hoc datasets (Barzilay and McKeown, 2001; Shinyama et al., 2002; Barzilay and Lee, 2003; Quirk et al., 2004; Dolan et al., 2004), and have often relied on human judgments to evaluate their results (B arzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison—Burch, 2005).
    Page 1, “Introduction”
  3. Section 5 presents experimental results establishing a correlation between our automatic metric and human judgments .
    Page 2, “Introduction”
  4. While most work on evaluating paraphrase systems has relied on human judges (Barzilay and McKeown, 2001; Ibrahim et al., 2003; Bannard and Callison-Burch, 2005) or indirect, task-based methods (Lin and Pantel, 2001; Callison-Burch et al., 2006), there have also been a few attempts at creating automatic metrics that can be more easily replicated and used to compare different systems.
    Page 2, “Related Work”
  5. In addition, the metric was shown to correlate well with human judgments .
    Page 2, “Related Work”
  6. However, a significant drawback of this approach is that PEM requires substantial in-domain bilingual data to train the semantic adequacy evaluator, as well as sample human judgments to train the overall metric.
    Page 2, “Related Work”
  7. While PEM was shown to correlate well with human judgments , it has some limitations.
    Page 5, “Paraphrase Evaluation Metrics”
  8. The average scores of the two human judges are shown in Table 3.
    Page 7, “Experiments”
  9. 5.2 Correlation with human judgments
    Page 7, “Experiments”
  10. Having established rough correspondences between BLEU/PINC scores and human judgments of se-
    Page 7, “Experiments”
  11. Empirically, BLEU correlates fairly well with human judgments of semantic equivalence, although still not as well as the inter-annotator agreement.
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention human judgments.

See all papers in Proc. ACL that mention human judgments.

Back to top.

parallel sentences

Appears in 10 sentences as: parallel sentences (13)
In Collecting Highly Parallel Data for Paraphrase Evaluation
  1. Another method for collecting monolingual paraphrase data involves aligning semantically parallel sentences from different news articles describing the same event (Shinyama et al., 2002; Barzilay and Lee, 2003; Dolan et al., 2004).
    Page 2, “Related Work”
  2. While utilizing multiple translations of literary work or multiple news stories of the same event can yield significant numbers of parallel sentences , this data tend to be noisy, and reliably identifying good paraphrases among all possible sentence pairs remains an open problem.
    Page 2, “Related Work”
  3. Figure 3: Evaluation of paraphrase systems trained on different numbers of parallel sentences .
    Page 6, “Experiments”
  4. Overall, all the trained models produce reasonable paraphrase systems, even the model trained on just 28K single parallel sentences .
    Page 6, “Experiments”
  5. Examples of the outputs produced by the models trained on single parallel sentences and on all parallel sentences are shown in Table 2.
    Page 6, “Experiments”
  6. Systems trained on fewer parallel sentences are more conservative and make fewer mistakes.
    Page 7, “Experiments”
  7. On the other hand, systems trained on more parallel sentences often produce very good paraphrases but are also more likely to diverge from the original meaning.
    Page 7, “Experiments”
  8. We randomly selected 200 source sentences and generated 2 paraphrases for each, representing the two extremes: one paraphrase produced by the model trained with single parallel sentences, and the other by the model trained with all parallel sentences .
    Page 7, “Experiments”
  9. The results confirm our finding that the system trained with single parallel sentences preserves the meaning better but is also more conservative.
    Page 7, “Experiments”
  10. Table 3: Average human ratings of the systems trained on single parallel sentences and on all parallel sentences .
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention parallel sentences.

See all papers in Proc. ACL that mention parallel sentences.

Back to top.

machine translation

Appears in 6 sentences as: Machine Translation (1) machine translation (5)
In Collecting Highly Parallel Data for Paraphrase Evaluation
  1. A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years.
    Page 1, “Abstract”
  2. Machine paraphrasing has many applications for natural language processing tasks, including machine translation (MT), MT evaluation, summary evaluation, question answering, and natural language generation.
    Page 1, “Introduction”
  3. Despite the similarities between paraphrasing and translation, several maj or differences have prevented researchers from simply following standards that have been established for machine translation .
    Page 1, “Introduction”
  4. Professional translators produce large volumes of bilingual data according to a more or less consistent specification, indirectly fueling work on machine translation algorithms.
    Page 1, “Introduction”
  5. ble for researchers working on paraphrasing to compare system performance and exploit the kind of automated, rapid training-test cycle that has driven work on Statistical Machine Translation .
    Page 2, “Introduction”
  6. In addition to paraphrasing, our data collection framework could also be used to produces useful data for machine translation and computer vision.
    Page 9, “Discussions and Future Work”

See all papers in Proc. ACL 2011 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

Mechanical Turk

Appears in 6 sentences as: Mechanical Turk (6)
In Collecting Highly Parallel Data for Paraphrase Evaluation
  1. We designed our data collection framework for use on crowdsourcing platforms such as Amazon’s Mechanical Turk .
    Page 2, “Related Work”
  2. We deployed the task on Amazon’s Mechanical Turk , with video segments selected from YouTube.
    Page 3, “Data Collection”
  3. Figure l: A screenshot of our annotation task as it was deployed on Mechanical Turk .
    Page 3, “Data Collection”
  4. We deployed our data collection framework on Mechanical Turk over a two-month period from July to September in 2010, collecting 2,089 video segments and 85,550 English descriptions.
    Page 4, “Data Collection”
  5. We randomly selected a thousand sentences from our data and collected two paraphrases of each using Mechanical Turk .
    Page 8, “Experiments”
  6. Deploying the framework on Mechanical Turk over a two-month period yielded 85K English descriptions for 2K videos, one of the largest paraphrase data resources publicly available.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention Mechanical Turk.

See all papers in Proc. ACL that mention Mechanical Turk.

Back to top.

evaluation metric

Appears in 5 sentences as: Evaluation Metric (1) evaluation metric (2) evaluation metrics (2)
In Collecting Highly Parallel Data for Paraphrase Evaluation
  1. A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years.
    Page 1, “Abstract”
  2. However, a lack of standard datasets and automatic evaluation metrics has impeded progress in the field.
    Page 1, “Introduction”
  3. Second, we define a new evaluation metric , PINC (Paraphrase In N- gram Changes), that relies on simple BLEU-like n—gram comparisons to measure the degree of novelty of automatically generated paraphrases.
    Page 1, “Introduction”
  4. The more recently proposed metric PEM (Paraphrase Evaluation Metric ) (Liu et al., 2010) produces a single score that captures the semantic adequacy, fluency, and lexical dissimilarity of candidate paraphrases, relying on bilingual data to learn semantic equivalences without using n- gram similarity between candidate and reference sentences.
    Page 2, “Related Work”
  5. A good paraphrase, according to our evaluation metric , has few n-gram overlaps with the source sentence but many n- gram overlaps with the reference sentences.
    Page 6, “Paraphrase Evaluation Metrics”

See all papers in Proc. ACL 2011 that mention evaluation metric.

See all papers in Proc. ACL that mention evaluation metric.

Back to top.

n-gram

Appears in 5 sentences as: n-gram (5)
In Collecting Highly Parallel Data for Paraphrase Evaluation
  1. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates.
    Page 1, “Abstract”
  2. Thus, no n-gram overlaps are required to determine the semantic adequacy of the paraphrase candidates.
    Page 5, “Paraphrase Evaluation Metrics”
  3. In essence, it is the inverse of BLEU since we want to minimize the number of n-gram overlaps between the two sentences.
    Page 5, “Paraphrase Evaluation Metrics”
  4. where N is the maximum n-gram considered and n-
    Page 5, “Paraphrase Evaluation Metrics”
  5. A good paraphrase, according to our evaluation metric, has few n-gram overlaps with the source sentence but many n- gram overlaps with the reference sentences.
    Page 6, “Paraphrase Evaluation Metrics”

See all papers in Proc. ACL 2011 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

n-grams

Appears in 4 sentences as: n-grams (5)
In Collecting Highly Parallel Data for Paraphrase Evaluation
  1. We introduce a new scoring metric PINC that measures how many n-grams differ between the two sentences.
    Page 5, “Paraphrase Evaluation Metrics”
  2. grams and n-gramC are the lists of n-grams in the
    Page 5, “Paraphrase Evaluation Metrics”
  3. The PINC score computes the percentage of n-grams that appear in the candidate sentence but not in the source sentence.
    Page 5, “Paraphrase Evaluation Metrics”
  4. dates for introducing new n-grams but not for omitting n-grams from the original sentence.
    Page 6, “Paraphrase Evaluation Metrics”

See all papers in Proc. ACL 2011 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

parallel data

Appears in 4 sentences as: parallel data (4)
In Collecting Highly Parallel Data for Paraphrase Evaluation
  1. We quantified the utility of our highly parallel data by computing the correlation between BLEU and human ratings when different numbers of references were available.
    Page 8, “Experiments”
  2. While our data collection framework yields useful parallel data , it also has some limitations.
    Page 9, “Discussions and Future Work”
  3. By pairing up descriptions of the same video in different languages, we obtain parallel data without requiring any bilingual skills.
    Page 9, “Discussions and Future Work”
  4. We introduced a data collection framework that produces highly parallel data by asking different annotators to describe the same video segments.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2011 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

model trained

Appears in 3 sentences as: model trained (3) models trained (1)
In Collecting Highly Parallel Data for Paraphrase Evaluation
  1. Overall, all the trained models produce reasonable paraphrase systems, even the model trained on just 28K single parallel sentences.
    Page 6, “Experiments”
  2. Examples of the outputs produced by the models trained on single parallel sentences and on all parallel sentences are shown in Table 2.
    Page 6, “Experiments”
  3. We randomly selected 200 source sentences and generated 2 paraphrases for each, representing the two extremes: one paraphrase produced by the model trained with single parallel sentences, and the other by the model trained with all parallel sentences.
    Page 7, “Experiments”

See all papers in Proc. ACL 2011 that mention model trained.

See all papers in Proc. ACL that mention model trained.

Back to top.