Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
Chambers, Nathanael and Jurafsky, Daniel

Article Structure

Abstract

This paper improves the use of pseudo-words as an evaluation framework for selectional preferences.

Introduction

For many natural language processing (NLP) tasks, particularly those involving meaning, creating labeled test data is difficult or expensive.

History of Pseudo-Word Disambiguation

Pseudo-words were introduced simultaneously by two papers studying statistical approaches to word sense disambiguation (WSD).

How Frequent is Unseen Data?

Most NLP tasks evaluate their entire datasets, but as described above, most selectional preference evaluations have focused only on unseen data.

How to Select a Confounder

Given a test set S of pairs (vd, n) E S, we now address how best to select a confounder n’ .

Models

5.1 A New Baseline

Experiments

Our training data is the NYT section of the Gigaword Corpus, parsed into dependency graphs.

Results

Results are given for the two dimensions: confounder choice and training size.

Discussion

Confounder Choice: Performance is strongly influenced by the method used when choosing con-

Conclusion

Current performance on various natural language tasks is being judged and published based on pseudo-word evaluations.

Topics

conditional probability

Appears in 14 sentences as: conditional probabilities (1) Conditional probability (2) conditional probability (11)
In Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
  1. We propose a conditional probability baseline:
    Page 5, “Models”
  2. (1999) showed that corpus frequency and conditional probability correlate with human decisions of adjective-noun plausibility, and Dagan et al.
    Page 5, “Models”
  3. (1999) appear to propose a very similar baseline for verb-noun selectional preferences, but the paper evaluates unseen data, and so the conditional probability model is not studied.
    Page 5, “Models”
  4. If conditional probability is a reasonable baseline, better performance may just require more data.
    Page 5, “Models”
  5. Thus, we use conditional probability as defined in the previous section, but define the count
    Page 5, “Models”
  6. The conditional probability is reported as Baseline.
    Page 6, “Experiments”
  7. The conditional probability Baseline falls from 91.5 to 79.5, a 12% absolute drop from completely random to neighboring frequency.
    Page 6, “Results”
  8. Accuracy is the same as recall when the model does not guess between pseudo words that have the same conditional probabilities .
    Page 7, “Results”
  9. Training Size: Training data improves the conditional probability baseline, but does not help the smoothing model.
    Page 7, “Discussion”
  10. We optimized argument cutoffs for each training size, but the model still appears to suffer from additional noise that the conditional probability baseline does not.
    Page 7, “Discussion”
  11. High Precision Baseline: Our conditional probability baseline is very precise.
    Page 7, “Discussion”

See all papers in Proc. ACL 2010 that mention conditional probability.

See all papers in Proc. ACL that mention conditional probability.

Back to top.

n-gram

Appears in 6 sentences as: n-gram (7)
In Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
  1. The third line across the bottom of the figure is the number of unseen pairs using Google n-gram data as proxy argument counts.
    Page 4, “How Frequent is Unseen Data?”
  2. Creating argument counts from n-gram counts is described in detail below in section 5.2.
    Page 4, “How Frequent is Unseen Data?”
  3. Using the Google n-gram corpus, we recorded all verb-noun co-occurrences, defined by appearing in any order in the same n-gram , up to and including 5-grams.
    Page 5, “Models”
  4. For instance, the test pair (throwsubject, ball) is considered seen if there exists an n-gram such that throw and ball are both included.
    Page 5, “Models”
  5. C (vd, n) as the number of times 2) and n (ignoring d) appear in the same n-gram .
    Page 5, “Models”
  6. The Google n-gram backoff model is almost as good as backing off to the Erk smoothing model.
    Page 7, “Results”

See all papers in Proc. ACL 2010 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

development set

Appears in 5 sentences as: development set (5)
In Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
  1. We then record every seen (vd, n) pair during training that is seen two or more times3 and then count the number of unseen pairs in the NYT development set (1455 tests).
    Page 3, “How Frequent is Unseen Data?”
  2. Figure 1: Percentage of NYT development set that is unseen when trained on varying amounts of data.
    Page 3, “How Frequent is Unseen Data?”
  3. Figure 2: Percentage of subject/object/preposition arguments in the NYT development set that is unseen when trained on varying amounts of NYT data.
    Page 3, “How Frequent is Unseen Data?”
  4. Corpus counts covered 2 years of the AP section, and we used the development set of the NYT section to extract the seen and unseen pairs.
    Page 4, “How Frequent is Unseen Data?”
  5. We randomly chose 9 documents from the year 2001 for a development set , and 41 documents for testing.
    Page 5, “Experiments”

See all papers in Proc. ACL 2010 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

word sense

Appears in 4 sentences as: word sense (4)
In Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
  1. While pseudo-words originally evaluated word sense disambiguation, they are now commonly used to evaluate selectional preferences.
    Page 1, “Abstract”
  2. One way to mitigate this problem is with pseudo-words, a method for automatically creating test corpora without human labeling, originally proposed for word sense disambiguation (Gale et al.,
    Page 1, “Introduction”
  3. While pseudo-words are now less often used for word sense disambigation, they are a common way to evaluate selectional preferences, models that measure the strength of association between a predicate and its argument filler, e.g., that the noun lunch is a likely object of eat.
    Page 1, “Introduction”
  4. Pseudo-words were introduced simultaneously by two papers studying statistical approaches to word sense disambiguation (WSD).
    Page 2, “History of Pseudo-Word Disambiguation”

See all papers in Proc. ACL 2010 that mention word sense.

See all papers in Proc. ACL that mention word sense.

Back to top.

n-grams

Appears in 3 sentences as: n-grams (3)
In Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
  1. The dotted line uses Google n-grams as training.
    Page 3, “How Frequent is Unseen Data?”
  2. We also avoided over-counting co-occurrences in lower order n-grams that appear again in 4 or 5 - grams.
    Page 5, “Models”
  3. For the web baseline (reported as Google), we stemmed all words in the Google n-grams and counted every verb 2) and noun n that appear in Gigaword.
    Page 6, “Experiments”

See all papers in Proc. ACL 2010 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

semantic role

Appears in 3 sentences as: semantic role (2) semantic roles (1)
In Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
  1. Selectional preferences are useful for NLP tasks such as parsing and semantic role labeling (Zapirain et al., 2009).
    Page 1, “Introduction”
  2. Our reported results include every (vd,n) in the data, not a subset of particular semantic roles .
    Page 8, “Discussion”
  3. Conditional probability is thus a very strong starting point if selectional preferences are an internal piece to a larger application, such as semantic role labeling or parsing.
    Page 8, “Discussion”

See all papers in Proc. ACL 2010 that mention semantic role.

See all papers in Proc. ACL that mention semantic role.

Back to top.

sense disambiguation

Appears in 3 sentences as: sense disambiguation (3)
In Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
  1. While pseudo-words originally evaluated word sense disambiguation , they are now commonly used to evaluate selectional preferences.
    Page 1, “Abstract”
  2. One way to mitigate this problem is with pseudo-words, a method for automatically creating test corpora without human labeling, originally proposed for word sense disambiguation (Gale et al.,
    Page 1, “Introduction”
  3. Pseudo-words were introduced simultaneously by two papers studying statistical approaches to word sense disambiguation (WSD).
    Page 2, “History of Pseudo-Word Disambiguation”

See all papers in Proc. ACL 2010 that mention sense disambiguation.

See all papers in Proc. ACL that mention sense disambiguation.

Back to top.

Statistical significance

Appears in 3 sentences as: Statistical significance (1) statistical significance (1) statistically significant (1)
In Improving the Use of Pseudo-Words for Evaluating Selectional Preferences
  1. Statistical significance tests were calculated using the approximate randomization test (Yeh, 2000) with 1000 iterations.
    Page 6, “Results”
  2. * indicates statistical significance with the column’s Baseline at the p < 0.01 level, T at p < 005.
    Page 7, “Results”
  3. All numbers are statistically significant * with p-value < 0.01 from the number to their left.
    Page 7, “Results”

See all papers in Proc. ACL 2010 that mention Statistical significance.

See all papers in Proc. ACL that mention Statistical significance.

Back to top.