A New Dataset and Method for Automatically Grading ESOL Texts
Yannakoudakis, Helen and Briscoe, Ted and Medlock, Ben

Article Structure

Abstract

We demonstrate how supervised discriminative machine learning techniques can be used to automate the assessment of ‘English as a Second or Other Language’ (ESOL) examination scripts.

Introduction

The task of automated assessment of free text focuses on automatically analysing and assessing the quality of writing competence.

Cambridge Learner Corpus

The Cambridge Learner Corpus2 (CLC), developed as a collaborative project between Cambridge University Press and Cambridge Assessment, is a large collection of texts produced by English language learners from around the world, sitting Cambridge Assessment’s English as a Second or Other Language (ESOL) examinations3.

Approach

We treat automated assessment of ESOL text (see Section 2) as a rank preference learning problem (see Section 1).

Evaluation

In order to evaluate our AA system, we use two correlation measures, Pearson’s product-moment correlation coefficient and Spearman’s rank correlation coefficient (hereafter Pearson’s and Spearman’s correlation respectively).

Validity tests

The practical utility of an AA system will depend strongly on its robustness to subversion by writers who understand something of its workings and attempt to exploit this to maximise their scores (independently of their underlying ability).

Previous work

In this section we briefly discuss a number of the more influential and/or better described approaches.

Conclusions and future work

Though many of the systems described in Section 6 have been shown to correlate well with examiners’ marks on test data in many experimental contexts, no cross-system comparisons are available because of the lack of a shared training and test dataset.

Topics

language model

Appears in 7 sentences as: language model (5) language models (2)
In A New Dataset and Method for Automatically Grading ESOL Texts
  1. In order to estimate the error-rate, we build a trigram language model (LM) using ukWaC (ukWaC LM) (Ferraresi et al., 2008), a large corpus of English containing more than 2 billion tokens.
    Page 4, “Approach”
  2. Next, we extend our language model with trigrams extracted from a subset of the texts contained in the
    Page 4, “Approach”
  3. As the CLC contains texts produced by second language learners, we only extract frequently occurring trigrams from highly ranked scripts to avoid introducing erroneous ones to our language model .
    Page 5, “Approach”
  4. A word trigram in test data is counted as an error if it is not found in the language model .
    Page 5, “Approach”
  5. We compute presence/absence efficiently using a Bloom filter encoding of the language models (Bloom, 1970).
    Page 5, “Approach”
  6. Extending our language model with frequent trigrams extracted from the CLC improves Pearson’s and Spearman’s correlation by 0.006 and 0.015 respectively.
    Page 5, “Evaluation”
  7. This suggests that there is room for improvement in the language models we developed to estimate the error-rate.
    Page 5, “Evaluation”

See all papers in Proc. ACL 2011 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

bigrams

Appears in 6 sentences as: bigrams (6)
In A New Dataset and Method for Automatically Grading ESOL Texts
  1. (a) Word unigrams (b) Word bigrams
    Page 4, “Approach”
  2. (a) PoS unigrams (b) PoS bigrams (c) PoS trigrams
    Page 4, “Approach”
  3. Word unigrams and bigrams are lower-cased and used in their inflected forms.
    Page 4, “Approach”
  4. PoS unigrams, bigrams and trigrams are extracted using the RASP tagger, which uses the CLAWS4 tagset.
    Page 4, “Approach”
  5. (a) word unigrams within a sentence (b) word bigrams within a sentence (c) word trigrams within a sentence
    Page 7, “Validity tests”
  6. The Bayesian Essay Test Scoring sYstem (BETSY) (Rudner and Liang, 2002) uses multinomial or Bernoulli Naive Bayes models to classify texts into different classes (e. g. pass/fail, grades AF) based on content and style features such as word unigrams and bigrams , sentence length, number of verbs, noun—verb pairs etc.
    Page 8, “Previous work”

See all papers in Proc. ACL 2011 that mention bigrams.

See all papers in Proc. ACL that mention bigrams.

Back to top.

unigrams

Appears in 6 sentences as: unigrams (6)
In A New Dataset and Method for Automatically Grading ESOL Texts
  1. (a) Word unigrams (b) Word bigrams
    Page 4, “Approach”
  2. (a) PoS unigrams (b) PoS bigrams (c) PoS trigrams
    Page 4, “Approach”
  3. Word unigrams and bigrams are lower-cased and used in their inflected forms.
    Page 4, “Approach”
  4. PoS unigrams , bigrams and trigrams are extracted using the RASP tagger, which uses the CLAWS4 tagset.
    Page 4, “Approach”
  5. (a) word unigrams within a sentence (b) word bigrams within a sentence (c) word trigrams within a sentence
    Page 7, “Validity tests”
  6. The Bayesian Essay Test Scoring sYstem (BETSY) (Rudner and Liang, 2002) uses multinomial or Bernoulli Naive Bayes models to classify texts into different classes (e. g. pass/fail, grades AF) based on content and style features such as word unigrams and bigrams, sentence length, number of verbs, noun—verb pairs etc.
    Page 8, “Previous work”

See all papers in Proc. ACL 2011 that mention unigrams.

See all papers in Proc. ACL that mention unigrams.

Back to top.

LM

Appears in 5 sentences as: LM (7)
In A New Dataset and Method for Automatically Grading ESOL Texts
  1. In order to estimate the error-rate, we build a trigram language model (LM) using ukWaC (ukWaC LM ) (Ferraresi et al., 2008), a large corpus of English containing more than 2 billion tokens.
    Page 4, “Approach”
  2. correlatlon correlatlon word ngrams 0.601 0.598 +PoS ngrams 0.682 0.687 +script length 0.692 0.689 +PS rules 0.707 0.708 +complexity 0.714 0.712 Error-rate features +ukWaC LM 0.735 0.758 +CLC LM 0.741 0.773 +true CLC error-rate 0.751 0.789
    Page 5, “Approach”
  3. CLC (CLC LM ).
    Page 5, “Approach”
  4. feature correlation correlation none 0.741 0.773 word ngrams 0.713 0.762 PoS ngrams 0.724 0.737 script length 0.734 0.772 PS rules 0.712 0.731 complexity 0.738 0.760 ukWaC+CLC LM 0.714 0.712
    Page 5, “Evaluation”
  5. In the experiments reported hereafter, we use the ukWaC+CLC LM to calculate the error-rate.
    Page 5, “Evaluation”

See all papers in Proc. ACL 2011 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

feature set

Appears in 4 sentences as: Feature set (1) feature set (3)
In A New Dataset and Method for Automatically Grading ESOL Texts
  1. 3.2 Feature set
    Page 4, “Approach”
  2. Our full feature set is as follows:
    Page 4, “Approach”
  3. Although the above modifications do not exhaust the potential challenges a deployed AA system might face, they represent a threat to the validity of our system since we are using a highly related feature set .
    Page 7, “Validity tests”
  4. The addition of an incoherence metric to the feature set of an AA system has been shown to improve performance significantly (Miltsakaki and Kukich, 2000; Miltsakaki and Kukich, 2004).
    Page 9, “Conclusions and future work”

See all papers in Proc. ACL 2011 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.

feature weights

Appears in 3 sentences as: Feature weights (1) feature weights (2)
In A New Dataset and Method for Automatically Grading ESOL Texts
  1. One of the main approaches adopted by previous systems involves the identification of features that measure writing skill, and then the application of linear or stepwise regression to find optimal feature weights so that the correlation with manually assigned scores is maximised.
    Page 6, “Evaluation”
  2. Linear regression is used to assign optimal feature weights that maximise the correlation with the examiner’s scores.
    Page 8, “Previous work”
  3. Feature weights and/or scores can be fitted to a marking scheme by stepwise or linear regression.
    Page 8, “Previous work”

See all papers in Proc. ACL 2011 that mention feature weights.

See all papers in Proc. ACL that mention feature weights.

Back to top.

machine learning

Appears in 3 sentences as: machine learning (3)
In A New Dataset and Method for Automatically Grading ESOL Texts
  1. We demonstrate how supervised discriminative machine learning techniques can be used to automate the assessment of ‘English as a Second or Other Language’ (ESOL) examination scripts.
    Page 1, “Abstract”
  2. Different techniques have been used, including cosine similarity of vectors representing text in various ways (Attali and Burstein, 2006), often combined with dimensionality reduction techniques such as Latent Semantic Analysis (LSA) (Landauer et al., 2003), generative machine learning models (Rudner and Liang, 2002), domain-specific feature extraction (Attali and Burstein, 2006), and/or modified syntactic parsers (Lonsdale and Strong-Krause, 2003).
    Page 1, “Introduction”
  3. We address automated assessment as a supervised discriminative machine learning problem and particularly as a rank preference problem (J oachims, 2002).
    Page 2, “Introduction”

See all papers in Proc. ACL 2011 that mention machine learning.

See all papers in Proc. ACL that mention machine learning.

Back to top.

SVM

Appears in 3 sentences as: SVM (4)
In A New Dataset and Method for Automatically Grading ESOL Texts
  1. In this paper, we report experiments on rank preference Support Vector Machines (SVMs) trained on a relatively small amount of data, on identification of appropriate feature types derived automatically from generic text processing tools, on comparison with a regression SVM model, and on the robustness of the best model to ‘outlier’ texts.
    Page 2, “Introduction”
  2. In its basic form, a binary SVM classifier learns a linear threshold function that discriminates data points of two categories.
    Page 3, “Approach”
  3. We trained a SVM regression model with our full set of feature types and compared it to the SVM rank preference model.
    Page 6, “Evaluation”

See all papers in Proc. ACL 2011 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

text classification

Appears in 3 sentences as: text classification (3)
In A New Dataset and Method for Automatically Grading ESOL Texts
  1. Implicitly or explicitly, previous work has mostly treated automated assessment as a supervised text classification task, where training texts are labelled with a grade and unlabelled test texts are fitted to the same grade point scale via a regression step applied to the classifier output (see Section 6 for more details).
    Page 1, “Introduction”
  2. Discriminative classification techniques often outperform non-discriminative ones in the context of text classification (J oachims, 1998).
    Page 2, “Introduction”
  3. This system shows that treating AA as a text classification problem is viable, but the feature types are all fairly shallow, and the approach doesn’t make efficient use of the training data as a separate classifier is trained for each grade point.
    Page 8, “Previous work”

See all papers in Proc. ACL 2011 that mention text classification.

See all papers in Proc. ACL that mention text classification.

Back to top.