Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
Bhat, Suma and Xue, Huichao and Yoon, Su-Youn

Article Structure

Abstract

Designing measures that capture various aspects of language ability is a central task in the design of systems for automatic scoring of spontaneous speech.

Introduction

Assessment of a speaker’s proficiency in a second language is the main task in the domain of automatic evaluation of spontaneous speech (Zechner et al., 2009).

Related Work

Speaking in a nonnative language requires diverse abilities, including fluency, pronunciation, intonation, grammar, vocabulary, and discourse.

Shallow-analysis approach to measuring syntactic complexity

The measures of syntactic complexity in this approach are POS bigrams and are not obtained by a deep analysis (syntactic parsing) of the structure of the sentence.

Models for Measuring Grammatical Competence

We mentioned that the measure proposed in this study is derived from assumptions similar to those studied in (Yoon and Bhat, 2012).

Experimental Setup

Our experiments seek answers to the following questions.

Experimental Results

First, we compare the discriminative ability of measures of syntactic complexity (VSM-model based measure with that of the MaxEnt-based measure) across proficiency levels.

Discussions

We now discuss some of the observations and results of our study with respect to the following items.

Conclusions

Seeking alternatives to measuring syntactic complexity of spoken responses via syntactic parsers, we study a shallow-analysis based approach for use in automatic scoring.

Topics

bigrams

Appears in 22 sentences as: bigram (4) bigrams (21)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. In order to avoid the problems encountered with deep analysis-based measures, Yoon and Bhat (2012) explored a shallow analysis-based approach, based on the assumption that the level of grammar sophistication at each proficiency level is reflected in the distribution of part-of-speech (POS) tag bigrams .
    Page 3, “Related Work”
  2. The measures of syntactic complexity in this approach are POS bigrams and are not obtained by a deep analysis (syntactic parsing) of the structure of the sentence.
    Page 3, “Shallow-analysis approach to measuring syntactic complexity”
  3. In a shallow-analysis approach to measuring syntactic complexity, we rely on the distribution of POS bigrams at every profi-
    Page 3, “Shallow-analysis approach to measuring syntactic complexity”
  4. Consider the two sentence fragments below taken from actual responses (the bigrams of interest and their associated POS tags are boldfaced).
    Page 3, “Shallow-analysis approach to measuring syntactic complexity”
  5. Notice how these grammatical expressions (one erroneous and the other sophisticated) can be detected by the POS bigrams “MD-TO” and “NN-
    Page 3, “Shallow-analysis approach to measuring syntactic complexity”
  6. The result is a new measure based on POS bigrams to assess ESL learners’ mastery of syntactic complexity.
    Page 4, “Shallow-analysis approach to measuring syntactic complexity”
  7. Then, regarding POS bigrams as terms, they construct POS-based vector space models for each score-class (there are four score classes denoting levels of proficiency as will be explained in Section 5.2), thus yielding four score-specific vector-space models (VSMs).
    Page 4, “Models for Measuring Grammatical Competence”
  8. 0 0034: the cosine similarity score between the test response and the vector of POS bigrams for the highest score class (level 4); and,
    Page 4, “Models for Measuring Grammatical Competence”
  9. First, the VSM-based method is likely to overestimate the contribution of the POS bigrams when highly correlated bigrams occur as terms in the VSM.
    Page 4, “Models for Measuring Grammatical Competence”
  10. Consider the presence of a grammar pattern represented by more than one POS bigram .
    Page 4, “Models for Measuring Grammatical Competence”
  11. However, we note that the two bigrams are correlated and including them both results in an overestimation of their contribution.
    Page 4, “Models for Measuring Grammatical Competence”

See all papers in Proc. ACL 2014 that mention bigrams.

See all papers in Proc. ACL that mention bigrams.

Back to top.

MaxEnt

Appears in 15 sentences as: MaXEnt (2) MaxEnt (14)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. The inductive classifier we use here is the maximum-entropy model ( MaxEnt ) which has been used to solve several statistical natural language processing problems with much success (Berger et al., 1996; Borthwick et al., 1998; Borthwick, 1999; Pang et al., 2002; Klein et al., 2003; Rosenfeld, 2005).
    Page 5, “Models for Measuring Grammatical Competence”
  2. The productive feature engineering aspects of incorporating features into the discriminative MaxEnt classifier motivate the model choice for the problem at hand.
    Page 5, “Models for Measuring Grammatical Competence”
  3. In particular, the ability of the MaxEnt model’s estimation routine to handle overlapping (correlated) features makes it directly applicable to address the first limitation of the VSM model.
    Page 5, “Models for Measuring Grammatical Competence”
  4. The second limitation, related to the ineffective weighting of terms via the the t f -idf scheme, seems to be addressed by the fact that the MaxEnt model assigns a weight to each feature (in our case, POS bigrams) on a
    Page 5, “Models for Measuring Grammatical Competence”
  5. Therefore, a MaxEnt model has an advantage over the model described in 4.1 in that it uses four different weight schemes (one per score level) and each scheme is optimized for each score level.
    Page 5, “Models for Measuring Grammatical Competence”
  6. Subsequently, the feature extraction stage (a VSM or a MaxEnt model as the case may be) generates the syntactic complexity feature which is then incorporated in a multiple linear regression model to generate a score.
    Page 6, “Experimental Setup”
  7. We used the maximum entropy classifier implementation in the MaxEnt toolkit4.
    Page 6, “Experimental Setup”
  8. The results that follow are based on MaxEnt classifier’s parameter settings initialized to zero.
    Page 6, “Experimental Setup”
  9. The ASR data set was used to train the MaxEnt classifier and the features generated from the SM data set were used for evaluation.
    Page 7, “Experimental Setup”
  10. For a given response, the MaxEnt classifier calculates the conditional probability of a score-class given the response, in turn yielding conditional probabilities of each score group given the observation — p,- for score group i E {1, 2, 3, 4}.
    Page 7, “Experimental Setup”
  11. This permits us to better represent the score assigned by the MaxEnt classifier as a relative preference over score assignments.
    Page 7, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention MaxEnt.

See all papers in Proc. ACL that mention MaxEnt.

Back to top.

language acquisition

Appears in 9 sentences as: language acquisition (10)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. Second, the measure makes sense theoretically, both from algorithmic and native language acquisition points of view.
    Page 1, “Abstract”
  2. Prior studies in language acquisition and second language research have conclusively shown that proficiency in a second language is characterized by several factors, some of which are, fluency in language production, pronunciation accuracy, choice of vocabulary, grammatical sophistication and accuracy.
    Page 1, “Introduction”
  3. 0 In the domain of native language acquisition , the presence or absence of a grammatical structure indicates grammatical development.
    Page 2, “Introduction”
  4. Informed by studies in second language acquisition and language testing that regard these factors as key determiners of spoken language proficiency, some researchers have focused on the objective measurement of these aspects of spoken language in the context of automatic assessment of language ability.
    Page 2, “Related Work”
  5. The idea that the level of syntactic complexity (in terms of its range and sophistication) can be assessed based on the distribution of POS-tags is informed by prior studies in second language acquisition .
    Page 3, “Shallow-analysis approach to measuring syntactic complexity”
  6. However, when considered in the context of language acquisition studies, this approach seems to be justified.
    Page 9, “Discussions”
  7. Studies in native language acquisition, have considered multiple grammatical developmental indices that represent the grammatical levels reached at various stages of language acquisition .
    Page 9, “Discussions”
  8. Similarly, Scarborough (1990) proposed the Index of Productive Syntax (IPSyn), according to which, the presence of particular grammatical structures, from a list of 60 structures (ranging from simple ones such as including only subjects and verbs, to more complex constructions such as conjoined sentences) is evidence of language acquisition milestones.
    Page 9, “Discussions”
  9. We also make an interesting observation that the impressionistic evaluation of syntactic complexity is better approximated by the presence or absence of grammar and usage patterns (and not by their frequency of occurrence), an idea supported by studies in native language acquisition .
    Page 9, “Conclusions”

See all papers in Proc. ACL 2014 that mention language acquisition.

See all papers in Proc. ACL that mention language acquisition.

Back to top.

POS tagger

Appears in 9 sentences as: POS tag (2) POS tagger (5) POS tagging (1) POS tags (2)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. The idea of capturing differences in POS tag distributions for classification has been explored in several previous studies.
    Page 3, “Related Work”
  2. In the area of text-genre classification, POS tag distributions have been found to capture genre differences in text (Feldman et al., 2009; Marin et al., 2009); in a language testing context, it has been used in grammatical error detection and essay scoring (Chodorow and Leacock, 2000; Tetreault and Chodorow, 2008).
    Page 3, “Related Work”
  3. Consider the two sentence fragments below taken from actual responses (the bigrams of interest and their associated POS tags are boldfaced).
    Page 3, “Shallow-analysis approach to measuring syntactic complexity”
  4. The first stage, ASR, yields an automatic transcription, which is followed by the POS tagging stage.
    Page 6, “Experimental Setup”
  5. The steps for automatic assessment of overall proficiency follow an analogous process (either including the POS tagger or not), depending on the objective measure being evaluated.
    Page 6, “Experimental Setup”
  6. 5.3.2 POS tagger
    Page 6, “Experimental Setup”
  7. POS tags were generated using the POS tagger implemented in the Open-NLP toolkit3.
    Page 6, “Experimental Setup”
  8. This POS tagger was trained on about 528K word/tag pairs.
    Page 6, “Experimental Setup”
  9. However, due to substantial amount of speech recognition errors in our data, the POS error rate (resulting from the combined errors of ASR and automated POS tagger ) is expected to be higher.
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention POS tagger.

See all papers in Proc. ACL that mention POS tagger.

Back to top.

error rate

Appears in 5 sentences as: error rate (4) error rates (1)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. Automatic recognition of nonnative speakers’ spontaneous speech is a challenging task as evidenced by the error rate of the state-of-the-
    Page 2, “Related Work”
  2. For instance, Chen and Zechner (2011) reported a 50.5% word error rate (WER) and Yoon and Bhat (2012) reported a 30% WER in the recognition of ESL students’ spoken responses.
    Page 3, “Related Work”
  3. These high error rates at the recognition stage negatively affect the subsequent stages of the speech scoring system in general, and in particular, during a deep syntactic analysis, which operates on a long sequence of words as its context.
    Page 3, “Related Work”
  4. A word error rate (WER) of 31% on the SM dataset was observed.
    Page 6, “Experimental Setup”
  5. However, due to substantial amount of speech recognition errors in our data, the POS error rate (resulting from the combined errors of ASR and automated POS tagger) is expected to be higher.
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention error rate.

See all papers in Proc. ACL that mention error rate.

Back to top.

maximum entropy

Appears in 5 sentences as: Maximum Entropy (1) maximum entropy (4)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. This is done by resorting to a maximum entropy model based approach, to which we turn next.
    Page 5, “Models for Measuring Grammatical Competence”
  2. 5.3.4 Maximum Entropy Model Classifier
    Page 6, “Experimental Setup”
  3. We used the maximum entropy classifier implementation in the MaxEnt toolkit4.
    Page 6, “Experimental Setup”
  4. One straightforward way of using the maximum entropy classifier’s prediction for our case is to directly use its predicted score-level — l, 2, 3 or 4.
    Page 7, “Experimental Setup”
  5. Empirically, we show that the proposed measure, based on a maximum entropy classification, satisfied the constraints of the design of an objective measure to a high degree.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2014 that mention maximum entropy.

See all papers in Proc. ACL that mention maximum entropy.

Back to top.

native language

Appears in 5 sentences as: native language (4) natural language (1)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. Second, the measure makes sense theoretically, both from algorithmic and native language acquisition points of view.
    Page 1, “Abstract”
  2. 0 In the domain of native language acquisition, the presence or absence of a grammatical structure indicates grammatical development.
    Page 2, “Introduction”
  3. The inductive classifier we use here is the maximum-entropy model (MaxEnt) which has been used to solve several statistical natural language processing problems with much success (Berger et al., 1996; Borthwick et al., 1998; Borthwick, 1999; Pang et al., 2002; Klein et al., 2003; Rosenfeld, 2005).
    Page 5, “Models for Measuring Grammatical Competence”
  4. Studies in native language acquisition, have considered multiple grammatical developmental indices that represent the grammatical levels reached at various stages of language acquisition.
    Page 9, “Discussions”
  5. We also make an interesting observation that the impressionistic evaluation of syntactic complexity is better approximated by the presence or absence of grammar and usage patterns (and not by their frequency of occurrence), an idea supported by studies in native language acquisition.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2014 that mention native language.

See all papers in Proc. ACL that mention native language.

Back to top.

statistically significant

Appears in 5 sentences as: statistically significant (5)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. In addition, including our proposed measure of syntactic complexity in an automatic scoring model results in a statistically significant performance gain over the state-of-the-art.
    Page 2, “Introduction”
  2. The correlation was approximately 0.1 higher in absolute value than that of 0034, which was the best performing feature in the VSM-based model and the difference is statistically significant .
    Page 8, “Experimental Results”
  3. We note that the performance gain of Base+mescore over Base as well as over Base + cos4 is statistically significant at level = 0.01.
    Page 8, “Experimental Results”
  4. The performance gain of Base+cos4 over Base, however, is not statistically significant at level = 0.01.
    Page 8, “Experimental Results”
  5. Including the measure of syntactic complexity in an automatic scoring model resulted in statistically significant performance gains over the state-of-the-art.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2014 that mention statistically significant.

See all papers in Proc. ACL that mention statistically significant.

Back to top.

cosine similarity

Appears in 3 sentences as: cosine similarity (3)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. The similarity between a test response and a score-specific vector is then calculated by a cosine similarity metric.
    Page 4, “Models for Measuring Grammatical Competence”
  2. Although a total of 4 cosine similarity scores (one per score group) were generated, only 0034from among the four similarity scores, and cosmazc,
    Page 4, “Models for Measuring Grammatical Competence”
  3. 0 0034: the cosine similarity score between the test response and the vector of POS bigrams for the highest score class (level 4); and,
    Page 4, “Models for Measuring Grammatical Competence”

See all papers in Proc. ACL 2014 that mention cosine similarity.

See all papers in Proc. ACL that mention cosine similarity.

Back to top.

highest score

Appears in 3 sentences as: highest score (3)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. 0 0034: the cosine similarity score between the test response and the vector of POS bigrams for the highest score class (level 4); and,
    Page 4, “Models for Measuring Grammatical Competence”
  2. The measure of syntactic complexity of a response, 0034, is its similarity to the highest score class.
    Page 4, “Models for Measuring Grammatical Competence”
  3. We used the ASR data set to train a POS-bigram VSM for the highest score class and generated 0034 and cosmazc reported in Yoon and Bhat (2012), for the SM data set as outlined in Section 4.1.
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention highest score.

See all papers in Proc. ACL that mention highest score.

Back to top.

model training

Appears in 3 sentences as: model training (2) models trained (1)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. A distinguishing feature of the current study is that the measure is based on a comparison of characteristics of the test response to models trained on large amounts of data from each score point, as opposed to measures that are simply characteristics of the responses themselves (which is how measures have been considered in prior studies).
    Page 5, “Models for Measuring Grammatical Competence”
  2. The ASR set, with 47,227 responses, was used for ASR training and POS similarity model training .
    Page 6, “Experimental Setup”
  3. Although the skewed distribution limits the number of score-specific instances for the highest and lowest scores available for model training , we used the data without modifying the distribution since it is representative of responses in a large-scale language assessment scenario.
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention model training.

See all papers in Proc. ACL that mention model training.

Back to top.

regression model

Appears in 3 sentences as: regression model (3)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. Subsequently, the feature extraction stage (a VSM or a MaxEnt model as the case may be) generates the syntactic complexity feature which is then incorporated in a multiple linear regression model to generate a score.
    Page 6, “Experimental Setup”
  2. As in prior studies, here too the level of agreement is evaluated by means of the weighted kappa measure as well as unrounded and rounded Pearson’s correlations between machine and human scores (since the output of the regression model can either be rounded or regarded as is).
    Page 7, “Experimental Setup”
  3. The results reported are averaged over a 5-fold cross validation of the multiple regression model , where 80% of the SM data
    Page 8, “Experimental Results”

See all papers in Proc. ACL 2014 that mention regression model.

See all papers in Proc. ACL that mention regression model.

Back to top.

syntactic parsing

Appears in 3 sentences as: syntactic parsers (1) syntactic parsing (2)
In Shallow Analysis Based Assessment of Syntactic Complexity for Automated Speech Scoring
  1. Not surprisingly, Chen and Zechner (2011) studied measures of grammatical complexity via syntactic parsing and found that a Pearson’s correlation coefficient of 0.49 between syntactic complexity measures (derived from manual transcriptions) and proficiency scores, was drastically reduced to near nonexistence when the measures were applied to ASR word hypotheses.
    Page 3, “Related Work”
  2. The measures of syntactic complexity in this approach are POS bigrams and are not obtained by a deep analysis ( syntactic parsing ) of the structure of the sentence.
    Page 3, “Shallow-analysis approach to measuring syntactic complexity”
  3. Seeking alternatives to measuring syntactic complexity of spoken responses via syntactic parsers , we study a shallow-analysis based approach for use in automatic scoring.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2014 that mention syntactic parsing.

See all papers in Proc. ACL that mention syntactic parsing.

Back to top.