Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
Mohler, Michael and Bunescu, Razvan and Mihalcea, Rada

Article Structure

Abstract

In this work we address the task of computer-assisted assessment of short student answers.

Introduction

One of the most important aspects of the learning process is the assessment of the knowledge acquired by the learner.

Related Work

Several state-of-the-art short answer grading systems (Sukkarieh et al., 2004; Mitchell et al., 2002) require manually crafted patterns which, if matched, indicate that a question has been answered correctly.

Answer Grading System

We use a set of syntax-aware graph alignment features in a three-stage pipelined approach to short answer grading, as outlined in Figure 1.

Data Set

To evaluate our method for short answer grading, we created a data set of questions from introductory computer science assignments with answers provided by a class of undergraduate students.

Results

We independently test two components of our overall grading system: the node alignment detection scores found by training the perceptron, and the overall grades produced in the final stage.

Discussion and Conclusions

There are three things that we can learn from these experiments.

Topics

SVM

Appears in 13 sentences as: SVM (13)
In Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
  1. Using each of these as features, we use Support Vector Machines ( SVM ) to produce a combined real-number grade.
    Page 3, “Answer Grading System”
  2. The weight vector u is trained to optimize performance in two scenarios: Regression: An SVM model for regression (SVR) is trained using as target function the grades assigned by the instructors.
    Page 6, “Answer Grading System”
  3. Ranking: An SVM model for ranking (SVMRank) is trained using as ranking pairs all pairs of student answers (AS,At) such that grade(Az-,AS) > grade(Az-,At), where A,- is the corresponding instructor answer.
    Page 6, “Answer Grading System”
  4. At each grid point, the training data is partitioned into 5 folds which are used to train a temporary SVM model with the given parameters.
    Page 6, “Answer Grading System”
  5. We train the isotonic regression model on each type of system output (i.e., alignment scores, SVM output, BOW scores).
    Page 6, “Answer Grading System”
  6. 5.4 SVM Score Grading
    Page 8, “Results”
  7. The SVM components of the system are run on the full dataset using a 12-fold cross validation.
    Page 8, “Results”
  8. Both SVM models are trained using a linear kernel.11 Results from both the SVR and the SVMRank implementations are reported in Table 7 along with a selection of other measures.
    Page 8, “Results”
  9. Features [ A — Ten Folds ] SVM Model [ A — Ten Folds IR Model [ A — Ten Folds ]
    Page 8, “Results”
  10. The correlation for the BOW-only SVM model for SVMRank improved upon the best BOW feature
    Page 9, “Discussion and Conclusions”
  11. Likewise, using the BOW-only SVM model for SVR reduces the RMSE by .022 overall compared to the best BOW feature.
    Page 9, “Discussion and Conclusions”

See all papers in Proc. ACL 2011 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

perceptron

Appears in 9 sentences as: Perceptron (2) perceptron (8)
In Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
  1. Following the same line of work in the textual entailment world are (Raina et al., 2005), (MacCartney et al., 2006), (de Marneffe et al., 2007), and (Chambers et al., 2007), which experiment variously with using diverse knowledge sources, using a perceptron to learn alignment decisions, and exploiting natural logic.
    Page 2, “Related Work”
  2. The scoring function is trained on a small set of manually aligned graphs using the averaged perceptron algorithm.
    Page 3, “Answer Grading System”
  3. In order to learn the parameter vector w, we use the averaged version of the perceptron algorithm (Freund and Schapire, 1999; Collins, 2002).
    Page 4, “Answer Grading System”
  4. The pseudocode for the learning algorithm is shown in Table l. After training the perceptron , these 32 student answers are removed from the dataset, not used as training further along in the pipeline, and are not included in the final results.
    Page 4, “Answer Grading System”
  5. Table l: Perceptron Training for Node Matching.
    Page 4, “Answer Grading System”
  6. In addition, the student answers used to train the perceptron are removed from the pipeline after the perceptron training stage.
    Page 6, “Data Set”
  7. We independently test two components of our overall grading system: the node alignment detection scores found by training the perceptron , and the overall grades produced in the final stage.
    Page 7, “Results”
  8. 5.1 Perceptron Alignment
    Page 7, “Results”
  9. However, as the perceptron is designed to minimize error rate, this may not reflect an optimal objective when seeking to detect matches.
    Page 7, “Results”

See all papers in Proc. ACL 2011 that mention perceptron.

See all papers in Proc. ACL that mention perceptron.

Back to top.

feature vector

Appears in 7 sentences as: feature vector (7)
In Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
  1. We use Mani, $8) to denote the feature vector associated with a pair of nodes (55,-, :58), where :10,- is a node from the instructor answer A, and x8 is a node from the student answer A8.
    Page 4, “Answer Grading System”
  2. For a given answer pair (A1, As), we assemble the eight graph alignment scores into a feature vector
    Page 5, “Answer Grading System”
  3. We combine the alignment scores $001,, A8) with the scores ¢B(Ai, As) from the lexical semantic similarity measures into a single feature vector ¢(A,-,AS) = [¢G(A,-,AS)|¢B(A,-,AS)].
    Page 5, “Answer Grading System”
  4. The feature vector $901,, A8) contains the eight alignment scores found by applying the three transformations in the graph alignment stage.
    Page 5, “Answer Grading System”
  5. The feature vector ¢B(Ai, A8) consists of eleven semantic features —the eight knowledge-based features plus LSA, BSA and a vector consisting only of tf*idf weights — both with and without question demoting.
    Page 5, “Answer Grading System”
  6. Thus, the entire feature vector zMAi, A8) contains a total of 30 features.
    Page 5, “Answer Grading System”
  7. We report the results of running the systems on three subsets of features ¢(Ai, A8): BOW features ¢B(Ai, As) only, alignment features $901,, As) only, or the full feature vector (labeled “Hybrid”).
    Page 8, “Results”

See all papers in Proc. ACL 2011 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

semantic similarity

Appears in 7 sentences as: Semantic Similarity (1) semantic similarity (6)
In Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
  1. We combine several graph alignment features with lexical semantic similarity measures using machine learning techniques and show that the student answers can be more accurately graded than if the semantic measures were used in isolation.
    Page 1, “Abstract”
  2. Of these, 36 are based upon the semantic similarity
    Page 4, “Answer Grading System”
  3. 3.3 Lexical Semantic Similarity
    Page 5, “Answer Grading System”
  4. In order to address this, we combine the graph alignment scores, which encode syntactic knowledge, with the scores obtained from semantic similarity measures.
    Page 5, “Answer Grading System”
  5. (2006) and Mohler and Mihalcea (2009), we use eight knowledge-based measures of semantic similarity : shortest path [PATH], Leacock & Chodorow (1998) [LCH], Lesk (1986), Wu & Palmer(l994) [WUP], Resnik (1995) [RES], Lin (1998), Jiang & Conrath (1997) [JCN], Hirst & St. Onge (1998) [H80], and two corpus-based measures: Latent Semantic Analysis [LSA] (Landauer and Dumais, 1997) and Explicit Seman-
    Page 5, “Answer Grading System”
  6. Briefly, for the knowledge-based measures, we use the maximum semantic similarity — for each open-class word — that can be obtained by pairing it up with individual open-class words in the second input text.
    Page 5, “Answer Grading System”
  7. We combine the alignment scores $001,, A8) with the scores ¢B(Ai, As) from the lexical semantic similarity measures into a single feature vector ¢(A,-,AS) = [¢G(A,-,AS)|¢B(A,-,AS)].
    Page 5, “Answer Grading System”

See all papers in Proc. ACL 2011 that mention semantic similarity.

See all papers in Proc. ACL that mention semantic similarity.

Back to top.

F-measure

Appears in 6 sentences as: (1) F-measure (5)
In Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
  1. For the alignment detection, we report the precision, recall, and F-measure associated with correctly detecting matches.
    Page 7, “Results”
  2. The threshold weight learned from the bias feature strongly influences the point at which real scores change from non-matches to matches, and given the threshold weight learned by the algorithm, we find an F-measure of 0.72, with precision(P) = 0.85 and recall(R) = 0.62.
    Page 7, “Results”
  3. By manually varying the threshold, we find a maximum F-measure of 0.76, with P=0.79 and R=0.74.
    Page 7, “Results”
  4. Figure 2 shows the full precision-recall curve with the F-measure overlaid.
    Page 7, “Results”
  5. .l I“ ' "' "' J'I.... PreCISIOH ----- --F-Measure
    Page 7, “Results”
  6. Figure 2: Precision, recall, and F-measure on node-level match detection
    Page 7, “Results”

See all papers in Proc. ACL 2011 that mention F-measure.

See all papers in Proc. ACL that mention F-measure.

Back to top.

machine learning

Appears in 6 sentences as: machine learning (6)
In Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
  1. We combine several graph alignment features with lexical semantic similarity measures using machine learning techniques and show that the student answers can be more accurately graded than if the semantic measures were used in isolation.
    Page 1, “Abstract”
  2. In this paper, we explore the possibility of improving upon existing bag-of-words (BOW) approaches to short answer grading by utilizing machine learning techniques.
    Page 1, “Introduction”
  3. First, to what extent can machine learning be leveraged to improve upon existing approaches to short answer grading.
    Page 1, “Introduction”
  4. A later implementation of the Oxford-UCLES system (Pulman and Sukkarieh, 2005) compares several machine learning techniques, including inductive logic programming, decision tree learning, and Bayesian learning, to the earlier pattern matching approach, with encouraging results.
    Page 2, “Related Work”
  5. We define a total of 68 features to be used to train our machine learning system to compute node-node (more specifically, subgraph-subgraph) matches.
    Page 4, “Answer Grading System”
  6. Before applying any machine learning techniques, we first test the quality of the eight graph alignment features 2pc; (A1, A8) independently.
    Page 8, “Results”

See all papers in Proc. ACL 2011 that mention machine learning.

See all papers in Proc. ACL that mention machine learning.

Back to top.

similarity measures

Appears in 6 sentences as: similarity measures (6)
In Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
  1. We combine several graph alignment features with lexical semantic similarity measures using machine learning techniques and show that the student answers can be more accurately graded than if the semantic measures were used in isolation.
    Page 1, “Abstract”
  2. In the final stage (Section 3.4), we produce an overall grade based upon the alignment scores found in the previous stage as well as the results of several semantic BOW similarity measures (Section 3.3).
    Page 3, “Answer Grading System”
  3. All eight WordNet-based similarity measures listed in Section 3.3 plus the LSA model are used to produce these features.
    Page 4, “Answer Grading System”
  4. In order to address this, we combine the graph alignment scores, which encode syntactic knowledge, with the scores obtained from semantic similarity measures .
    Page 5, “Answer Grading System”
  5. We combine the alignment scores $001,, A8) with the scores ¢B(Ai, As) from the lexical semantic similarity measures into a single feature vector ¢(A,-,AS) = [¢G(A,-,AS)|¢B(A,-,AS)].
    Page 5, “Answer Grading System”
  6. One surprise while building this system was the consistency with which the novel technique of question demoting improved scores for the BOW similarity measures .
    Page 7, “Results”

See all papers in Proc. ACL 2011 that mention similarity measures.

See all papers in Proc. ACL that mention similarity measures.

Back to top.

error rate

Appears in 3 sentences as: error rate (3)
In Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
  1. However, as the perceptron is designed to minimize error rate , this may not reflect an optimal objective when seeking to detect matches.
    Page 7, “Results”
  2. First, we can see from the results that several systems appear better when evaluating on a correlation measure like Pearson’s p, while others appear better when analyzing error rate .
    Page 9, “Discussion and Conclusions”
  3. Evaluating with a correlative measure yields predictably poor results, but evaluating the error rate indicates that it is comparable to (or better than) the more intelligent BOW metrics.
    Page 9, “Discussion and Conclusions”

See all papers in Proc. ACL 2011 that mention error rate.

See all papers in Proc. ACL that mention error rate.

Back to top.

lexical semantic

Appears in 3 sentences as: Lexical Semantic (1) lexical semantic (2)
In Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
  1. We combine several graph alignment features with lexical semantic similarity measures using machine learning techniques and show that the student answers can be more accurately graded than if the semantic measures were used in isolation.
    Page 1, “Abstract”
  2. 3.3 Lexical Semantic Similarity
    Page 5, “Answer Grading System”
  3. We combine the alignment scores $001,, A8) with the scores ¢B(Ai, As) from the lexical semantic similarity measures into a single feature vector ¢(A,-,AS) = [¢G(A,-,AS)|¢B(A,-,AS)].
    Page 5, “Answer Grading System”

See all papers in Proc. ACL 2011 that mention lexical semantic.

See all papers in Proc. ACL that mention lexical semantic.

Back to top.

regression model

Appears in 3 sentences as: regression model (3)
In Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments
  1. We train the isotonic regression model on each type of system output (i.e., alignment scores, SVM output, BOW scores).
    Page 6, “Answer Grading System”
  2. For each fold, one additional fold is held out for later use in the development of an isotonic regression model (see Figure 3).
    Page 8, “Results”
  3. This is likely due to the different objective function in the corresponding optimization formulations: while the ranking model attempts to ensure a correct ordering between the grades, the regression model seeks to minimize an error objective that is closer to the RMSE.
    Page 9, “Discussion and Conclusions”

See all papers in Proc. ACL 2011 that mention regression model.

See all papers in Proc. ACL that mention regression model.

Back to top.