Introduction | In the past decade, sequence labeling algorithms such as HMMs, CRFs, and Collins’ perceptrons have been extensively studied in the field of NLP (Rabiner, 1989; Lafferty et al., 2001; Collins, 2002). |
Introduction | Among them, we focus on the perceptron algorithm (Collins, 2002). |
Introduction | In the perceptron , the score function f (:13, y) is given as f(a:,y) = w - qb(a:,y) where w is the weight vector, and qb(a:, y) is the feature vector representation of the pair (:13, By making the first-order Markov assumption, we have |
Discriminative training | We incorporate all our new features into a linear model and learn weights for each using the online averaged perceptron algorithm (Collins, 2002) with a few modifications for structured outputs inspired by Chiang et al. |
Experiments | We use 1,000 sentence pairs and gold alignments from LDC2006E86 to train model parameters: 800 sentences for training, 100 for testing, and 100 as a second held-out development set to decide when to stop perceptron training. |
Experiments | Figure 8: Learning curves for 10 random restarts over time for parallel averaged perceptron training. |
Experiments | Perceptron training here is quite stable, converging to the same general neighborhood each time. |
Introduction | We train the parameters of the model using averaged perceptron (Collins, 2002) modified for structured outputs, but can easily fit into a max-margin or related framework. |
Conclusion | It is observed that active learning of parsing with the averaged perceptron , which is one of the large margin classifiers, works also well for Japanese dependency analysis. |
Experimental Evaluation and Discussion | 6.2 Averaged Perceptron |
Experimental Evaluation and Discussion | We used the averaged perceptron (AP) (Freund and Schapire, 1999) with polynomial kernels. |
Experimental Evaluation and Discussion | We found the best value of the epoch T of the averaged perceptron by using the development set. |
Parsing experiments | 7.2 Averaged perceptron training |
Parsing experiments | We chose the averaged structured perceptron (Freund and Schapire, 1999; Collins, 2002) as it combines highly competitive performance with fast training times, typically converging in 5—10 iterations. |
Parsing experiments | Pass = %dependencies surviving the beam in training data, Orac = maximum achievable UAS on validation data, Accl/Acc2 = UAS of Models 1/2 on validation data, and Timel/Time2 = minutes per perceptron training iteration for Models 1/2, averaged over all 10 iterations. |
Results | In our first experiment, we trained supertagger models using Generalised Iterative Scaling (GIS) (Darroch and Ratcliff, 1972), the limited memory BFGS method (BFGS) (Nocedal and Wright, 1999), the averaged perceptron (Collins, 2002), and the margin infused relaxed algorithm (MIRA) (Crammer and Singer, 2003). |
Results | GIS 96.34 96.43 96.53 96.62 85.3 Perceptron 95.82 95.99 96.30 - 85.2 MIRA 96.23 96.29 96.46 96.63 85 .4 |
Results | For all four algorithms the training time is proportional to the amount of data, but the GIS and BFGS models trained on only CCGbank took 4,500 and 4,200 seconds to train, while the equivalent perceptron and MIRA models took 90 and 95 seconds to train. |