Conclusion | With efficient approximate decoding, perceptron training on the whole Treebank becomes practical, which can be done in about a day even with a Python implementation. |
Experiments | This result confirms that our feature set design is appropriate, and the averaged perceptron learner is a reasonable candidate for reranking. |
Experiments | We use the development set to determine the optimal number of iterations for averaged perceptron , and report the F1 score on the test set. |
Experiments | column is for feature extraction, and training column shows the number of perceptron iterations that achieved best results on the dev set, and average time per iteration. |
Forest Reranking | 3.1 Generic Reranking with the Perceptron |
Forest Reranking | In this work we use the averaged perceptron algorithm (Collins, 2002) since it is an online algorithm much simpler and orders of magnitude faster than Boosting and MaxEnt methods. |
Forest Reranking | Shown in Pseudocode l, the perceptron algorithm makes several passes over the whole training data, and in each iteration, for each sentence 3,, it tries to predict a best parse 3),- among the candidates cand(si) using the current weight setting. |
Introduction | his parser, and Wenbin Jiang for guidance on perceptron averaging. |
Abstract | We show that these are relatively unproblematic for an algorithm operating under the 0-1 loss model, whereas for the commonly used voted perceptron algorithm, hard training cases could result in incorrect prediction on the uncontroversial cases at test time. |
Introduction | For example, the perceptron family of algorithms handle random classification noise well (Cohen, 1997). |
Introduction | We show in section 3.4 that the widely used Freund and Schapire (1999) voted perceptron algorithm could face a constant hard case bias when confronted with annotation noise in training data, irrespective of the size of the dataset. |
Introduction | 3 Voted Perceptron |
Abstract | Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make. |
Background | Using the averaged perceptron algorithm (Collins, 2002), White & Rajkumar (2009) trained a structured prediction ranking model to combine these existing syntactic models with several n-gram language models. |
Introduction | Rajkumar & White (2011; 2012) have recently shown that some rather egregious surface realization errors—in the sense that the reader would likely end up with the wrong interpretation—can be avoided by making use of features inspired by psycholinguistics research together with an otherwise state-of-the-art averaged perceptron realization ranking model (White and Rajkumar, 2009), as reviewed in the next section. |
Introduction | With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model. |
Introduction | Therefore, to develop a more nuanced self-monitoring reranker that is more robust to such parsing mistakes, we trained an SVM using dependency precision and recall features for all three parses, their n-best parsing results, and per-label precision and recall for each type of dependency, together with the realizer’s normalized perceptron model score as a feature. |
Simple Reranking | The first one is the baseline generative model (hereafter, generative model) used in training the averaged perceptron model. |
Simple Reranking | The second one is the averaged perceptron model (hereafter, perceptron model), which uses all the features reviewed in Section 2. |
Simple Reranking | Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations |
Introduction | In the past decade, sequence labeling algorithms such as HMMs, CRFs, and Collins’ perceptrons have been extensively studied in the field of NLP (Rabiner, 1989; Lafferty et al., 2001; Collins, 2002). |
Introduction | Among them, we focus on the perceptron algorithm (Collins, 2002). |
Introduction | In the perceptron , the score function f (:13, y) is given as f(a:,y) = w - qb(a:,y) where w is the weight vector, and qb(a:, y) is the feature vector representation of the pair (:13, By making the first-order Markov assumption, we have |
Abstract | We investigate different ways of learning structured perceptron models for coreference resolution when using nonlocal features and beam search. |
Conclusion | We evaluated standard perceptron learning techniques for this setting both using early updates and LaSO. |
Conclusion | In the special case where only local features are used, this method coincides with standard structured perceptron learning that uses exact search. |
Experimental Setup | Unless otherwise stated we use 25 iterations of perceptron training and a beam size of 20. |
Introduction | This paper studies and extends previous work using the structured perceptron (Collins, 2002) for complex NLP tasks. |
Related Work | Perceptrons for coreference. |
Related Work | The perceptron has previously been used to train coreference resolvers either by casting the problem as a binary classification problem that considers pairs of mentions in isolation (Bengtson and Roth, 2008; Stoyanov et al., 2009; Chang et al., 2012, inter alia) or in the structured manner, where a clustering for an entire document is predicted in one go (Fernandes et al., 2012). |
Related Work | Stoyanov and Eisner (2012) train an Easy-First coreference system with the perceptron to learn a sequence of join operations between arbitrary mentions in a document and accesses nonlocal features through previous merge operations in later stages. |
Representation and Learning | We find the weight vector 21) by online learning using a variant of the structured perceptron (Collins, 2002). |
Representation and Learning | The structured perceptron iterates over training instances (55,, 3),), where :10, are inputs and y, are outputs. |
Representation and Learning | If 7' is set to l, the update reduces to the standard structured perceptron update. |
Introduction | We propose a novel joint event extraction algorithm to predict the triggers and arguments simultaneously, and use the structured perceptron (Collins, 2002) to train the joint model. |
Introduction | Therefore we employ beam search in decoding, and train the model using the early-update perceptron variant tailored for beam search (Collins and Roark, 2004; Huang et al., 2012). |
Joint Framework for Event Extraction | Based on the hypothesis that facts are interdependent, we propose to use structured perceptron with inexact search to jointly extract triggers and arguments that co-occur in the same sentence. |
Joint Framework for Event Extraction | 3.1 Structured perceptron with beam search |
Joint Framework for Event Extraction | Structured perceptron is an extension to the standard linear perceptron for structured prediction, which was proposed in (Collins, 2002). |
Character Classification Model | Algorithm 1 Perceptron training algorithm. |
Character Classification Model | The classifier can be trained with online learning algorithms such as perceptron , or offline learning models such as support vector machines. |
Character Classification Model | We choose the perceptron algorithm (Collins, 2002) to train the classifier for the character classification-based word segmentation model. |
Experiments | Figure 3: Learning curve of the averaged perceptron classifier on the CTB developing set. |
Experiments | We train the baseline perceptron classifier for word segmentation on the training set of CTB 5.0, using the developing set to determine the best training iterations. |
Experiments | Figure 3 shows the learning curve of the averaged perceptron on the developing set. |
Knowledge in Natural Annotations | Algorithm 2 Perceptron learning with natural annotations. |
Learning with Natural Annotations | Algorithm 3 Online version of perceptron learning with natural annotations. |
Learning with Natural Annotations | Considering the online characteristic of the perceptron algorithm, if we are able to leverage much more (than the Chinese wikipedia) data with natural annotations, an online version of learning procedure shown in Algorithm 3 would be a better choice. |
Conclusions | In addition, since this response-based supervision is weak and ambiguous, we have also presented a method for using multiple reference parses to perform perceptron weight updates and shown a clear further improvement in end-task performance with this approach. |
Experimental Evaluation | Table 2 presents reranking results for our proposed response-based weight update (Single) for the averaged perceptron (cf. |
Modified Reranking Algorithm | Reranking using an averaged perceptron (Collins, 2002a) has been successfully applied to a variety of NLP tasks. |
Modified Reranking Algorithm | Algorithm 1 AVERAGED PERCEPTRON TRAINING WITH RESPONSE-BASED UPDATE Input: A set of training examples (ei,y§‘), where 6, is a NL sentence and = arg maXyEGENQai) EXEC _ Output: The parameter vector W, averaged over all iterations 1. . |
Modified Reranking Algorithm | 1: procedure PERCEPTRON |
Abstract | We present a new perceptron learning algorithm using antagonistic adversaries and compare it to previous proposals on 12 multilingual cross-domain part-of-speech tagging datasets. |
Conclusion | Our approach was superior to previous approaches across 12 multilingual cross-domain POS tagging datasets, with an average error reduction of 4% over a structured perceptron baseline. |
Experiments | Learning with antagonistic adversaries performs significantly better than structured perceptron (SP) learning, L00 -regularization and LRA across the board. |
Experiments | We note that on the in-domain dataset (PTB-biomedical), L00 -regularization performs best, but our approach also performs better than the structured perceptron baseline on this dataset. |
Experiments | The number of zero weights or very small weights is significantly lower for learning with antagonistic adversaries than for the baseline structured perceptron . |
Introduction | Section 2 introduces previous work on robust perceptron learning, as well as the methods dicussed in the paper. |
Robust perceptron learning | Our framework will be averaged perceptron learning (Freund and Schapire, 1999; Collins, 2002). |
Experiment | The feature templates in (Zhao et al., 2006) and (Zhang and Clark, 2007) are used in training the CRFs model and Perceptrons model, respectively. |
Segmentation Models | This section briefly reviews two supervised models in these categories, a character-based CRFs model, and a word-based Perceptrons model, which are used in our approach. |
Segmentation Models | 2.2 Word-based Perceptrons Model |
Segmentation Models | Zhang and Clark (2007) first proposed a word-based segmentation model using a discriminative Perceptrons algorithm. |
Semi-supervised Learning via Co-regularizing Both Models | The model induction process is described in Algorithm 1: given labeled dataset D; and unlabeled dataset Du, the first two steps are training a CRFs (character-based) and Perceptrons (word-based) model on the labeled data D; , respectively. |
Semi-supervised Learning via Co-regularizing Both Models | Afterwards, the agreements A are used as a set of constraints to bias the learning of CRFs (§ 3.2) and Perceptron (§ 3.3) on the unlabeled data. |
Semi-supervised Learning via Co-regularizing Both Models | 3.3 Perceptrons with Constraints |
Answer Grading System | The scoring function is trained on a small set of manually aligned graphs using the averaged perceptron algorithm. |
Answer Grading System | In order to learn the parameter vector w, we use the averaged version of the perceptron algorithm (Freund and Schapire, 1999; Collins, 2002). |
Answer Grading System | The pseudocode for the learning algorithm is shown in Table l. After training the perceptron , these 32 student answers are removed from the dataset, not used as training further along in the pipeline, and are not included in the final results. |
Data Set | In addition, the student answers used to train the perceptron are removed from the pipeline after the perceptron training stage. |
Related Work | Following the same line of work in the textual entailment world are (Raina et al., 2005), (MacCartney et al., 2006), (de Marneffe et al., 2007), and (Chambers et al., 2007), which experiment variously with using diverse knowledge sources, using a perceptron to learn alignment decisions, and exploiting natural logic. |
Results | We independently test two components of our overall grading system: the node alignment detection scores found by training the perceptron , and the overall grades produced in the final stage. |
Results | 5.1 Perceptron Alignment |
Results | However, as the perceptron is designed to minimize error rate, this may not reflect an optimal objective when seeking to detect matches. |
Experiments | For comparison, we also investigated training the reranker with Perceptron and MIRA. |
Experiments | The f -scores of the held-out and evaluation set given by T-MIRA as well as the Perceptron and |
Experiments | When very few labeled data are available for training (compared with the number of features), T-MIRA performs much better than the vector-based models MIRA and Perceptron . |
Introduction | Many learning algorithms applied to NLP problems, such as the Perceptron (Collins, |
Tensor Model Construction | As a way out, we first run a simple vector-model based learning algorithm (say the Perceptron ) on the training data and estimate a weight vector, which serves as a “surro- |
Abstract | We present an incremental joint framework to simultaneously extract entity mentions and relations using structured perceptron with efficient beam-search. |
Algorithm 3.1 The Model | To estimate the feature weights, we use structured perceptron (Collins, 2002), an extension of the standard perceptron for structured prediction, as the learning framework. |
Algorithm 3.1 The Model | (2012) proved the convergency of structured perceptron when inexact search is applied with violation-fixing update methods such as early-update (Collins and Roark, 2004). |
Algorithm 3.1 The Model | Figure 4 shows the pseudocode for structured perceptron training with early-update. |
Conclusions and Future Work | For the first time, we addressed this challenging task by an incremental beam-search algorithm in conjunction with structured perceptron . |
Introduction | Following the above intuitions, we introduce a joint framework based on structured perceptron (Collins, 2002; Collins and Roark, 2004) with beam-search to extract entity mentions and relations simultaneously. |
Introduction | Our previous work (Li et al., 2013) used perceptron model with token-based tagging to jointly extract event triggers and arguments. |
Related Work | Our previous work (Li et al., 2013) used structured perceptron with token-based decoder to jointly predict event triggers and arguments based on the assumption that entity mentions and other argument candidates are given as part of the input. |
Baseline System | (2009) in using a decision tree classifier rather than an averaged linear perceptron . |
Baseline System | We find the decision tree classifier to work better than the default averaged perceptron (used by Stoyanov et al. |
Experiments | Angerc is the averaged perceptron baseline, DecTree is the decision tree baseline, and the +Feature rows show the effect of adding a particular feature incrementally (not in isolation) to the DecTree baseline. |
Experiments | We start with the Reconcile baseline but employ the decision tree (DT) classifier, because it has significantly better performance than the default averaged perceptron classifier used in Stoyanov et a1. |
Experiments | (2009).9 Table 2 compares the baseline perceptron results to the DT results and then shows the incremental addition of the Web features to the DT baseline (on the ACE04 development set). |
Experiments | The baseline learner in our experiments is a pairwise ranking perceptron that is used on various features and training data and plugged into various meta- |
Experiments | The perceptron algorithm itself compares favorably to related learning techniques such as the MIRA adaptation of Chiang et a1. |
Experiments | In contrast, the perceptron is deterministic when started from a zero-vector of weights and achieves favorable 28.0 BLEU on the news-commentary test set. |
Joint Feature Selection in Distributed Stochastic Learning | The resulting algorithms can be seen as variants of the perceptron algorithm. |
Joint Feature Selection in Distributed Stochastic Learning | subgradient leads to the perceptron algorithm for pairwise ranking3 (Shen and Joshi, 2005): |
Joint Feature Selection in Distributed Stochastic Learning | (2010) also present an iterative mixing algorithm where weights are mixed from each shard after training a single epoch of the perceptron in parallel on each shard. |
Related Work | Examples for adapted algorithms include Maximum-Entropy Models (Och and Ney, 2002; Blunsom et al., 2008), Pairwise Ranking Perceptrons (Shen et al., 2004; Watanabe et al., 2006; Hopkins and May, 2011), Structured Perceptrons (Liang et al., 2006a), Boosting (Duh and Kirchhoff, 2008; Wellington et al., 2009), Structured SVMs (Tillmann and Zhang, 2006; Hayashi et al., 2009), MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang et al., 2009), and others. |
Abstract | Our approach automatically generalizes a seed lexicon and includes a scalable, parallelized perceptron parameter estimation scheme. |
Data | Number of perceptron epochs T. |
Data | A function PerceptronEpoch(T, 0, L) that runs a single epoch of the hidden-variable structured perceptron algorithm on training set T with initial parameters 0, returning a new parameter vector 0’ . |
Learning | We use the hidden variable structured perceptron algorithm to learn 6 from a list of (question :5, query 2) training examples. |
Learning | We adopt the iterative parameter mixing variation of the perceptron (McDonald et al., 2010) to scale to a large number of training examples. |
Learning | Because the number of training examples is large, we adopt a parallel perceptron approach. |
Abstract | We augment the standard perceptron algorithm with a global integer linear programming formulation to optimize both local fit of information into each topic and global coherence across the entire overview. |
Introduction | We estimate the parameters of our model using the perceptron algorithm augmented with an integer linear programming (ILP) formulation, run over a training set of example articles in the given domain. |
Method | Using the perceptron framework augmented with an ILP formulation for global optimization, the system is trained to select the best excerpt for each document d, and each topic tj. |
Method | We implement this algorithm using the perceptron framework, as it can be easily modified for structured prediction while preserving convergence guarantees (Daume III and Marcu, 2005; Snyder and Barzilay, 2007). |
Method | Training Procedure Our algorithm is a modification of the perceptron ranking algorithm (Collins, 2002), which allows for joint learning across several ranking problems (Daumé III and Marcu, 2005; Snyder and Barzilay, 2007). |
Rank(eij1...€ij7~,Wj) | Our third baseline, Disjoint, uses the ranking perceptron framework as in our full system; however, rather than perform an optimization step during training and decoding, we simply select the highest-ranked excerpt for each topic. |
Experiments | Due to the long CRF training time (days to weeks even for stochastic gradient descent training) for these large label size datasets, we choose the perceptron algorithm for training. |
Experiments | We note that the selection of training algorithm does not affect the decoding process: the decoding is identical for both CRF and perceptron training algorithms. |
Introduction | Sequence tagging algorithms including HMMs (Ra-biner, 1989), CRFs (Lafferty et al., 2001), and Collins’s perceptron (Collins, 2002) have been widely employed in NLP applications. |
Problem formulation | In this section, we formulate the sequential decoding problem in the context of perceptron algorithm (Collins, 2002) and CRFs (Lafferty et al., 2001). |
Problem formulation | Formally, a perceptron model is |
Problem formulation | If x is given, the decoding is to find the best y which maximizes the score of f (y, x) for perceptron or the probability of p(y|x) for CRFs. |
Introduction | The discrimina-tive model is global and trained with the structured perceptron . |
Introduction | We also show how perceptron learning with beam-search (Collins and Roark, 2004) can be extended to handle the additional ambiguity, by adapting the “violation-fixing” perceptron of Huang et al. |
The Dependency Model | We also show, in Section 3.3, how perceptron training with early-update (Collins and Roark, 2004) can be used in this setting. |
The Dependency Model | We use the averaged perceptron (Collins, 2002) to train a global linear model and score each action. |
The Dependency Model | Since there are potentially many gold items, and one gold item is required for the perceptron update, a decision needs |
Incorporating Syntactic Structures | The choice of action w is given by a learning algorithm, such as a maximum-entropy classifier, support vector machine or Perceptron , trained on labeled data. |
Incorporating Syntactic Structures | (2011), a transition based model with a Perceptron and a lookahead heuristic process. |
Introduction | We demonstrate our approach on a local Perceptron based part of speech tagger (Tsuruoka et al., 2011) and a shift reduce dependency parser (Sagae and Tsujii, 2007), yielding significantly faster tagging and parsing of ASR hypotheses. |
Syntactic Language Models | (2005) uses the Perceptron algorithm to train a global linear discriminative model |
Syntactic Language Models | We use a global linear model with Perceptron training. |
Syntactic Language Models | Perceptron training learns the parameters a. |
Conclusions | We find the best scoring derivation via forest reranking using both local and nonlocal features, that we train using the perceptron algorithm. |
Conclusions | Finally, distributed training strategies have been developed for the perceptron algorithm (McDonald et al., 2010), which would allow our generator to scale to even larger datasets. |
Problem Formulation | Algorithm 1: Averaged Structured Perceptron N i=1 |
Problem Formulation | We estimate the weights oc using the averaged structured perceptron algorithm (Collins, 2002), which is well known for its speed and good performance in similar large-parameter NLP tasks (Liang et al., 2006; Huang, 2008). |
Problem Formulation | As shown in Algorithm l, the perceptron makes several passes over the training scenarios, and in each iteration it computes the best scoring (w,h) among the candidate derivations, given the current weights oc. |
Related Work | Our model is closest to Huang (2008) who also performs forest reranking on a hypergraph, using both local and nonlocal features, whose weights are tuned with the averaged perceptron algorithm (Collins, 2002). |
Conclusion | It is observed that active learning of parsing with the averaged perceptron , which is one of the large margin classifiers, works also well for Japanese dependency analysis. |
Experimental Evaluation and Discussion | 6.2 Averaged Perceptron |
Experimental Evaluation and Discussion | We used the averaged perceptron (AP) (Freund and Schapire, 1999) with polynomial kernels. |
Experimental Evaluation and Discussion | We found the best value of the epoch T of the averaged perceptron by using the development set. |
Discriminative training | We incorporate all our new features into a linear model and learn weights for each using the online averaged perceptron algorithm (Collins, 2002) with a few modifications for structured outputs inspired by Chiang et al. |
Experiments | We use 1,000 sentence pairs and gold alignments from LDC2006E86 to train model parameters: 800 sentences for training, 100 for testing, and 100 as a second held-out development set to decide when to stop perceptron training. |
Experiments | Figure 8: Learning curves for 10 random restarts over time for parallel averaged perceptron training. |
Experiments | Perceptron training here is quite stable, converging to the same general neighborhood each time. |
Introduction | We train the parameters of the model using averaged perceptron (Collins, 2002) modified for structured outputs, but can easily fit into a max-margin or related framework. |
Experiments | 5.1 Baseline Perceptron Classifier |
Experiments | We first report experimental results of the single perceptron classifier on CTB 5.0. |
Experiments | The first 3 rows in each subtable of Table 3 show the performance of the single perceptron |
Segmentation and Tagging as Character Classification | Algorithm 1 Perceptron training algorithm. |
Segmentation and Tagging as Character Classification | Several classification models can be adopted here, however, we choose the averaged perceptron algorithm (Collins, 2002) because of its simplicity and high accuracy. |
Related Work | A few other state-of-the-art CWS systems are using semi-Markov perceptron methods or voting systems based on multiple semi-Markov |
Related Work | perceptron segmenters (Zhang and Clark, 2007; Sun, 2010). |
Related Work | Those semi-Markov perceptron systems are moderately faster than the heavy probabilistic systems using semi-Markov conditional random fields or latent variable conditional random fields. |
Baseline parser | We adopt the parser of Zhang and Clark (2009) for our baseline, which is based on the shift-reduce process of Sagae and Lavie (2005), and employs global perceptron training and beam search. |
Baseline parser | The model parameter 5 is trained with the averaged perceptron algorithm, applied to state items (sequence of actions) globally. |
Experiments | The optimal iteration number of perceptron learning is determined |
Improved hypotheses comparison | This turns out to have a significant empirical influence on perceptron training with early-update, where the training of the model interacts with search (Daume III, 2006). |
Citation Extraction Data | We then use the development set to learn the penalties for the soft constraints, using the perceptron algorithm described in section 3.1. |
Soft Constraints in Dual Decomposition | All we need to employ the structured perceptron algorithm (Collins, 2002) or the structured SVM algorithm (Tsochantaridis et al., 2004) is a black-box procedure for performing MAP inference in the structured linear model given an arbitrary cost vector. |
Soft Constraints in Dual Decomposition | This can be ensured by simple modifications of the perceptron and subgradient descent optimization of the structured SVM objective simply by truncating c coordinate-wise to be nonnegative at every learning iteration. |
Soft Constraints in Dual Decomposition | Intuitively, the perceptron update increases the penalty for a constraint if it is satisfied in the ground truth and not in an inferred prediction, and decreases the penalty if the constraint is satisfied in the prediction and not the ground truth. |
Introduction | In this case, learning can follow the online structured perceptron learning procedure by Collins (2002), where weights updates for the k’th training example (x09), y("’)) are given as: |
Introduction | While the Viterbi algorithm can be used for tagging optimal state-sequences given the weights, the structured perceptron can learn optimal model weights given gold-standard sequence labels. |
Introduction | In the M-step, we take the decoded state-sequences in the E—step as observed, and run perceptron learning to update feature weights wi. |
Experiments | We trained the parsers using the averaged perceptron (Freund and Schapire, 1999; Collins, 2002), which represents a balance between strong performance and fast training times. |
Experiments | of iterations of perceptron training, we performed up to 30 iterations and chose the iteration which optimized accuracy on the development set. |
Experiments | 12Due to the sparsity of the perceptron updates, however, only a small fraction of the possible features were active in our trained models. |
Abstract | parameter estimation of 6, we use the averaged perceptron as described in (Collins, 2002). |
Abstract | POS-/+: perceptron trained without/with POS. |
Abstract | These results are compared to the core perceptron trained without POS in (J iang et al., 2008a). |
Abstract | In this paper, we combine easy-first dependency parsing and POS tagging algorithms with beam search and structured perceptron . |
Abstract | The proposed solution can also be applied to combine beam search and structured perceptron with other systems that exhibit spurious ambiguity. |
Training | Current work (Huang et al., 2012) rigorously explained that only valid update ensures convergence of any perceptron variants. |
Results | In our first experiment, we trained supertagger models using Generalised Iterative Scaling (GIS) (Darroch and Ratcliff, 1972), the limited memory BFGS method (BFGS) (Nocedal and Wright, 1999), the averaged perceptron (Collins, 2002), and the margin infused relaxed algorithm (MIRA) (Crammer and Singer, 2003). |
Results | GIS 96.34 96.43 96.53 96.62 85.3 Perceptron 95.82 95.99 96.30 - 85.2 MIRA 96.23 96.29 96.46 96.63 85 .4 |
Results | For all four algorithms the training time is proportional to the amount of data, but the GIS and BFGS models trained on only CCGbank took 4,500 and 4,200 seconds to train, while the equivalent perceptron and MIRA models took 90 and 95 seconds to train. |
Parsing experiments | 7.2 Averaged perceptron training |
Parsing experiments | We chose the averaged structured perceptron (Freund and Schapire, 1999; Collins, 2002) as it combines highly competitive performance with fast training times, typically converging in 5—10 iterations. |
Parsing experiments | Pass = %dependencies surviving the beam in training data, Orac = maximum achievable UAS on validation data, Accl/Acc2 = UAS of Models 1/2 on validation data, and Timel/Time2 = minutes per perceptron training iteration for Models 1/2, averaged over all 10 iterations. |
Alignment-based coordinate structure analysis | Shimbo and Hara defined this measure as a linear function of many features associated to arcs, and used perceptron training to optimize the weight coefficients for these features from corpora. |
Improvements | The weight of these features, which eventually determines the score of the bypass, is tuned by perceptron just like the weights of other features. |
Introduction | Recently, Shimbo and Hara (2007) proposed to use a large number of features to model this symmetry, and optimize the feature weights with perceptron training. |
Approach | For this reason we choose as a ranking algorithm the Perceptron which is both accurate and efficient and can be trained with online protocols. |
Approach | Specifically, we implement the ranking Perceptron proposed by Shen and J oshi (2005), which reduces the ranking problem to a binary classification problem. |
Approach | For regularization purposes, we use as a final model the average of all Perceptron models posited |