Baseline System | (2009) in using a decision tree classifier rather than an averaged linear perceptron . |
Baseline System | We find the decision tree classifier to work better than the default averaged perceptron (used by Stoyanov et al. |
Experiments | Angerc is the averaged perceptron baseline, DecTree is the decision tree baseline, and the +Feature rows show the effect of adding a particular feature incrementally (not in isolation) to the DecTree baseline. |
Experiments | We start with the Reconcile baseline but employ the decision tree (DT) classifier, because it has significantly better performance than the default averaged perceptron classifier used in Stoyanov et a1. |
Experiments | (2009).9 Table 2 compares the baseline perceptron results to the DT results and then shows the incremental addition of the Web features to the DT baseline (on the ACE04 development set). |
Experiments | The baseline learner in our experiments is a pairwise ranking perceptron that is used on various features and training data and plugged into various meta- |
Experiments | The perceptron algorithm itself compares favorably to related learning techniques such as the MIRA adaptation of Chiang et a1. |
Experiments | In contrast, the perceptron is deterministic when started from a zero-vector of weights and achieves favorable 28.0 BLEU on the news-commentary test set. |
Joint Feature Selection in Distributed Stochastic Learning | The resulting algorithms can be seen as variants of the perceptron algorithm. |
Joint Feature Selection in Distributed Stochastic Learning | subgradient leads to the perceptron algorithm for pairwise ranking3 (Shen and Joshi, 2005): |
Joint Feature Selection in Distributed Stochastic Learning | (2010) also present an iterative mixing algorithm where weights are mixed from each shard after training a single epoch of the perceptron in parallel on each shard. |
Related Work | Examples for adapted algorithms include Maximum-Entropy Models (Och and Ney, 2002; Blunsom et al., 2008), Pairwise Ranking Perceptrons (Shen et al., 2004; Watanabe et al., 2006; Hopkins and May, 2011), Structured Perceptrons (Liang et al., 2006a), Boosting (Duh and Kirchhoff, 2008; Wellington et al., 2009), Structured SVMs (Tillmann and Zhang, 2006; Hayashi et al., 2009), MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang et al., 2009), and others. |
Experiments | Due to the long CRF training time (days to weeks even for stochastic gradient descent training) for these large label size datasets, we choose the perceptron algorithm for training. |
Experiments | We note that the selection of training algorithm does not affect the decoding process: the decoding is identical for both CRF and perceptron training algorithms. |
Introduction | Sequence tagging algorithms including HMMs (Ra-biner, 1989), CRFs (Lafferty et al., 2001), and Collins’s perceptron (Collins, 2002) have been widely employed in NLP applications. |
Problem formulation | In this section, we formulate the sequential decoding problem in the context of perceptron algorithm (Collins, 2002) and CRFs (Lafferty et al., 2001). |
Problem formulation | Formally, a perceptron model is |
Problem formulation | If x is given, the decoding is to find the best y which maximizes the score of f (y, x) for perceptron or the probability of p(y|x) for CRFs. |
Conclusions | We find the best scoring derivation via forest reranking using both local and nonlocal features, that we train using the perceptron algorithm. |
Conclusions | Finally, distributed training strategies have been developed for the perceptron algorithm (McDonald et al., 2010), which would allow our generator to scale to even larger datasets. |
Problem Formulation | Algorithm 1: Averaged Structured Perceptron N i=1 |
Problem Formulation | We estimate the weights oc using the averaged structured perceptron algorithm (Collins, 2002), which is well known for its speed and good performance in similar large-parameter NLP tasks (Liang et al., 2006; Huang, 2008). |
Problem Formulation | As shown in Algorithm l, the perceptron makes several passes over the training scenarios, and in each iteration it computes the best scoring (w,h) among the candidate derivations, given the current weights oc. |
Related Work | Our model is closest to Huang (2008) who also performs forest reranking on a hypergraph, using both local and nonlocal features, whose weights are tuned with the averaged perceptron algorithm (Collins, 2002). |
Incorporating Syntactic Structures | The choice of action w is given by a learning algorithm, such as a maximum-entropy classifier, support vector machine or Perceptron , trained on labeled data. |
Incorporating Syntactic Structures | (2011), a transition based model with a Perceptron and a lookahead heuristic process. |
Introduction | We demonstrate our approach on a local Perceptron based part of speech tagger (Tsuruoka et al., 2011) and a shift reduce dependency parser (Sagae and Tsujii, 2007), yielding significantly faster tagging and parsing of ASR hypotheses. |
Syntactic Language Models | (2005) uses the Perceptron algorithm to train a global linear discriminative model |
Syntactic Language Models | We use a global linear model with Perceptron training. |
Syntactic Language Models | Perceptron training learns the parameters a. |
Related Work | A few other state-of-the-art CWS systems are using semi-Markov perceptron methods or voting systems based on multiple semi-Markov |
Related Work | perceptron segmenters (Zhang and Clark, 2007; Sun, 2010). |
Related Work | Those semi-Markov perceptron systems are moderately faster than the heavy probabilistic systems using semi-Markov conditional random fields or latent variable conditional random fields. |
Abstract | parameter estimation of 6, we use the averaged perceptron as described in (Collins, 2002). |
Abstract | POS-/+: perceptron trained without/with POS. |
Abstract | These results are compared to the core perceptron trained without POS in (J iang et al., 2008a). |