Introducing Nonlocal Features | In other words, it is unlikely that we can devise a feature set that is informative enough to allow the weight vector to converge towards a solution that lets the learning algorithm see the entire documents during training, at least in the situation when no external knowledge sources are used. |
Introducing Nonlocal Features | Thus the learning algorithm always reaches the end of a document, avoiding the problem that early updates discard parts of the training data. |
Introducing Nonlocal Features | When we applied LaSO, we noticed that it performed worse than the baseline learning algorithm when only using local features. |
Introduction | The main reason why early updates underper-form in our setting is that the task is too difficult and that the learning algorithm is not able to profit from all training data. |
Introduction | Put another way, early updates happen too early, and the learning algorithm rarely reaches the end of the instances as it halts, updates, and moves on to the next instance. |
Representation and Learning | Algorithm 1 shows pseudocode for the leam-ing algorithm, which we will refer to as the baseline learning algorithm . |
Results | Since early updates do not always make use of the complete documents during training, it can be expected that it will require either a very wide beam or more iterations to get up to par with the baseline learning algorithm . |
Results | Recall that with only local features, delayed LaSO is equivalent to the baseline learning algorithm . |
Results | From these results we conclude that we are better off when the learning algorithm handles one document at a time, instead of getting feedback within documents. |
Abstract | We propose an online learning algorithm based on tensor-space models. |
Abstract | We apply with the proposed algorithm to a parsing task, and show that even with very little training data the learning algorithm based on a tensor model performs well, and gives significantly better results than standard learning algorithms based on traditional vector-space models. |
Conclusion and Future Work | In this paper, we reformulated the traditional linear vector-space models as tensor-space models, and proposed an online learning algorithm named Tensor-MIRA. |
Introduction | Many learning algorithms applied to NLP problems, such as the Perceptron (Collins, |
Introduction | A tensor weight learning algorithm is then proposed in 4. |
Online Learning Algorithm | Here we propose an online learning algorithm similar to MIRA but modified to accommodate tensor models. |
Online Learning Algorithm | ,m, where 95,- is the input and y,- is the reference or oracle hypothesis, are fed to the weight learning algorithm in sequential order. |
Tensor Model Construction | As a way out, we first run a simple vector-model based learning algorithm (say the Perceptron) on the training data and estimate a weight vector, which serves as a “surro- |
Tensor Space Representation | Most of the learning algorithms for NLP problems are based on vector space models, which represent data as vectors qb E R”, and try to learn feature weight vectors w E R” such that a linear model 3/ = w - qb is able to discriminate between, say, good and bad hypotheses. |
Abstract | We introduce a provably correct learning algorithm for latent-variable PCFGs. |
Additional Details of the Algorithm | The learning algorithm for L-PCFGs can be useI as an initializer for the EM algorithm for L PCFGs. |
Experiments on Parsing | This section describes parsing experiments using the learning algorithm for L-PCFGs. |
Experiments on Parsing | Table 1: Results on the development data (section 22) and test data (section 23) for various learning algorithms for L-PCFGs. |
Experiments on Parsing | In this special case, the L-PCFG learning algorithm is equivalent to a simple algorithm, with the following steps: 1) define the matrix Q with entries wam = count(w1,w2)/N where count(w1,w2) is the number of times that bi-gram (101,102) is seen in the data, and N = wam count(w1, 7.02). |
The Learning Algorithm for L-PCFGS | Our goal is to design a learning algorithm for L-PCFGs. |
The Learning Algorithm for L-PCFGS | 4.2 The Learning Algorithm |
The Learning Algorithm for L-PCFGS | Figure 1 shows the learning algorithm for L-PCFGs. |
The Matrix Decomposition Algorithm | This section describes the matrix decomposition algorithm used in Step 1 of the learning algorithm . |
Abstract | Additive tree metrics can be leveraged by “meta-algorithms” such as neighbor-joining (Saitou and Nei, 1987) and recursive grouping (Choi et al., 2011) to provide consistent learning algorithms for latent trees. |
Abstract | In our learning algorithm , we assume that examples of the form (1002,3302) for i E [N] = {1, . |
Abstract | The word embeddings are used during the leam-ing process, but the final decoder that the learning algorithm outputs maps a POS tag sequence a: to a parse tree. |
Introduction | (2002) and employ machine learning algorithms to build classifiers from tweets with manually annotated sentiment polarity. |
Introduction | To this end, we extend the existing word embedding learning algorithm (Collobert et al., 2011) and develop three neural networks to effectively incorporate the supervision from sentiment polarity of text (e.g. |
Introduction | In the accuracy of polarity consistency between each sentiment word and its top N closest words, SSWE outperforms existing word embedding learning algorithms . |
Related Work | Under this assumption, many feature learning algorithms are proposed to obtain better classification performance (Pang and Lee, 2008; Liu, 2012; Feldman, 2013). |
Related Work | We extend the existing word embedding learning algorithm (Collobert et al., 2011) and develop three neural networks to learn SSWE. |
Related Work | In the following sections, we introduce the traditional method before presenting the details of SSWE learning algorithms . |
Abstract | We use aligned subsequences as features for machine learning algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. |
Conclusions and Future Work | and Waterman, 1981), and other learning algorithms for discriminating between cognates and non-cognates. |
Our Approach | Therefore, because the edit distance was widely used in this research area and produced good results, we are encouraged to employ orthographic alignment for identifying pairs of cognates, not only to compute similarity scores, as was previously done, but to use aligned subsequences as features for machine learning algorithms . |
Our Approach | 3.3 Learning Algorithms |
Implementation | The learning algorithm is applied in a shift-reduce parser, where the training data consists of the (unique) list of shift and reduce operations required to produce the gold RST parses. |
Introduction | Alternatively, our approach can be seen as a nonlinear learning algorithm for incremental structure prediction, which overcomes feature sparsity through effective parameter tying. |
Large-Margin Learning Framework | of our learning algorithm are different. |
Large-Margin Learning Framework | Algorithm 1 Mini-batch learning algorithm Input: Training set D, Regularization parameters A and 7', Number of iteration T, Initialization matrix A0, and Threshold 5 whilet = l,...,Tdo |
Feature Weighting Methods | ing, better classification and regression models can be built by using the feature weights generated by these models as a pre-weight on the data points for other machine learning algorithms . |
Related Work | Noise tolerance techniques aim to improve the learning algorithm itself to avoid over-fitting caused by mislabeled instances in the training phase, so that the constructed classifier becomes more noise-tolerant. |
Related Work | Decision tree (Mingers, 1989; Vannoorenberghe and Denoeux, 2002) and boosting (Jiang, 2001; Kalaia and Servediob, 2005; Karmaker and Kwek, 2006) are two learning algorithms that have been investigated in many studies. |
Related Work | For example, useful information can be removed with noise elimination, since annotation errors are likely to occur on ambiguous instances that are potentially valuable for learning algorithms . |
Abstract | This section first introduces the fundamental supervised learning method, and then describes a baseline active learning algorithm . |
Abstract | 3.2 Active Learning Algorithm |
Abstract | 4.4 Bilingual Active Learning Algorithm |
Introduction | In order to address these unique challenges for wikification for the short tweets, we employ graph-based semi-supervised learning algorithms (Zhu et al., 2003; Smola and Kondor, 2003; Blum et al., 2004; Zhou et al., 2004; Talukdar and Crammer, 2009) for collective inference by exploiting the manifold (cluster) structure in both unlabeled and labeled data. |
Introduction | effort to explore graph-based semi-supervised learning algorithms for the wikification task. |
Semi-supervised Graph Regularization | We propose a novel semi-supervised graph regularization framework based on the graph-based semi-supervised learning algorithm (Zhu et al., 2003): |
Add arc <eC,ej> to GC with | As we employed the MIRA learning algorithm , it is possible to identify which specific features are useful, by looking at the weights learned to each feature using the training data. |
Add arc <eC,ej> to GC with | Other text-level discourse parsing methods include: (1) Percep-coarse: we replace MIRA with the averaged per-ceptron learning algorithm and the other settings are the same with Our-coarse; (2) HILDA-manual and HILDA-seg are from Hernault (2010b)’s work, and their inputted EDUs are from RST-DT and their own EDU segmenter respectively; (3) LeThanh indicates the results given by LeThanh el al. |
Add arc <eC,ej> to GC with | We can also see that the averaged perceptron learning algorithm , though simple, can achieve a comparable performance, better than HILDA-manual. |
Introduction | Implicitly, the weight learning algorithm can be seen as a gradient descent procedure minimizing the difference between the scores of highest scoring (Viterbi) state sequences, and the label state sequences. |
Introduction | Pseudocode of the learning algorithm for the partially labeled case is given in Algorithm 1. |
Introduction | We see that while all three learning algorithms perform better than the baseline, the performance of the purely unsupervised system is inferior to supervised approaches. |
Related Work | Our work builds on one such approach — SampleRank (Wick et al., 2011), a sampling-based learning algorithm . |
Sampling-Based Dependency Parsing with Global Features | We begin with the notation before addressing the decoding and learning algorithms . |
Sampling-Based Dependency Parsing with Global Features | ure 4 summarizes the learning algorithm . |