Introduction | based Feature Reduction for MaxEnt |
Introduction | In their Maximum Entropy ( MaXEnt ) based approach for Hindi NER development, Saha et al. |
Introduction | (2008) also observed that the performance of the MaXEnt based model often decreases when huge number of features are used in the model. |
Maximum Entropy Based Model for Hindi NER | Maximum Entropy ( MaxEnt ) principle is a commonly used technique which provides probability of belongingness of a token to a class. |
Maximum Entropy Based Model for Hindi NER | MaxEnt computes the probability p(0| h) for any 0 from the space of all possible outcomes 0, and for every h from the space of all possible histories H. In NER, history can be viewed as all information derivable from the training corpus relative to the current token. |
Maximum Entropy Based Model for Hindi NER | The computation of probability (p(0|h)) of an outcome for a token in MaxEnt depends on a set of features that are helpful in making predictions about the outcome. |
Decoding with Sense-Based Translation Model | MaxEnt classifiers |
Decoding with Sense-Based Translation Model | Once we get sense clusters for word tokens in test sentences, we load pre-trained MaXEnt classifiers of the corresponding word types. |
Experiments | We trained our MaxEnt classifiers with the off-the-shelf MaxEnt tool.4 We performed 100 iterations of the L-BFGS algorithm implemented in the training toolkit on the collected training events from the sense-annotated data as described in Section 3.2. |
Experiments | It took an average of 57.5 seconds for training a Maxent classifier. |
Sense-Based Translation Model | entropy ( MaxEnt ) based classifier that is used to predict the translation probability p(é|C(c)). |
Sense-Based Translation Model | The MaxEnt classifier can be formulated as follows. |
Sense-Based Translation Model | This is not a issue for the MaxEnt classifier as it can deal with arbitrary overlapping features (Berger et al., 1996). |
Alignment Link Confidence Measure | We combine the HMM alignment, the BM alignment and the MaXEnt alignment (ME) using the above link selection algorithm. |
Alignment Link Confidence Measure | Figure 3 shows such an example, where alignment errors in the MaXEnt alignment are shown with dotted lines. |
Improved MaXEnt Aligner with Confidence-based Link Filtering | In addition to the alignment combination, we also improve the performance of the MaXEnt aligner through confidence-based alignment link filtering. |
Improved MaXEnt Aligner with Confidence-based Link Filtering | Here we select the MaXEnt aligner because it has |
Introduction | In section 4 we show how to improve a MaXEnt word alignment quality by removing low confidence alignment links, which also leads to improved translation quality as shown in section 5. |
Related Work | Regarding word alignment combination, in addition to the commonly used ”intersection-union-refine” approach (Och and Ney, 2003), (Ayan and Dorr, 2006b) and (Ayan et al., 2005) combined alignment links from multiple word alignment based on a set of linguistic and alignment features within the MaXEnt framework or a neural net model. |
Sentence Alignment Confidence Measure | HMM 54.72 -0.710 BM 62.53 -0.699 MaxEnt 69.26 -0.699 |
Sentence Alignment Confidence Measure | We randomly selected 512 Chinese-English (CE) sentence pairs and generated word alignment using the MaxEnt aligner (Ittycheriah and Roukos, 2005). |
Sentence Alignment Confidence Measure | For each sentence pair in the CE test set, we calculate the confidence scores of the HMM alignment, the Block Model alignment and the MaXEnt alignment, then select the alignment with the highest confidence score. |
Translation | We extract phrase translation tables from the baseline MaXEnt word alignment as well as the alignment with confidence-based link filtering, then translate the test set with each phrase translation table. |
Experimental Setup | Subsequently, the feature extraction stage (a VSM or a MaxEnt model as the case may be) generates the syntactic complexity feature which is then incorporated in a multiple linear regression model to generate a score. |
Experimental Setup | We used the maximum entropy classifier implementation in the MaxEnt toolkit4. |
Experimental Setup | The results that follow are based on MaxEnt classifier’s parameter settings initialized to zero. |
Models for Measuring Grammatical Competence | The inductive classifier we use here is the maximum-entropy model ( MaxEnt ) which has been used to solve several statistical natural language processing problems with much success (Berger et al., 1996; Borthwick et al., 1998; Borthwick, 1999; Pang et al., 2002; Klein et al., 2003; Rosenfeld, 2005). |
Models for Measuring Grammatical Competence | The productive feature engineering aspects of incorporating features into the discriminative MaxEnt classifier motivate the model choice for the problem at hand. |
Models for Measuring Grammatical Competence | In particular, the ability of the MaxEnt model’s estimation routine to handle overlapping (correlated) features makes it directly applicable to address the first limitation of the VSM model. |
Analysis | MAXENT seems to outperform the CLASS BASED baseline because it learns more from the training data. |
Analysis | ' —E|— MaxEnt —@— ClassBased I I I I I I I I I I I I I I I I I I I I I I I I I |
Analysis | ' —E|— MaXEnt —9— ClassB ased |
Experiments | To evaluate our system ( MAXENT ) and our baselines, we partitioned the corpora into training and testing data. |
Experiments | For each NP in the test data, we generated a set of modifiers and looked at the predicted orderings of the MAXENT , CLASS BASED, and GOOGLE N-GRAM methods. |
Results | The MAXENT model consistently outperforms CLASS BASED across all test corpora and sequence lengths for both tokens and types, except when testing on the Brown and Switchboard corpora for modifier sequences of length 5, for which neither approach is able to make any correct predictions. |
Results | MAXENT also outperforms the GOOGLE N-GRAM baseline for almost all test corpora and sequence lengths. |
Results | For the Switchboard test corpus token and type accuracies, the GOOGLE N-GRAM baseline is more accurate than MAXENT for sequences of length 2 and overall, but the accuracy of MAXENT is competitive with that of GOOGLE N-GRAM. |
A Joint Model with Unlabeled Parallel Text | Maximum entropy ( MaxEnt ) models1 have been widely used in many NLP tasks (Berger et al., 1996; Ratnaparkhi, 1997; Smith, 2006). |
A Joint Model with Unlabeled Parallel Text | With MaxEnt , we learn from the input data: |
A Joint Model with Unlabeled Parallel Text | When 11 is 0, the algorithm ignores the unlabeled data and degenerates to two MaXEnt models trained on only the labeled data. |
Experimental Setup 4.1 Data Sets and Preprocessing | MaxEnt: This method learns a MaxEnt classifier for each language given the monolingual labeled data; the unlabeled data is not used. |
Results and Analysis | 8 By making use of the unlabeled parallel data, our proposed approach improves the accuracy, compared to MaXEnt , by 8.12% (or 33.27% error reduction) on English and 3.44% (or 16.92% error reduction) on Chinese in the first setting, and by 5.07% (or 19.67% error reduction) on English and 3.87% (or 19.4% error reduction) on Chinese in the second setting. |
Results and Analysis | 8Significance is tested using paired t-tests with p<0.05: denotes statistical significance compared to the corresponding performance of MaXEnt ; * denotes statistical significance compared to SVM; and r denotes statistical significance compared to Co-SVM. |
Results and Analysis | When 11 is set to 0, the joint model degenerates to two MaXEnt models trained with only the labeled data. |
Experiments and Results | The MaXEnt classifiers are also from the Stanford toolkit, and both the document and year mention classifiers use its default settings (quadratic prior). |
Experiments and Results | MaXEnt Unigram is our new discriminative model for this task. |
Experiments and Results | MaXEnt Time is the discriminative model with rich time features (but not NER) as described in Section 3.3.2 (Time+NER includes NER). |
Learning Time Constraints | Figure 2: Distribution over years for a single document as output by a MaxEnt classifier. |
Learning Time Constraints | We train a MaxEnt model on each year mention, to be described next. |
Learning Time Constraints | We use a MaxEnt classifier trained on the individual year mentions. |
Timestamp Classifiers | We used a MaxEnt model and evaluated with the same filtering methods based |
Timestamp Classifiers | Ultimately, this MaxEnt model vastly outperforms these NLLR models. |
Timestamp Classifiers | The above language modeling and MaxEnt approaches are token-based classifiers that one could apply to any topic classification domain. |
Distant Supervision | To increase coverage, we train a Maximum Entropy ( MaxEnt ) classifier (Manning and Klein, |
Distant Supervision | The MaxEnt model achieves an F1 of 61.2% on the SR corpus (Table 3, line 2). |
Distant Supervision | As described in Section 4, each document is represented as a graph of sentences and weights between sentences and source/sink nodes representing SR/SNR are set to the confidence values obtained from the distantly trained MaxEnt classifier. |
Sentiment Relevance | We divide both the SR and P&L corpora into training (50%) and test sets (50%) and train a Maximum Entropy ( MaxEnt ) classifier (Manning and Klein, 2003) with bag-of-word features. |
Discussions and Conclusions | We achieved a high accuracy of 84.7% for predicting such boundaries using MaXEnt model on machine parse trees. |
Elementary Trees to String Grammar | During training, we label nodes with translation boundaries, as one additional fitnction tag; during decoding, we employ the MaxEnt model to predict the translation boundary label probability for each span associated with a subgraph y, and discourage derivations accordingly for using nonterminals over the non—translation boundary span. |
Experiments | To learn our MaxEnt models defined in § 3.3, we collect the events during extracting elm2str grammar in training time, and learn the model using improved iterative scaling. |
Experiments | There are 16 thousand human parse trees with human alignment; additional 1 thousand human parse and aligned sent-pairs are used as unseen test set to verify our MaxEnt models and parsers. |
Experiments | It showed our MaxEnt model is very accurate using human trees: 94.5% of accuracy, and about 84.7% of accuracy for using the machine parsed trees. |
Introduction | The boundary cases were not addressed in the previous literature for trees, and here we include them in our feature sets for learning a MaxEnt model to predict the transformations. |
Introduction | The rest of the paper is organized as follows: in section 2, we analyze the projectable structures using human aligned and parsed data, to identify the problems for SCFG in general; in section 3, our proposed approach is explained in detail, including the statistical operators using a MaxEnt model; in section 4, we illustrate the integration of the proposed approach in our decoder; in section 5, we present experimental results; in section 6, we conclude with discussions and future work. |
Methods | This is performed by weighting features to maximise the likelihood of data and, for each instance, decisions are made based on features present at that point, thus maxent classification is quite suitable for our purposes. |
Methods | As feature weights are mutually estimated, the maxent classifier is capable of taking feature dependence into account. |
Methods | By downweighting such features, maxent is capable of modelling to a certain extent the special characteristics which arise from the automatic or weakly supervised training data acquisition procedure. |
Results | Here we decided not to check whether these keywords made sense in scientific texts or not, but instead left this task to the maximum entropy classifier, and added only those keywords that were found reliable enough to predict spec label alone by the maxent model trained on the training dataset. |
Results | This 54—keyword maxent classifier got an F5=1(spec) score of 79.73%. |
Results | We manually examined all keywords that had a P(spec) > 0.5 given as a standalone instance for our maxent model, and constructed a dictionary of hedge cues from the promising candidates. |
Error Detection with a Maximum Entropy Model | We tune our model feature weights using an off-the-shelf MaXEnt toolkit (Zhang, 2004). |
Error Detection with a Maximum Entropy Model | During test, if the probability p(correct|¢) is larger than p(incorrect|¢) according the trained MaXEnt model, the word is labeled as correct otherwise incorrect. |
Experiments | Starting with MaXEnt models with single linguistic feature or word posterior probability based feature, we incorporated additional features incre-mentally by combining features together. |
Experiments | We conducted three groups of experiments using the MaXEnt based error detection model with various feature combinations. |
Experiments | Using discrete word posterior probabilities as features in the MaxEnt based error detection model is marginally better than word posterior probability thresholding in terms of CER, but obtains a 13.79% relative improvement in F measure. |
Introduction | We integrate two sets of linguistic features into a maximum entropy ( MaxEnt ) model and develop a MaxEnt-based binary classifier to predict the category (correct or incorrect) for each word in a generated target sentence. |
Conclusions | Furthermore, our framework is scalable for other local sentence level extractors in addition to the MaxEnt model. |
Experiments | Our ILP model and its variants all outperform Mintz++ in precision in both datasets, indicating that our approach helps filter out incorrect predictions from the output of MaxEnt model. |
Experiments | However, in the Riedel’s dataset, Mintz++, the MaxEnt relation extractor, does not perform well, and our framework cannot improve its performance. |
Experiments | Hence, our framework does not perform well due to the poor performance of MaXEnt extractor and the lack of clues. |
The Framework | By adopting ILP, we can combine the local information including MaXEnt confidence scores and the implicit relation backgrounds that are embedded into global consistencies of the entity tuples together. |
Consensus Decoding Algorithms | The standard Viterbi decoding objective is to find 6* = arg maxe A - 6( f, e). |
Consensus Decoding Algorithms | 6 = arg maxe EP(e/|f) [8(6; (3’)] arg maxe Z P(e’|f) - 8(6; 6’) |
Consensus Decoding Algorithms | arg maerE EP(e’|f) [3(6; 6/” = arg maxe Z P(e/|f) - ij'(€) ° ¢j(el) e’EE j |
Experiment | To establish performance for the phonological standard, we use the IBPOT learner to find constraint weights but do not update M. The resultant learner is essentially MaxEnt OT with the weights estimated through Metropolis sampling instead of gradient ascent. |
Introduction | We consider this question by examining the dominant framework in modern phonology, Optimality Theory (Prince and Smolensky, 1993, OT), implemented in a log-linear framework, MaXEnt OT (Goldwater and Johnson, 2003), with output forms’ probabilities based on a weighted sum of |
Phonology and Optimality Theory 2.1 OT structure | In IBPOT, we use the log-linear EVAL developed by Goldwater and J ohn-son (2003) in their MaxEnt OT system. |
Phonology and Optimality Theory 2.1 OT structure | MEOT also is motivated by the general MaxEnt framework, whereas most other OT formulations are ad hoc constructions specific to phonology. |
Phonology and Optimality Theory 2.1 OT structure | In MaXEnt OT, each constraint has a weight, and the candidates’ scores are the sums of the weights of violated constraints. |
Generating reference reordering from parallel sentences | This model was significantly better than the MaxEnt aligner (Ittycheriah and Roukos, 2005) and is also flexible in the sense that it allows for arbitrary features to be introduced while still keeping training and decoding tractable by using a greedy decoding algorithm that explores potential alignments in a small neighborhood of the current alignment. |
Generating reference reordering from parallel sentences | The model thus needs a reasonably good initial alignment to start with for which we use the MaxEnt aligner (Ittycheriah and Roukos, 2005) as in McCarley et al. |
Results and Discussions | None - 35.5 Manual 180K 52.5 MaxEnt 70.0 3.9M 49.5 |
Results and Discussions | We see that the quality of the alignments matter a great deal to the reordering model; using MaxEnt alignments cause a degradation in performance over just using a small set of manual word alignments. |
Experiment | Method P R F OOV—R Stanford 0.861 0.853 0.857 0.639 ICTCLAS 0.812 0.861 0.836 0.602 Li-Sun 0.707 0.820 0.760 0.734 Maxent 0.868 0.844 0.856 0.760 No-punc 0.865 0.829 0.846 0.760 No-balance 0.869 0.877 0.873 0.757 Our method 0.875 0.875 0.875 0.773 |
Experiment | Maxent only uses the PKU data for training, with neither punctuation information nor self-training framework incorporated. |
Experiment | The comparison of Maxent and No-punctuation |
Document-specific MT System | ment (HMM (Vogel et al., 1996) and MaxEnt (Ittycheriah and Roukos, 2005) alignment models, phrase pair extraction, MT model training (Ittycheriah and Roukos, 2007) and LM model training. |
Related Work | Target part-of-speech and null dependency link are exploited in a MaXEnt classifier to improve the MT quality estimation (Xiong et al., 2010). |
Static MT Quality Estimation | 0 17 decoding features, including phrase translation probabilities (source-to-target and target-to-source), word translation probabilities (also in both directions), maxent prob-abilitiesl, word count, phrase count, distor- |
Static MT Quality Estimation | 1The maxent probability is the translation probability |
Discussion | It shows that 1) as expected, our classifiers do worse on the harder semantic reordering prediction than syntactic reordering prediction; 2) thanks to the high accuracy obtained by the maxent classifiers, integrating either the syntactic or the semantic reordering constraints results in better reordering performance from both syntactic and semantic perspectives; 3) in terms of the mutual impact, the syntactic reordering models help improving semantic reordering more than the semantic reordering |
Discussion | Syntactic Semantic l-m rm l-m rm MR08 75.0 78.0 66.3 68.5 +syn-reorder 78.4 80.9 69.0 70.2 +sem—reorder 76.0 78.8 70.7 72.7 +b0th 78.6 81.7 70.6 72.1 Maxent Classifier 80.7 85.6 70.9 73.5 |
Experiments | tactic parsing and semantic role labeling on the Chinese sentences, then train the models by using MaxEnt toolkit with L1 regularizer (Tsuruoka et al., 2009).3 Table 3 shows the reordering type distribution over the training data. |
Related Work | Marton and Resnik (2008) employed soft syntactic constraints with weighted binary features and no MaXEnt model. |
Available at http://nlp. stanford.edu/software/mimlre. shtml. | —l— Guided DS Semi—MIML —.— DS+upsampling —'— MaxEnt |
Available at http://nlp. stanford.edu/software/mimlre. shtml. | Our baselines: 1) MaXEnt is a supervised maximum entropy baseline trained on a human-labeled data; 2) DS+upsamp|ing is an upsampling experiment, where MIML was trained on a mix of a distantly-labeled and human-labeled data; 3) Semi-MIML is a recent semi-supervised extension. |
The Challenge | We experimentally tested alternative feature sets by building supervised Maximum Entropy ( MaxEnt ) models using the hand-labeled data (Table 3), and selected an effective combination of three features from the full feature set used by Surdeanu et al., (2011): |
The Challenge | Table 3: Performance of a MaxEnt , trained on hand-labeled data using all features (Surdeanu et al., 2011) vs using a subset of two (types of entities, dependency path), or three (adding a span word) features, and evaluated on the test set. |
Introduction | | standard | 34.79 | 56.93 | + depLM 3529* 56.17** + maxent 35.40** 56.09** |
Introduction | + depLM & maxent 35.71** 55.87** |
Introduction | Adding dependency language model (“depLM”) and the maximum entropy shift-reduce parsing model ( “maxent” ) significantly improves BLEU and TER on the development set, both separately and jointly. |
MultiLayer Context Model - MCM | and predicted dialog act by arg maxa 13(a|ud*): |
MultiLayer Context Model - MCM | * N M311 a; = arg maXa [6351 a * HF”, M1: ] (6) |
MultiLayer Context Model - MCM | For each segment wuj in u, its predicted slot are determined by arg maXS P(sj|wuj,d*,sj_1): |
Abstract | In this paper we present a comprehensive treatment of ECs by first recovering them with a structured MaxEnt model with a rich set of syntactic and lexical features, and then incorporating the predicted ECs into a Chinese-to-English machine translation task through multiple approaches, including the extraction of EC-specific sparse features. |
Chinese Empty Category Prediction | We propose a structured MaXEnt model for predicting ECs. |
Chinese Empty Category Prediction | (1) is the familiar log linear (or MaXEnt ) model, where fk,(ei_1,T, 6,) is the feature function and |
Discussion | We use three sentiment classification techniques: Na‘1've Bayes, MaxEnt and SVM with un-igrams, bigrams and trigrams as features. |
Discussion | MaxEnt (Movie) -0.29 (72.17) MaxEnt (Twitter) -0.26 (71.68) SVM (Movie) -().24 (66.27) SVM (Twitter) -().19 (73.15) |
Discussion | MaxEnt has the highest negative correlation of -().29 and -().26. |
Experiments | unigram, bigram, trigram 92.6 MAXENT POS, chunks, NE, supertags |
Experiments | unigram, bigram, trigram 93.6 MAxENT POS, wh-word, head word |
Experiments | SVM 81.6 BINB 82.7 MAXENT 83.0 MAX-TDNN 78.8 NBOW 80.9 DCNN 87.4 |