Approaches | Figure 2 shows the factor graph for this joint model . |
Discussion and Future Work | We find that we can outperform prior work in the low-resource setting by coupling the selection of feature templates based on information gain with a joint model that marginalizes over latent syntax. |
Discussion and Future Work | Our discriminative joint models treat latent syntax as a structured-feature to be optimized for the end-task of SRL, while our other grammar induction techniques optimize for unlabeled data likelihood—optionally with distant supervision. |
Experiments | This highlights an important advantage of the pipeline trained model: the features can consider any part of the syntax (e. g., arbitrary sub-trees), whereas the joint model is limited to those features over which it can efficiently marginalize (e.g., short dependency paths). |
Experiments | In the low-resource setting of the CoNLL-2009 Shared task without syntactic supervision, our joint model (Joint) with marginalized syntax obtains state-of-the-art results with features IGC described in § 4.2. |
Experiments | These results begin to answer a key research question in this work: The joint models outperform the pipeline models in the low-resource setting. |
Introduction | 0 Comparison of pipeline and joint models for SRL. |
Introduction | The joint models use a non-loopy conditional random field (CRF) with a global factor constraining latent syntactic edge variables to form a tree. |
Introduction | Even at the expense of no dependency path features, the joint models best pipeline-trained models for state-of-the-art performance in the low-resource setting (§ 4.4). |
Related Work | In both pipeline and joint models , we use features adapted from state-of-the-art approaches to SRL. |
Bottom-up tree-building | However, the major distinction between our models and theirs is that we do not jointly model the structure and the relation; rather, we use two linear- |
Bottom-up tree-building | Although joint modeling has shown to be effective in various NLP and computer Vision applications (Sutton et al., 2007; Yang et al., 2009; Wojek and Schiele, 2008), our choice of using two separate models is for the following reasons: |
Bottom-up tree-building | Then, in the tree-building process, we will have to deal with the situations where the joint model yields conflicting predictions: it is possible that the model predicts Sj = 1 and RJ- 2 NO-REL, or Vice versa, and we will have to decide which node to trust (and thus in some sense, the structure and the relation is no longer jointly modeled ). |
Related work | 2.2 Joty et al.’s joint model |
Related work | Second, they jointly modeled the structure and the relation for a given pair of discourse units. |
Related work | The strength of J oty et al.’s model is their joint modeling of the structure and the relation, such that information from each aspect can interact with the other. |
Abstract | Experiments on Automatic Content Extraction (ACE)1 corpora demonstrate that our joint model significantly outperforms a strong pipelined baseline, which attains better performance than the best-reported end-to-end system. |
Conclusions and Future Work | In addition, we aim to incorporate other IE components such as event extraction into the joint model . |
Experiments | We compare our proposed method (Joint w/ Global) with the pipelined system (Pipeline), the joint model with only local features (Joint w/ Local), and two human annotators who annotated 73 documents in ACE’OS corpus. |
Experiments | Our joint model correctly identified the entity mentions and their relation. |
Experiments | Figure 7 shows the details when the joint model is applied to this sentence. |
Introduction | This is the first work to incrementally predict entity mentions and relations using a single joint model (Section 3). |
Abstract | Here, we present a novel formulation for a neural network joint model (NNJM), which augments the NNLM with a source context window. |
Introduction | Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window. |
Introduction | Unlike previous approaches to joint modeling (Le et al., 2012), our feature can be easily integrated into any statistical machine translation (SMT) decoder, which leads to substantially larger improvements than k-best rescoring only. |
Model Variations | Although there has been a substantial amount of past work in lexicalized joint models (Marino et al., 2006; Crego and Yvon, 2010), nearly all of these papers have used older statistical techniques such as Kneser-Ney or Maximum Entropy. |
Model Variations | This is consistent with our rescoring-only result, which indicates that k-best rescoring is too shallow to take advantage of the power of a joint model . |
Model Variations | We have described a novel formulation for a neural network-based machine translation joint model , along with several simple variations of this model. |
Neural Network Joint Model (NNJ M) | To make this a joint model , we also condition on source context vector 81-: |
Abstract | In a quantitative evaluation on the task of judging geographically informed semantic similarity between representations learned from 1.1 billion words of geo-located tweets, our joint model outperforms comparable independent models that learn meaning in isolation. |
Evaluation | To illustrate how the model described above can learn geographically-informed semantic representations of words, table 1 displays the terms with the highest cosine similarity to wicked in Kansas and Massachusetts after running our joint model on the full 1.1 billion words of Twitter data; while wicked in Kansas is close to other evaluative terms like evil and pure and religious terms like gods and spirit, in Massachusetts it is most similar to other intensifiers like super, ridiculously and insanely. |
Evaluation | As one concrete example of these differences between individual data points, the cosine similarity between city and seattle in the —GEO model is 0.728 (seattle is ranked as the 188th most similar term to city overall); in the INDIVIDUAL model using only tweets from Washington state, 6WA(city,seattle) = 0.780 (rank #32); and in the JOINT model , using information from the entire United States with deviations for Washington, 6WA(city, seattle) = 0858 (rank #6). |
Evaluation | While the two models that include geographical information naturally outperform the model that does not, the JOINT model generally far outperforms the INDIVIDUAL models trained on state-specific subsets of the data.1 A model that can exploit all of the information in the data, learning core vector-space representations for all words along with deviations for each contextual variable, is able to learn more geographically-informed representations for this task than strict geographical models alone. |
Model | A joint model has three a priori advantages over independent models: (i) sharing data across variable values encourages representations across those values to be similar; e.g., while city may be closer to Boston in Massachusetts and Chicago in Illinois, in both places it still generally connotes a municipality; (ii) such sharing can mitigate data sparseness for less-witnessed areas; and (iii) with a joint model , all representations are guaranteed to |
Conclusion | In addition, the joint model is efficient enough for practical use. |
Experiments | 3'Other evaluation metrics are also proposed by (Zheng et al., 2011a) which is only suitable for their system since our system uses a joint model |
Experiments | The selection of K also directly guarantees the running time of the joint model . |
Experiments | using the proposed joint model are shown in Table 3 and Table 4. |
Pinyin Input Method Model | To make typo correction better, we consider to integrate it with FTC conversion using a joint model . |
Related Works | As we will propose a joint model |
Character-Level Dependency Tree | (2012) proposed a joint model for Chinese word segmentation, POS-tagging and dependency parsing, studying the influence of joint model and character features for parsing, Their model is extended from the arc-standard transition-based model, and can be regarded as an alternative to the arc-standard model of our work when pseudo intra-word dependencies are used. |
Character-Level Dependency Tree | (2012) investigate a joint model using pseudo intra-word dependencies. |
Character-Level Dependency Tree | To our knowledge, we are the first to apply the arc-eager system to joint models and achieve comparative performances to the arc-standard model. |