A Generative PCFG Model | The Input The set of analyses for a token is thus represented as a lattice in which every arc corresponds to a specific lexeme l, as shown in Figure l. A morphological analyzer M : W —> L is a function mapping sentences in Hebrew (W E W) to their corresponding lattices (M = L E L). |
A Generative PCFG Model | 212,, and a morphological analyzer , we look for the most probable parse tree 7r s.t. |
A Generative PCFG Model | Since the lattice L for a given sentence W is determined by the morphological analyzer M we have |
Experimental Setup | Morphological Analyzer Ideally, we would use an of-the-shelf morphological analyzer for mapping each input token to its possible analyses. |
Experimental Setup | patible with the one of the Hebrew Treebank.8 For this reason, we use a data-driven morphological analyzer derived from the training data similar to (Cohen and Smith, 2007). |
Experimental Setup | To control for the effect of the HSPELL-based pruning, we also experimented with a morphological analyzer that does not perform this pruning. |
Model Preliminaries | We represent all morphological analyses of a given utterance using a lattice structure. |
Previous Work on Hebrew Processing | Morphological analyzers for Hebrew that analyze a surface form in isolation have been proposed by Segal (2000), Yona and Wintner (2005), and recently by the knowledge center for processing Hebrew (Itai et al., 2006). |
Previous Work on Hebrew Processing | Morphological dis-ambiguators that consider a token in context (an utterance) and propose the most likely morphological analysis of an utterance (including segmentation) were presented by Bar-Haim et a1. |
Previous Work on Hebrew Processing | Tsarfaty (2006) used a morphological analyzer (Segal, 2000), a PoS tagger (Bar-Haim et al., 2005), and a general purpose parser (Schmid, 2000) in an integrated framework in which morphological and syntactic components interact to share information, leading to improved performance on the joint task. |
Introduction | For the task of full morphological analysis, the lexicon must provide all possible morphological analyses for any given token. |
Introduction | In this paper, we investigate the characteristics of Hebrew unknowns for full morphological analysis , and propose a new method for handling such unavoidable lack of information. |
Introduction | In our evaluation, these learned distributions include the correct analysis for unknown words in 85% of the cases, contributing an error reduction of over 30% over a competitive baseline for the overall task of full morphological analysis in Hebrew. |
Method | model application is a set of possible full morphological analyses for the token — in exactly the same format as the morphological analyzer provides. |
Previous Work | Habash and Rambow (2006) used the root+pattern+features representation of Arabic tokens for morphological analysis and generation of Arabic dialects, which have no lexicon. |
Previous Work | They report high recall (95%—98%) but low precision (37%—63%) for token types and token instances, against gold-standard morphological analysis . |
Previous Work | Unlike Nakagawa, our model does not use any segmented text, and, on the other hand, it aims to select full morphological analysis for each token, |
Abstract | The first algorithm PROMODES, which participated in the Morpho Challenge 2009 (an intema-tional competition for unsupervised morphological analysis ) employs a lower order model whereas the second algorithm PROMODES-H is a novel development of the first using a higher order model. |
Introduction | This study is called morphological analysis . |
Introduction | four tasks are assigned to morphological analysis : word decomposition into morphemes, building morpheme dictionaries, defining morphosyn-tactical rules which state how morphemes can be combined to valid words and defining mor-phophonological rules that specify phonological changes morphemes undergo when they are combined to words. |
Introduction | Results of morphological analysis are applied in speech synthesis (Sproat, 1996) and recognition (Hirsimaki et al., 2006), machine translation (Amtrup, 2003) and information retrieval (Kettunen, 2009). |
Related work | We have presented two probabilistic generative models for word decomposition, PROMODES and PROMODES-H. Another generative model for morphological analysis has been described by Snover and Brent (2001) and Snover et al. |
Related work | Combining different morphological analysers has been performed, for example, by Atwell and Roberts (2006) and Spiegler et al. |
Abstract | In this paper, we investigate the usefulness of character-level part-of-speech in the task of Chinese morphological analysis . |
Abstract | Through experiments, we demonstrate that by introducing character-level POS information, the performance of a baseline morphological analyzer can be significantly improved. |
Conclusion | In our error analysis, we believe that by exploring the character-level POS and the internal word structure (Zhang et a1., 2013) at the same time, it is possible to further improve the performance of morphological analysis and parsing. |
Conclusion | Corpus-based Japanese Morphological Analysis . |
Evaluation | In Table 6 we compare our approach with morphological analyzers in previous studies. |
Introduction | Therefore, compared to word-level POS, the character-level POS can produce information for more expressive features during the learning process of a morphological analyzer . |
Introduction | In this paper, we investigate the usefulness of character-level P08 in the task of Chinese morphological analysis . |
Introduction | Through experiments, we demonstrate that by introducing character-level POS information, the performance of a baseline morphological analyzer can be significantly improved. |
Evaluation | Word Generation Tools and Settings For unsupervised learning of morphology, we use Morfessor CAT-MAP (V. 0.9.2) which was shown to be a very accurate morphological analyzer for morphologically rich languages (Creutz and Lagus, 2007). |
Evaluation | and thus we also have a morphological analyzer that can give all possible segmentations for a given word. |
Evaluation | By running the morphological analyzer on the OOVs, we can have the potential upper bound of OOV reduction by the system (labeled “oo” in Tables 2 and 3). |
Introduction | For low-resource languages, resources such as morphological analyzers are not usually available, and even good scholarly descriptions of the morphology (from which a tool could be built) are often not available. |
Conclusion and Future Work | In order to help with replication of the results in this paper, we have run the various morphological analysis steps and created the necessary training, tuning and test data files needed in order to train, tune and test any phrase-based machine translation system with our data. |
Conclusion and Future Work | We would particularly like to thank the developers of the open-source Moses machine translation toolkit and the Omorfi morphological analyzer for Finnish which we used for our experiments. |
Experimental Results | So, we ran the word-based baseline system, the segmented model (Unsup L—match), and the prediction model (CRF—LM) outputs, along with the reference translation through the supervised morphological analyzer Omorfi (Piri—nen and Listenmaa, 2007). |
Models 2.1 Baseline Models | performance of unsupervised segmentation for translation, our third baseline is a segmented translation model based on a supervised segmentation model (called Sup), using the hand-built Omorfi morphological analyzer (Pirinen and Lis-tenmaa, 2007), which provided slightly higher BLEU scores than the word-based baseline. |
Related Work | Segmented translation performs morphological analysis on the morphologically complex text for use in the translation model (Brown et al., 1993; Goldwater and McClosky, 2005; de Gispert and Marifio, 2008). |
Related Work | Previous work in segmented translation has often used linguistically motivated morphological analysis selectively applied based on a language-specific heuristic. |
Translation and Morphology | In fact, in our experiments, unsupervised morphology always outperforms the use of a hand-built morphological analyzer . |
Conclusion | For future work, we want to expand our work to other dialects, while utilizing dialectal morphological analysis to improve conversion. |
Previous Work | Sawaf (2010) proposed a dialect to MSA normalization that used character-level rules and morphological analysis . |
Previous Work | We tokenized Egyptian and Arabic according to the ATB tokenization scheme using the MADA+TOKAN morphological analyzer and to-kenizer v3.1 (Roth et al., 2008). |
Proposed Methods 3.1 Egyptian to EG’ Conversion | Perhaps a morphological analyzer , or just a part-of-speech tagger, could enforce (or probabilistically encourage) a match in parts of speech. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | In particular, using a morphological analyzer seeems like a promising possibility. |
Proposed Methods 3.1 Egyptian to EG’ Conversion | One approach could be to run a morphological analyzer for dialectal Arabic (e.g. |
Evaluation | Morphisto, for example, generates alternative morphological analyses , so that the disambiguation algorithm performs a random choice between these. |
Extensions and Related Research | (V) Integration with other ontological knowledge sources in order to improve the recall of morphosyntactic and morphological analyses (e.g., for disambiguating grammatical case). |
Extensions and Related Research | These observations provide further support for our conclusion that the ontology-based integration of morphosyntactic analyses enhances both the robustness and the level of detail of morphosyntactic and morphological analyses . |
Ontologies and annotations | 2.2 Integrating different morphosyntactic and morphological analyses |
Processing linguistic annotations | (i) Morphisto, a morphological analyzer without contextual disambiguation (Zielinski and Simon, 2008), |
Introduction | In addition, morphological analysis plays a crucial role here, as highly frequent morpheme correspondences can be particularly revealing. |
Introduction | In addition, our model carries out an implicit morphological analysis of the lost language, utilizing the known morphological structure of the related language. |
Model | This interplay implicitly relies on a morphological analysis of words in the lost language, while utilizing knowledge of the known language’s lexicon and morphology. |
Problem Formulation | rect morphological analysis of words in the lost language must be learned, we assume that the inventory and frequencies of prefixes and suffixes in the known language are given. |
Problem Formulation | In summary, the observed input to the model consists of two elements: (i) a list of unanalyzed word types derived from a corpus in the lost language, and (ii) a morphologically analyzed lexicon in a known related language derived from a separate corpus, in our case nonparallel. |
Inflection prediction models | Morphological analysis: returns the set of possible morphological analyses Aw = {a1, ..., a”} for w. A morphological analysis a is a vector of categorical values, where each dimension and its possible values are defined by L. |
Inflection prediction models | For the morphological analysis operation, we used the same set of morphological features described in (Minkov et al., 2007), that is, seven features for Russian (POS, Person, Number, Gender, Tense, Mood and Case) and 12 for Arabic (POS, Person, Number, Gender, Tense, Mood, Negation, Determiner, Conjunction, Preposition, Object and Possessive pronouns). |
Inflection prediction models | The same is true with the operation of morphological analysis . |
Introduction | Work in this area is motivated by two advantages offered by morphological analysis : (1) it provides linguistically motivated clustering of words and makes the data less sparse; (2) it captures morphological constraints applicable on the target side, such as agreement phenomena. |
Abstract | We also show that finite-state morphological analyzers are effective sources of type information when few labeled examples are available. |
Data | While we do not explore a rule-writing approach to POS-tagging, we do consider the impact of rule-based morphological analyzers as a component in our semi-supervised POS-tagging system. |
Introduction | We also did not consider morphological analyzers as a form of type supervision, as suggested by Merialdo (1994). |
Introduction | Also, morphological analyzers help for morphologically rich languages when there are few labeled types or tokens (and, it never hurts to use them). |
Morphological Transducers | We use FSTs for morphological analysis : the FST accepts a word type and produces a set of morphological features. |
Abstract | On the target side (Turkish), we only perform morphological analysis and disambiguation but treat the complete complex morphological tag as a factor, instead of separating morphemes. |
Experimental Setup and Results | On the Turkish side, we perform a full morphological analysis , (Oflazer, 1994), and morphological disambiguation (Yuret and Ture, 2006) to select the contextually salient interpretation of words. |
Experimental Setup and Results | 6For example, the morphological analyzer outputs +A3 s g to mark a singular noun, if there is no explicit plural morpheme. |
Related Work | Goldwater and McClosky (2005) use morphological analysis on the Czech side to get improvements in Czech-to-English statistical machine translation. |
Alignment Methods | morphological analyzers to normalize or split the sentence into morpheme streams (Corston-Oliver and Gamon, 2004). |
Introduction | A myriad of methods have been proposed to handle each of these phenomena individually, including morphological analysis , stemming, compound breaking, number regularization, optimizing word segmentation, and transliteration, which we outline in more detail in Section 2. |
Related Work on Data Sparsity in SMT | Previous works have attempted to handle morphology, decompounding and regularization through lemmatization, morphological analysis , or unsuperVised techniques (NieBen and Ney, 2000; Brown, 2002; Lee, 2004; Goldwater and McClosky, 2005; Talbot and Osborne, 2006; Mermer and Akin, 2010; Macherey et al., 2011). |
Related Work on Data Sparsity in SMT | unified framework, requiring no language specific tools such as morphological analyzers or word seg-menters. |
Experimental SetUp | This Bible edition is augmented by gold standard morphological analysis (including segmentation) performed by biblical scholars. |
Experimental SetUp | We obtained gold standard segmentations of the Arabic translation with a handcrafted Arabic morphological analyzer which utilizes manually constructed word lists and compatibility rules and is further trained on a large corpus of hand-annotated Arabic data (Habash and Ram-bow, 2005). |
Experimental SetUp | The accuracy of this analyzer is reported to be 94% for full morphological analyses , and 98%-99% when part-of-speech tag accuracy is not included. |
Multilingual Morphological Segmentation | The underlying assumption of our work is that structural commonality across different languages is a powerful source of information for morphological analysis . |
Conclusion | In the future, we plan to explore introducing multiple segmentation options into the lattice, and the application of our method to a full morphological analysis (as opposed to segmentation) of the target language. |
Related Work | The transformation might take the form of a morphological analysis or a morphological segmentation. |
Related Work | 2.1 Morphological Analysis |
Related Work | Many languages have access to morphological analyzers , which annotate surface forms with their lemmas and morphological features. |
MT System Selection | These features rely on language models, MSA and Egyptian morphological analyzers and a Highly Dialectal Egyptian lexicon to decide whether each word is MSA, Egyptian, Both, or Out of Vocabulary. |
MT System Selection | These features are: sentence length (in words), percentage of selected words and phrases, number of selected words, number of selected phrases, number of words morphologically selected as dialectal by a mainly Levantine morphological analyzer , number of words selected as dialectal by the tool’s DA-MSA lexicons, number of OOV words against the MSA-Pivot system training data, number of words in the sentences that appeared less than 5 times in the training data, number of words in the sentences that appeared between 5 and 10 times in the training data, number of words in the sentences that appeared between 10 and 15 times in the training data, number of words that have spelling errors and corrected by this tool (e.g., word-lengthening), number of punctuation marks, and number of words that are written in Latin script. |
Machine Translation Experiments | The MSA portion of the Arabic side is segmented according to the Arabic Treebank (ATB) tokenization scheme (Maamouri et al., 2004; Sadat and Habash, 2006) using the MADA+TOKAN morphological analyzer and tok-enizer v3.1 (Roth et al., 2008), while the DA portion is ATB-tokenized with MADA-ARZ (Habash et al., 2013). |
Related Work | Sawaf (2010) and Salloum and Habash (2013) used hybrid solutions that combine rule-based algorithms and resources such as leXicons and morphological analyzers with statistical models to map DA to MSA before using MSA-to-English MT systems. |
Experiment | All the data are annotated with information on morphological analysis , clause boundary detection and dependency analysis by hand. |
Linefeed Insertion Technique | In our method, a sentence, on which morphological analysis , bunsetsu segmentation, clause boundary analysis and dependency analysis are performed, is considered the input. |
Preliminary Analysis about Linefeed Points | The data is annotated by hand with information on morphological analysis , bunsetsu segmentation, dependency analysis, clause boundary detection, and linefeeds insertion. |
Experimental Results | In these cases, the joint model, entertaining all morphological possibilities, was able to find the combination of links and morphological analyses that are collectively more likely. |
Introduction | To date, studies of morphological analysis and dependency parsing have been pursued more or less independently. |
Previous Work | Since space does not allow a full review of the vast literature on morphological analysis and parsing, we focus only on past research involving joint morphological and syntactic inference (§2.l); we then discuss Latin (§2.2), a language representative of the challenges that motivated our approach. |
Approach to Sentence-Level Dialect Identification | The aforementioned approach relies on language models (LM) and MSA and EDA Morphological Analyzer to decide whether each word is (a) MSA, (b) EDA, (c) Both (MSA & EDA) or (d) OOV. |
Approach to Sentence-Level Dialect Identification | Percentage of words in the sentence that is analyzable by an MSA morphological analyzer . |
Approach to Sentence-Level Dialect Identification | Percentage of words in the sentence that is analyzable by an EDA morphological analyzer . |
Experiments | We used MeCab as a morphological analyzer and CaboCha14 (Kudo and Matsumoto, 2002) as the dependency parser to find the boundaries of the bunsetsu. |
Gazetteer Induction 2.1 Induction by MN Clustering | After preprocessing the first sentence of an article using a morphological analyzer , MeCab9, we extracted the last noun after the appearance of Japanese postpo-sition “Oi (wa)” (% “is”). |
Using Gazetteers as Features of NER | Asahara and Motsumoto (2003) proposed using characters instead of morphemes as the unit to alleviate the effect of segmentation errors in morphological analysis and we also used their character-based method. |
Conclusion | As a next step, we will focus on morphological analysis and disambiguation of Turkish words. |
Conclusion | After determining the correct morphological analysis of Turkish words, we will use the parts of these analyses to replace the leaf nodes that we intentionally left as “*NONE*”. |
Corpus construction strategy | corresponds to the morphological analysis “gec-NEG-FUT-ZSG” of the verb “gecmeyeceksin”. |