Spectral Unsupervised Parsing with Additive Tree Metrics

We propose a spectral approach for unsupervised constituent parsing that comes with theoretical guarantees on latent structure recovery.

Appears in 15 sentences as: POS tag (5) POS tags (13)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- :0, is the POS tag of 21),).Page 2, “Abstract”
- The word embeddings are used during the leam-ing process, but the final decoder that the learning algorithm outputs maps a POS tag sequence a: to a parse tree.Page 2, “Abstract”
- While ideally we would want to use the word information in decoding as well, much of the syntax of a sentence is determined by the POS tags, and relatively high level of accuracy can be achieved by learning, for example, a supervised parser from POS tag sequences.Page 2, “Abstract”
- Just like our decoder, our model assumes that the bracketing of a given sentence is a function of its POS tags .Page 2, “Abstract”
- The POS tags are generated from some distribution, followed by a deterministic generation of the bracketing parse tree.Page 2, “Abstract”
- Therefore, the procedure to find a bracketing for a given POS tag a: is to first estimate the distance matrix sub-block fiww from raw text data (see §3.4), and then solve the optimization problem arg minueu using a variant of the Eisner-Satta algorithm where is identical to 00.0 in Eq.Page 6, “Abstract”
- We first defined a generative model that describes how a sentence, its sequence of POS tags , and its bracketing is generated (§2.3).Page 6, “Abstract”
- First an undirected u 6 L1 is generated (only as a function of the POS tags ), and then u is mapped to a bracketing using a direction mapping hdir.Page 6, “Abstract”
- 3This data sparsity problem is quite severe — for example, the Penn treebank (Marcus et a1., 1993) has a total number of 43,498 sentences, with 42,246 unique POS tag sequences, averaging to be 1.04.Page 6, “Abstract”
- More formally, an anchor is a function G that maps a word index j and a sequence of POS tags :13 to a local context G(j, The anchor we use is G(j, :13) = (j, 303-).Page 7, “Abstract”
- U stands for universal POS tagset, OB stands for conjoining original POS tags with Brown clusters and UB stands for conjoining universal POS tags with Brown clusters.Page 8, “Abstract”

See all papers in *Proc. ACL 2014* that mention POS tags.

See all papers in *Proc. ACL* that mention POS tags.

Back to top.

Appears in 10 sentences as: embeddings (14)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- The word embeddings are used during the leam-ing process, but the final decoder that the learning algorithm outputs maps a POS tag sequence a: to a parse tree.Page 2, “Abstract”
- Then, latent states are generated for each bracket, and finally, the latent states at the yield of the bracketing parse tree generate the words of the sentence (in the form of embeddings ).Page 2, “Abstract”
- L€t V I: {101, ..., mg, 21, ..., 2H}, With 20,- 1‘61)-resenting the word embeddings , and 21- representing the latent states of the bracketings.Page 3, “Abstract”
- Word embeddings As mentioned earlier, each 212,- can be an arbitrary feature vector.Page 8, “Abstract”
- For English, more sophisticated word embeddings are easily obtainable, and we experiment with neural word embeddings Turian et a1.Page 8, “Abstract”
- We also explored two types of CCA embeddings : OSCCA and TSCCA, given in Dhillon et a1.Page 8, “Abstract”
- The OSCCA embeddings behaved better, so we only report its results.Page 8, “Abstract”
- CCM is used with the initializer proposed in Klein and Manning (2002).5 NN, CC, and BC indicate the performance of our method for neural embeddings, CCA embeddings , and Brown clustering respectively, using the heuristic for hdir de-Page 8, “Abstract”
- We find that the neural embeddings modestly outperform the CCA and Brown cluster embeddings .Page 9, “Abstract”
- We didn’t have neural embeddings for German and Chinese (which worked best for English) and thus only used Brown cluster embeddings .Page 9, “Abstract”

See all papers in *Proc. ACL 2014* that mention embeddings.

See all papers in *Proc. ACL* that mention embeddings.

Back to top.

Appears in 9 sentences as: Learning Algorithm (1) learning algorithm (7) learning algorithms (1)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- Additive tree metrics can be leveraged by “meta-algorithms” such as neighbor-joining (Saitou and Nei, 1987) and recursive grouping (Choi et al., 2011) to provide consistent learning algorithms for latent trees.Page 2, “Abstract”
- In our learning algorithm , we assume that examples of the form (1002,3302) for i E [N] = {1, .Page 2, “Abstract”
- The word embeddings are used during the leam-ing process, but the final decoder that the learning algorithm outputs maps a POS tag sequence a: to a parse tree.Page 2, “Abstract”
- Our learning algorithm focuses on recovering the undirected tree based for the generative model that was described above.Page 3, “Abstract”
- 3 Spectral Learning Algorithm based on Additive Tree MetricsPage 4, “Abstract”
- From here, we use d to denote dspectral, since that is the metric we use for our learning algorithm .Page 5, “Abstract”
- Algorithm 1 The learning algorithm for finding the latent structure from a set of examples (1007,3307), i EPage 6, “Abstract”
- The full learning algorithm is given in Figure l. The first step in the algorithm is to estimate the covariance matrix block 293(1) (j, k) for each training example ma) and each pair of preterminal positions (j, k) in ma).Page 7, “Abstract”
- Note that the learning algorithm is such that it ensures that Same, is) = imam M) if 0mm) = G(j’,a:(il)) and 6103,3307) = G(k’, 3304)).Page 7, “Abstract”

See all papers in *Proc. ACL 2014* that mention learning algorithm.

See all papers in *Proc. ACL* that mention learning algorithm.

Back to top.

Appears in 9 sentences as: parse tree (7) parse trees (3)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- Cog-nitively, it is more plausible to assume that children obtain only terminal strings of parse trees and not the actual parse trees .Page 1, “Abstract”
- Most existing solutions treat the problem of unsupervised parsing by assuming a generative process over parse trees e.g.Page 1, “Abstract”
- Unlike in phylogenetics and graphical models, where a single latent tree is constructed for all the data, in our case, each part of speech sequence is associated with its own parse tree .Page 2, “Abstract”
- , N} are given, and the goal is to predict a bracketing parse tree for each of these examples.Page 2, “Abstract”
- The word embeddings are used during the leam-ing process, but the final decoder that the learning algorithm outputs maps a POS tag sequence a: to a parse tree .Page 2, “Abstract”
- The POS tags are generated from some distribution, followed by a deterministic generation of the bracketing parse tree .Page 2, “Abstract”
- Then, latent states are generated for each bracket, and finally, the latent states at the yield of the bracketing parse tree generate the words of the sentence (in the form of embeddings).Page 2, “Abstract”
- We could thus conclude that 2112 and 2113 should be closer in the parse tree than wlPage 2, “Abstract”
- The resulting t is a binary bracketing parse tree .Page 4, “Abstract”

See all papers in *Proc. ACL 2014* that mention parse tree.

See all papers in *Proc. ACL* that mention parse tree.

Back to top.

Appears in 7 sentences as: constituent parse (3) constituent parses (1) constituent parsing (3)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- We propose a spectral approach for unsupervised constituent parsing that comes with theoretical guarantees on latent structure recovery.Page 1, “Abstract”
- More specifically, we approach unsupervised constituent parsing from the perspective of structure learning as opposed to parameter learning.Page 1, “Abstract”
- This undirected latent tree is then directed via a direction mapping to give the final constituent parse .Page 1, “Abstract”
- Figure 2: Candidate constituent parses for :13 = (VBD, DT, NN) (left-correct, right -incorrect)Page 2, “Abstract”
- Two candidate constituent parse structures are shown in Figure 2 and the correct one is boxed in green (the other in red).Page 2, “Abstract”
- As implied by the above definition of hdir, selecting which edge is the root can be interpreted as determining the top bracket of the constituent parse .Page 4, “Abstract”
- We described a spectral approach for unsupervised constituent parsing that comes with theoretical guarantees on latent structure recovery.Page 9, “Abstract”

See all papers in *Proc. ACL 2014* that mention constituent parse.

See all papers in *Proc. ACL* that mention constituent parse.

Back to top.

Appears in 7 sentences as: latent variables (7)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- We associate each sentence with an undirected latent tree graphical model, which is a tree consisting of both observed variables (corresponding to the words in the sentence) and an additional set of latent variables that are unobserved in the data.Page 1, “Abstract”
- However, due to the presence of latent variables , structure learning of latent trees is substantially more complicated than in observed models.Page 1, “Abstract”
- The latent variables can incorporate various linguistic properties, such as head information, valence of dependency being generated, and so on.Page 3, “Abstract”
- However, the fact that the z,-are latent variables makes this strategy substantially more complicated.Page 4, “Abstract”
- In particular, it becomes challenging to compute the distances among pairs of latent variables .Page 4, “Abstract”
- What is needed is a “special” distance function that allows us to reverse engineer the distances among the latent variables given the distances among the observed variables.Page 4, “Abstract”
- As we describe below, given the tree structure, the additive tree metric property allows us to compute “backwards” the distances among the latent variables as a function of the distances among the observed variables.Page 4, “Abstract”

See all papers in *Proc. ACL 2014* that mention latent variables.

See all papers in *Proc. ACL* that mention latent variables.

Back to top.

Appears in 4 sentences as: data sparsity (4)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- This leads to a severe data sparsity problem even for moderately long sentences.Page 2, “Abstract”
- Assume for this section is large (we address the data sparsity issue in §3.4).Page 4, “Abstract”
- We now address the data sparsity problem, in particular that ’D(a:) can be very small, and therefore estimating d for each POS sequence separately can be problematic.3Page 6, “Abstract”
- 3This data sparsity problem is quite severe — for example, the Penn treebank (Marcus et a1., 1993) has a total number of 43,498 sentences, with 42,246 unique POS tag sequences, averaging to be 1.04.Page 6, “Abstract”

See all papers in *Proc. ACL 2014* that mention data sparsity.

See all papers in *Proc. ACL* that mention data sparsity.

Back to top.

Appears in 4 sentences as: generative model (4)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- Our generative model deterministically maps a POS sequence to a bracketing via an undirectedPage 3, “Abstract”
- The complete generative model that we follow is then:Page 3, “Abstract”
- Our learning algorithm focuses on recovering the undirected tree based for the generative model that was described above.Page 3, “Abstract”
- We first defined a generative model that describes how a sentence, its sequence of POS tags, and its bracketing is generated (§2.3).Page 6, “Abstract”

See all papers in *Proc. ACL 2014* that mention generative model.

See all papers in *Proc. ACL* that mention generative model.

Back to top.

Appears in 4 sentences as: graphical model (2) graphical models (2)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- We associate each sentence with an undirected latent tree graphical model , which is a tree consisting of both observed variables (corresponding to the words in the sentence) and an additional set of latent variables that are unobserved in the data.Page 1, “Abstract”
- Unlike in phylogenetics and graphical models , where a single latent tree is constructed for all the data, in our case, each part of speech sequence is associated with its own parse tree.Page 2, “Abstract”
- Following this intuition, we propose to model the distribution over the latent bracketing states and words for each tag sequence a: as a latent tree graphical model , which encodes conditional inde-pendences among the words given the latent states.Page 3, “Abstract”
- Generating a bracketing via an undirected tree enables us to build on existing methods for structure learning of latent-tree graphical models (Choi et al., 2011; Anandkumar et al., 2011).Page 3, “Abstract”

See all papers in *Proc. ACL 2014* that mention graphical model.

See all papers in *Proc. ACL* that mention graphical model.

Back to top.

Appears in 4 sentences as: syntactic context (5)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- Although a: and 33’ are not identical, it is likely that 293/ (2, 3) is similar to 233(1, 2) because the determiner and the noun appear in similar syntactic context .Page 7, “Abstract”
- 233/ (5, 7) also may be somewhat similar, but 233/ (2, 7) should not be very similar to 233(1, 2) because the noun and the determiner appear in a different syntactic context .Page 7, “Abstract”
- The observation that the covariance matrices depend on local syntactic context is the main driving force behind our solution.Page 7, “Abstract”
- The local syntactic context acts as an “anchor,” which enhances or replaces a word index in a sentence with local syntactic context .Page 7, “Abstract”

See all papers in *Proc. ACL 2014* that mention syntactic context.

See all papers in *Proc. ACL* that mention syntactic context.

Back to top.

Appears in 4 sentences as: Treebank (1) treebank (4)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- 3This data sparsity problem is quite severe — for example, the Penn treebank (Marcus et a1., 1993) has a total number of 43,498 sentences, with 42,246 unique POS tag sequences, averaging to be 1.04.Page 6, “Abstract”
- For English we use the Penn treebank (Marcus et al., 1993), with sections 2—21 for training and section 23 for final testing.Page 7, “Abstract”
- For German and Chinese we use the Ne-gra treebank and the Chinese treebank respectively and the first 80% of the sentences are used for training and the last 20% for testing.Page 7, “Abstract”
- For both methods we chose the best parameters for sentences of length 6 g 10 on the English Penn Treebank (training) and used this set for all other experiments.Page 8, “Abstract”

See all papers in *Proc. ACL 2014* that mention treebank.

See all papers in *Proc. ACL* that mention treebank.

Back to top.

Appears in 4 sentences as: Word embeddings (1) word embeddings (4)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- The word embeddings are used during the leam-ing process, but the final decoder that the learning algorithm outputs maps a POS tag sequence a: to a parse tree.Page 2, “Abstract”
- L€t V I: {101, ..., mg, 21, ..., 2H}, With 20,- 1‘61)-resenting the word embeddings , and 21- representing the latent states of the bracketings.Page 3, “Abstract”
- Word embeddings As mentioned earlier, each 212,- can be an arbitrary feature vector.Page 8, “Abstract”
- For English, more sophisticated word embeddings are easily obtainable, and we experiment with neural word embeddings Turian et a1.Page 8, “Abstract”

See all papers in *Proc. ACL 2014* that mention word embeddings.

See all papers in *Proc. ACL* that mention word embeddings.

Back to top.

Appears in 3 sentences as: optima problems (1) optimization problem (2)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- Unfortunately, finding the global maximum for these objective functions is usually intractable (Cohen and Smith, 2012) which often leads to severe local optima problems (but see Gormley and Eisner, 2013).Page 1, “Abstract”
- Directly attempting to maximize the likelihood unfortunately results in an intractable optimization problem and greedy heuristics are often employed (Harmeling and Williams, 2011).Page 4, “Abstract”
- Therefore, the procedure to find a bracketing for a given POS tag a: is to first estimate the distance matrix sub-block fiww from raw text data (see §3.4), and then solve the optimization problem arg minueu using a variant of the Eisner-Satta algorithm where is identical to 00.0 in Eq.Page 6, “Abstract”

See all papers in *Proc. ACL 2014* that mention optimization problem.

See all papers in *Proc. ACL* that mention optimization problem.

Back to top.

Appears in 3 sentences as: parsing algorithm (2) parsing algorithms (1)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- Although finding the “minimal” latent tree is NP-hard in general, for the case of projective trees we find that it can be found using bilexical parsing algorithms .Page 1, “Abstract”
- While this criterion is in general NP-hard (Desper and Gascuel, 2005), for projective trees we find that a bilexical parsing algorithm can be used to find an exact solution efficiently (Eisner and Satta, 1999).Page 2, “Abstract”
- However, if we restrict u to be in L1, as we do in the above, then maximizing over Ll can be solved using the bilexical parsing algorithm from Eisner and Satta (1999).Page 6, “Abstract”

See all papers in *Proc. ACL 2014* that mention parsing algorithm.

See all papers in *Proc. ACL* that mention parsing algorithm.

Back to top.

Appears in 3 sentences as: Penn Treebank (1) Penn treebank (2)

In *Spectral Unsupervised Parsing with Additive Tree Metrics*

- 3This data sparsity problem is quite severe — for example, the Penn treebank (Marcus et a1., 1993) has a total number of 43,498 sentences, with 42,246 unique POS tag sequences, averaging to be 1.04.Page 6, “Abstract”
- For English we use the Penn treebank (Marcus et al., 1993), with sections 2—21 for training and section 23 for final testing.Page 7, “Abstract”
- For both methods we chose the best parameters for sentences of length 6 g 10 on the English Penn Treebank (training) and used this set for all other experiments.Page 8, “Abstract”

See all papers in *Proc. ACL 2014* that mention Penn treebank.

See all papers in *Proc. ACL* that mention Penn treebank.

Back to top.