SciSurf: Index of 'The Equivalence of Information-Theoretic and Likelihood-Based Methods for Neural Dimensionality Reduction'

The Equivalence of Information-Theoretic and Likelihood-Based Methods for Neural Dimensionality Reduction

Ross S. Williamson, Maneesh Sahani, Jonathan W. Pillow

Published in PLOS Comp. Biol., April 2015

Abstract

One popular method, known as maximally informative dimensions (MID), uses an information-theoretic quantity known as “single-spike information” to identify this space. Here we examine MID from a model-based perspective. We show that MID is a maximum-likelihood estimator for the parameters of a Iinear—nonlinear—Poisson (LNP) model, and that the empirical single-spike information corresponds to the normalized log-likelihood under a Poisson model. This equivalence implies that MID does not necessarily find maximally informative stimulus dimensions when spiking is not well described as Poisson. We provide several examples to illustrate this shortcoming, and derive a lower bound on the information lost when spiking is Bernoulli in discrete time bins. To overcome this limitation, we introduce model-based dimensionality reduction methods for neurons with non-Poisson firing statistics, and show that they can be framed equivalently in likelihood-based or information-theoretic terms. Finally, we show how to overcome practical limitations on the number of stimulus dimensions that MID can estimate by constraining the form of the nonparametric nonlinearity in an LNP model. We illustrate these methods with simulations and data from primate visual cortex.

Author Summary

A popular approach to the neural coding problem is to identify a low-dimensional linear projection of the stimulus space that preserves the aspects of the stimulus that affect a neu-ron’s probability of spiking. Previous work has focused on both information-theoretic and likelihood-based estimators for finding such projections. Here, we show that these two approaches are in fact equivalent. We show that maximally informative dimensions (MID), a popular information-theoretic method for dimensionality reduction, is identical to the maximum-likelihood estimator for a particular linear-nonlinear encoding model with Poisson spiking. One implication of this equivalence is that MID may not find the infor-mation-theoretically optimal stimulus projection when spiking is non-Poisson, which we illustrate with a few simple examples. Using these insights, we propose novel dimensional-ity-reduction methods that incorporate non-Poisson spiking, and suggest new parametri-zations that allow for tractable estimation of high-dimensional subspaces.

Introduction

Characterizing this relationship is difficult in general because of the high dimensionality of natural signals. A substantial literature therefore has focused on dimensionality reduction methods for identifying which stimuli affect a neuron’s probability of firing. The basic idea is that many neurons compute their responses in a low dimensional subspace, spanned by a small number of stimulus features. By identifying this subspace, we can more easily characterize the nonlinear mapping from stimulus features to spike responses [1—5].

For all such methods, the goal is to find a set of linear filters, specified by the columns of a matrix K, such that the probability of response r given a stimulus 3 depends only on the linear projection of 3 onto these filters, i.e., p(r|s) % p(r|KTs). Existing methods differ in computational complexity, modeling assumptions, and stimulus requirements. Typically, moment-based estimators have low computational cost but succeed only for restricted classes of stimulus distributions, whereas information-theo-retic and likelihood-based estimators allow for arbitrary stimuli but have high computational cost. Previous work has established theoretical connections between moment-based and likeli-hood-based estimators [11, 14, 17, 19, 23], and between some classes of likelihood-based and information-theoretic estimators [14, 20, 21, 24].

We show that this estimator is formally identical to the maximum likelihood (ML) estimator for the parameters of a linear-nonlinear-Poisson (LNP) encoding model. Although previous work has demonstrated an asymptotic equivalence between these methods [20, 24, 25] , we show that the correspondence is exact, regardless of time bin size or the amount of data. This equivalence follows from the fact that the plugin estimate for the single-spike information [26], the quantity that MID optimizes, is equal to a normalized Poisson log-likelihood.

We illustrate this shortcoming by showing that MID can fail to find information-maximizing filters for simulated neurons with binary or other non-Poisson spike count distributions. To overcome this limitation, we introduce new dimensionality-reduction estimators based on non-Poisson noise models, and show that they can be framed equivalently in information-theoretic or likelihood-based terms.

The difficulty arises from the intractability of using histograms to estimate densities in high-dimensional subspaces. However, the single-spike information depends only on the ratio of densities, which is proportional to the nonlinearity in the LNP model. We show that by restricting the parametrization of this nonlinearity so that the number of parameters does not grow exponentially with the number of dimensions, we can obtain flexible yet computationally tractable estimators for models with many filters or dimensions.

Results

Background

Linear-nonlinear cascade models provide a useful framework for describing neural responses to high-dimensional stimuli. These models define the response in terms of a cascade of linear, nonlinear, and probabilistic spiking stages (see Fig. 1). The linear stage reduces the dimensionality by projecting the high-dimen-sional stimulus onto a set of linear filters, and a nonlinear function then converts the output of these filters to a nonnegative spike rate. Let 9 = {K, a} denote the parameters of the LNP model, Where K is a (tall, skinny) matrix Whose columns contain the stimulus filters (for cases With a single filter, we Will denote the filter With a vector k instead of the matrix K), and a are parameters governing the nonlinear function f from feature space to instantaneous spike rate. Under this model, the probability of a spike response r given stimulus sis governed by a Poisson distribution:

The defining feature of a Poisson process is that responses in non-overlapping time bins are conditionally independent given the spike rate. In a discrete-time LNP model, the conditional probability of a dataset D = {(st,rt)}, consisting of stimulus-response pairs indexed where —(2 log rt!) is a constant that does not depend on 9. The ML estimate for 0 is simply the maxi-Maximally informative dimensions (MID). The maximally informative dimensions (MID) estimator seeks to find an informative low-dimensional projection of the stimulus by maximizing an information-theoretic quantity known as the single-spike information [26]. This quantity, which we denote 155, is the the average information that the time of a single spike (considered independently of other spikes) carries about the stimulus Although first introduced as a quantity that can be computed from the peri-stimulus time histogram (PSTH) measured in response to a repeated stimulus, the single-spike information can also be expressed as the Kullback-Leibler (KL) divergence between two distributions over the stimulus (see [26] , appendix B): where p(s) denotes the marginal or “raw” distribution over stimuli, and p(s|spike) is the distribution over stimuli conditioned on observing a spike, also known as the “spike-triggered” stimulus distribution. Note that p(s|spike) is not the same as p(s|r = 1), the distribution of stimuli conditioned on a spike count of r = 1, since a stimulus that elicits two spikes will contribute twice as much to the spike-triggered distribution as a stimulus that elicits only one spike. The MID estimator [18] seeks to find the linear projection that preserves maXimal single-spike information: Where p(KTs) and p(KTs|spike) are the raW and spike-triggered stimulus distributions projected onto the subspace defined by the columns of K, respectively. In practice, the MID estimator maximizes an estimate of the projected single-spike information:

The columns of KMID can be conceived as “directions” or “axes” in stimulus space that are most informative about a neuron’s probability of spiking, as quantified by single-spike information. Fig. 2 shows a simulated example illustrating the MID estimate for a single linear filter in a two-dimensional stimulus space.

Equivalence of MID and maximum-likelihood LNP

Here we present a stronger result, showing that the equivalence is not merely asymptotic. We show that standard MID, using histogram-based estimators for raw and spike-triggered stimulus densities p(s) and p(s|spike), is exactly the ML estimator for the parameters of an LNP model, regardless of spike rate, the time bins used to count spikes, or the amount of data. The standard implementation of MID [18, 20] uses histograms to estimate the projected We will now unpack the details of this estimate in order to show its relationship to the LNP model log-likelihood. Where xt = KTst denotes the linear projection of the stimulus st, nsp = 2:1 rt is the total number of spikes, and lBi(') is the indicator function for the set Bi, defined as: The estimates f) and (1 are also known as the “plug-in” estimates, and correspond to maximum likelihood estimates for the densities in question. These estimates give us a plugin estimate for projected single-spike information: Where the function Q(X) denotes the ratio of density estimates: Note that Q (X) is a pieceWise constant function that takes the value 612/13,- over the ith histogram bin Bi.

Given a projection matriX K, the ML estimate for the parameter vector a = (fl, . . .,fm) is the average number of spikes per stimulus in each histogram bin, divided by time bin Width A, that is: Note that functions f and Q are related by f (x) 2 (film (x) and that the sum NA This allows us to directly relate the empirical single-spike information (Equation 8) With the LNP model log-likelihood, normalized by the spike count as follows: Where Llnpwo, D) denotes the Poisson log-likelihood under a “null” model in Which spike rate

In fact, the quantity —£lnp(90, D) can be considered an estimate for the marginal entropy of the response distribution, H (r) = —Z p(r) log p(r), since it is the average log-probability of the response under a Poisson model, independent of the stimulus. This makes it clear that the sin-gle-spike information Iss can be equally regarded as “LNP information”.

This equality holds independent of time bin size A, the number of samples N and the number of spikes nsp. From this relationship, it is clear that the linear projection K that maximizes fss also maximizes the LNP log-likelihood £1” P(9; D), meaning that the MID estimate is the same as an ML estimate for the does not depend on the stimulus, but takes constant rate x10 2 across the entire stimulus filters in an LNP model: Moreover, the histogram-based estimates of the raW and spike-triggered stimulus densities f) and q, Which are used for computing the empirical single-spike information fss, correspond to a particular parametrization of the LNP model nonlinearity f as a pieceWise constant function over histogram bins. The ratio of these plugin estimates gives rise to the ML estimate for f. MID is thus formally equivalent to an ML estimator for both the linear filters and the nonlinearity of an LNP model.

Selecting the number of parameters for the nonlinearity is important both for accurately estimating single-spike information from finite data and for successfully finding the most informative filter or filters. Fig. 3 illustrates this point using data from a simulated neuron With a single filter in a two-dimensional stimulus space. For small datasets, the MID estimate computed With many histogram bins (i.e., many parameters for the nonlinearity) substantially overestimates the true Iss and yields large errors in the filter estimate. Even With 1000 stimuli and 200 spikes, a 20-bin histogram gives substantial upward bias in the estimate of single-spike information (Fig. 3D). Parametrization of the nonlinearity is therefore an important problem that should be addressed explicitly When using MID, e.g., by cross-validation or other model selection methods.

Models with Bernoulli spiking

However, real spike trains may eXhibit more or less variability than a Poisson process [fl]. In particular, the Poisson assumption breaks down when the time bin in which the data are analyzed approaches the length of the refractory period, since in that case each bin can contain at most one spike. In that case, a Bernoulli model provides a more accurate description of neural data, since it allows only 0 or 1 spike per bin. The Bernoulli and discrete-time Poisson models approach the same limiting Poisson process as the bin size (and single-bin spike probability) approaches zero while the average spike rate remains constant. However, as long as single-bin spike probabilities are above zero, the two models differ.

That is, if the spike count r given stimulus sis not a Poisson random variable, then MID does not necessarily find the subspace preserving maXimal information between stimulus and response. To show this, we derive the mutual information between the stimulus and a Bernoulli distributed spike count, and show that this quantity is closely related to the log-likelihood under a linear-nonlin-ear-Bernoulli encoding model.

We can define the linear-nonlinear-Bernoulli (LNB) model by analogy to the LNP model, but with Bernoulli instead of Poisson spiking. The parameters 9 = {K, a} consist of a matriX K that determines a linear projection of the stimulus space, and a set of parameters a that govern the nonlinearity Here, the output of f is spike probability 2» in the range [0, 1]. The probability of a spike response r 6 {0,1} given stimulus s is governed by a Bernoulli distribution. We can express this model as If K has a single filter and the nonlinearity is restricted to be a logistic function, f(x) = 1/(1+eXp (—x)), this reduces to the logistic regression model. Note that the spike probability 2» is analogous to the single-bin Poisson rate 1A from the LNP model (Equation 1), and the two models become identical in the small-bin limit Where the probability of spiking p(r = 1) goes to zero [24, 26].

We can derive an equivalent dimensionality-reduction estimator in information-theoretic terms. The mutual information between the projected stimulus If we normalize by the probability of observing a spike, we obtain a quantity With units of bits-per-spike that can be directly compared to single-spike information. We refer to this as the Bernoulli information:

DKL<p(x|r = 0) p(x)) is the information (per spike) carried by silences, and

fies the information conveyed by each spike alone (no matter how many spikes might co-occur in the same time bin) but neglects the information conveyed by silences, the Bernoulli information reflects the information conveyed by both spikes and silences. Let fBer = f0+f5$ denote the empirical or plugin estimate of the Bernoulli information, where fss is the empirical single-spike information (Equation 8), and f0 is a plugin estimate of the KL divergence between p(x|r = 0) and p(x), weighted by (N—nsp) /n$p, the ratio of the number of silences to the number of spikes. It is straightforward to show that empirical Bernoulli information equals the LNB model log-likelihood per spike plus a constant: pluginWhere F = 7%" denotes mean spike count per bin and HM = — 1% log 1% — N plug-in estimate for the marginal response entropy. Because the second term is independent of 9, the maximum of the empirical Bernoulli information is identical to the maximum of the LNB model likelihood, meaning that once again, we have an exact equivalence between likeli-hood-based and information-based estimators.

The empirical Bernoulli information is strictly greater than the estimated single-spike (or “Poisson”) information for a binary spike train that is not all zeros or ones, since To > 0 and these spike absences are neglected by the single-spike information measure. Only in the limit of infinitesimal time bins, Where p(r = 1) —> 0, does fBer converge to fss[24, 26]. As a result, standard MID can fail to identify the most informative subspace When applied to a neuron With Bernoulli spiking. We illustrate this phenomenon With two (admittedly toy) simulated examples. For both examples, we compute the standard MID estimate kMID by maximizing fss, and the LNB filter estimate k3” Which maximizes the LNB likelihood.

4) uses stimuli uniformly distributed on the right half of the unit circle. The Bernoulli spike probability 2» increases linearly as a function of stimulus angle: A = (S—7T/2)/7T, for s E (—71/2,7T/2]. For this neuron, the most informative 1D axis is the vertical

Illustration of MID failure mode due to non-Poisson spiking. (A) Stimuli were drawn uniformly on the unit half-circle, 0 ~ Unif(—rr/2,rr/2). The simulated neuron had Bernoulli (i.e., binary) spiking, where the probability of a spike increased linearly from 0 to 1 as evaried from —rr/2 to 17/2, that is: p(spikel0) = 0/rr+1/2. Stimuli eliciting “spike” and “no-spike” are indicated by gray and black circles, respectively. Forthis neuron, the most informative one-dimensional linear projection corresponds to the vertical axis (km), but the MID estimator (km) exhibits a 16° clockwise bias. (B) Information from spikes (black), silences (gray), and both (red), as a function of projection angle. The peak of the Bernoulli information (which defines kBe,) lies close to 17/2, while the peak of single-spike information (which defines km) exhibits the clockwise bias shown in A. Note that km, does not converge to the optimal direction even in the limit of infinite data, due to its lack of sensitivity to information from silences. Although this figure is framed in an information-theoretic sense, equations (19) and (20) detail the equivalence between [Ber and Lmb, so that this figure can be viewed from either an information-theoretic or likelihood-based perspective.

By contrast, kMID exhibits a substantial clockwise bias, resulting from its failure to take into account the information from silences (which are more informative when spike rate is high). Fig. 4B shows the breakdown of total Bernoulli information into fss (spikes) and f0 (silences) as a function of projection angle, which illustrates the relative biases of the two quantities.

5) uses stimuli drawn from a standard bivariate Gaussian (0 mean and identity covariance), in which standard MID makes a 90° error in identifying the most informative one-dimensional subspace. The neuron’s nonlinearity (Fig. 5A) is excitatory in stimulus axis 51 and suppressive in stimulus axis 52 (indicating that a large projection onto 51 increases spike probability, while a large projection onto 52 decreases spike probability). For this neuron, both stimulus axes are clearly informative, but the (suppressive, vertical) axis 52 carries 13% more information than the (excitatory, horizontal) axis 51. However, the standard MID estimator identifies 51 as the most informative axis (Fig. 5C), due once again to the failure to account for the information carried by silences.

However, it is clear that the assumption of Poisson firing can lead the standard MID estimator to make mistakes when spiking is actually Bernoulli. In general, we suggest that the question of which estimator performs better is an empirical one, and depends on which model describes the true spiking process more accurately. Quantifying MID information loss for binary spike trains. In the limit of infinitesimal time bins, the information carried by silences goes to zero, and the plugin estimates for Bernoulli and single-spike information converge: IO —> 0 and fBer —> fss. However, for finite time bins, the Bernoulli information can substantially exceed single-spike information. In the previous section, we showed that this mismatch can lead to errors in subspace identification. Here we derive a lower bound on the information lost due to the neglect of IO, the information (per spike) carried by silences, as a function of marginal probability of a spike, p(r = 1).

Thus, for example, if 20% of the bins in a binary spike train contain a spike, the standard MID estimator will necessarily neglect at least 10% of the total mutual information. We show this bound holds in the asymptotic limit of small p(r = 1) (see Methods for details), but conjecture that it holds for all p(r = 1). The bound is tight in the Poisson limit, p(r = 1) —> 0, but is substantially loose in the limit where spiking is common p(r = 1) —> 1, in which all information is carried by silences. Fig. 6 shows our bound compared to the actual (numerical) lower bound for an example with a binary stimulus.

Models with arbitrary spike count distributions

For the general case, then, we must consider an arbitrary distribution over counts conditioned on a stimulus. As we Will see, maximizing the mutual information based on histogram estimators is once again equivalent to maximizing the likelihood of an LN model With pieceWise constant mappings from the linear stimulus projection to count probabilities.

minimum information loss for binary spiking

Lower bound on the fraction of total information neglected by MID for a Bernoulli neuron, as a function of the marginal spike probability p(spike) = p(r = 1), for the special case of a binary stimulus. Information loss is quantified as the ratio l0/(Io+lss), the information due to no-spike events, [0, divided by the total information due to spikes and silences, lo+lss. The dashed gray line shows the lower bound derived in the limitp(spike) —> 0. The solid black line shows the actual minimum achieved for binary stimuli s 6 {0,1 } with p(s = 1) = q, computed via a numerical search over the parameter q e [0, 1] for each value of p(spike). The lower bound is substantially loose for p(spike) > 0, since as p(spike) —> 1 , the fraction of information due to silences goes to 1.

The linear-nonlinear-count (LNC) model, Which includes LNB as a special case, is defined by a linear dimensionality reduction matrix K and a set of nonlinear functions {f (O). . ., f (rm)} that map the projected stimulus to the probability of observing {0, . . ., rmax} spikes, respectively. We can write the probability of a spike response r given projected stimulus X = KTs as: Note that there is a constraint on the functions f requiring that Zj f (j)(x) = 1,Vx, since the probabilities over possible counts must add to 1 for each stimulus. Where 1j(rt) is an indicator function selecting time bins tin Which the spike count is j. As before, we consider the case Where f (J) takes a constant value in each of m histogram bins {Bi}, so likelihood estimates for the values can be given in terms of the histogram probabilities: where 11,- is the number of stimuli in bin Bi, :1?) is the number of stimuli in bin B,- that elicited j spikes, N (j) is the number of stimuli in all bins that elicited j spikes, and N is the total number of stimuli. The histogram fractions of the projected raW spike counts 13,- are defined as in Equation 6, With the j-spike conditioned histograms defined analogously: Thus, the log-likelihood for projection matrix K, having already maximized With respect to the nonlinearities by using their plugin estimates, is

If the binned spike-counts rt measured in response to stimuli st are not Poisson distributed, the projection matrix K which maximizes the mutual information between KTs and r can be found as follows. Recalling that rmax is the maximal spike count possible in the time bin and writing x = KTs, we have: To ease comparison With the single-spike information, Which is measured in bits per spike, we normalize the mutual information by the mean spike count to obtain: ized information carried by the j-spike responses. Note that I 1, the information carried by single-spike responses, is not the same as the single-spike information 155, since the latter combines information from all responses With 1 or more spikes, by assuming that each spike is conditionally independent of all other spikes. We can estimate the mutual information from data using a histogram based plugin estimator: pluginComparison With the LNC model log-likelihood (Equation 30) reveals that: where H [r] = — 2133" logNVU) is the plug-in estimate for the marginal entropy of the ob-served spike countsobserved. Note that this reduces to the relationship between Bernoulli information and LNB model log-likelihood (Equation 20) in the special case where rmax = 1. Thus, even in the most general case of an arbitrary distribution over spike counts given a stimulus, the subspace projection K that maximizes the histogram-based estimate of mutual information is identical to the maximum-likelihood K for an LN model with a corresponding piecewise constant parametrization of the nonlinearities.

We formulate two simple examples to illustrate the sub-optimality of standard MID for neurons whose stimulus-condi-tioned count distributions are not Poisson. For both examples, the neuron was sensitive to a one-dimensional projection along the horizontal axis and emitted either 0, 1, or 2 spikes in response to a stimulus.

The first example (A) involves a deterministic neuron, where spike count is O, 1, or 2 according to a piecewise constant nonlinear function of the projected stimulus. Here, MID does not use the information from zero or two-spike bins optimally; it ignores information from zero-spike responses entirely, and treats stimuli eliciting two spikes as two independent samples from p(x|spike). The Icoum estimator, by contrast, is sensitive to the non-Poisson statistics of the response, and combines information from all spike counts (Equation 35), yielding both higher information and faster convergence to the true filter.

7B) involves a model neuron in which a sigmoidal nonlinearity determines the probability that it fires exactly 1 spike (high at negative stimulus projections) or stochastically emits either 0 or 2 spikes, each with probability 0.5 at positive stimulus projections). Thus, the nonlinearity does not change the mean spike rate, but strongly affects its variance. Because the probability of observing a single spike is not affected by the stimulus, single-spike information is zero for all projections, and the MID estimate does not converge to the true filter even with infinite data. However, the full count information Imam correctly weights the information carried by different spike counts and provides a consistent estimator for K.

Identifying high-dimensional subspaces

MID has usually been limited to estimation of only one or two filters, and we are unaware of a practical setting in which it has been used to recover more than three. This stands in contrast to methods like spike-trig-gered covariance (STC) [1, 7], information-theoretic spike-triggered average and covariance (iSTAC) [19], projection-pursuit regression [28], Bayesian spike-triggered covariance [14], and quadratic variants of MID [21, 22] , all of which can tractably estimate ten or more filters. This capability may be important, given that V1 neurons exhibit sensitivity to as many as 15 dimensions [Q], and many canonical neural computations (e.g., motion estimation) require a large number of stimulus dimensions [22, 30]. Before we continue, it is helpful to consider why MID is impractical for high-dimensional feature spaces. The problem isn’t the number of filter parameters: these scale linearly with dimensionality, since a p-fllter model with D-dimensional stimuli requires only Dp parameters, or indeed only (D — 1) p — 519(1) — 1) parameters to specify the subspace after accounting for degeneracies. The problem is instead the number of parameters needed to specify the densities p(x) and p(x|spike). For histogram-based density estimators, the number of parameters grows exponentially with dimension: a histogram with m bins along each of p filter axes requires 111*” parameters, a phenomenon sometimes called the “curse of dimensionality”.

A key benefit of the LNP model likelihood framework is that it shifts the focus of estimation away from the separate densities p(x|spike) and p(x) to a single nonlinear function This change in focus makes it easier to scale the likelihood approach to high dimensions for a few different reasons. First, direct estimation of a single nonlinearity in place of two densities immediately halves the number of parameters required to achieve a similarly detailed picture of the neuron’s response to the filtered stimulus. Second, the dependence of the MID cost function on the logarithm of the ratio p(x|spike)/p(x) makes it very sensitive to noise in the estimated value of the denominator p(x) when that value is near 0. Unfortunately, as p(x) is also the probability with which samples are generated, these low-value regions are precisely where the fewest samples are available. This is a common difficulty in the empirical estimation of information-theoretic quantities, and others working in more general machine-learning settings have suggested direct estimation of the ratio rather than its parts [31—33]. In LN neural modeling such direct estimation of the ratio is equivalent to direct estimation of the nonlinearity.

For example, consider an experiment using natural Visual images. While natural images presumably form a smooth manifold within the space of all possible pixel patterns, the structure of this manifold is neither simple nor known. The natural distribution of images does not factor over disjoint sets of pixels, nor over linear projections of pixel values. A small random perturbation in all pixels makes a natural image appear unnaturally noisy, Violating the underlying presumption of kernel density estimators that local perturbations do not alter the density much. Indeed the question of how best to model the distribution of natural stimuli is a matter of active research. By contrast, we might expect to be able to develop better parametric forms to describe the non-linearities employed by neural systems. For instance, we might expect the neural nonlinearity to vary smoothly in the space of photoreceptor activation, and thus of filter outputs. Thus, locally kernel-smoothed estimates of the nonlinear mapping—or even parametric choices of function class, such as low-order polynomials—might be valid, even if the stimulus density changes abruptly. Alternatively, subunits within the neural receptive field might lead to additively or multiplicatively separable components of the nonlinearity that act on the outputs of different filters. In this case, it would be possible to factor f between two subsets of filter outputs, say to give f(x) = f1(x1)f2(x2), even though there is no reason for the stimulus distribution to factor: p(x) 7E p(x1)p(x2). This reduction of f to two (or more) lower-dimensional functions would avoid the exponential parameter explosion implied by the curse of dimensionality.

In many such cases, f is parametrized by a quadratic form embedded in a 1D nonlinearity [14] , so that the number of parameters scales only quadratically with the number of filters. A similar approach has been formulated in information-theoretic terms using a quadratic logistic Bernoulli model [21, 22, 24]. Another method, known as extended projection pursuit regression (ePPR) [28] , parametrizes f as a sum of one-dimensional nonlinearities, in which case the number of parameters grows only linearly with the number of filters. Parametrizing the many-filter LNP model. Here we provide a general formulation that encompasses both standard MID and constrained methods that scale to high-dimensional subspaces. We can rewrite the LNP model (Equation 1) as follows:

. ., 11¢, which are linearly combined with weights a,- and then passed through a scalar nonlinearity 9. We refer to g as the output nonlinearity; its primary role is to ensure the spike rate A is positive regardless of weights 05,-. This can also be considered a special case of an LNLN model [15, 34, 35]. If we fix 9 and the basis functions {(p,} in advance, fitting the nonlinearity simply involves estimating the parameters a,- from the projected stimuli and associated spike counts. If g is convex and log-concave, then the log-likelihood is concave in {05,-} given K, meaning the parameters governing f can be fit without getting stuck in non-global maxima [11].

The maximum-likelihood weights {65,-} are proportional to the ratio between the number of spikes and number of stimuli in the i’th histogram bin (Equation 10). As discussed above, the number of basis functions :14, scales exponentially with the number of filters, making this parametrization impractical for high-dimensional feature spaces. Another special case of this framework corresponds to Bayesian spike-triggered covariance analysis [14], in which the basis functions (p,- are taken to be linear and quadratic functions of the projected stimulus. If the stimulus is Gaussian, then standard STC and iSTAC provide an asymptotically optimal fit to this model under the assumption that g is exponential [14, 19].

Other reasonable choices include polynomials, sigmoids, sinusoids (i.e., Fourier components), cubic splines, radial basis functions, or any mixture of these bases. Alternatively, we could use nonparametric models such as Gaussian processes, which have been used to model low-dimensional tuning curves and firing rate maps [36, 37]. Theoretical convergence for arbitrary high-dimensional nonlinearities requires a scheme for increasing the complexity of the basis or nonparametric model as we increase the amount of data recorded [38—41]. We do not examine such theoretical details here, focusing instead on the problem of choosing a particular basis that is well suited to the dataset at hand. Below, we introduce basis functions {(pi} that provide a reasonable tradeoff between flexibility and tractability for parametrizing high-dimensional nonlinear functions.

We propose to parametrize the nonlinearity for many-filter LNP models using cylindrical basis functions (CBFs), which we introduce by analogy to radial basis functions (RBFs). These functions are restricted in some directions of the feature space (like RBFs), but are constant along other dimensions. They are therefore the function-domain analogues to the probability “experts” used in product-of-ex-perts models [42] in that they constrain a high-dimensional function along only a small number of dimensions, while imposing no structure on the others. We define a “first-order” CBF as a Gaussian bump in one direction of the feature space, pa-rametrized by center location [,4 and a characteristic width 0: which affects the function along vector component x,- and is invariant along xj 75 i. Parametriz-ing f with first-order CBFs is tantamount to assuming f can be parametrized as the sum of 1D functions along each filter axis, that is f (x) = g (fl(x1) + . . . fm(xm)), where each function is parametrized with a linear combination of “bump” functions. This setup resembles the parametrization used in the extended projection-pursuit regression (ePPR) model [28], although the nonlinear transformation g confers some added flexibility. For example, we can have multiplicative combination when g(-) = exp(-), resulting in a separable f, or rectified additive combination when g(-) = max(-,O), which is closer to ePPR. If we use d basis functions along each filter axis, the resulting nonlinearity requires kd parameters for a k-filter LNP model. We can define second-order CBFs as functions that have Gaussian dependence on two dimensions of the input space and that are insensitive to all others: basis represents f as a (transformed) sum of these bivariate functions, giving d2 parameters if we use d2 basis functions for each of the possible pairs of k filter outputs, or merely g d2 if we instead partition the k filters into disjoint pairs. Higher-order CBFs can be defined analogously: k’th order CBFs are Gaussian RBFs in a k’ dimensional subspace while remaining constant in the remaining k—k’ dimensions. Of course, there is no need to represent the entire nonlinearity using CBFs of the same order. It might make sense, for example, to represent the nonlinear combination of the first two filter responses with second order CBFs (which is comparable to standard MID with a 2D histogram representation of the nonlinearity), and then use first order CBFs to represent the contributions of additional (less-informative) filter outputs.

This dataset contains extracellular single unit recordings of simple and complex cells driven by an oriented 1D binary white noise stimulus sequence (i.e., “flickering bars”). For each neuron, we fit an LNP model using: (1) the information-theoretic spike-triggered average and covariance (iSTAC) estimator [19]; and (2) the maximum likelihood estimator for an LNP model with nonlinearity parametrized by first-order CBFs. The iSTAC estimator, which combines information from the STA and STC, returns a list of filters ordered by informativeness about the neural response. It models the nonlinearity as an exponentiated quadratic function (an instance of a generalized quadratic model [23]), and yields asymptotically optimal performance under the condition that stimuli are Gaussian. For comparison, we also implemented a model with a less-constrained nonlinearity, using Gaussian RBFs sensitive to all filter outputs (rbf-LNP). This approach is comparable to to “classic” MID, although it exploits the LNP formulation to allow local smoothing of the nonlinearity (rather than square histogram bins). Even so, the number of parameters in the nonlinearity still grows exponentially with the number of filters so, computational concerns prevented us from recovering more than four filters with this method.

8 compares the performance of these estimators on neural data, and illustrates the ability to tractably recover high-dimensional feature spaces using maximum likelihood methods, provided that the nonlinearity is parametrized appropriately. We used 3 CBFs per filter output for the cbf-LNP model (resulting in 3p parameters for the nonlinearity), and a grid with 3 RBFs per dimension for the rbf-LNP model (31’ parameters). By contrast, the exponentiated-quadrat-ic nonlinearity underlying the iSTAC estimator requires O(p2) parameters. .° to excitatory information 0

avg # of excitatory filters to (a ‘h at

Estimation of high-dimensional subspaces using a nonlinearity parametrized with cylindrical basis functions (CBFs). (A) Eight most informative filters for an example complex cell, estimated with iSTAC (top row) and cbf-LNP (bottom row). For the cbf-LNP model, the nonlinearity was parametrized with three first-order CBFs for the output of each filter (see Methods). (B) Estimated 1D nonlinearity along each filter axis, for the filters shown in (A). Note that third and fourth iSTAC filters are suppressive while third and fourth cbf-LNP filter are excitatory. (C) Cross-validated single-spike information for iSTAC, cbf-LNP, and rbf-LNP, as a function of the number of filters, averaged over a population of 16 neurons (selected from [29] for having 2 8 informative filters). The cbf-LNP estimate outperformed iSTAC in all cases, while rbf-LNP yielded a slight further increase forthe first four dimensions. (D) Computation time forthe numerical optimization of the cbf-LNP likelihood for up to 8 filters. Even for 30 minutes of data and 8 filters, optimisation took about 4 hours. (E) Average number of excitatory filters as a function of total number of filters, for each method. (F) Information gain from excitatory filters, for each method, averaged across neurons. Each point represents the average amount of information gained from adding an excitatoryfilter, as a function of the number of filters.

Note that this is equivalent to computing test log-likelihood under the LNP model. For a subset of neurons determined to have 8 or more informative filters (16/59 cells), the cbf-LNP filters captured more information than the iSTAC filters (Fig. 8C). This indicates that the CBF nonlinearity captures the nonlinear mapping from filter outputs to spike rate more accurately than an eXponentiated quadratic, and that this flexibility confers advantages in identifying the most informative stimulus dimensions. The first four filters estimated under the rbf-LNP model captured slightly more information again than the cbf-LNP filters, indicating that first-order CBFs provide slightly too restrictive a parametrization for these neurons. Due to computational considerations, we did not attempt to fit the rbf-LNP model with > 4 filters, but note that the cbf-LNP model scaled easily to 8 filters (Fig. 8D).

In particular, the cbf-LNP fit reveals that excitatory filters provide more information than iSTAC attributes to them, and that excitatory filters should come earlier relative to suppressive filters when ordering by informativeness. Fig. 8A-B, which shows the first 8 filters and associated marginal one-dimensional nonlinearities for an example V1 compleX cell, provides an illustration of this discrepancy. Under the iSTAC estimate (Fig. 8A, top row), the first two most informative filters are excitatory but the third and fourth are suppressive (see nonlinearities in Fig. 8B). However, the cbf-LNP estimate (and rbf-LNP estimate, not shown) indicates that the four most informative filters are all excitatory. This tendency holds across the population of neurons. We can quantify it in terms of the number of excitatory filters within the first 11 filters identified (Fig. 8E) or the total amount of information (i.e., log-likelihood) contributed by excitatory filters (Fig. 8F). This shows that iSTAC, which nevertheless provides a computationally ineXpensive initialization for the cbf-LNP estimate, does not accurately quantify the information contributed by excitatory filters. Most likely, this reflects the fact that an eXponentiated-qua-dratic does not provide as accurate a description of the nonlinearity along excitatory stimulus dimensions as can be obtained with a nonparametric estimator.

Relationship to previous work

Here, we consider the relationship of the methods described in this study to these earlier approaches. Rapela et al [28] introduced a technique known as extended Projection Pursuit Regression (ePPR), Where the high-dimensional estimation problem is reduced to a sequence of simpler low-dimensional ones. The approach is iterative. A one-dimensional model is found first, and the dimensionality is then progressively increased to optimize a cost function, but With the search for filters restricted to dimensions orthogonal to all the filters already identified. From a theoretical perspective this assumes that the spiking probability can be defined as a sum of functions of the different stimulus components; that is,

By contrast, we have focused here on parametrization rather than sequential optimization. In all cases, we optimized the log-likelihood simultaneously over all filter dimensions. For high-dimensional models, we advocate parametrization of the nonlinearity so as to avoid the curse of dimensionality. However, the CBF form we have introduced is more flexible than that of ePPR, both in that two or more dimensional components are easily included, and in that the outputs of the components can be combined non-linearly.

The iSTAC estimator, introduced by Pillow & Simoncelli [19] , is based on maximization of the KL divergence between Gaussian approximations to the spike-triggered and stimulus ensembles—thus finding the feature space that maximizes the single-spike information under a Gaussian model of both the spike-trig-gered and stimulus ensembles. Park & Pillow [44] showed its relationship to an LNP model With an exponentiated quadratic spike rate, Which takes the form:

Moreover, they proposed a new model, known as “elliptical LNP”, which allowed estimation of a nonparametric nonlinearity around the quadratic function (instead of assuming an eXponential form). Rajan et al. [24] considered a similar model within an information-theoretic framework and proposed extending it to nonlinear combinations of outputs from multiple quadratic functions. In a similar vein, Sharpee et al [45, 46] used

The authors also proposed a “nonlinear MID” in which the standard MID estimator is extended by setting the firing rate to be a quadratic function of the form f(kTs+sTC s). This method is one-dimensional in a quadratic stimulus space (unlike multidimensional linear MID) and therefore avoids the curse of dimensionality. Other work has used independent component analysis to find directions in stimulus space in Which the spike-trig-gered distribution has maXimal deViations from Gaussianity [8].

Discussion

Distributional assumptions implicit in MID

Although the MID estimator was originally described in information-theoretic language, we have shown that, when used with plugin estimators for information-theoretic quantities, it is mathematically identical to the maximum likelihood estimator for a linear-nonlinear-Pois-son (LNP) encoding model. This equivalence holds irrespective of spike rate, the amount of data, or the size of time bins used to count spikes. We have shown that this follows from the fact that the plugin estimate for single-spike information is equal (up to an additive constant) to the log-likelihood per spike of the data under an LNP model.

In practice, however, such agnosticism can be difficult to achieve, as the need to estimate information-theoretic quantities from data requires the choice of a particular estimator. MID has the virtue of using a nonparametric estimator for raw and spike-triggered stimulus densities, meaning that the number of parameters (i.e., the number of histogram bins) can grow flexibly with the amount of data. This allows it to converge for arbitrary densities, in the limit of infinite data. However, for a finite dataset, the choice of number of bins is critical for obtaining an accurate estimate. As we show in Fig. 3, a poor choice can lead to a systematic under or overestimate of the single-spike information, and in turn, a poor estimate of the most informative stimulus dimensions. Determining the number of histogram bins should therefore be considered a model selection problem, validated with a statistical procedure such as cross-validation.

To be clear, the single-spike information represents a valid information-theoretic quantity that does not explicitly assume any model. As noted in [26], it is simply the information carried by a single spike time, independent of all other spike times. However, conditionally independent spiking is also the fundamental assumption underlying the Poisson model and, as we have shown, the standard MID estimator (based on the KL-divergence between histograms) is mathematically identical to the maximum likelihood estimator for an LNP model with piecewise constant nonlinearity. Thus, MID achieves no more and no less than a maXimum likelihood estimator for a Poisson response model. As we illustrate in Fig. 4, MID does not maximize the mutual information between the projected stimulus and the spike response when the distribution of spikes conditioned on stimuli is not Poisson; it is an inconsistent estimator for the relevant stimulus subspace in such cases.

MID makes different, but not necessarily fewer, assumptions when compared to other LN estimators. For instance, although the maximum-likelihood estimator for a generalized linear model assumes a less-flexible model for the neural nonlinearity than does MID, it readily permits estimation of certain forms of spike-interdependence that MID neglects. In particular, MID-derived estimates are subject to concerns regarding model mismatch that arise whenever the true generative family is unknown [47].

Where the information theoretic and like-lihood-based estimators are identical, nothing is lost by this approach. However, besides making assumptions eXplicit, the likelihood-based framework also readily facilitates the introduction of suitable priors for regularization, or hierarchical models [48, 49], or of more structured models the type discussed here.

Generalizations

From a model-based perspective, the generalizations correspond to maximum likelihood estimators for a linear-nonlinear-Bernoulli (LNB) model (for binary spike counts), and the linear-nonlinear-Count (LNC) model (for arbitrary discrete spike counts). For both models, we obtained an equivalent relationship between log-likelihood and an estimate of mutual information between stimulus and response. This correspondence extends previous work that showed only approximate or asymptotic relationships between between information-theoretic and maximum-likelihood estimators [20, 24, 25]. The LNC model is the most general of the models we have considered. It requires the fewest assumptions, since it allows for arbitrary distributions over spike count given the stimulus. It includes both LNB and LNP as special cases (i.e., when the count distribution is Bernoulli or Poisson, respectively). We could analogously define arbitrary “LNX” models, where X stands in for any probability distribution over the neural response (analog or discrete), and perform dimensionality reduction by maximizing likelihood for the filter parameters under this model. The log-likelihood under any such model can be associated with an information-theoretic quantity, analogous to single-spike, Bernoulli, and count information, using the difference of log-likelihoods (see also [35]): Where px(r|s,9) denotes the conditional response distribution associated With the LNX model With parameters 9, and px(r|90) describes the marginal distribution over r under the stimulus distribution p(s). The empirical or plugin estimate of this information is equal to the LNX model log-likelihood plus the estimated marginal entropy: where n denotes the number of samples and 90 depends only on the marginal response distribution. The maximum likelihood estimate is therefore equally a maximal-information estimate.

Spike-history dependencies can influence the single-bin spike count distribu-tion—for example, a Bernoulli model is more accurate than a Poisson model when the bin size is smaller than or equal to the refractory period, since the Poisson model assigns positive probability to the event of having 2 2 two spikes in a single bin. The models we have considered can all be extended to capture spike history dependencies by augmenting the stimulus with a vector representation of spike history, as in both conditional renewal models and generalized linear models [10, 12, 27, 50—52].

Standard implementations of MID employ histogram-based density estimators for p(KTs) and p(KTs|spike). However, dimensionality and parameter count can be a crippling issue given limited data, and density estimation becomes intractable in dimensionalities > 3. Furthermore, the dependence of the information on the logarithm of the ratio of these densities amplifies sensitivity to errors in these estimates. The LNP-likelihood view suggests direct estimation of the nonlinearity f, rather than of the densities. Such estimates are naturally more robust, and are more sensibly regularized based on expectations about neuronal responses without reference to any regularities in the stimulus distribution. We have proposed a flexible yet tractable form for the nonlinearity in terms of linear combinations of basis functions cascaded with a second output nonlinearity. This approach yielded a flexible, computationally efficient, constrained version of MID that is able to estimate high-dimensional feature spaces. It is also general in the sense that it encompasses standard MID, generalized linear and quadratic models, and other constrained models that scale tractably to high-dimensional subspaces. Future work might seek to extend this flexible likelihood-based approach further, for example by including priors over the weights with which basis functions are combined to improve regularization, or perhaps by adjusting hyperparameters in a hierarchical model as has been successful with linear approaches [48, 49].

In all of these examples however, issues with dimensionality have prevented the estimation of feature spaces with more than two dimensions. The methods presented within this paper will help to overcome these issues, opening access to further important questions regarding the relationship between stimuli and their neural representation.

Methods

Bound on lost information under MID Here we present a derivation of the lower bound on the fraction of total information carried by silences for a Bernoulli neuron, in the limit of rare spiking. For notational convenience, let p = p(r = 1) denote the marginal probability of a spike, so the probability of silence is p(r = 0) = Note that this is a generalized form of the Jensen-Shannon (IS) divergence; the standard IS-di-; 2. In the limit of small p (i.e., the Poisson limit), the mutual information is dominated by the vergence between Q0 and Q1 is obtained when p =

Here we wish to show a bound on the fraction of information carried by the Q0 term. We can do this by computing a second-order Taylor expansion of (1 — ,0)DKL(QO |PS) and I (s,r) around p = O, and show that their ratio is bounded below by p/ 2. Expanding in p, we have Where in the limit p —> 0. We conjecture that the bound holds for all values of p. For the case of p = %, this corresponds to an assertion about the relative contribution of each of the two terms in the IS divergence, that is: for any choice of distributions Q0 and Q1. We have been unable to find any counter-examples to this (or to the more general conjecture), but have so far been unable to find a general proof.

Single-spike information and Poisson log-likelihood

An important general corollary to the equivalence between MID and an LNP maximum likelihood estimate is that the standard single-spike information estimate lss based on a PSTH measured in response to repeated stimuli is also a Poisson log-likelihood per spike (plus a constant). Specifically, the empirical single-spike information is equal to the log-likelihood ratio between an inhomogeneous and homogeneous Poisson model of the repeat data (normalized by spike count): Where XML denotes the maximum-likelihood or plugin estimate of the time-varying spike rate (i.e., the PSTH itself), 2: is the mean spike rate across time, and £(9c;r) denotes the log-likeli-hood of the repeat data r under a Poisson model With time-varying rate 2v.

Let {rjt} denote spike counts collected during a “frozen noise” experiment, With repeat index j E {1, . . ., nrpt} and index 1‘ E {1, . . ., nt} over time bins of Width A. Then T = 11A is the duration of the stimulus and N 2 11, mm is the total number of time bins in the entire experiment. The single-spike information can be estimated With a discrete version of the formula for single-spike information provided in [26] (see eq. 2.5): Where = A1 Eff; rjt is an estimate of the spike rate in the t’th time bin in response to the stimulus sequence 5, and Z = ( tzl M0) /nt is the mean spike rate across the experiment. Note that this formulation assumes (as in [26]) that T is long enough that an average over stimulus sequences is well approximated by the average across time. The plugin (ML) estimator for spike rate can be read off from the peri-stimulus time histogram (PSTH). It results from averaging the response across repeats for each time bin: Clearly, Z = %, Where nsp 2 EN rjt is the total spike count. This allows us to rewrite single-spike vector This is given by: Which is identical to relationship between single-spike information and Poisson log-likelihood expressed in Equation 13. Thus, even When estimated from raster data, Iss is equal to the difference between Poisson log-likelihoods under an inhomogeneous (rate-varying) and a homogeneous (constant rate) Poisson model, diVided by spike count (see also [57] ). These normalized log-likelihoods can be conceived as entropy estimates, With — "i (xi; r) providing an estimate 5p for prior entropy, measuring the prior uncertainty about spike times given the mean rate, and — i (xi; r) corresponding to posterior entropy, measuring the posterior uncertainty once :15], we know the time-varying spike rate. A similar quantity has been used to report the cross-validation performance of conditionally spike information is evaluated using the rate estimate xi obtained With parameters fit to training data and responses r from unseen test data. This results in the “cross-validated” single-spike in-formation:

Note that this quantity can be negative in cases of extremely poor model fit, that is, when the model prediction on test data is worse than of the best constant-spike-rate Poisson model. Cross-validated single-spike information provides a useful measure for comparing models with different numbers of parameters (e.g., a l-filter vs. 2-filter LNP model), since units of “bits” are more interpretable than raw log-likelihood of test data. Generally, lg” can be considered to a lower bound on the model’s true predictive power, due to stochasticity in both training and test data. By contrast, the empirical ISS evaluated on training data tends to overestimate information due to over-fitting.

Computation of model-based information quantities

Consider a world with two stimuli, ‘A’ and ‘B’, and two possible discrete stimulus sequences, $1 2 AB and $2 2 BA, each of which occurs with equal probability, so p(sl) = p(52) = 0.5. Assume each sequence lasts T = 2s, so the natural time bin size for considering the spike response is A = 1s. Suppose that stimulus A always elicits 3 spikes, while B always elicits 1 spike. Thus, when sequence 51 is presented, we observe 3 spikes in the first time interval and 1 spike in the second interval; when 52 is presented, we observe 1 spike in the first time interval and 3 spikes in the second.

For this example, 11(1‘), takes the value 3 during (0,1] and 1 during (1,2] , while 12(1‘) takes values 1 and 3 during the corresponding intervals. The mean spike rate for both stimuli is Z = 2 sp/s. Plugging these into Equation 54 gives single-spike information of 155 = 0.19 bits/spike. This result is slightly easier to grasp using an equivalent definition of single-spike information as the mutual information between the stimulus s and a single spike time 1‘ (see [26]). If one were told that a spike, sampled at random from the four spikes present during every trial, occurred during [0, 1], then the posterior p(s|T = 1) attaches 3/4 probability to s = $1 and 1/4 to s = $2. The posterior entropy is therefore —0.25 log 0.25—0.75 log 0.75 = 0.81 bits. We obtain the same entropy if the spike occurs in the second interval, so H (SIT) = 0.81. The prior entropy is H (s) = 1 bit, so once again we have 155 = 1 —0.81 = 0.19 bits/spike. The Bernoulli information, by contrast, is undefined, since r takes values outside the set {0,1}, and therefore cannot have a Bernoulli distribution. To make Bernoulli information well defined, we would need to either truncate spike counts above 1 (e. g., [59] ), or else use smaller bin size so that no bin contains more than one spike. In the latter case, we would need to provide more information about the distribution of spike times within these finer bins. If, for eXample, the three spikes elicited by A are evenly spaced within the interval and we use bins equal to 1/35, then the Bernoulli information will clearly exceed single-spike information, since the time of a no-spike response (r = O, a term neglected by single-spike information) provides perfect information about the stimulus, since it occurs only in response to B.

We defined Icount to be the mutual information normalized by the mean spike count per bin (Equation 35). Thus, Icoum = 0.5 bits/spike, which is more than double the single-spike information.

Gradient and Hessian of LNP log-likelihood

We performed joint optimization of filter parameters K and basis function weights {05,-} using MATLAB’s fminunc function. We found this approach to converge much more rapidly than alternating coordinate ascent. We used analytically computed gradient and Hessian of the joint-likelihood to speed up performance, which we provide here. Where 9 2 {Kg} are the model parameters, A is the time bin size, and 1 denotes a vector of ones. The first and second derivatives of the log-likelihood are given by Where multiplication, division, and eXponentiation operations on vector quantities indicate component-Wise operations.

. ., km denote the linear filters, i.e., the 111 columns of K. Then the required gradients of I. With respect to the model parameters can be written: inverse-link function g at its input, and ‘0’ denotes Hadamard or component-Wise vector product. Lastly, second derivative blocks, Which can be plugged into Equation 62 to form the Hessian, are given by

V1 data analysis

The spatiotemporal stimulus had between 8 and 32 spatial bars and we considered 10 time bins for the temporal integration window. This made for a stimulus space with dimensionality ranging from 80 to 320. The cbf-LNP model was implemented with a cylindrical basis function (CBF) nonlinearity using three f1rst-order CBFs per filter. For a k-fllter model, this resulted in 3k parameters for the nonlinearity, and (240+3)k parameters in total for a stimulus with 24 bars.

Unlike the histogram-based parametrization discussed in the manuscript (which produces a piecewise constant nonlinearity), this results in a smooth nonlinearity and, more importantly, a smooth log-likelihood with tractable analytic gradients. We defined a grid of RBFs with three grid points per dimension, so that CBF and RBF models were identical for a l-fllter model. For a k-fllter model, this resulted in 3" parameters for the nonlinearity, and 240k+3k parameters in total, for a stimulus with 24 bars. For both models, the basis function responses were combined linearly and transformed by a “soft-rectification” function: g(-) = log(1+exp(-)), to ensure positive spike rates. We also evaluated the performance of an exponential function, g(-) = exp(-), which yielded slightly worse performance (reducing single-spike information by ~ 0.02 bits/spike).

Both models were fit incrementally, with the N+1 dimensional model being initialized with the parameters of the N dimensional model, plus one additional fllter (initialized with the iSTAC filter that provided the greatest increase in log-likelihood). The joint likelihood in K and a was ascended using MATLAB’s fminunc optimization function, which exploits analytic gradients and Hessians. The models were fit to 80% of the data, with the remaining 20% used for validation. In order to calculate information contributed by excitatory fllters under the cbf-LNP model (Fig. 8F), we removed each filter from the model and refit the nonlinearity (using the training data) using just the other fllters. We quantified the information contributed by each filter as the difference between log-likelihood of the full model and log-likelihood of the reduced model

We sorted the filters by informativeness and computed the cumulative sum of information loss to obtain the trace shown in (Fig. 8F).

8D) were averaged over 100 repetitions using different random seeds. For each cell, four segments of activity were chosen randomly with fixed lengths of 5, 10, 20 and 30 minutes, which contained between about 22000 and 173000 spikes. Even with 30 minutes of data, 8 filters could be identified within about 4 hours on a desktop computer, making the approach tractable even for large numbers of filters. Code will be provided at http://pillowlab.princeton.edu/codehtml.

Acknowledgments

We thank I. M. Beck and P. E. Latham for insightful discussions and providing scientific input during the course of this project. We thank M. Day, B. Dichter, D. Goodman, W. Guo, and L. Meshulam for providing comments on an early version of this manuscript.