SciSurf: Index of 'Power Laws for Heavy-Tailed Distributions: Modeling Allele and Haplotype Diversity for the National Marrow Donor Program'

Power Laws for Heavy-Tailed Distributions: Modeling Allele and Haplotype Diversity for the National Marrow Donor Program

Noa Slater, Yoram Louzoun, Loren Gragert, Martin Maiers, Ansu Chatterjee, Mark Albrecht

Published in PLOS Comp. Biol., April 2015

Abstract

These measures are of particular interest in the field of hematopoietic stem cell transplant (HSCT). Donor/Recipient suitability for HSCT is determined by Human Leukocyte Antigen (HLA) similarity. Match predictions rely upon a precise description of HLA diversity, yet classical estimates are inaccurate given the heavy-tailed nature of the distribution. This directly affects HSCT matching and diversity measures in broader fields such as species richness. We, therefore, have developed a power-law based estimator to measure allele and haplotype diversity that accommodates heavy tails using the concepts of regular variation and occupancy distributions. Application of our estimator to 6.59 million donors in the Be The Match Registry revealed that haplotypes follow a heavy tail distribution across all ethnicities: for example, 44.65% of the European American haplotypes are represented by only 1 individual. Indeed, our discovery rate of all US. European American haplotypes is estimated at 23.45% based upon sampling 3.97% of the population, leaving a large number of unobserved haplotypes. Population coverage, however, is much higher at 99.4% given that 90% of European Americans carry one of the 4.5% most frequent haplotypes. Alleles were found to be less diverse suggesting the current registry represents most alleles in the population. Thus, for HSCT registries, haplotype discovery will remain high with continued recruitment to a very deep level of sampling, but population coverage will not. Finally, we compared the convergence of our power-law versus classical diversity estimators such as Capture recapture, Chao, ACE and Jackknife methods. When fit to the haplotype data, our estimator displayed favorable properties in terms of convergence (with respect to sampling depth) and accuracy (with respect to diversity estimates). This suggests that power-law based estimators offer a valid alternative to classical diversity estimators and may have broad applicability in the field of population genetics.

Author Summary

The heavy tail is eXpected from theoretical considerations and is observed in most populations. Accurate measures of diversity are difficult to achieve given that a limited number of common hap-lotypes represent the majority of the population, whereas the major contributor to haplo-type diversity comes from unique haplotypes that are “rare” and present in only a fraction of the population. A major issue for unrelated HSCT donor registries is estimating population coverage with respect to servicing the public need. We here use a power-law methodology that accommodates heavy-tails to estimate both the population coverage by ethnicity in the US and the genetic diversity of alleles and haplotypes. For the European American population, which has the deepest sampling amongst ethnicities, we show that registry population coverage is better than 99%, but the diversity of this sample only represents 40% of the unique haplotypes eXpected to be found in the population. Population coverage for other ethnicities was poorer and ranged down to 92% as was the case for Native Americans that had the worst coverage. We further show that the formalism developed here produces better estimates of the population properties than existing methods.

Introduction

In the context of hematopoietic stem cell transplant (HSCT) matching, proper estimates of allele and haplotype diversity are essential for describing the population genetics that govern the clinical suitability of a donor/patient match. Matching algorithms scan large unrelated donor registries to identify suitable matches based upon individual Human Leukocyte Antigen (HLA) genetic typings. HLA is defined by the super locus of genes contained in the major histocompatibility complex (MHC) on the 6th chromosome that encodes proteins governing the adaptive immune system response [1]. Successful transplantation requires careful matching of donor/ recipient HLA to avoid adverse immune reactions that result in graft rejection or graft versus host disease [2,3]. The ambiguous nature of HLA typing, however, presents challenges for transplant matching, given that donor registries contain a range of typing methods and allele definitions that have evolved since the 1980’s. This “mixed resolution” data contains uncertainties regarding the exact alleles each donor has and their phase on the chromosome. The standard approach for resolving these uncertainties is to impute each donor’s HLA typing to a common allele resolution and set chromosomal phase using expectation maximization (EM) algorithms [4]. Given a proper candidate set of haplotypes, EM algorithms work well to estimate the distribution of this defined population, which becomes the reference data for computing accurate donor/patient match predictions.

This problem arises due to the following main reasons: Lack of Adequate Sampling Depth—the MHC region has the highest allele diversity within the human genome, so even a large sample relative to the population is unlikely to observe all clinically relevant haplotypes. For example, in the European American population 44.65% of haplotypes in the Be The Match registry are singletons (i.e. are represented by only one individual), yet the most frequent haplotype represents 6.11% of the sample so one might falsely conclude that variation is lower than it really is. This motivates the need to develop reliable methods for assessing sampling depth and the benefits of continued sampling; Ambiguous Nature of HLA T yping—true haplotype variation is confounded by ambiguous typing methods, which describe each individual as a set of possible alleles at each loci (consistent with HLA typing). When candidate haplotypes are constructed from cross-combinations of these possible alleles, an intractably large set results due to the “curse of dimensionality”[5]. Currently, this exhaustive set of haplotypes is heavily trimmed prior to estimation with the EM algorithm, but the trimming strategies are not quantitatively informed; Variation in Diversity Among Populations—ancestral migration patterns have resulted in different patterns of allele and haplotype diversity among ethnic groups, implying that sampling depth requirements are likely to vary by population. Clearly, there is a need to develop quantitative estimates of allele and haplotype diversity.

Estimating A and H is a difficult problem given that we do not directly observe all alleles and haplotypes in the population. Instead, we must establish and extrapolate a relationship between the number of kinds we observe and the frequency of each kind. Mathematical frameworks for describing this relationship have been developed in the study of species richness and occupancy. Capture-recapture methods have been adapted from ecology to genetics for estimating the total number of different alleles/haplotypes [6]. Such methods could be applied to estimate A and H by tagging re-samples of alleles and haplotypes. Although interesting, these capture recapture methods require a very large sampling depth for heavy tailed distributions, as will be shown. Occupancy distributions [7]offer a more natural representation of our data sampling process and use the mathematical concept of regular variation to extrapolate A and H using a power law function that describes the distribution of allele and haplotype frequencies in the population [7,8]. The power law relationship was constructed under the assumption that an infinite number of categories (or kinds) exist in the population; so for our purposes we apply a boundary constraint to estimate a finite number of categories under this general framework. In the medical field of HSCT, estimates of A are largely catalog based; unique HLA genomic sequences are stored in databases as alleles using locus-specific nomenclature (the best known Table 1. List of symbols used. Symbol Definition pj Apriori probability of choosing a haplotype j for a person in the population. Ntotd Total population size. A Total number of unique alleles in the population. U(R) Expected number of unique haplotypes in a sample size of R. a Power of distribution. C Normalization factor. Xmin Minimal value of non zero p]. Xmax Maximal value of non zero p,-. R Sample size. Z(R) Expected probability for an unobserved haplotype in a sample of size R. X0 Point at which slope of distribution changes. pH (x) Density of p; values—the number of haplotypes in a region = the number of haplotypes pH(X) = VHlfIX) Density distribution function for the probability of obsen/ing a clone of frequency x.

Further information on the prevalence of specific HLA alleles in the general population is provided by the Common and Well Documented (CWD) allele list [10]. Estimates of H are typically based on EM derived HLA haplotype frequencies, which represent nomenclature-based allele combinations that are likely to be found in the population. These estimates take into account that genetic information is inherited in blocks making the true value of H much smaller than the eXpansion of all possible allele combinations. For example, estimates of H within the United States population in 2013 identified over 160,000 unique 6-locus haplotypes in the Be The Match Registry, where 99% of the time H was comprised of combinations of “common” alleles and roughly 90% of the known IMGT documented alleles (A) were absent from the haplotype frequencies [11]. Thus, current methods for estimating of A and H rely upon direct observation of a given allele or haplotype. For estimating the “common” portions of A and H, this is a reliable strategy for sufficient observational data is likely to exist. However, such an approach is unsuitable for estimating the “rare” portion of A and H lacking direct observational data. Evidence for the existence of a large number of “rare” alleles and haplotypes in the overall population is outlined by Klitz [12] , who modeled the effects of gene conversion, point mutations, and recombination to establish the heavy tailed nature of allele and haplotype distributions.

This power-law framework is useful for modeling the probability mass (and number) of A and H that are unseen in the sample and represented by the “invisible” tail of the distribution. We perform this on 6-locus HLA data (A~C~B~DRBX~DRB1~DQB1; where DRBX = DRB3/4/ 5) from the Be The Match Registry, which is one of the largest data sets characterizing human HLA population genetics containing the typing information for 6.59 million donors [11]. We also assess registry coverage, which is the proportion of the total population occupied by the observed haplotypes, of A and H with respect to the potential base of patients seeking HSCT to inform donor recruitment strategies. Last, we discuss broader applications of the power-law methodology outside HSCT for modeling species richness in the field of ecology, which is characterized by similar heavy-tailed distributions.

Results

Haplotype Distribution Formalism

In order to estimate the total number of unique haplotypes H having probability pj for haplo-type j 1gj§H,O§pj-§ 1, we define a counting function (vH (x)),also denoted an exceedance function, representing the number of haplotypes With a relative frequency (i.e. pj) above or equal to x. Observations suggest that vH (x) can be approximated by a scale free function:

However, we here take a continuous approximation of.VH (x) This approximation stops being valid at some maximal and minimal haplotype relative frequencies (XmiI1 and Xmax) Obviously, a (unique) haplotype cannot have a frequency of less than 1 and, thus, a relative frequency of less than1 / Ntotal (Xmin 2 1 / Ntoml), Where Ntoml is the total number of haplotypes in the population (tWice the total population size since each sampled person has two sets of haplotypes). Ntoml is typically much larger than the number of unique haplotypes, H. Similarly, above a given relative frequency, vH (x) becomes smaller than 1, and the approximation stops being valid. The total number of unique haplotypes in this formalism is H : VH (Xmin)Given the continuous formalism, one can also define a haplotype count density (Which is not a formal density function, since it is not normalized to 1), representing the number of unique haplotypes H in a given relative frequency interval of dx: [,4 H (x)is not a probability density function (PDF). However it can be normalized, by defining: which is a proper density distribution function. Thus, pH(x) provides a description of how common (i.e. the density) each pj is in the population. Further, a proper cumulative distribution function (GDP) for the 19175 (the relative haplotype frequencies) can be derived by normalizing VH(x) where Pr(pj g x) = 1 — vH (x) / H , since vH (x) / H represents the complementary CDF (i.e.1-Pr(pj§x)). Also, as noted above, the power law eXponent of yH(x) is 1 unit greater than the power law eXponent of VH(x)since these two quantities share the properties linking CDFs to PDFs, namely that a PDF is proportional to the first derivative of its CDF.

In order to estimate the properties of the real population haplotype distribution, a parallel theory is required for finite populations. The finite population affects the upper and lower bounds of the distribution. The lower bound is determined by the absence of relative frequencies smaller than 1 / Ntoml and is common to all types of distributions. The upper bound is specific to heavy tail distributions. In such distributions, the contribution of the right tail to the overall probability mass is considerable. As such, the normalization constant of the distribution is greatly affected by the maXimal relative frequency in the population (Xmax).

In order to incorporate the finite size of the population, minimal and maximal relative haplotype frequencies are defined (Xmin and Xmax). As mentioned, the minimal frequency for a haplotype cannot be less than one person in the whole Ntoml population that carries one copy of this haplotype (Xml-HZNtoml'l). However, in many realistic cases, the scale free distribution only starts at a higher value than that.

As mentioned, the minimum difference between relative frequencies is Ntoml'l. We can thus define Ntoml'lto be the minimal bin size. Per definition, at the largest ‘X’ value, there is typically one unique haplotype. Thus,vH(xmaX) = 1, leading to: Assuming Xmax is an upper threshold, we have the equal sign in the equation and in this case Xmax can then be numerically extracted from Eq 6. In many realistic cases, the scale free distribution has an exponential cutoff, starting at much lower values and XmX can be lower (see 82 Text). Estimates of H (Number of Unique Haplotypes)

It is thus of interest to estimate the total size of H, and the relation between the sampling size and the portion of H that we actually observe. If the haplotype relative frequency distribution pH (x) was light tailed, most haplotypes would have the same order of frequency. In such a case, it would have been enough to sample a limited part of the population to detect the vast majority of different haplotypes and estimate H. More specifically, if the sample would be at least an order larger than H, most unique haplotypes would be detected. However, in heavy tailed distributions (such as power-law distributions) the vast majority of haplotypes are very rare. A deep sampling is thus required for a precise estimate of H and a very deep sample is required to observe all unique haplotypes. On the other hand, the contribution of these rare haplotypes to the coverage is often negligible since most of the population has common haplotypes. The precise balance between these two phenomena is determined by the power eXponent of the distribution (a). Formally, the expected number of unique haplotypes (U(R)) in a sample size (R) can be estimated, assuming a truncated power law formalism as defined above, to be (see Sl—S3 Texts): Where y is the partial gamma function. The total number of unique haplotypes can be computed to be:

Estimates of the Distribution Properties

Many methods have been proposed to estimate this slope, including a fit of the CDF or the PDF, or a fit of the power of the Zipf plot [15]. However, such methods have been argued to be flawed, as discussed in an excellent reVieW by Clauset et al. [16]. The same authors have proposed estimates for the slope of the scale free distribution for either continuous [17] (Eq 9) or discrete (Eq 10) distributions: absoluteWhere n is the total number of observed haplotypes, pj's are the haplotype relative frequencies, pm" is the relative frequency of the rarest observed haplotype, y's are the parallel values in abso-lute frequencies for the discrete case and ymin is the minimal absolute frequency. Another estimator based on the asymptotic properties of regular variation was proposed by Kn

Note that this estimator was derived for the GDP and fl = 05-1, which is 1 less than our PDF based estimator. While these methods are precise for a full power law distribution, they do not converge to the proper value when the distribution is truncated either from above or from below (Fig 1A). We also checked the convergence of Eq 10 when using minimal absolute frequency y*which is larger or equal to yminmentioned 0 5 . This truncated power law estimate does converge to the y >|< — . right value, but very slowly (Fig 1A orange circles). Formally, we estimated the fit to the distribution using a Kolmogorov Smirnov test, and used the lower cutoff 1min producing the best fit.

The parameters are bounded by Eq 6 for XmX and Ntotal'l for Xmin. The numerical minimization is initialized With Xmax from the observed data, Xmin at its boundary value, and a as it is computed from Eq 9 (for a schematic figure see Fig 2). The value of a computed using this estimate converges to the proper value at limited sampling depth (Fig 1A).

Methodology Validation

We generated H haplotypes with relative frequencies (pj ’ 5) taken from a pure power law distribution with exponent of a. Next, we generated samples of different sizes (R) based on the pj ’ 5 values. We compared the expected value of U (R)(Eq 7) to the actual number of different unique haplotypes that we got in the sample, with an excellent fit (less than 1% deviation—SI Fig). A similar analysis with a truncated power law gave similar results.

Schematic figure. (A) We observe a population with different haplotypes. From the population we extract two measures—the haplotype frequency distribution (B) and the number of unique haplotypes as a function of the sample size (C). We assume that the frequency distribution (B) is a scale free distribution with upper and lower cutoff values Xmingngmax. For the sake of simplicity, we assume a zero probability to obsen/e haplotypes with frequencies beyond these values. We provide an initial guess for the lower cutoff to be limited by the total population size, and the upper cutoff, we limit by an upper estimate from the total population size (Eq B6). We use an initial guess of Xmin at its boundary, the Clauset estimate for the slope, and the obsen/ed highest frequench 0 max. We then fit the obsen/ed unique haplotype cun/e (C) with an log(observed (R’)) for different values of sample sizes R'. Finally, we extract the optimal parameters (E) and produce an estimate of the number of unique haplotypes for any population size N, where N is the target population size. We then compared the convergence rate of our method with other frameworks for estimating H. We used a simulated population from a truncated power law, as described above and formed sub samples of incremental size. We then generated estimates of H as a function of increasing subsample size.

This method was adapted to compute the total number of different alleles instead of the total population as is performed in ecology. In order to compute the capture recapture estimate, we divided each subsample into two smaller sub samples: A and B. We then computed the ratio between the product of the number of different alleles/haplotypes in A and B and the number of different alleles/haplo-types in their intersection:| Unique(A)| >|< |unique(B) | / | Unique(A U B) We further compared our method with a Iackknife based estimator [19]. Capture-recapture and power-law estimates were found to converge to the true value of H, but the capture-recapture method required a very deep sample of the population to attain accuracy whereas the power-law method converged quickly and offered accurate estimates, even with limited sampling (Fig 1B). The Jackknife based estimator did not converge to the proper value.

Many of these methods have been merged into the CatchAll software [21]. To further compare our method, with existing state of the art methods, we have tested these measures on the same samples. All methods proposed by CatchAll showed a very slow convergence, or actually no convergence (Fig 3).

Comparisons of methods to evaluate the haplotype number. We compare the Bunge and Barger (2008) method as implemented in the Catchall model using a population with scale free frequency distribution of haplotypes with a slope of -1 .5 and compare the estimated species richness as supplied by the Catchall software for all models where the software converges to a result. These models include parametric predictions procedures, such as Mixture-of-two-exponentials-mixed Poisson and Mixture-of-three-exponentials-mixed Poisson on the obsen/ed species richness or on their log values and nonparametric procedures, such as the ACE (Abundance-based Coverage Estimator) ACE1 (Abundance-based Coverage Estimator for highly heterogeneous cases) and different versions of the Chao-Bunge gamma-Poisson estimator. Each line represents the estimated number of haplotypes using one of the Catchall methods. The black line represents the real simulated species number. The purple line with diamonds represents our estimates. Our estimates converge much faster and to a much more precise estimate of the real haplotype numberthan any of the methods proposed by Catchall.

We then validated that the resulting values fit the observed distribution all the way to the full sample size (7.8e6) and extrapolated it to the total European American population as defined by the Census [22]. The final value is an estimate of the total number of haplotypes in the European American population in the US (Fig 4; see 82 Fig for the other populations). When we estimated Xmin, XmX and a from the distribution (purple full line) an excellent fit was obtained.

The fit diverges from the observed result (Fig 4 and 82 Fig). The strongest deviation is for a fixed a, and the smallest is for a fixed Xmin. A fixed Xmin is expected to have the smallest effect, since its estimate is closest to the reality in the studied distributions. To summarize, the method above has been shown to converge properly in simulated and sampled data. However, for accurate estimates, it is recommended to treat all parameters Xmin, Xmax and a as free parameters to be estimated in the model.

Haplotype Numbers in US Sub-populations

Values for other sub-populations are given in Table 2. For most populations, the haplotype frequency distribution follows a power law with eXponent or of the PDF between 1.4 and 1.9. Further, estimates of the power law eXponent converged before the full sample size was analyzed (Table 3 and 83—88 Figs). Caucasian population

Comparison of computed and observed expected number of haplotypes in a sample of size R : U(R)forthe European American population for different sample sizes R. The gray squares are observations. A|| estimates were performed using 10% of the sample. The lines are an estimation of U(R) with all free parameters (full purple), maximal Xmax and free a (dashed green line), free Xmax and free a but fixed Xmin (dashed black line) and free boundaries but a based on the Clauset estimate (double dashed red line). The inset shows the ratio between the computed number of haplotypes and the real one.

We therefore expect that haplotypes discovery rate should remain appreciable with continued sampling given that the current sample size is about 3.97% of the European American populations. In contrast, if the distribution were light-tailed, we would be eXpected to detect all of the haplotypes that exist. Since the distribution is heavy tailed, we estimate to have detected only 30% of H that exist in the US population (Table 2).

Population Coverage

When or is close to 1, most of the haplotypes have practically no contribution to the coverage. Actually only very feW haplotypes in the right hand tail of vH(x) (the larger values of x) contribute to the population coverage. For example, in the European American population, the most frequent 1% of haplotypes provide 73.18% of the coverage. The fraction of the population With haplotypes Which have already been identified can be computed to be (see S4 Text): max min where y is the partial gamma function. As was the case for Eq 7, we compared the values obtained from this calculation to the values obtained in the simulation mentioned above, and obtained an excellent fit (S9 Fig).

However, the coverage of these 88,620 haplotypes is about 99.4% (Fig 5C and Table 2). The relationship between the coverage and the number of haplotypes depends on 0c, and varies among the studied populations. We analyzed 5 combined populations, for which clear census based estimates of the total population size are available (African-Americans—AFA, Asian Pacific Islanders—API, European Americans—CAU, Hispanic—HIS and Native Americans—NAM), and calculated the sample size needed in order to have the same percent coverage of the European American population (Fig 5D and Table 2). The combined populations have similar distributions (Fig 5A), with the European American population having the flattest distribution (least diverse, or most contribution from frequent haplotypes), and the Native American having the steepest slope (most diverse, and the least contribution from frequent haplotypes). However, since as mentioned above the number of haplotypes (H) is affected by the total population size, the Native American population is actually the least diverse, while the European American population is the most diverse, when H is taken as a measure of the diversity (Fig 5B). While most haplotypes are not sampled, most individuals in the population carry known haplotypes. As mentioned earlier, 99.4% of the European American population is covered by observed haplotypes (Fig 5C). Coverage is also more than 98% for the Hispanic, African-American and Asian/Pacific Islander populations. Coverage is less for the Native American population at 93.4% percent. In order to reach the percent coverage that currently exists for the European American population, 1 to 6 million additional donors would need to be recruited for each of the other populations (Fig 5D). This represents a factor between the required and existing sample size of around 3 for the Asian Pacific Islanders, but up to 15 for Native Americans, as summarized in Table 2.

Probability of New Haplotype Discovery

The contrast between the small fraction of haplotypes known and the large coverage implies that the frequency of each undiscovered haplotype is expected to be very small. The probability that a haplotype from a newly sampled host is not present in the current sample of size R is (see S4 Text):

We validated the accuracy of our analytical equation using simulation (810 Fig). This function decreases slowly as a function of sample size R, Which explains the high degree of population coverage achieved even With limited discovery of all haplotypes in the population. Additionally, this value could be used to assign an eXpected frequency to missing haplotypes, Which in turn could be used in HSCT matching to assign haplo-type frequencies to newly discovered haplotypes not covered by the EM haplotype frequencies. Estimates of A (Number of Unique Alleles)

Each haplotype is composed of alleles at 6 genetic loci: ‘A’,’B’,’C’,’DQB1’,’DRB1’,’DRBX’. We separated the haplotype data and generated allele distributions for each of the 6 loci. Our assumptions about the alleles are just the same as for the haplotypes, and a similar analysis was performed on each loci. We fit the expected number of alleles U(R) as a function of the sample size R to the actual number of unique alleles in partial samples of the data we have for alleles (see Methods) to obtain the values of Xmin, Xmax and a.

Such powers give most of the importance to the frequent alleles and reduce the total number of alleles for a given population size. Indeed, estimates of A according to the current sample for the data of the five combined populations are not much higher than the observed number of alleles (Fig 6 left plot).

As eXpected by their reported diversity [9], Class I HLA alleles have a slightly higher power eXponent (1.2—1.3) than class II (around 1) (Table 4). The class II power exponents, which are all close 1, are eXpected from neutral evolution model [23], highlighting that if selection occurs, it is mainly focused on class I or it is occurring at the haplotype level. Surprisingly, the discovery rate for new alleles in the IMGT database appears to be increasing, in contrast with the conclusion that most alleles are known, raising the suspicion that the total number of existing alleles is much larger than current estimates. However, the estimates above propose that most alleles are currently known.

We plotted the expected number of alleles as a function of time and compared it to the number of alleles which were identified by that date in the IMGT (Fig 7). In analyzing the entire US population as a whole, a comparison between the expected number of alleles U(R) and those actually observed presents a contradiction: Initially, the number of observed alleles was significantly less than expected based upon sample size; presently, it is much larger than expected.

Second and more importantly, the transition to sequencing-based typing (SBT) changed the fundamental nature of allele definitions and improved capacity for allele discovery. With a much larger definition of alleles and a higher discovery rate, the fundamental power law relationship would be expected to change at some point in time as the data sources move towards SBT versus older oligo or serology based methods. For example, a recent data submission from a single SBT laboratory resulted in a 30% increase in the number of observed alleles [24]. Thus, our current power-law methodology appears flawed for providing accurate estimates for the number of unique alleles and requires future modifications to accommodate the mixed data sources.

Discussion

Quantitative estimates of A and H are important for assessing whether typed donor cohorts have sufficient sampling depth to capture the clinically relevant HLA variation among individuals. Estimates of A and H for the Be The Match Registry suggest that current registry sampling depth is sufficient to represent only 52—100% of A and 21—30% of H across the 5 main racial populations assessed. The number of “rare” alleles and haplotypes, however, is small compared to the number of “common” ones suggesting that the current diversity of our registry is sufficient to describe 99.94—100% and 93.4—99.4% of the overall United States population with respect to allele and haplotype coverage. Further, the estimation methods developed in this article quickly converged—with limited sampling depth—to accurate estimates of allele and haplotype diversity in the presence of heavy-tailed distributions. Popular methods intended for measuring diversity with bell shaped distributions were shown to require deep samples before achieving meaningful convergence, as was the case for capture-recapture models, or failed outright to converge toward accurate estimates. A detailed comparison of our methodology with state of the art species richness estimates, as performed by the CatchAll software, show a much faster and more precise convergence to the real haplo-type number using our methodology.

The common thread among these problems is the need to understand the behavior of “rare” events in a population with a heavy tail distribution: this can be problematic because “rare” events represent critical patterns that have minimal, if any, representation within the sample. Appropriate statistical methodologies for “rare” events recognize that light tailed distributions (often having the characteristic bell shape) are inappropriate for adequately describing population diversity. The assumption of a light tailed distribution implies that a relatively small sample is enough to detect and fully describe the population (i.e. captures all species or haplotype diversity). This intuition stems from the absence of extremely rare events, which is an untrue assumption for a heavy tail distribution. For example, in most realistic species distributions, the majority of species are very rare, and thus a very deep sample is required to detect most species. Moreover, if the tail is flat enough, another element should be incorporated, which is the extreme contribution of the most frequent species to the distribution. In such cases, most of the population will belong to a very small number of species.

The properties of our data fit these assumptions, which require an indeX of variation a 6 (1,2) for the PDF of frequencies: our indeX of variation ranged from 1.45 to 1.85 across the 21 populations. We note that to support the interpretability of our results, we modeled the PDF of frequencies whereas others have modeled the GDP of frequencies where values of a will be a - 1 units smaller. It is important to note that the convergence to the proper values of a occurred for a limited sample size. Thus, our method may be applicable for other surveys where the sample size is much smaller. Moreover, beyond the need to estimate the total number of different haplotype, the proposed method can be used to properly estimate a for small population sizes.

Note however, that in realistic situations, the most frequent haplotypes (or species) are well known, and their frequency is well estimated. Thus, a simpler approach could be to split the population into frequency known haplotypes and rare haplotypes, and focus on the distribution of the rare haplotypes, taking into account that their contribution to the population coverage may be very limited, even if they constitute the vast majority of the diversity.

In ecological scenarios, since most species are very rare, the current shrinking of the wildlife population is equivalent to a dramatic reduction in the total diversity. While in normal distributions we expect species to disappear and diversity to be significantly reduced only when the total population is close to extinction, the heavy tailed distribution of species frequencies leads to opposite conclusions. For example, assuming parameters similar to the ones obtained here, a 10 fold reduction in the total population would be equivalent to a 17—18% reduction in the number of species.

Certain species are less detectable based on experimental or behavioral elements (e.g. speed of movement). A future development of such methods should be the introduction of a frequency bias in the detection probability.

Although our predictions aligned reasonably with the absolute frequencies, we are concerned with the fact that the underlying data may be biased towards over-representing common alleles and haplotypes; thus, our estimates of A and H may under-represent the true population diversity. Our concern is based upon the mechanics of the EM algorithm, which trim the tail of the haplotype frequency distribution numerous times during the estimation process to generate a parsimonious and tractable candidate set; without trimming the number of potential haplotypes that greatly exceeds the number of donors. Further, expected changes in typing technology (that are already underway) from SBT to next generation sequencing (NGS) may also re-def1ne allele definitions and increase the capacity for allele discovery. For example, NGS has the ability to detect allelic variants previously “hidden” behind genotypes that only differ by recombination [30]. In short, EM and typing technology factors could increase estimates for A and H by changing the fundamental nature of the power law relationship.

Methods

Haplotype and Allele Frequency Data

In brief, an EM algorithm was utilized to resolve uncertainties in allele and chromosomal phase for a mixed resolution set of donor HLA typings (serology, sequence-specific oligonucleotide/primer, and sequence based typing). The only notable change from the Gragert methodology was that in the last iteration of the EM algorithm, a winner-take-all approach was applied where each donor contributed 1 unit of probability mass to their most likely pair of haplotypes; this is opposed to each donor assigning 1 unit of probability mass across a range of haplotype pairs—consistent with their HLA typing—in proportion to their conditional likelihood. Allele frequencies were derived as marginal sums of the haplotype frequencies.

Methodology Validation

We have performed simulations With chosen a and Ntoml. For each set of a and Ntotal values, we set Xmin to be 1 /Nt t l, and calculated the maximum value possible for Xmax using 0 a max min min were observed. We normalized the results to overcome the imprecise sum. We then produced a random population With these a priori probabilities, and compared the expected frequency of a haplotype unobserved in a sample (Z(R)), the expected number of unique haplotypes in the population (U(R)) and the fraction of the population covered by these haplotypes to the simulations results. Code for the estimation of U(R) and a test set are available in the Supp. Mat.

Numerical Optimization

We assume a zero probability of observing haplotypes With frequencies beyond these values. In order to estimate the properties of the distribution, the values of a, Xmm and Xmax should be estimated. We provide boundaries for sz-n and Xmax using Eq B1 and B6 in 82 Text. We then perform a numerical fit of the observed and expected (Eq 8) unique haplotypes, With a cost sample sizes R’. We use an initial guess of the parameter above using Eq B1 for Xmm, the highest observed frequency XOmaX, and the estimate of a, based on Eq 9. We then extract the optimal parameters (a, Xmax, Xmax), calculate the expected total number of unique haplotypes H and produce an estimate of the number of unique haplotypes for any population size U(R).

Supporting Information

S4 Text. Calculation of probability that a haplotype is not observed and the fraction of the population covered with current observations. (DOCX) 81 Fig. Ratio of analytical estimated U(R) values divided by simulation results for different values of sample size and a (Inset: Analytical estimation (gray solid line) and simulation values (black circles) of U(R) (black circles) for a = 1.5). 82 Fig. Comparison of computed and observed expected number of haplotypes in a sample of size R : U(R)for four different American populations for different sample sizes R. The gray squares are observations. All estimates were performed using 10% of the sample. The lines are an estimation of U(R)With all free parameters (full purple), maXimal Xmax and free a (dashed green line), free Xmax and free a but fixed Xmin (dashed black line) and free boundaries but a based on the Clauset estimator (double dashed red line).

Observations are in gray dashed lines With triangles, and black solid lines With circles are the linear regressions.

Observations are in gray dashed lines With triangles, and black solid lines With circles are the linear regressions. (TIFF)

Observations are in gray dashed lines With triangles, and black solid lines With circles are the linear regressions.

Observations are in gray dashed lines With triangles, and black solid lines With circles are the linear regressions. (TIFF)

Observations are in gray dashed lines With triangles, and black solid lines With circles are the linear regressions. (TIFF) S9 Fig. Ratio of analytical estimated fraction of covered population values divided by simulation results for different values of sample size and a (Inset: Analytical estimation (gray solid line) and simulation values (black circles) of fraction of covered population for a = 1 .5.). (TIFF) 810 Fig. Ratio of analytical estimated values of the probability that there exists at least one unobserved haplotype divided by simulations results for different values of sample size and a (Inset: Analytical estimation (gray solid line) and simulation values (black circles) of the probability that there exists at least one unobserved haplotype for a = 1.5.). (TIFF)

Acknowledgments

Thanks to Abeer Madbouly for her helpful edits and advice that helped improve the manuscript presentation. We thank Miriam Beller for the text editing of the current manuscript.

Author Contributions

Performed the experiments: NS. Analyzed the data: NS YL MA. Contributed reagents/materials/ analysis tools: LG MM AC MA. Wrote the paper: NS YL MM MA.