SciSurf: Index of 'PRIMAL: Fast and Accurate Pedigree-based Imputation from Sequence Data in a Founder Population'

PRIMAL: Fast and Accurate Pedigree-based Imputation from Sequence Data in a Founder Population

Oren E. Livne, Lide Han, Gorka Alkorta-Aranburu, William Wentworth-Sheilds, Mark Abney, Carole Ober, Dan L. Nicolae

Published in PLOS Comp. Biol., March 2015

Abstract

Here, we describe PRIMAL (Bedl-gfiee Mputation &gorithm), a fast and accurate pedigree-based phasing and imputation algorithm for founder populations. PRIMAL incorporates both existing and original ideas, such as a novel indexing strategy of ldentity-By-Descent (IBD) segments based on clique graphs. We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs), from 98 whole genome sequences. Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies. Using the IBD cliques we were also able to infer the parental origin of 83% of alleles, and genotypes of deceased recent ancestors for whom no genotype information was available. This imputed data set will enable us to better study the relative contribution of rare and common variants on human phenotypes, as well as parental origin effect of disease risk alleles in >1 ,000 individuals at minimal cost.

Author Summary

To overcome this limitation and design cost-efficient studies, we developed a two step method: sequencing of relatively few members of a well-characterized founder population followed by pedigree-based whole genome imputation of many other individuals with genome-wide genotype data. We show that by sequencing only 98 Hutter-ites, we can impute 7 million variants in an additional 1,317 Hutterites with >99% accuracy and an average call rate of 87%. Furthermore, parental origin was assigned to 83% of the alleles. Such studies in the Hutterites and other founder populations should yield new insights into the genetic architecture of common diseases, gene expression traits, and clinically relevant biomarkers of disease, and ultimately provide outstanding opportunities for personalized medicine in these well-characterized populations.

Introduction

Therefore, approaches that allow accurate imputation of rare variants to large numbers of individuals based on the sequences of relatively few individuals could address this important question at minimal cost. Founder populations are particularly suitable to this strategy because pedigree relationships are either known or can be inferred from genotypes, facilitating imputation approaches that incorporate identity by descent (IBD) relationships between chromosomal segments and improving imputation accuracy. Moreover, variants that occur at low frequency (<5%) or are rare (<1%) in large outbred populations, may occur at common frequencies (>5%) in founder populations due to the bottleneck at the time of their founding followed by random genetic drift effects in subsequent generations. Similar to mutations for rare monogenic disorders reaching relatively common frequencies in founder populations [3—6] , subsets of the rare variants contributing to common complex diseases are also expected to occur at higher frequencies in these populations. This provides a unique opportunity to study the relative roles of rare and common variants on common disease risk in individuals exposed to similar environments, which further minimizes the contribution of non-genetic factors to inter-individual variation in disease risk and facilitates identification of disease-associated alleles.

LD-based imputation methods require a reference panel of genotype training data, usually from unrelated individuals, to infer local haplotype structure, and sharing of hap-lotype stretches are used for filling in missing genotypes [8—11]. These approaches typically result in high call rates at the expense of lower accuracy, especially for rare alleles [12]. In contrast, pedigree-based imputation approaches are more accurate because they rely on identifying regions of IBD sharing among the study subjects [13,14]. However, call rates are typically lower than from LD-based methods, and pedigree-based imputation can be significantly slower to implement due to complex pedigree structures, which often pose limitations on maximum family sizes and minimum relatedness of individuals [15].

We first phased the SNV genotypes using pedigree-based phasing algorithms [16,17] and determined IBD segments between each pair of haplotypes using a Hidden-Markov Model [18]. We then organized IBD segments into an IBD clique dictionary, a novel data structure for efficient IBD lookup queries that enables fast pedigree-based imputation of the variants identified in the 98 genomes. We demonstrate that the accuracy of the algorithm is above 99% regardless of minor allele frequency, with a call rate of approximately 77%. To improve the call rate, the missing genotypes were imputed using the LD-based IMPUTE2 program [11], with the phased haplotypes of the 98 whole genome sequenced Hutterites as the reference panel. The result is a hybrid method that combines the benefits of pedigree and LD-based strategies to obtain similar accuracy (> 99%), and higher call rates (87.3%). Moreover, using the IBD clique dictionary implemented in PRIMAL, we can infer the parental origin of 83% of alleles. We are also able to impute Whole genome genotypes to recent ancestors With no available DNA. The PRIMAL algorithm and software Will facilitate genetic studies of rare variants and parent-of-origin effects in the Hutter-ites and in other founder populations With similar data.

Materials and Methods

Ethics Statement

This study was conducted according to the principles expressed in the Declaration of Helsinki. All participants in the experiment provided written informed consent in approval With the University of Chicago Institutional RevieW Board.

Sample Composition

After a series of migrations and population bottlenecks, they settled in What is now South Dakota in the 1870s, and currently live on communal farms in the northern U.S. plains states and western Canadian provinces [19]. At present, there are over 14,000 Hutterites living in South Dakota, all of Whom are descendants of just 64 founders and related to each other With a mean kinship coefficient of 3.4% [20]. This study includes 1,415 Hutterites Who previously participated in one or more of our studies of Mendelian and common diseases and associated phenotypes (e.g., [5,21] ). These individuals are related to each other through multiple lines of descent in a 3,671-person minimum pedigree.

Framework Genome-Wide Genotypes

As part of our quality control (QC) process, we removed SNVs With five or more Mendelian errors, Hardy-Weinberg p-values < 0.001, or call rates <95%, resulting in 332,242 SNVs present on all three platforms. The final sample included 1,415 Hutterites With genotype call rate > 95%. We used the subset of 271,486 SNVs With minor allele frequency (MAP) 2 5% for phasing and imputation in this study. These SNVs are referred to as the “framework SNVs”, and genotyped individuals for Whom both parents were not genotyped are referred to as the “quasi-founders” of this sub-pedigree.

Whole Genome Sequencing and QC

To achieve this we used a greedy algorithm described elsewhere [16] where subjects were selected sequentially to maximize the average kinship to the non-sequenced individuals, while imposing a kinship smaller than 0.1 with the sequenced individuals. Sequencing was performed by Complete Genomics, Inc. (Mountain View, CA). A total of 18.2 million variants (14.0M SNVs, 2.7M insertions, 1.4M deletions; Table 1) were discovered in the 98 WGS, including 11.6 million variants (9.2M SNVs, 1.3M insertions and 1.1M deletions) for which both alleles were called as high quality by Complete Ge-nomics. Using the 332,242 SNVs, the concordance between the genotypes from the whole genome sequences and those determined by genotyping with the Affymetrix arrays was 99.8%.

The method is an extension of the classical Mendelian error checking in families. However, in contrast to Mendelian checks that use parents and their offspring, our approach includes all pairs of related individuals, regardless of the distance of the relationship, using their IBD segments. High confidence IBD2 segments (i.e., IBD = 2, or regions where two individuals inherited the same chromosomal segments from a common ancestor) were previously calculated between each pair of individuals among the 98 Hutterites using the 332,242 framework SNVs [24]. Next, for each sequenced variant, we determined the number of IBD2 segments shared between pairs of individuals that contain the variant and counted the number of discordances (the number of pairs of IBD2 segments in which the genotypes for the variant under investigation did not match). We then estimated the variant calling discordant rate (the proportion of discordances) for each class of variants as the total number of discordances divided by the total number of pairs of IBD2 segments in that category. Discordant rates increased with decreasing call rate, suggesting poorer quality of genotype calls for variants with more missing data. Thus, we determined call rate cutoffs for each variant class to maintain a less than 0.5% discordant rate. This resulted in a final set of variants that included all non-singletons (i.e., variants in which the rare allele occurred at least twice) with rs numbers (in dbSNP135) with call rates > 90% and novel variants (no rs number in dbSNP135) with call rates > 99% (i.e., at most one missing call). Among singletons (variants with one copy of the rare allele in the sequenced subjects), we retained novel insertions with call rates > 90% and all other variant types with call rates > 99% (Table 1, and Fig S2 in 81 Text). The allele frequency distribution and functional annotation of the final set of 7,008,666 variants in the 98 Hutterites with WGS are shown in Fig S2 in 81 Text.

These 15 individuals were sequenced on the Illumina platform at a 10—17X coverage. High quality (as determined by Illumina) genotypes were extracted for all the SNVs imputed using PRIMAL and that passed QC. One of the 15 subjects was sequenced on both platforms and this allowed us to estimate the joint sequencing error rate. Discordance rates between the Illumina sequence-based and PRIMAL-imputed genotypes were calculated as the proportion of differences in genotypes in each of the remaining 14 individuals using these two methods.

Software

The algorithm described in Results is implemented in software, PRIMAL V1, that is freely available for academic use from the website: https://github.com/orenlivne/ober

Results

1, steps 4—8). The first four require only the framework SNVs: (i) phasing; (ii) identifying IBD segments among all haplotype pairs; (iii) indexing IBD segments into a dictionary of IBD cliques; and (iV) assigning parental origin to haplotypes. In the fifth step, we phase the WGS-derived genotypes, and then perform fast pedigree-based imputation of all variants present in the WGS using the IBD

clique dictionary.

Phasing

[17] and Glodzik et al. [13] and to our earlier phasing algorithm for Hutterite genotype data [16] , but introduces two key improvements that boost its quality. First, we use a phased pro-band as a template to phase siblings in nuclear families as in Coop et al. [26] (Supplementary Materials 81 Text), and second, we employ a Hidden Markov Model (HMM) similar to the IBDLD model [24] to identify IBD segments between a proband and his/her surrogate parents (Fig S3 in 81 Text). The phasing workflow is outlined in Fig S3 and described in detail in 81 Text. Using this approach, only 0.5% of the framework genotypes remained unphased, 99.2% of the genotypes were correctly phased, and the remaining 0.3% of the framework genotypes were discordant.

IBD Segment Identification

Therefore, we created a complete IBD dictionary by identifying IBD segments between each pair of the 2x1,415 = 2,830 haplotypes in the sample (81 Text). Computational complexity prevented us from using available software to estimate IBD segments in related individuals [27—29]. Our HMM is the haplotype analogue of the genotype HMM used for phasing, and is similar to the HBD-HMM developed previously [18]. However, only kinship coefficients are used instead of condensed identity coefficients. The complexity is quadratic in the number of samples, but the hidden constant is small because only two states (IBD or not IBD) are possible instead of the nine in the genotype HMM

To verify the overall quality of the detected IBD segments, our fraction of the genome covered by IBD segments was compared to the fraction calculated by IBDLD [24]. The methods were concordant (correlation coefficient r = 0.96 with a slope of [3 = 1.01) and the length distribution followed an eXponential distribution, in accordance with the

IBD Segment Indexing into Cliques

We organize IBD segments in an IBD segment index data structure, which consists of a set of IBD cliques at each SNV and allows a quick O(1)time queries of whether a pair of haplotypes is IBD at a certain SNV.

2) whose nodes are the 2,830 haplotypes of the 1,415 Hutterites, an edge indicates the two haplotypes are IBD, and the edge weight is the HMM posterior probability of IBD (81 Text, Eq. (19c)). Large weights are thus given to haplotype pairs that have a higher probability of being IBD.

In practice, G is a perturbation of a clique union due to very low HMM certainty near segment ends and genotyping errors, and we would like to recover a “reasonable” set of cliques from it. Cluster editing methods (see for example [31]) find the minimum number of edges (or total edge weight) that need to be added or removed to transform G to a clique union. This is an NP-hard problem, and practical heuristic-based algorithms run in superlinear time in the number of edges. We chose a different heuristic inspired by the graph algebraic multigrid literature [32—34] that resulted in good imputation cross-validation accuracy and has linear complexity (81 Text). First, we calculate new edge weights called aff1nities that measure the connectedness or affinity between the graph neighborhoods of the nodes (Fig. 2). A large aff1nity means that the nodes share many common neighbors, i.e., they are connected via many short paths. Next, we removed graph edges with weight < 0.85 or aff1nity < 0.9. These thresholds were chosen to minimize imputation errors in a cross-validation of several framework SNVs representing the entire MAF spectrum. Finally, each of the resulting graph’s connected components is transformed to a clique by adding links between all nodes that are not yet connected (Fig. 2 and Fig S4 in 81 Text). This method worked well for our data set, and these thresholds should be good default values for other data sets. However, threshold determination and a comparison with other clique-generation methods undoubtedly need to be further investigated in a future research.

In addition, cliques allow the derivation of the maximum call rate obtainable per SNV from imputation, which is the ratio of the number of haplotypes in cliques containing haplotypes of sequenced individuals to the total number of haplotypes. The predicted imputation rate was 85% i 9% for the framework SNVs. Note that, using pedigree-based imputation, the accuracy approaches 100% (because we rely on Mendelian rules).

Parental Origin (PO) Assignment

Haplotypes of non-quasi-founders can be automatically labeled as paternal and maternal because their parents are included in our sample and haplotypes are assigned using Mendelian rules. However, because the quasi-founders do not have genotyped parents, the parental origin of the quasi-founder haplotypes is assigned in two stages. First, during phasing, we do not determine which haplotypes are paternal and maternal, but we ensure that the first haplotype of every child comes from the same parent (arbitrarily denoted A), and the second haplotype from the other parent (arbitrarily denoted B). This is achieved using the following steps: a) Regions of the children’s haplotypes are assigned to four different “bins” (illustrated as four colors in Fig S5 in 81 Text) that represent the four parental haplotypes. Regions that are IBD are in the same bin, under the constraint that the number of recombinations be minimized.

For each assignment, we calculate for each child C With haplotypes C1, C2 a separation measure as follows: let F1 be the fraction of C1 covered by A’s haplotypes plus the fraction of C2 covered by B’s haplotypes, and F2 be the fraction of C1 covered by B’s haplotypes plus the fraction of C2 covered by A’s haplotypes. The separation is the ratio maX(F1,F2), Which measures how decisively C’s haplotypes can be identified as paternal or maternal haplotypes. c) We pick the parental assignment that maximizes the minimum child separation, and order C1, C2 in all children so that the first always corresponds to parent A and the second to parent B. The separation measure is defined in 81 Text.

For each parent and each clique, we calculate the median of the set of kinship coefficients between the parent and all quasi-founders in the clique that are not siblings of the proband (the quasi-founder in question), resulting in a 2x2 matrix (Fig. 3; siblings and non-quasi-founders are excluded to minimize bias). For each SNV, indexed by s, we define a separation measure m(C, s) (precisely defined in the Supplementary Sl Text, Eq. (4)) such that-1 g m(C, s) g 1. The measure approaches-1 when the off-diagonal matrix elements are much larger than the diagonal elements, and approaches 1 when the diagonal elements dominate. If the proband is properly phased, m(C, s) must be consistently positive or negative across the chromosome. We consider only “informative variants” as those where |m(C, s)| > 0.25 is separated from 0. Suppose there are n+ informative variants with m(C, s) > 0 and n_ with m(C, s) < 0; the sample separation measure M(C) is defined as max(n+, n_) /(n+ + n_). That is, the fraction of variants exhibiting the “majority sign”. We assign parental origin when M(C) > 0.75. Using this approach we were able to assign parental origin to 76% (313 out of 41 1) of the quasi-founders’ chromosomes, with 279 having M(C) > 0.99 (Fig S6 in SI Text). Including non-quasi-founders, we were able to assign parental origin to 93% of the sample.

Pedigree-based Imputation

Once the IBD clique dictionary is constructed, imputation is performed separately and in parallel for each variant present in one or more of the 98 Whole genome sequences. The main idea behind the approach is that each sequencing-based allele that is phased on a particular haplo-type can be imputed to all the haplotypes in its IBD clique. First, homozygous genotypes are phased, and the alleles and indices of the two haplotypes are placed into a queue. We remove the first haplotype from the queue, and impute all haplotypes in its IBD clique with the same allele. If these include haplotypes of heterozygous genotypes in the 98 sequenced individuals, they can now be phased. For each such individual, we add its other haplotype index and allele to the end of the queue. The next entry in the queue is then similarly processed, except that, when there is conflicting allele information within a clique (when a two-third majority vote does not exist), no haplotype is imputed. We process queue entries one by one until the queue becomes empty.

Finding and indexing IBD segments into cliques takes the majority of computing time in the PRIMAL pipeline. The dominant complexity term is O(n2s), where n = 1415 is the number of genotyped individuals and s = 271,486 is the number of framework markers (S1 Table in 81 Text, columns 2—3).

The mean individual call rate was 75.5%; 547 out of 1317 individuals (41%) had call rate 2 80%. Call rates were higher in regions with higher framework SNV density, lower recombination rate and farther from the telomeres (Fig S7a in 81 Text). Fig S8a in 81 Text shows that the MAF distributions of European ultra-rare SNVs (MAF = 0 in the 1000 genomes CEU database) are comparable in both the 98 sequenced Hut-terite sample set and the 98 sequenced + 1,317 imputed Hutterites (n = 1,415). Furthermore, we compared the Alternative Allele Frequency (AAF) in the Hutterites and CEU sample set. The Hutterite and CEU AAF were highly correlated (Fig S8b-d in 81 Text). Out of 6,715,275 variants that were not A/T or C/ G SNVs, 5,299,330 had similar CEU and Hutterite AAFs (absolute difference < 0.1); there were more variants with larger AAF in the Hutterites than in CEU compared to the opposite case (880,912 vs. 534,012 variants).

Cross Validation

To check the accuracy of PRIMAL imputed genotypes, their concordance with the framework genotypes was assessed. First, we phased the framework (AffymetriX) genotypes, identified IBD

We then masked the framework genotypes of the 1,317 individuals whose genomes were not sequenced, imputed the framework genotypes, and calculated the concordance between the imputed and true genotypes over a sample of 53,861 framework SNVs (sorted by base-pair position, every 5th framework SNV was picked instead of using all SNVs to save computing time). The concordance was close to a 100% regardless of MAP (Fig S7c in 81 Text). In addition, we also tested for heterozygote concordance rate within the variants with MAF < 5% because the concordance over all genotypes would be high even if they were randomly imputed. The heterozygous concordance also approached 100%.

Comparison with Genotypes from an Independent WGS Experiment

The concordance rate for each subject was larger than 99% (the concordance rates ranged from 99.3% to 99.8%) with an overall average of 99.7%. This overall rate is very similar to the rate of concordance obtained from the subject sequenced on both platforms.

Increasing Call Rates Using LD-based Imputation

However, while genotypes imputed by PRIMAL had high accuracy, the call rate (77%) is lower than the maximum predicted rate, most likely due to imperfect phasing of variants without a consensus allele. To mitigate this problem, we filled in as many genotypes as possible for the remaining 23% of variants using LD-based imputation. We chose IMPUTE2 [11] because of its ease of use, high speed and high imputation accuracy. Importantly, we used the high quality pedigree-based phased haplotypes from the 98 whole genome sequenced individuals as the reference panel. This boosted the IMPUTE2 accuracy (evidenced by the measures described below) and reflects the accuracy of our phasing. To obtain data that are consistent in format and accuracy to those generated by PRIMAL, IMPUTE2 genotype probabilities were converted to hard genotype calls only if the maximum probability among the three possible genotypes was > 99%; otherwise, they were not called. When using this criterion, the concordance rates between IMPUTE2 genotypes and those based on sequencing in the 14 individuals range between 99.5 and 99.8% with an overall average of 99.7% (identical to PRIMAL).

All genotypes called by both methods and called as heterozygous by at least one of them were included. IM-PUTE2-imputed genotypes were retained only if the heterozygous concordance rate was 2 99% and the MAP 2 1% (heterozygous concordance rate drops significantly for variants with MAF <1%—Fig S9 in 81 Text). Finally, the PRIMAL+IMPUTE2 combined method yielded an overall call rate of 87.3% with > 99% estimated accuracy (Table 2).

First, we created a data set with twice the number of samples (2N). For each subject, we created “paternal haploid” and “maternal haploid” sets. For unphased genotypes, the haploid entries were set to missing. We ran IMPUTE2 on the haploid data set. We then assigned parental origin to each genotype called by IMPUTE2 in the original data set only if both the P0 of the paternal and maternal haplotypes were imputed with maximum probability > 99% and were compatible with the genotype. PRIMAL alone assigned PO to 80% of alleles, but with IMPUTE2 directly imputing from PO-assigned haplotypes, we increase the PO call rate to 83%.

Discussion

For example, family-based studies are particularly well suited for discovery of rare dis-ease-associated variants and revealing parent-of-origin effects while minimizing potential confounding due to population substructure and genetic and environmental heterogeneity. Moreover, the family structure itself allows more extensive quality control checks of genotype data and ultimately more accurate genotype calls. Now, in the era of whole exome and whole genome sequencing, studies in families and founder populations offer a new, powerful framework for mapping studies because the genome or exome sequences of relatively few ‘founders’ are needed to impute highly accurate whole genome genotypes to other members of the pedigree with only framework genotypes.

The call rates and, to a lesser degree the concordance rates, are correlated to the degree of relatedness between the imputed individuals and the sequenced subjects. Fig 816 in 81 Text illustrates these relationships, and suggest that the rates are mostly influenced by the few sequenced subjects who are most related to the imputed individual. Note that similar accuracy can be achieved using IMPUTE2 (as detailed above), with a call rate of 84% when restricting to the high quality called genotypes.

This additional information is unique to this approach, and is crucial for many analyses, such as those looking for parent-of-origin effects in associated variants, and imprinting. PRIMAL can be applied to other founder populations or to large families to provide accurate and nearly complete genotype coverage for relatively very small cost and minimal computation time. The quantity and quality of the genotypes generated using PRIMAL will depend on several factors including the family structures, the extent of IBD sharing between the reference and target subjects, and the quality of framework genotypes that are used for inferring the IBD cliques. In addition to comprehensive surveys of the effects of all variants present in the Hutterite genomes on risk for common and Mendelian diseases and on disease-associated quantitative phenotypes, these data will facilitate association studies with the > 460,000 variants that are rare (<1%) in European populations but have risen to common (>5%) frequencies in the Hut-terites and investigations of the effects of maternally-inherited versus paternally-inherited alleles on disease risks and quantitative trait values, and will allow the incorporation of the additional information from IBD sharing in more efficient genetic association studies. Such studies in the Hutterites and other founder populations should yield new insights into the genetic architecture of common diseases, gene expression traits, and clinically relevant biomark-ers of disease, and ultimately provide outstanding opportunities for personalized medicine in these well-characterized populations.

Supporting Information

81 Text. Supplementary methods showing detailed information on phasing and IBD esti-

Acknowledgments

The authors thank Rachel Myers, Katie Igartua, Lorenzo Pesce and Catherine Herman for useful discussions and comments on the manuscript.

Author Contributions

Performed the experiments: OEL LH WWS. Analyzed the data: OEL MA CO DLN. Contributed reagents/materials/ analysis tools: OEL LH GAA WWS. Wrote the paper: OEL GAA MA CO DLN.

Topics

haplotypes

Appears in 30 sentences as: haplotype (11) Haplotypes (1) haplotypes (33)

In PRIMAL: Fast and Accurate Pedigree-based Imputation from Sequence Data in a Founder Population

LD-based imputation methods require a reference panel of genotype training data, usually from unrelated individuals, to infer local haplotype structure, and sharing of hap-lotype stretches are used for filling in missing genotypes [8—11].
Page 2, “Introduction”
We first phased the SNV genotypes using pedigree-based phasing algorithms [16,17] and determined IBD segments between each pair of haplotypes using a Hidden-Markov Model [18].
Page 2, “Introduction”
To improve the call rate, the missing genotypes were imputed using the LD-based IMPUTE2 program [11], with the phased haplotypes of the 98 whole genome sequenced Hutterites as the reference panel.
Page 2, “Introduction”
The first four require only the framework SNVs: (i) phasing; (ii) identifying IBD segments among all haplotype pairs; (iii) indexing IBD segments into a dictionary of IBD cliques; and (iV) assigning parental origin to haplotypes .
Page 5, “Results”
Therefore, we created a complete IBD dictionary by identifying IBD segments between each pair of the 2x1,415 = 2,830 haplotypes in the sample (81 Text).
Page 5, “IBD Segment Identification”
Our HMM is the haplotype analogue of the genotype HMM used for phasing, and is similar to the HBD-HMM developed previously [18].
Page 5, “IBD Segment Identification”
A total of 97,821,947 IBD segments were identified among the 1,415 Hutterites (~1.1 segment per haplotype pair on average, because there are 2830x2829/2 = 4,003,035 individual pairs and 22 chromosomes).
Page 6, “IBD Segment Identification”
We organize IBD segments in an IBD segment index data structure, which consists of a set of IBD cliques at each SNV and allows a quick O(1)time queries of whether a pair of haplotypes is IBD at a certain SNV.
Page 7, “IBD Segment Indexing into Cliques”
2) whose nodes are the 2,830 haplotypes of the 1,415 Hutterites, an edge indicates the two haplotypes are IBD, and the edge weight is the HMM posterior probability of IBD (81 Text, Eq.
Page 7, “IBD Segment Indexing into Cliques”
Large weights are thus given to haplotype pairs that have a higher probability of being IBD.
Page 7, “IBD Segment Indexing into Cliques”
Because IBD is a transitive relation, G must be a union of disjoint cliques (fully connected subgraphs), one for each ancestral haplotype present in the population.
Page 7, “IBD Segment Indexing into Cliques”

See all papers in March 2015 that mention haplotypes.

See all papers in PLOS Comp. Biol. that mention haplotypes.

allele frequency

Appears in 5 sentences as: allele frequencies (1) Allele Frequency (1) allele frequency (3)

In PRIMAL: Fast and Accurate Pedigree-based Imputation from Sequence Data in a Founder Population

Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies .
Page 1, “Abstract”
We demonstrate that the accuracy of the algorithm is above 99% regardless of minor allele frequency , with a call rate of approximately 77%.
Page 2, “Introduction”
We used the subset of 271,486 SNVs With minor allele frequency (MAP) 2 5% for phasing and imputation in this study.
Page 3, “Framework Genome-Wide Genotypes”
The allele frequency distribution and functional annotation of the final set of 7,008,666 variants in the 98 Hutterites with WGS are shown in Fig S2 in 81 Text.
Page 4, “Whole Genome Sequencing and QC”
Furthermore, we compared the Alternative Allele Frequency (AAF) in the Hutterites and CEU sample set.
Page 10, “Pedigree-based Imputation”

See all papers in March 2015 that mention allele frequency.

See all papers in PLOS Comp. Biol. that mention allele frequency.

genome-wide

Appears in 4 sentences as: Genome-Wide (1) genome-wide (3)

In PRIMAL: Fast and Accurate Pedigree-based Imputation from Sequence Data in a Founder Population

We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs), from 98 whole genome sequences.
Page 1, “Abstract”
To overcome this limitation and design cost-efficient studies, we developed a two step method: sequencing of relatively few members of a well-characterized founder population followed by pedigree-based whole genome imputation of many other individuals with genome-wide genotype data.
Page 1, “Author Summary”
To address the limitations of LD- and pedigree-based imputation methods, we developed PRIMAL (Bediggee Mputation &gorithm), a fast phasing and imputation algorithm, to assign genotypes at 7 million bi-allelic variants that were discovered in the whole genome sequences of 98 Hutterites to an additional set of 1,317 Hutterites who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs).
Page 2, “Introduction”
Framework Genome-Wide Genotypes
Page 3, “Framework Genome-Wide Genotypes”

See all papers in March 2015 that mention genome-wide.

See all papers in PLOS Comp. Biol. that mention genome-wide.

computing time

Appears in 3 sentences as: computation time (1) computing time (2)

In PRIMAL: Fast and Accurate Pedigree-based Imputation from Sequence Data in a Founder Population

Finding and indexing IBD segments into cliques takes the majority of computing time in the PRIMAL pipeline.
Page 10, “Pedigree-based Imputation”
We then masked the framework genotypes of the 1,317 individuals whose genomes were not sequenced, imputed the framework genotypes, and calculated the concordance between the imputed and true genotypes over a sample of 53,861 framework SNVs (sorted by base-pair position, every 5th framework SNV was picked instead of using all SNVs to save computing time ).
Page 11, “Cross Validation”
PRIMAL can be applied to other founder populations or to large families to provide accurate and nearly complete genotype coverage for relatively very small cost and minimal computation time .
Page 12, “Discussion”

See all papers in March 2015 that mention computing time.

See all papers in PLOS Comp. Biol. that mention computing time.