Analysis of CD data | Gene analyses with VEGAS and PLINK were performed using the mean SNP statistic for VEGAS and both the mean SNP statistic (PLINK-avg) and the top SNP statistic (PLINK-top) for PLINK. |
Analysis of summary SNP statistics | Analysis of summary SNP statistics |
Analysis of summary SNP statistics | These SNP-wise models first analyse the individual SNPs in a gene and combine the resulting SNP p-values into a gene test-statistic, and can thus be used even when only the SNP p-values are available. |
Analysis of summary SNP statistics | Although evaluation of the gene test-statistic does require an estimate of the LD between SNPs in the gene, estimates based on reference data with similar ancestry as the data the SNP p-values were computed from has been shown to yield accurate results [17,19]. |
Data | SNPs were annotated to genes based on dbSNP version 135 SNP locations and NCBI 37.3 gene definitions. |
Data | For the main analyses only SNPs located between a gene’s transcription start and stop sites were annotated to that gene, yielding 13,172 protein-coding genes containing at least one SNP in the CD data. |
Gene analysis | This model first projects the SNP matriX for a gene onto its principal components (PC), pruning away PCs with very small eigenval-ues, and then uses those PCs as predictors for the phenotype in the linear regression model. |
Gene analysis | By default only 0.1% of the variance in the SNP data matriX is pruned away. |
Introduction | More traditional gene analysis models are also implemented, for comparison and to provide analysis of SNP summary statistics. |
Other features and implementation | Efficient SNP to gene annotation and a batch mode for parallel processing are provided to simplify the overall analysis process. |
HiSeq Error Model | We note that the modeling of sequencing errors is similar to the approach taken by some SNP calling methods, such as the one employed by GATK [27]. |
Introduction | Reliable reference genomes are pivotal for finding genetic variations such as single nucleotide polymorphisms ( SNP ) and insertions/deletions (indels), which bolster downstream applications such as ge-nome-wide association studies, population genomics and comparative biology [4—7]. |
Introduction | It is fundamentally different from copy-number detection algorithms, which are designed to look for departures from a normal situation of diploidy [24,25], and from SNP calling algorithms, which find variants based on the assumption of diploidy [26], or assume an user defined ploidy level [27,28]. |
Simulations | Finally, the sensitivity of variant detection was high for all simulated situations, with false negative rates for SNP calling ranging from zero to 7.05%. |
Simulations | For example, a SNP with allele ratio 7:8 resulted in lower dosage accuracy than a SNP with allele ratio 13:2. |
Simulations | With higher coverage levels, both models had similar false negative rates of SNP discovery. |
Summary of the Ploidy Estimation Model | First, we define the probability of there being a SNP in any position as P( SNP ). |
Switchgrass Dataset | Overall, we called 134,464 variants within the contigs, with an average density of one SNP every 47 nucleotides. |
Switchgrass Dataset | We also performed variant calling with GATK [27] and obtained a density of one SNP every 60 bases. |
Switchgrass Dataset | We note that these SNP densities may be inflated due to homoeologue collapse and may not reflect exclusively allelic variation. |
High concordance in SNP frequency between sequenced viral replicates from patients | High concordance in SNP frequency between sequenced viral replicates from patients |
High concordance in SNP frequency between sequenced viral replicates from patients | To evaluate possible biases resulting from our RT-PCR procedure, we compared SNP frequencies in technical replicates, finding a high level of concordance. |
High concordance in SNP frequency between sequenced viral replicates from patients | In each case, the paired replicates showed SNP frequencies that were well correlated even when ignoring SNPs that occur with < 10% or >90% frequency in paired samples, R>0.95 for each pair (Fig 1). |