Abstract | To examine covariation of mutations between two different sites using deep sequencing data , we developed an approach to estimate the tight bounds on the two-site bivariate probabilities in each viral sample, and the mutual information between pairs of positions based on all the bounds. |
Conventional sequence data | Conventional sequence data |
Conventional sequence data | Variation in the deep sequencing data was compared to protease sequence variation in the Stanford HIV Database and Gag/Gag-Pol sequence variation in the Los Alamos National |
Conventional sequence data | Entries with available nucleotide data were translated using IUPAC standard protein codes and, if any ambiguities existed in the translated sequence or nucleotide data was unavailable for that sequence, corresponding protein sequence data was used to fill in any ambiguities in the translated sequence. |
Correlation analysis in using bound estimates protease captures known pair correlations | Findings in these publications potentially serve as a benchmark that can be used to estimate how well we are able to recover information about correlated mutations from protease and gag deep sequencing data using the bounding procedure. |
Covariation of mutations in Gag-protease proteins | Identifying pairs of correlated mutations from deep sequencing data is not as straightforward as when given conventional multiple sequence alignments. |
Deep sequencing data | Deep sequencing data |
Discussion | Previous to this study, there was no direct method to extract two-site frequency counts from viral deep sequencing data of gag and pol given the absence of sequence linkage due to the short sequencing reads. |
Discussion | Nevertheless, the procedure we have developed for identifying covariation from deep sequencing data with short reads used on single site mutation frequencies and bounds on joint marginals serves as a good starting point upon which future studies may eXpand datasets containing many deep sequenced samples. |
Mutations in protease and gag | In Fig 2, the observed variation in the deep sequencing data (top) is shown above with the variation present in 2378 drug-naive gag sequences from the Los Alamos HIV sequence database (bottom) (http://www. |
Mutations in protease and gag | We identified considerably more mutations from our deep sequencing data as compared with LANL data in the following regions of gag: in matrix both near the matrix/capsid (MA/ CA) cleavage site and scattered throughout the central portion of MA; in p2 and nucleocapsid (NC) on either side of the p2/NC cleavage site; and throughout the first half of p6. |
Abstract | The method can be used for whole genome shotgun (WGS) sequencing data . |
Acknowledgments | The switchgrass sequence data were produced by the US Department of Energy Ioint Genome Institute http://www.jgi.doe.gov/ in collaboration with the user community. |
Simulations | We simulated coverage levels varying from 10X per copy, which is typically less than optimal, to a coverage of 75X per haploid copy, which is higher than the usually employed datasets, although currently practicable given the continuously decreasing costs of next-generation sequencing data . |
Switchgrass Dataset | Next, we downloaded from the NCBI Sequence Read Archive whole genome shotgun reads from the same genotype, obtained through the Illumina HiSeq 2000 platform, in a total of 106.4 Gb of sequence data , and aligned all read pairs against the reference genome. |
Wheat Dataset | To investigate the effectiveness of ConPADE in that situation, as a validation procedure, we initially applied it to sequence data from the large arm of chromosome 5D—that is, chromosome 5 from the subgenome D. This data contains 236.8 Mb of sequence, with a contig L50 of 2,647 bp, and is expected to cover roughly half of the complete long arm of chromosome 5D. |
Discovery of non-coding RNA genes active in the immediate-early response | Of the four established ID-miRs for which we have CAGE data for the precursor, only two (hsa-mir-320a and hsa-mir-155) satisfied the expression criterion in the small RNA sequencing data . |
peak category. | Mature miRNA expression in MCF7 cells in response to HRG in the small RNA sequencing data . |
peak category. | Eleven ID-miRs were present in the small RNA sequencing data With expression above the minimum threshold. |
Abstract | Using transcriptome sequencing data from chronic lymphocytic leukemia, breast cancer and uveal melanoma tumor samples, we show that hundreds of cryptic 3’ splice sites (3’SSs) are used in cancers with SF3B1 mutations. |
Code, data, and reproducibility | Sequencing data is available through dbGaP (phs000767). |
Sample selection | RNA sequencing data was downloaded from CGHub [32]. |