Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
Christopher DeBoever, Emanuela M. Ghia, Peter J. Shepard, Laura Rassenti, Christian L. Barrett, Kristen Jepsen, Catriona H. M. Jamieson, Dennis Carson, Thomas J. Kipps, Kelly A. Frazer

Abstract

Using transcriptome sequencing data from chronic lymphocytic leukemia, breast cancer and uveal melanoma tumor samples, we show that hundreds of cryptic 3’ splice sites (3’SSs) are used in cancers with SF3B1 mutations. We define the necessary sequence context for the observed cryptic 3’ 88s and propose that cryptic 3’SS selection is a result of SF3B1 mutations causing a shift in the sterically protected region downstream of the branch point. While most cryptic 3’SSs are present at low frequency (<10%) relative to nearby canonical 3’SSs, we identified ten genes that preferred out-of-frame cryptic 3’SSs. We show that cancers with mutations in the SF3B1 HEAT 5-9 repeats use cryptic 3’SSs downstream of the branch point and provide both a mechanistic model consistent with published experimental data and affected targets that will guide further research into the oncogenic effects of SF3B1 mutation.

Author Summary

Many such recurrently mutated genes have been identified over the last few years, but we often do not know the underlying mechanisms by which they contribute to cancer growth. Unexpectedly, several genes in the spliceosome, the collection of RNAs and proteins that remove introns from transcribed RNAs, are recurrently mutated in different cancers. Here, we have examined mutations in the splicing factor SF3B 1, a key component of the spliceosome, and identified a global splicing defect present in different cancers with SF3B1 mutations by comparing the expression of splice junctions using generalized linear models. While prior studies have reported a limited number of aberrant splicing events in SF3B1-mutated cancers, we have established that SF3B1 mutations are associated with usage of hundreds of atypical splice sites at the 3’ end of the intron. We have identified nucleotide sequence requirements for these cryptic splice sites that are consistent with a proposed mechanistic model. These findings greatly expand our understanding of the effect of SF3B1 mutations on splicing and provide new targets for determining the oncogenic effect of SF3B1 mutations.

Introduction

Recurrent mutations in the highly conserved HEAT 5—9 repeats of splicing factor 3B subunit 1 (SF3BI) have been reported in myelodysplastic syndrome, chronic lymphocytic leukemia (CLL), breast cancer (BRCA), uveal melanoma (UM), and pancreatic cancer [2—7]. SF3BI mutation is associated with poor prognosis in CLL but improved prognosis in myelodysplasia and UM [2,7—9]. Prior studies have shown that mutated SF3BI CLL samples have differential exon inclusion and use some cryptic 3’ splice sites (3’SSs) relative to wild-type SF3BI CLL samples [5,6,8,10,11]. However, it is unknown whether SF3BI mutation is associated with the same 3’SS selection defects in different cancers. The mechanism underlying the cryptic 3’SS selection and the functional consequences thereof remain unresolved as well.

SF3B1 also interacts with other spliceo-somal proteins such as U2AF2, which binds the polypyrimidine tract (PPT) downstream of the BP [2,14,15]. The binding of the U2-snRNP and other spliceosome proteins around the BP prevents 3’SS selection in a ~12—18 bp region directly downstream of the BP due to steric hindrance [16,17]. Inherited cis-acting splicing mutations beyond this ~12—18 bp region downstream of the BP that result in the use of cryptic 3’SSs have been shown to occur in Mendelian disease genes [18]. Additionally, a competitive region exists ~12 bp downstream from the first 3’SS after the protected region where AG dinucleotides can compete to be used as 3’SSs based on sequence characteristics such as the PPT length, distance from the BP, nucleotide preceding the AG dinucleotide, and other features [17].

To test this, we examined splice site usage in transcriptome data from SF3BI mutant and SF3BI wild-type CLL, UM and BRCA cases. We identified 619 cryptic 3’SSs used more frequently in SF3BI mutants and clustered 10—30 bp upstream of canonical 3’SSs. The majority of these cryptic 3’SSs were observed in all three tumor types despite the divergent clinical implications of SF3BI mutation. Our analysis of tumors with SF3BI mutations shows that cryptic 3’SS selection occurs only in samples with missense mutations at ~10 amino acid hotspots in the fifth to ninth HEAT repeats. We analyzed the organization of splicing motifs around the cryptic 3’SSs and found that only introns with an AG dinucleotide at the boundary of the sterically protected region downstream of the BP but >10 bp upstream of the canonical 3’SS are susceptible to cryptic 3’SS selection in SF3BI mutants. We assessed the functional impact of SF3BI mutation and found that the cryptic 3’SSs are typically used at low frequency in the SF3BI mutants (< 10% relative to the canonical splice site) and are sometimes present in the SF3BI wild-types but at an even lower frequency (<0.5% relative to the canonical splice site). However, we identified 10 candidate genes, some previously implicated in tumorigenesis, for which there is a high amount of out-of-frame cryptic splice site usage that may affect the function of these genes.

Results

Cryptic 3’ splice sites 10—30 bp upstream of canonical 3’ splice sites are used in SFBB1 mutants

A splice junction is considered differentially used between mutant and wild-type samples if the expression level of that junction differs significantly after accounting for overall expression differences of the corresponding gene locus. All tested junctions were covered by at least 20 reads summed over all cancer samples in a given analysis, shared a 5’ splice site and/or 3’SS with a Gencode splice junction, and had a known splice site motif. We identified 1,749 junctions that were significantly differentially used between the SF3BI mutant and SF3BI wild-type samples across the three tumor types including 1,330 novel junctions, of which 1,117 are novel 3’SSs (BH-adjusted p < 0.1, 82 File). These 1,749 significant junctions were highly enriched for novel splice junctions compared to annotated junctions (Fisher exact, p < 10'200 (Fisher exact, p < 10'200 novel 3’SSs. These 1,749 significant junctions include 61 of 79 splice sites recently reported as specific to CLL cases with SF3BI mutations [11] supporting the specificity of our approach while demonstrating an increased sensitivity that has allowed us to identify many more cryptic 3’SSs than previously reported. We plotted the distance between each significant novel 3’88 and its associated canonical 3’SS (defined as the nearest Gencode 3’SS that shared the same 5’ splice site—see Methods). Of the 1,117 significant novel 3’SSs, 619 were proximal cryptic 3’SSs clustered 10—30 bp upstream of their associated canonical 3’SSs while the remaining 498 cryptic 3’SSs were widely distributed (herein referred to as distal cryptic 3’SSs) (Fig. 1A, S3 File). All of the 619 proximal cryptic 3’SSs were used more often in the SF3BI mutant samples compared to the wild-type sam) and the novel junctions were enriched for novel 3’SSs ) showing that SF3BI mutations result in the usage of a large number of ples and 58% were out-of-frame relative to the nearby canonical 3’SSs, suggesting that these are not canonical 3’SSs missing from Gencode. 417 of the 498 distal cryptic 3’SSs were also used more highly in the SF3BI mutants (S4 File). The distribution of the 1,117 significant novel 3’SSs is different from that of novel 3’SSs whose usage did not differ significantly between the SF3BI mutants and wild-types (Fig. 1B,C), further demonstrating that the usage of proximal cryptic 3’SSs is a property of SF3BI mutants. Examining each tumor type individually, we observed the same enrichment of cryptic 3’SSs 10—30 bp upstream of canonical splice sites (82 Fig). Given these observations, SF3B1’s role in binding the BP, and the organization of the BP and splicing motifs in the last 30 bp of the intron [12] , we focused our initial analyses on the 619 proximal cryptic 3’SSs.

Cryptic 3’SS selection is limited to tumors with mutations in HEAT repeat hotspots

We clustered all samples based on the read coverage of the 619 proximal cryptic 3’SSs and found that four SF3BI mutated BRCA samples did not cluster With the other mutants

1D). The SF3BI mutation for one of these BRCA samples was a nonsense mutation not located in the HEAT 5—9 repeats While another sample had a subclonal (8.4%) HEAT 5—9 mutation With attenuated cryptic 3’SS selection (S3 Fig.). The other two samples had mutations in the HEAT 5—9 repeats but outside of the apparent ~10 amino acid mutational hotspots

1E). We observed cryptic 3’SS selection in a TCGA lung adenocarcinoma sample With a hotspot mutation but not in lung cancer samples With SF3BI mutations outside of the five hot-spots (S4 Fig.). These results show that cryptic 3’SS selection only occurs in tumors carrying mutations in one of the five ~10 amino acid hotspots in the HEAT 5—9 repeats and is not limited to cancers in Which SF3BI is recurrently mutated. Cryptic 3’SSs are shared across different cancer types

1D). Some cryptic 3’SSs were not used in one or two of the cancer types due to lower expression of the corresponding genes in those cancers. Differences in cryptic 3’SS usage due to varying gene expression may contribute to the divergent prognostic implications of SF3BI mutation in various cancers [2,7].

The gene set with the second smallest p-value consists of genes up-regulated in chronic myelogenous leukemia and the seventh gene set contains genes up-regulated in aggressive uveal melanoma samples (GSEA [21] , q < 1035). These results may reflect the fact that we are more likely to identify cryptic 3’SSs in genes that are highly expressed which may bias such a gene set enrichment analysis. Nonetheless, several gene sets with potential importance for cancer development are enriched such as genes positively correlated with BRCAI, ATM, and CHEKZ expression across normal tissues (GSEA, q < 1028).

Cryptic 3’SSs are located ~13—1 7 bp downstream of the branch point

2A). We chose 23,066 control 3’SSs (see Methods) and plotted the nucleotide frequency [22] for the last 50 bp of the introns for all control, associated canonical, and cryptic 3’SSs as well as the enrichment of adenines relative to the control introns. The control introns have a typical nucleotide composition with a 4—24 bp PPT preceding the 3’SS (Fig. 2B) [13]. The associated canonical 3’SS introns are enriched for adenines ~15—2O bp upstream of the SSS since the proximal cryptic 3’SSs are located in this region (Fig. 2C). However, the introns for proximal (Fig. 2D) and distal (Fig. 2E) cryptic 3’SSs have a strong enrichment of adenines concentrated ~15 bp upstream of the splice sites. These results suggest that the increased usage of the 619 proximal and 417 distal cryptic 3’SSs in the SF3BI mutants may result from the same mechanism. The human BP motif is highly degenerate except for a largely invariant adenine [13] leading us to suspect that the adenine signal upstream of the cryptic 3’SSs is caused by the associated canonical 3’SSs’ BP adenines. We used SVM_BP [23] to predict BPs for the associated canonical 3’SSs and calculated the distance from the highest scoring predicted BPs to the cryptic splice sites. We found that AG dinucleotides that serve as cryptic 3’SSs are enriched ~13—17 bp downstream from the predicted BP (Fig. 3A) relative to random AG dinucleotides present in control 3’SS introns (Fig. 3B, p < 107, Mann Whitney U). For cryptic 3’SSs not located 13—17 hp downstream from the highest scoring BP in Fig. 3A, we calculated the distance from

associated

prxial cryptic canonical 3’SS 3’SS

contl 3’SS

ill

C —iogm1.9e-1o

D

E

samples to compare to the cryptic and associated canonical 3’SS p intron for (B) 23,066 control 3’SSs; (C) 613 assOCIated canon values from Fisher exact tests for enrichment of adenines at each position relative to control 3’SSs. Horizontal line marks significance level ofp = 0.05. (-log10 0.05 z 1.3). The p-value box plots have different scales in (C), (D), and (E); the smallest p-values for each panel are labeled. the second highest scoring BP to the cryptic 3’SSs and found that overall, the majority of the cryptic 3’SSs were located 13—17 bp from either the highest or second highest scoring BP

Proposed mechanism of cryptic 3’SS selection

3’SSs are typically not located within ~12—18 bp downstream of the BP because the proteins bound to the BP sterically hinder AG dinucleotides in this region and prevent them from being used as 3’SSs [16]. Our results suggest that AG dinucleotides serving as cryptic 3’SSs in SF3BI mutants are located at the end of this sterically protected region downstream of the BP

3D). Additionally, during the splicing reaction, the spliceosome searches ~12 bp downstream from the first 3’SS after the BP for any other 3’SSs and chooses the strongest 3’SS based on sequence features [16]. The lack of cryptic 3’SSs in the last 10 bp of the intron (Fig. 1A) indicates that cryptic 3’SSs used in SF3BI mutants are located far enough upstream of the associated canonical 3’SSs to avoid competition for splicing. We observed that the distance between associated canonical 3’SSs and their predicted BPs is significantly greater than the distance between control 3’SSs and their BPs such that the cryptic 3’SSs at the edge of the protected region do not compete with the canonical 3’SS for splicing (p < 1023, Mann Whitney U, Fig. 3E,F). We also predicted BP’s for the 619 proximal and 417 distal cryptic 3’SSs (as opposed to above where we predicted BP’s for the canonical 3’SSs associated with the 619 proximal 3’SSs) and found that the majority of these cryptic 3’SSs were 13—17 bp downstream of their predicted BP’s (SS Fig.) providing further evidence that most cryptic 3’SSs (both proximal and distal) associated with SF3BI mutations are located at the edge of the sterically protected region.

Studying the role of cryptic 3’SS in inherited Mendelian disease genes, Kralovicova et al. 2005 used splicing reporters with cryptic 3’SSs located in the PPT and found that moving the cryptic 3’SS into the ~12—18 bp steri-cally protected region reduced or eliminated cryptic 3’SS selection. On the other hand, moving an AG dinucleotide out of the sterically protected region allowed for its selection as a cryptic 3’88 [18]. These published experimental results and the rigid distance between the BP and the cryptic 3’SSs observed in our study are consistent with a model of altered 3’SS selection in SF3BI mutants due to a change in the size of the sterically hindered region downstream of the BP.

For 900 of these introns, the potential cryptic 3’SSs also passed the coverage cutoff, of which 310 were used significantly more often in the SF3BI mutants. This analysis demonstrates that not every potential cryptic 3’SS is differentially used in the mutants, so the sequence requirements described here appear to be necessary for cryptic 3’SS usage but not sufficient.

Cryptic 3’SSs are used infrequently relative to canonical 3’SSs

Although the cryptic splice sites described here are used significantly more often in the SF3BI mutants, the biological effects are likely dependent on the proportion of transcripts that use the cryptic 3’SSs relative to the canonical 3’SSs. We therefore calculated the percent spliced in (PSI) for the proximal cryptic 3’SSs relative to their associated canonical 3’SSs in the CLL samples since they have a higher sequencing depth than the other tumor samples (81 Fig.) that allows for more accurate quantification of splicing and because the distribution of well-characterized low and high-risk CLL prognostic factors was similar between the SF3BI mutated and wild-type samples (Fig. 4A). To calculate PSI for the 325 proximal cryptic 3’SSs used significantly more often in the SF3BI mutants from the CLL-only analysis (S6—S7 Files), we divided the number of reads that span the cryptic 3’SS by the number of reads that span both the cryptic 3’88 and its associated canonical 3’SS. We observed that some cryptic 3’SSs are used exclusively in SF3BI mutants while others are also used in SF3BI wild-type samples but at a lower frequency relative to the mutants (Fig. 4A). 67% of the cryptic 3’SSs were included in

These results suggest that the cryptic splice sites are either included rarely even in the SF3BI mutants or that transcripts with cryptic splice sites are subject to a higher rate of nonsense-mediated decay (NMD). To investigate the potential role of NMD, we identified differentially expressed genes between the SF3BI mutant and wild-type samples in a joint analysis of all three cancers and performed a gene set enrichment analysis. We found that genes in the “Reactome NMD enhanced by the exon junction complex” set were enriched (GSEA [21], q < 1028) among the 272 differentially expressed genes (DESeq2, BH-adjusted p < 0.1, 88—89 Files) suggesting that NMD may be different between the SF3BI mutants and wild-types. 33 of the 582 genes that contained the 619 proximal cryptic 3’SSs were differentially expressed with the expression of 29/33 of these genes lower in the SF3BI mutants. Genes containing a proximal cryptic 3’SSs were more likely to be differentially expressed (Fisher exact, p < 108) and more likely to have lower expression in SF3BI mutants (Fisher exact, p = 0.0009). These results suggest that cryptic 3’SS selection may affect gene expression for a subset of genes. However, the observation that in-frame cryptic 3’SSs likely not subject to NMD and out-of-frame cryptic 3’SSs potentially subject to NMD are included at similar rates relative to their associated canonical 3’SSs (Fig. 4A) suggests that most genes’ expression are not affected by cryptic 3’SS selection and most cryptic 3’SSs are observed at a low frequency because they are spliced in infrequently compared to their associated canonical 3’SSs.

Despite the generally low PSI values for the 325 cryptic 3’SSs from the CLL-only analysis, we identified four genes previously implicated in cancer (TTII [24—26] , MAP3K7 [27—29], FXYDS [30], PFDNS [31]) and six others (YIFIA, ORAIZ, ZNF91, ZNF548, RPM—1280122. ], RP11—532F12.5) with out-of-frame cryptic 3’SSs that were consistently preferred to the associated canonical 3’SS in the CLL SF3BI mutant samples (Fig. 4B). Ferreira et al. identified the junctions in ORAIZ, ZNF91, and TTII in CLL SF3BI mutants as well [11]. Nine of the ten junctions were significant in our BRCA-only analysis and showed high differences in relative inclusion (S6 F ig., SlO—Sll Files). These genes are not differentially expressed between the CLL SF3BI mutant and wild-type samples (812 File) but the frequent inclusion of out-of—frame cryptic 3’SSs may affect their biological function.

Discussion

We have shown the cryptic 3’SSs have specific sequence requirements; AG dinucleotides used as cryptic 3’SSs in SF3BI mutants are located at the end of the sterically protected region ~13—17 bp downstream of the BP but are >10 bp upstream of nearby canonical 3’SSs allowing them to avoid competition for splicing. These sequence requirements limit the introns susceptible to cryptic 3’SS selection to those where the BP is located farther from the SSS than the typical ~24 bp. While these requirements appear necessary for cryptic 3’SS usage, they are not sufficient, as we did not detect cryptic 3’SS usage in all introns with AG dinucleotides that satisfy these requirements. Characteristics such as RNA conformation, RNA binding protein sites, BP prediction inaccuracies, cryptic or downstream canonical 3’SS strength, gene/transcript expression, sequencing depth, or other factors may also play a role in determining whether cryptic 3’SSs are used and detected by RNA sequencing.

When examining the three cancer types in our study individually, the number of cryptic 3’SSs identified was highly dependent on the sequencing depth of the samples (Sl—SZ Figs., 82 File). Additionally, examining cryptic 3’SSs expressed higher in the SF3BI mutants but not significantly (Fig. 1B) shows a modest enrichment of novel 3’SSs 10—30 bp upstream of canonical 3’SSs. These observations suggest that deeper sequencing will continue to reveal proximal cryptic 3’SSs in SF3BI mutants that are used very infrequently or are present in lowly expressed genes.

Using a curated a list of aberrant splice sites associated with different diseases from the literature, Kralovicova et al. 2005 found that in cases where cryptic 3’SS selection was not caused by mutation of the 3’YAG consensus sequence, cryptic 3’SSs were often located ~19 bp upstream of associated canonical 3’SSs and ~11—15 bp downstream of the BP [18]. Most of the diseases considered in Kralovicova et al. 2005 are Mendelian diseases where a cryptic 3’SS disrupts or abolishes the function of a single disease gene. In these cases, a mutation in the PPT between the sterically protected and competitive regions has introduced a cryptic 3’SS (Fig. 3D). For cancers with SF3BI mutations, we suspect that the size of the sterically protected region is slightly altered allowing for existing AG dinucleotides to be used as cryptic 3’SSs in hundreds of genes. It is also possible SF3BI mutations could cause destabilization of the U2 snRNP complex or alter interactions with U2AF2, affecting the ability to recognize the canonical 3’SS and leading to cryptic 3’SS selection. However, the rigid distance (~13—17 bp) from the predicted BPs to the cryptic 3’SSs for most of the cryptic 3’SSs is most consistent with a change in the size of the sterically protected region downstream of the branch point. We found that cryptic 3’SS selection is limited to tumors with mutations in the five ~10 amino acid hotspots in the SF3BI HEAT 5—9 repeats and that these mutations are associated with cryptic 3’SS selection across different cancer types and even in cancers in which SF3BI is not recurrently mutated. 58% of these cryptic 3’SSs are out-of-frame relative to nearby canonical 3’SSs, but the biological impact of these cryptic 3’SSs is likely a function of how frequently they are used relative to the nearby canonical 3’SSs. We found that while the cryptic 3’SSs are used more often in the SF3BI mutated samples compared to wild-type samples, they are used relatively infrequently (< 10%) compared to nearby canonical 3’SSs. While the differentially eXpressed genes between the SF3BI mutated and wild-type samples are enriched for genes in the NMD pathway, even in-frame cryptic 3’SSs are used at a low frequency indicating that the associated canonical 3’SS is mostly preferred to the cryptic 3’SS even in SF3BI mutants. Nonetheless, we identified ten genes, including four with known roles in cancer, which had a high frequency of cryptic splice site usage relative to the nearby canonical splice site. Further studies are required to determine whether low-frequency cryptic 3’SS selection in hundreds of genes, high-frequency cryptic 3’SS selection in a small group of genes, and/or other splicing alterations drive the oncogenic effect of SF3BI mutation.

Methods

Sample selection

For the chronic lymphocytic leukemia (CLL) samples, the UCSD IRB approved the study and all subjects gave informed consent (Project #080918). Refer to the informed consent for The Cancer Genome Atlas and Harbour et al. for consent information for other cancer samples [7].

Seven SF3BI mutated CLL cases and nine SF3BI wild-type CLL cases were identified from the CLL Consortium database. The mutations were originally characterized by PCR and verified in the RNA-sequencing data [9]. Sample dates were chosen on average 95 days prior to treatment and at least 287 days after prior treatment to select samples with high tumor cell count. Samples were chosen to have relatively similar numbers of I GH V mutated/unmutated and ZAP-7O positive/negative samples (Fig. 4).

SF3BI mutant samples were identified using the Broad GDAC TCGA analysis (http://gdac.broadinstitute.org/runs/analyses_2013_02_22/) in TCGA tumor types with no publication restrictions. Samples with SF3BI mutations outside of Gencode version 14 exons were excluded. We excluded any cancer types with less than four SF3BI mutants or for which paired-end RNA-sequencing data was not available leaving breast cancer (BRCA), lung adenocarcinoma (LUAD), and lung squamous cell carcinoma (LUSC). We chose 1.25 as many SF3BI wild-type controls as mutated samples for each cancer type randomly from samples without mutations in SF3BI or other splicing factors. RNA sequencing data was downloaded from CGHub [32].

Uveal melanoma samples were downloaded from the Short Read Archive (SRAO62359) [7]. As reported in Furney et 611., four uveal melanoma samples had SF3BI mutations in codon 625 and four had wild-type copies of SF3BI [33].

Library preparation and sequencing for CLL samples

RNA was polyA selected and processed using SMART cDNA synthesis (Clontech) to prepare sequencing libraries. Samples were sequenced on Illumina HiSquOOO instruments generating an average of 239 million paired 75 bp reads per sample (81 Fig).

Adapter trimming

AAAAAAAAAAAAAAAAAAAAA) [34]. Read pairs Where or one of both reads were of length less than 20 were removed.

Read alignment

RNA-seq reads were aligned to the human genome (hg19) using STAR 2.3.0e (—alignSIDBo-verhangMin 1—seedSearchStarthaX 12—alignSplicedMateMameinOveerate 0.08—out-FilterScoreMinOverLread 0.08—outFilterMatcthinOverLread 0.08—outFilterMultimameaX 100—outFilterIntronMotifs RemoveNoncanonicalUnannotated—outSIfilterOverhangMin 6 6 6 6) and a splice junction database consisting of junctions from Gencode, UCSC knownGene, AceVieW, lincRNAs, and H-InV [19,35—39]. Duplicate read pairs were removed prior to alignment by comparing the sequences of all read pairs and keeping only one read pair per set of read pairs With identical sequences.

Splice junction read coverage

Splice junction read coverages were obtained from the S].out.tab output file from STAR.

Novel splice junction identification

Novel splice junctions were defined as those junctions identified by STAR not present in Gen-code version 14 that (i) were covered by at least 20 reads summed over all cancer samples in a given analysis, (ii) shared a 5’ splice site and/or 3’SS with a Gencode junction, and (iii) had one

Splice junction usage

Splice junctions used in more than one Gencode gene were removed. When multiple cancer types were analyzed, we provided cancer type as a covariate to DEXSeq. Raw p-values were adjusted for multiple hypothesis testing using the Benjamini Hochberg procedure. To examine the impact of the coverage cutoff of 20 reads summed over all samples on our results, we increased the cutoff to 50, 75, and 100 reads summed over all samples and found that 42%, 32%, and 24% of the significant novel 3’SSs remained at each of these cutoffs. The enrichment for proximal cryptic 3’SS remained at all cutoffs, so we used the 20 read cutoff to maximize sensitivity.

Identification of associated canonical 3’SSs for cryptic 3’SSs

First, all Gencode splice sites that shared a 5’ splice site with the novel 3’SS were identified. Then, the closest Gen-code 3’SS from these splice sites that was downstream of the cryptic 3’SS was chosen as the associated canonical 3’SS for that cryptic 3’SS. If there was no Gencode 3’SS downstream of the cryptic 3’SS, the closest Gencode 3’SS upstream of the cryptic 3’SS was chosen as the associated canonical 3’SS.

Gene set enrichment for genes with cryptic 3’SS usage

We performed a gene set enrichment analysis using GSEA [21] for the genes that contained cryptic 3’SSs by combining the genes that contained the 619 proximal (S3 File) and the 417 dis

Identification of control 3’SSs

We identified 23,066 control 3’SSs by choosing splice sites that are annotated in Gencode, Whose average coverage over BRCA, CLL, and UM samples is greater than 100, and Whose 5' splice site does not have any novel 3'SSs. We characterized intronic AG dinucleotides for these control junctions by analyzing the intronic sequence downstream of the predicted branch points minus the last 10 bp of the intron since alternative 3’SSs can be located in the last 10 bp of the intron.

Hierarchical clustering

All heatmap rows and columns were clustered using scipy.cluster.hierarchy.linkage With either the “complete” or “single” distance metric.

SF3B1 mutant allele frequency

Mutant allele frequency was determined by calculating per-base coverages using unique properly paired reads With samtools mpileup for the SF3BI locus and counting the number of reads supporting either the reference or alternate alleles.

Gene expression

The remaining reads were realigned to the Gencode V14 transcriptome using Bowtie2 (V2.1.0,-t-k 400-X 400—no-mixed—no-discordant) and transcript expression was estimated using eXpress (V1.3.0,—maX-indel-size 20) [40,41]. Gene expression was estimated by summing together the effective counts or FPKM values for all transcripts contained in a gene.

Relative average expression of genes with cryptic 3’SSs

The average expression values were then normalized for each gene by dividing by the largest average expression of the three cancers for that gene. Therefore each column in the green heatmap in Fig. 1D has one value of 1.0 while the other two values are between 0.0 and 1.0 and represent the expression of the gene in that cancer relative to the maximum.

Definition of HEAT repeats

HEAT repeat locations were defined according to the definition of HEAT repeats in Wang etal.1998 [15].

COSMIC SF3B1 mutations

COSMIC V66 complete export was downloaded and the number of mutations at each location in the SF3BI heat domains 5—9 was plotted for locations with at least two observed mutations in COSMIC [42].

Nucleotide frequency plots

Nucleotide frequency plots were constructed using WebLogo (unit_name = ’probability’) [22]. Adenine enrichment was calculated by counting the number of adenines and non-adenines at each intron position for a given splice site class and comparing to the number of adenines and non-adenines in control 3’SSs using a Fisher exact test.

Branch point identification

The SVM_BP code was altered to allow for branch points eight bp from the SSS by setting mindist3ss = 8 in svm_getfeat.py (see https:// github.com/cdeboever3/svm-bpfinder). SVM_BP was run with options “Hsap 50.” When multiple branch points were predicted for one 3’SS, we chose the branch point with the highest sequence score (bp_scr). In some instances, there was more than one cryptic 3’SS associated with a canonical 3’SS, so we randomly chose only one of these cryptic splice sites for further analysis. For Fig. 3C, we plotted the distance from highest scoring BP predicted for canonical 3’SSs to their associated cryptic 3’SSs as in Fig. 3A. However, the distances for cryptic 3’SSs located less than 13 bp or more than 17 bp from the BP in Fig. 3A were replaced with the distance from the second highest scoring BP. SSC—SSD Fig. were created similarly.

Differential gene expression

We summed the effective counts from eXpress for all transcripts from each gene to obtain effective read counts for each gene. We provided these read counts to DESeq2 (v1.2.10, R v3.0.3) and tested for differential gene expression using nbinomWaldTest using cancer type as a covariate for the analysis with different cancers [43]. We only tested genes where the sum of effective read counts over all samples was greater than 100. p-values were adjusted using the Benjamini-Hochberg procedure. Gene set enrichment analysis was performed using GSEA [21].

Percent spliced in for cryptic 3’SSs relative to associated canonical 3’SSs

Percent spliced in (PSI) values for cryptic 3’SSs relative to canonical 3’SSs were calculated by dividing the number of reads that span the cryptic 3’SS (c) by the number of reads that span the cryptic 3’SS plus the number of reads that span the canonical 3’88 (61), Cf“, for each sample. The ten 3’SSs with high PSI values in CLL were identified by identifying cryptic 3’SSs whose median PSI was greater than 50% in the CLL SF3BI mutants but less than 20% in the wild-type samples and whose average coverage was at least 30 junction-spanning reads in the CLL mutant samples. These junctions were also chosen to be out-of-frame although the cryptic 3’SS in ORAIZ is located in the 5’ untranslated region.

Code, data, and reproducibility

We have made the code and intermediate data files needed to replicate this study available on Github (https://github.com/cdeboever3/deboever-sf3b1-2015) and Figshare (http://dx.doi.org/

Instructions are provided in the Github repository for reproducing our figures, tables, and statistical analyses. Sequencing data is available through dbGaP (phs000767).

Supporting Information

We sequenced the transcriptomes of peripheral blood mononucleocytes from seven SF3BI mutated chronic lymphocytic leukemia (CLL) cases and nine SF3BI Wild-type cases. We also obtained data from breast cancer (BRCA; 14 mutant, 18 Wild-type), lung squamous cell carcinoma (LUSC; four mutant, five Wild-type) and lung adenocarcinoma (LUAD; seven mutant, nine Wild-type) samples from the TCGA and uveal melanoma (UM; four mutant, four Wild-type) samples from Harbour et al. 2013.

logz distance in base pairs from 280, 1,476, and 86 significantly differentially used novel 3’SSs (82 File) to their associated canonical 3’SSs in (A) BRCA, (B) CLL, and (C) UM analyses respectively. Novel 3’SSs were associated with canonical 3’SSs only if they shared the same 5’ splice site. Zero represents the position of the canonical 3’SS. Negative and positive distances indicate that the cryptic 3’SS is respectively upstream or downstream from the canonical 3’SS. Inset shows base-by-base binning from zero to 50 base pairs upstream of canonical 3’SS. Red and blue histograms represent junctions with significantly higher usage in SF3BI mutants or SF3BI wild-type samples respectively. The number of cryptic 3’SS identified varied with the overall sequencing depth of the different data sets.

Heatmap shows for each BRCA sample the logz library-normalized count z-score for 192 proximal cryptic 3’SSs used significantly more often in the SF3BI mutants and located 10—30 bp upstream of canonical 3’SSs (82 File). SF3BI mutants are labeled With the observed missense or nonsense (*) mutation as well as the frequency of the mutant allele in the RNA-sequencing data. Attenuated cryptic 3’SS selection is visible for the K7OOE mutant With only 8.4% allele frequency. A633V and Y765C mutants do not show evidence for cryptic 3’SS selection. Black and White colorbar indicates Whether novel

Heatmap shows for each sample the logz li-brary-normalized count z-score for the 578 proximal cryptic 3’SSs used significantly more often in the SF3BI mutants in the CLL, BRCA, UM, LUAD, and LUSC joint analysis (82 File). Grey bars indicate frequency of SF3BI mutant allele in RNA-seq data. Colorbars indicate SF3BI mutation status, cancer type, and Whether the SF3BI mutation is located in the HEAT 5—9 repeats. Black and White colorbar indicates Whether novel 3’SSs are out-of-frame (black) relative to canonical 3’SSs.

Distance from 3’SS to highest scoring predicted branch point (BP). We were able to predict BPs for (A) 584 of 619 proximal cryptic 3’SSs and (B) 405 of 417 distal cryptic 3’SSs (as opposed to predicting the BPs for the associated canonical 3’SSs as in Fig. 3). Distance from either highest or second highest scoring predicted BP to (C) proximal cryptic 3’SSs and (D) distal cryptic 3’SSs. Cryptic 3’SSs that are used more often in SF3BI mutants have BPs located ~13—17 bp upstream regardless of Whether they are 10—30 bp upstream of canonical 3’SSs. (TIF)

Beeswarm plots showing the PSI values for the cryptic 3’SS relative to the associated canonical 3’SS in nine of ten genes With high levels of cryptic 3’SS inclusion in CLL SF3BI mutants (M) compared to Wild-type (W) samples that were also expressed in the BRCA samples. The number in the upper corner of each plot is the distance in base pairs from the highest or sec-ond-highest scoring BP predicted for the associated canonical 3’SS to the cryptic 3’SS. (TIF)

Metadata for samples used in this study. SF3BI mutated samples have columns for frequency of SF3BI mutation in RNA-seq data, mutation type, codon change and Whether the mutation is in the HEAT 5—9 repeats. These columns are empty for SF3BI Wild-type 82 File. Summary of differential junction usage results from DEXSeq. DEXSeq was used to test for differential splice junction usage in a joint analysis of the CLL, BRCA, and UM samples as well as individually for each cancer type. “Novel” indicates that the junction is not annotated in Gencode. Proximal indicates that a novel 3’SS is 10—30 bp upstream of a canonical Gencode 3’SS.

619 cryptic 3’SSs located 10—30 bp upstream of canonical 3’SSs from joint BRCA, CLL, and UM analysis. Location of 5’ splice sites and 3’SSs are one-based coordinates that denote the start and end of the intron. The columns COSMIC, TSgene, and ncg denote Whether the gene is present in COSMIC, TSGene, or the Network of Cancer Genes respectively.

417 distal cryptic 3’SSs used more often in SF3BI mutants from joint BRCA, CLL, and UM analysis. Location of 5’ splice sites and 3’SSs are one-based coordinates that denote the start and end of the intron. The columns COSMIC, TSgene, and ncg denote Whether the gene is present in COSMIC, TSGene, or the Network of Cancer Genes respectively.

GSEA results for 912 genes containing 619 proximal and 417 distal cryptic 3’ splice sites used more often in SF3BI mutants. (XLS)

325 significant cryptic 3’SSs located 10—30 bp upstream of canonical 3’SSs and used more often in SF3BI mutants from CLL-only DEXSeq analysis. Location of 5’ splice sites and 3’SSs are one-based coordinates that denote the start and end of the intron. The columns COSMIC, TSgene, and ncg denote Whether the gene is present in COSMIC, TSGene, or the Network of Cancer Genes respectively.

Percent spliced in for 325 cryptic 3’SSs located 10—30 bp upstream of canonical 3’SSs from CLL-only DEXSeq analysis. Note that there are only 324 values because one canonical 3’SS was filtered due to 10W coverage so a PSI value could not be calculated.

272 genes that are differentially expressed between SF3BI mutant and wild-type samples from joint analysis of CLL, BRCA, and UM using DESeq2. (TSV)

192 significant cryptic 3’SSs located 10—30 bp upstream of canonical 3’SSs and used more often in SF3BI mutants from BRCA-only DEXSeq analysis. Location of 5’ splice sites and 3’SSs are one-based coordinates that denote the start and end of the intron. The columns COSMIC, TSgene, and ncg denote Whether the gene is present in COSMIC, TSGene, or the Network of Cancer Genes respectively.

Percent spliced in for 192 cryptic 3’SSs located 10—30 bp upstream of canonical 3’SSs from BRCA-only DEXSeq analysis. Note that there are only 191 values because one canonical 3’SS was filtered due to 10W coverage so a PSI value could not be calculated. 812 File. 33 genes that are differentially expressed between SF3BI mutant and wild-type

Acknowledgments

The authors are grateful to the Chronic Lymphocytic Leukemia Research Consortium for providing the CLL samples and Marco A. Marra, Richard A. Moore, Ioanne E. Johnson, Andrew I. Mungall and the Transcriptome Consortium at the Canada’s Michael Smith Genome Sciences Centre for performing transcriptome sequencing for the CLL samples.

Author Contributions

Topics

wild-type

Appears in 24 sentences as: Wild-type (7) wild-type (21)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. Prior studies have shown that mutated SF3BI CLL samples have differential exon inclusion and use some cryptic 3’ splice sites (3’SSs) relative to wild-type SF3BI CLL samples [5,6,8,10,11].
    Page 2, “Introduction”
  2. To test this, we examined splice site usage in transcriptome data from SF3BI mutant and SF3BI wild-type CLL, UM and BRCA cases.
    Page 2, “Introduction”
  3. A splice junction is considered differentially used between mutant and wild-type samples if the expression level of that junction differs significantly after accounting for overall expression differences of the corresponding gene locus.
    Page 3, “Cryptic 3’ splice sites 10—30 bp upstream of canonical 3’ splice sites are used in SFBB1 mutants”
  4. We identified 1,749 junctions that were significantly differentially used between the SF3BI mutant and SF3BI wild-type samples across the three tumor types including 1,330 novel junctions, of which 1,117 are novel 3’SSs (BH-adjusted p < 0.1, 82 File).
    Page 3, “Cryptic 3’ splice sites 10—30 bp upstream of canonical 3’ splice sites are used in SFBB1 mutants”
  5. All of the 619 proximal cryptic 3’SSs were used more often in the SF3BI mutant samples compared to the wild-type sam) and the novel junctions were enriched for novel 3’SSs ) showing that SF3BI mutations result in the usage of a large number of ples and 58% were out-of-frame relative to the nearby canonical 3’SSs, suggesting that these are not canonical 3’SSs missing from Gencode.
    Page 3, “Cryptic 3’ splice sites 10—30 bp upstream of canonical 3’ splice sites are used in SFBB1 mutants”
  6. that allows for more accurate quantification of splicing and because the distribution of well-characterized low and high-risk CLL prognostic factors was similar between the SF3BI mutated and wild-type samples (Fig.
    Page 8, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  7. We observed that some cryptic 3’SSs are used exclusively in SF3BI mutants while others are also used in SF3BI wild-type samples but at a lower frequency relative to the mutants (Fig.
    Page 8, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  8. To investigate the potential role of NMD, we identified differentially expressed genes between the SF3BI mutant and wild-type samples in a joint analysis of all three cancers and performed a gene set enrichment analysis.
    Page 8, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  9. To identify cryptic 3’SSs with relatively high PSI values in the SF3BI mutant versus wild-type samples, we searched for cryptic 3’SSs that were 1) used more than 50% of the time in the CLL SF3BI mutants; 2) used less than 20% of the time in wild-type samples; and 3) had an average coverage of at least 30 junction-spanning reads in the mutant samples.
    Page 9, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  10. These genes are not differentially expressed between the CLL SF3BI mutant and wild-type samples (812 File) but the frequent inclusion of out-of—frame cryptic 3’SSs may affect their biological function.
    Page 10, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  11. We found that while the cryptic 3’SSs are used more often in the SF3BI mutated samples compared to wild-type samples, they are used relatively infrequently (< 10%) compared to nearby canonical 3’SSs.
    Page 11, “Discussion”

See all papers in March 2015 that mention wild-type.

See all papers in PLOS Comp. Biol. that mention wild-type.

Back to top.

differentially expressed

Appears in 8 sentences as: differentially eXpressed (1) differentially expressed (7)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. To investigate the potential role of NMD, we identified differentially expressed genes between the SF3BI mutant and wild-type samples in a joint analysis of all three cancers and performed a gene set enrichment analysis.
    Page 8, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  2. We found that genes in the “Reactome NMD enhanced by the exon junction complex” set were enriched (GSEA [21], q < 1028) among the 272 differentially expressed genes (DESeq2, BH-adjusted p < 0.1, 88—89 Files) suggesting that NMD may be different between the SF3BI mutants and wild-types.
    Page 8, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  3. 33 of the 582 genes that contained the 619 proximal cryptic 3’SSs were differentially expressed with the expression of 29/33 of these genes lower in the SF3BI mutants.
    Page 8, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  4. Genes containing a proximal cryptic 3’SSs were more likely to be differentially expressed (Fisher exact, p < 108) and more likely to have lower expression in SF3BI mutants (Fisher exact, p = 0.0009).
    Page 9, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  5. These genes are not differentially expressed between the CLL SF3BI mutant and wild-type samples (812 File) but the frequent inclusion of out-of—frame cryptic 3’SSs may affect their biological function.
    Page 10, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  6. While the differentially eXpressed genes between the SF3BI mutated and wild-type samples are enriched for genes in the NMD pathway, even in-frame cryptic 3’SSs are used at a low frequency indicating that the associated canonical 3’SS is mostly preferred to the cryptic 3’SS even in SF3BI mutants.
    Page 11, “Discussion”
  7. 272 genes that are differentially expressed between SF3BI mutant and wild-type samples from joint analysis of CLL, BRCA, and UM using DESeq2.
    Page 17, “Supporting Information”
  8. 33 genes that are differentially expressed between SF3BI mutant and wild-type
    Page 17, “Supporting Information”

See all papers in March 2015 that mention differentially expressed.

See all papers in PLOS Comp. Biol. that mention differentially expressed.

Back to top.

gene expression

Appears in 8 sentences as: Gene expression (3) gene expression (4) genes’ expression (1)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. Differences in cryptic 3’SS usage due to varying gene expression may contribute to the divergent prognostic implications of SF3BI mutation in various cancers [2,7].
    Page 5, “Cryptic 3’SS selection is limited to tumors with mutations in HEAT repeat hotspots”
  2. These results suggest that cryptic 3’SS selection may affect gene expression for a subset of genes.
    Page 9, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  3. 4A) suggests that most genes’ expression are not affected by cryptic 3’SS selection and most cryptic 3’SSs are observed at a low frequency because they are spliced in infrequently compared to their associated canonical 3’SSs.
    Page 9, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  4. Gene expression
    Page 13, “Gene expression”
  5. Gene expression was estimated by summing together the effective counts or FPKM values for all transcripts contained in a gene.
    Page 13, “Gene expression”
  6. Differential gene expression
    Page 14, “Differential gene expression”
  7. Gene expression was estimated as described above.
    Page 14, “Differential gene expression”
  8. We provided these read counts to DESeq2 (v1.2.10, R v3.0.3) and tested for differential gene expression using nbinomWaldTest using cancer type as a covariate for the analysis with different cancers [43].
    Page 14, “Differential gene expression”

See all papers in March 2015 that mention gene expression.

See all papers in PLOS Comp. Biol. that mention gene expression.

Back to top.

gene set

Appears in 8 sentences as: Gene set (2) gene set (6) gene sets (1)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. To characterize the roles of the genes affected by cryptic 3’SS usage, we performed a gene set enrichment analysis for the 912 genes that contained the 619 proximal and 417 distal cryptic 3’SSs used significantly more often in the SF3BI mutant samples (SS File).
    Page 5, “Cryptic 3’SS selection is limited to tumors with mutations in HEAT repeat hotspots”
  2. The gene set with the second smallest p-value consists of genes up-regulated in chronic myelogenous leukemia and the seventh gene set contains genes up-regulated in aggressive uveal melanoma samples (GSEA [21] , q < 1035).
    Page 5, “Cryptic 3’SS selection is limited to tumors with mutations in HEAT repeat hotspots”
  3. These results may reflect the fact that we are more likely to identify cryptic 3’SSs in genes that are highly expressed which may bias such a gene set enrichment analysis.
    Page 5, “Cryptic 3’SS selection is limited to tumors with mutations in HEAT repeat hotspots”
  4. Nonetheless, several gene sets with potential importance for cancer development are enriched such as genes positively correlated with BRCAI, ATM, and CHEKZ expression across normal tissues (GSEA, q < 1028).
    Page 5, “Cryptic 3’SS selection is limited to tumors with mutations in HEAT repeat hotspots”
  5. To investigate the potential role of NMD, we identified differentially expressed genes between the SF3BI mutant and wild-type samples in a joint analysis of all three cancers and performed a gene set enrichment analysis.
    Page 8, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  6. Gene set enrichment for genes with cryptic 3’SS usage
    Page 13, “Gene set enrichment for genes with cryptic 3’SS usage”
  7. We performed a gene set enrichment analysis using GSEA [21] for the genes that contained cryptic 3’SSs by combining the genes that contained the 619 proximal (S3 File) and the 417 dis
    Page 13, “Gene set enrichment for genes with cryptic 3’SS usage”
  8. Gene set enrichment analysis was performed using GSEA [21].
    Page 14, “Differential gene expression”

See all papers in March 2015 that mention gene set.

See all papers in PLOS Comp. Biol. that mention gene set.

Back to top.

breast cancer

Appears in 5 sentences as: Breast cancer (1) breast cancer (4)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. Using transcriptome sequencing data from chronic lymphocytic leukemia, breast cancer and uveal melanoma tumor samples, we show that hundreds of cryptic 3’ splice sites (3’SSs) are used in cancers with SF3B1 mutations.
    Page 1, “Abstract”
  2. Recurrent mutations in the highly conserved HEAT 5—9 repeats of splicing factor 3B subunit 1 (SF3BI) have been reported in myelodysplastic syndrome, chronic lymphocytic leukemia (CLL), breast cancer (BRCA), uveal melanoma (UM), and pancreatic cancer [2—7].
    Page 2, “Introduction”
  3. We excluded any cancer types with less than four SF3BI mutants or for which paired-end RNA-sequencing data was not available leaving breast cancer (BRCA), lung adenocarcinoma (LUAD), and lung squamous cell carcinoma (LUSC).
    Page 11, “Sample selection”
  4. We also obtained data from breast cancer (BRCA; 14 mutant, 18 Wild-type), lung squamous cell carcinoma (LUSC; four mutant, five Wild-type) and lung adenocarcinoma (LUAD; seven mutant, nine Wild-type) samples from the TCGA and uveal melanoma (UM; four mutant, four Wild-type) samples from Harbour et al.
    Page 15, “Supporting Information”
  5. Breast cancer proximal cryptic 3’SS coverage.
    Page 15, “Supporting Information”

See all papers in March 2015 that mention breast cancer.

See all papers in PLOS Comp. Biol. that mention breast cancer.

Back to top.

enrichment analysis

Appears in 5 sentences as: enrichment analysis (5)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. To characterize the roles of the genes affected by cryptic 3’SS usage, we performed a gene set enrichment analysis for the 912 genes that contained the 619 proximal and 417 distal cryptic 3’SSs used significantly more often in the SF3BI mutant samples (SS File).
    Page 5, “Cryptic 3’SS selection is limited to tumors with mutations in HEAT repeat hotspots”
  2. These results may reflect the fact that we are more likely to identify cryptic 3’SSs in genes that are highly expressed which may bias such a gene set enrichment analysis .
    Page 5, “Cryptic 3’SS selection is limited to tumors with mutations in HEAT repeat hotspots”
  3. To investigate the potential role of NMD, we identified differentially expressed genes between the SF3BI mutant and wild-type samples in a joint analysis of all three cancers and performed a gene set enrichment analysis .
    Page 8, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  4. We performed a gene set enrichment analysis using GSEA [21] for the genes that contained cryptic 3’SSs by combining the genes that contained the 619 proximal (S3 File) and the 417 dis
    Page 13, “Gene set enrichment for genes with cryptic 3’SS usage”
  5. Gene set enrichment analysis was performed using GSEA [21].
    Page 14, “Differential gene expression”

See all papers in March 2015 that mention enrichment analysis.

See all papers in PLOS Comp. Biol. that mention enrichment analysis.

Back to top.

amino acid

Appears in 4 sentences as: amino acid (4)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. Our analysis of tumors with SF3BI mutations shows that cryptic 3’SS selection occurs only in samples with missense mutations at ~10 amino acid hotspots in the fifth to ninth HEAT repeats.
    Page 2, “Introduction”
  2. The other two samples had mutations in the HEAT 5—9 repeats but outside of the apparent ~10 amino acid mutational hotspots
    Page 5, “Cryptic 3’SS selection is limited to tumors with mutations in HEAT repeat hotspots”
  3. These results show that cryptic 3’SS selection only occurs in tumors carrying mutations in one of the five ~10 amino acid hotspots in the HEAT 5—9 repeats and is not limited to cancers in Which SF3BI is recurrently mutated.
    Page 5, “Cryptic 3’SS selection is limited to tumors with mutations in HEAT repeat hotspots”
  4. We found that cryptic 3’SS selection is limited to tumors with mutations in the five ~10 amino acid hotspots in the SF3BI HEAT 5—9 repeats and that these mutations are associated with cryptic 3’SS selection across different cancer types and even in cancers in which SF3BI is not recurrently mutated.
    Page 10, “Discussion”

See all papers in March 2015 that mention amino acid.

See all papers in PLOS Comp. Biol. that mention amino acid.

Back to top.

Cancer Genes

Appears in 4 sentences as: Cancer Genes (4)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. The columns COSMIC, TSgene, and ncg denote Whether the gene is present in COSMIC, TSGene, or the Network of Cancer Genes respectively.
    Page 16, “Supporting Information”
  2. The columns COSMIC, TSgene, and ncg denote Whether the gene is present in COSMIC, TSGene, or the Network of Cancer Genes respectively.
    Page 16, “Supporting Information”
  3. The columns COSMIC, TSgene, and ncg denote Whether the gene is present in COSMIC, TSGene, or the Network of Cancer Genes respectively.
    Page 16, “Supporting Information”
  4. The columns COSMIC, TSgene, and ncg denote Whether the gene is present in COSMIC, TSGene, or the Network of Cancer Genes respectively.
    Page 17, “Supporting Information”

See all papers in March 2015 that mention Cancer Genes.

See all papers in PLOS Comp. Biol. that mention Cancer Genes.

Back to top.

Fisher exact

Appears in 4 sentences as: Fisher exact (6)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. These 1,749 significant junctions were highly enriched for novel splice junctions compared to annotated junctions (Fisher exact, p < 10'200 ( Fisher exact , p < 10'200 novel 3’SSs.
    Page 3, “Cryptic 3’ splice sites 10—30 bp upstream of canonical 3’ splice sites are used in SFBB1 mutants”
  2. samples to compare to the cryptic and associated canonical 3’SS p intron for (B) 23,066 control 3’SSs; (C) 613 assOCIated canon values from Fisher exact tests for enrichment of adenines at each position relative to control 3’SSs.
    Page 6, “E”
  3. Genes containing a proximal cryptic 3’SSs were more likely to be differentially expressed (Fisher exact, p < 108) and more likely to have lower expression in SF3BI mutants ( Fisher exact , p = 0.0009).
    Page 9, “Cryptic 3’SSs are used infrequently relative to canonical 3’SSs”
  4. Adenine enrichment was calculated by counting the number of adenines and non-adenines at each intron position for a given splice site class and comparing to the number of adenines and non-adenines in control 3’SSs using a Fisher exact test.
    Page 14, “Nucleotide frequency plots”

See all papers in March 2015 that mention Fisher exact.

See all papers in PLOS Comp. Biol. that mention Fisher exact.

Back to top.

RNA-seq

Appears in 4 sentences as: RNA-seq (4)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. RNA-seq reads were aligned to the human genome (hg19) using STAR 2.3.0e (—alignSIDBo-verhangMin 1—seedSearchStarthaX 12—alignSplicedMateMameinOveerate 0.08—out-FilterScoreMinOverLread 0.08—outFilterMatcthinOverLread 0.08—outFilterMultimameaX 100—outFilterIntronMotifs RemoveNoncanonicalUnannotated—outSIfilterOverhangMin 6 6 6 6) and a splice junction database consisting of junctions from Gencode, UCSC knownGene, AceVieW, lincRNAs, and H-InV [19,35—39].
    Page 12, “Read alignment”
  2. Number of uniquely mapped RNA-seq reads from STAR alignment.
    Page 15, “Supporting Information”
  3. Grey bars indicate frequency of SF3BI mutant allele in RNA-seq data.
    Page 15, “Supporting Information”
  4. SF3BI mutated samples have columns for frequency of SF3BI mutation in RNA-seq data, mutation type, codon change and Whether the mutation is in the HEAT 5—9 repeats.
    Page 16, “Supporting Information”

See all papers in March 2015 that mention RNA-seq.

See all papers in PLOS Comp. Biol. that mention RNA-seq.

Back to top.

adenocarcinoma

Appears in 3 sentences as: adenocarcinoma (3)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. We observed cryptic 3’SS selection in a TCGA lung adenocarcinoma sample With a hotspot mutation but not in lung cancer samples With SF3BI mutations outside of the five hot-spots (S4 Fig.
    Page 5, “Cryptic 3’SS selection is limited to tumors with mutations in HEAT repeat hotspots”
  2. We excluded any cancer types with less than four SF3BI mutants or for which paired-end RNA-sequencing data was not available leaving breast cancer (BRCA), lung adenocarcinoma (LUAD), and lung squamous cell carcinoma (LUSC).
    Page 11, “Sample selection”
  3. We also obtained data from breast cancer (BRCA; 14 mutant, 18 Wild-type), lung squamous cell carcinoma (LUSC; four mutant, five Wild-type) and lung adenocarcinoma (LUAD; seven mutant, nine Wild-type) samples from the TCGA and uveal melanoma (UM; four mutant, four Wild-type) samples from Harbour et al.
    Page 15, “Supporting Information”

See all papers in March 2015 that mention adenocarcinoma.

See all papers in PLOS Comp. Biol. that mention adenocarcinoma.

Back to top.

allele frequency

Appears in 3 sentences as: allele frequency (3)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. SF3B1 mutant allele frequency
    Page 13, “SF3B1 mutant allele frequency”
  2. Mutant allele frequency was determined by calculating per-base coverages using unique properly paired reads With samtools mpileup for the SF3BI locus and counting the number of reads supporting either the reference or alternate alleles.
    Page 13, “SF3B1 mutant allele frequency”
  3. Attenuated cryptic 3’SS selection is visible for the K7OOE mutant With only 8.4% allele frequency .
    Page 15, “Supporting Information”

See all papers in March 2015 that mention allele frequency.

See all papers in PLOS Comp. Biol. that mention allele frequency.

Back to top.

base pairs

Appears in 3 sentences as: base pairs (3)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. logz distance in base pairs from 280, 1,476, and 86 significantly differentially used novel 3’SSs (82 File) to their associated canonical 3’SSs in (A) BRCA, (B) CLL, and (C) UM analyses respectively.
    Page 15, “Supporting Information”
  2. Inset shows base-by-base binning from zero to 50 base pairs upstream of canonical 3’SS.
    Page 15, “Supporting Information”
  3. The number in the upper corner of each plot is the distance in base pairs from the highest or sec-ond-highest scoring BP predicted for the associated canonical 3’SS to the cryptic 3’SS.
    Page 16, “Supporting Information”

See all papers in March 2015 that mention base pairs.

See all papers in PLOS Comp. Biol. that mention base pairs.

Back to top.

sequencing data

Appears in 3 sentences as: Sequencing data (1) sequencing data (2)
In Transcriptome Sequencing Reveals Potential Mechanism of Cryptic 3’ Splice Site Selection in SF3B1-mutated Cancers
  1. Using transcriptome sequencing data from chronic lymphocytic leukemia, breast cancer and uveal melanoma tumor samples, we show that hundreds of cryptic 3’ splice sites (3’SSs) are used in cancers with SF3B1 mutations.
    Page 1, “Abstract”
  2. RNA sequencing data was downloaded from CGHub [32].
    Page 11, “Sample selection”
  3. Sequencing data is available through dbGaP (phs000767).
    Page 15, “Code, data, and reproducibility”

See all papers in March 2015 that mention sequencing data.

See all papers in PLOS Comp. Biol. that mention sequencing data.

Back to top.