Flexibility plays a key role in DNA supercoil-ing and DNA/protein binding, regulating DNA transcription, replication or repair. Specific interest in flexibility analysis concerns its relationship with human genome instability. Enrichment in flexible sequences has been detected in unstable regions of human genome defined fragile sites, where genes map and carry frequent deletions and rearrangements in cancer. Flexible sequences have been suggested to be the determinants of fragile gene proneness to breakage; however, their actual role and properties remain elusive. Our in silico analysis carried out genome-wide via the StabFIex algorithm, shows the conserved presence of highly flexible regions in budding yeast genome as well as in genomes of other Saccharomyces sensu stricto species. Flexibile peaks in S. cerevisiae identify 175 ORFs mapping on their 3’UTR, a region affecting mRNA translation, localization and stability. (TA) n repeats of different extension shape the central structure of peaks and co-Iocalize with polyadenylation efficiency element (EE) signals. ORFs with flexible peaks share common features. Transcripts are characterized by decreased halflife: this is considered peculiar of genes involved in regulatory systems with high turnover; consistently, their function affects biological processes such as cell cycle regulation or stress response. Our findings support the functional importance of flexibility peaks, suggesting that the flexible sequence may be derived by an expansion of canonical TAYRTA polyadenylation efficiency element. The flexible (TA)n repeat amplification could be the outcome of an evolutionary neofunctionali-zation leading to a differential 3’-end processing and expression regulation in genes with peculiar function. Our study provides a new support to the functional role of flexibility in genomes and a strategy for its characterization inside human fragile sites.
High DNA helix torsional flexibility characterizes sequences Which are enriched in fragile sites, loci of peculiar chromosome instability inside human genome often associated With cancer genes. AT-rich flexible islands are suggested to be the determinants of chromosome fragility; however, the origin of their occurrence in cancer genes and the mechanism of chromosome breakage remain unknown. Here, we study DNA flexibility in budding yeast chromosomes. We found that flexibility is conserved in yeast species. Flexibile peaks identify 175 ORFs, mapping on their 3’-end untraslated region. (TA)n repeats of different extension shape the central structure of peaks and co-localize With polyadenylation signals. ORFs With peaks have decreased mRNA stability and prevalent regulatory functions. Our findings support the functional importance of flexibility peaks. They suggest that functional processes may be also at the origin of flexibility peaks presence inside cancer genes in human fragile sites. Definition of role of flexible sequences in genomes may help to understand the processes implied in cancer gene rearrangements.
DNA conformational flexibility is a function of the dsDNA sequence that defines how the molecule can bend or exhibit a torsion (twist motion) about its axis. Flexibility is important in DNA supercoiling and shows particular significance in DNA-pro-tein interaction. The relationship of flexibility with the nucleosome occupancy and DNA looping along the genomes determines its key role in many biological functions including the DNA regulation during transcription and replication and DNA repair [1].
Fragile sites are regions peculiarly prone to DNA breakage, usually in conditions of replicational stress; the common fragile sites often map in association with genes involved in tumorigenesis, such as FHI T, WWOX; their instability causes cancer-specif1c recurrent deletion and translocation breakpoints [2]. While their molecular basis remains elusive, the identification in a number of them of AT-rich flexible islands, capable of forming stable secondary structures has suggested that flexible regions are good candidates for determinants of chromosome fragility [3, 4]. Effects on DNA stability through a structural interference with replication and a block of fork progression have been indicated as possible action mechanisms of flexible sequences [5]. Stalled forks and mitotic entry before replication completion have been indeed shown to be related to chromosome breakage in fragile regions [6]. New results, however, enlighten that also functional aspects are implied in chromosome fragility. Mapping of fragile sites in different cell type confirmed that their setting is tissue dependent and so epigenetically determined [7]. Consistently, fragile sites expressed in human lymphocytes show correlated breakage and are enriched in genes involved in immunity and inflammation, cell-type specific processes [8].
A support to this model comes from the observation in human genome that AT-rich flexibility peaks also lie at breakpoints of chromosome rearrangements involving the LCR22A-D region of 22q1 1.2 chromosome, a highly unstable segmental duplication implied in constitutional genomic diseases. [10].
Yeast has a very compact genome which however comprises a large number of eukaryotic typical genomic elements. A very favourable condition is the large availability of genome-wide data concerning the structural and functional aspects. To this aim, we developed a computer program that predicts the flexibility of the DNA helix by measurements of the twist angle between consecutive base pairs, implementing the TwistFlex software previously developed [11] for the analysis of human fragile sites [3, 12] and its adaptation to fast long sequences analysis.
We determined the presence of 183 flexibility peaks. We defined peaks as segments of genome with twist flexibility above a fixed threshold (i.e. twice the standard deviation). We mapped the location of the flexibility peaks within the yeast genome using the SGD [14] and data reported in literature, both uploaded into the UCSC Genome Browser [15]. Flexibility peaks appear on the 3’ UTR of 175 ORFs in S. cerevisiae, which share common features. The connection between flexibility peaks and ORFs could be the evolutionary outcome of modified canonical polyadenylation elements, leading to a differentiated 3’ end processing and gene expression regulation.
long on average (longest 975191), shortest 188191)). In the following, peaks shall be denoted by peakIV- 16, meaning the 16th peak within chrIV. Their chromosomal map shows no enrichment at specific chromosome arms or at centromere or telomere posi-tions/regions (Fig. 1). The longest chromosomes (chrIV, chrVII, cthII and cthV) contain the largest number of peaks, showing a general good correlation between peaks’ distribution and chromosome content (see Table 1 in 81 File). However, peaks do not follow a regular pattern but show regions of intense presence as well as empty regions; the different distances between peaks are reported in Fig. 1 (inset).
First, we compared peaks’ location to ORFs; then, to major genomic annotations. The results, reported in 81 Table, show that most of flexibility peaks (170 peaks out of 183, 92.9%) are positioned within interORF regions (Fisher test: p < 10—16). Out of the remaining peaks, 11 lie inside ORFs, one peak lies on a telomere (peakI-2) and one peak lies on a rRNA locus (peakXII-
In S. cerevisiae compact genome the interORF regions make up only 27% of the genome length. Of them, 26% are upstream of two divergently transcribed genes and 49% are upstream of one gene and downstream of another, so including putative promoters; finally, 25% are downstream of two convergently transcribed genes, presumably containing only terminators [16]. The inspection of the interORF regions containing flexibility peaks reveals that 67 peaks (39, 4%) lie at interORF regions between converging genes, 77 peaks (45, 3%) lie between genes with unidirectional transcription, only 26 peaks (15, 3%) lie between two genes with divergent transcription (see 81 Table). This is not coherent with 1:2:1 ratio distribution of the yeast genome, making the difference statistically significant for the converging regions (Fisher test:
For example, genes that are divergently expressed may share promoter and transcription factors and show similar regulation and functional relationship; similarly, convergent genes may share terminators or 3’-transcribed regions [17]. In this context, the observed prevalent position of flexibility peaks suggests that they could represent structural regulatory signals.
[17] to analyze the possible co-localiza-tion of any of these regions with flexibility peaks. According to the cited authors, promoters and terminators were considered the sequences intermediate between the different untraslated regions; for only a few ORFs without measure data, the average length of 5’UTR and 3’UTR were reported. We found that all peaks lying between convergent genes, except 4 peaks, co-lo-calize with the 3’UTR of one ORF or of both ORFs, as in the cases of very large peak extension or 3’UTR partial overlap (Fisher test: p < 10—15). Peaks lying between genes with unidirectional transcription co-localize with 3’UTR in 64 cases (Fisher test: p < 10—15). To sum up, peaks on a 3’ UTR region are 127 and ORFs with a peak in 3’ UTR are 175. Finally, peaks between divergent genes co-localize with 5’ UTR in 18 cases (Fisher test: p < 10—15). Peaks’ features are reported on 81 Table.
Differently, a Repeat Masker analysis revealed that all peaks were characterized by (TA)n or similar AT-rich repeats (Fig. 2). (TA)n repeats show a predominant presence and characterize all peak types except the 11 peaks lying inside ORFs, all of which contain (TTA)n. Repeats show a great length variability and comprise stretches of uninterrupted dinucleotide TA sequences mixed with degenerated TA sequences (from 23 to 8919p). For this reason, in the following we shall refer to all types of AT-rich sequences as to tandem repeats, indifferently. 17 other
Fig 2. Distribution of repeats within flexibility peaks.
cleavage site). EE is the upstream signal including mainly TATATA (consensus sequence: TAYRTA). PE occurs 16 to 27m downstream and the best word for this element is AATAAA (consensus sequence: AAWAAA); however, it is commonly described only as A-rich, since many functional sequences are characterized only by their adenosine content. The near-upstream element, as well as the near-down-stream, is characterized as T-rich [18].
The TAYRTA sequence provides the greatest effect on 3’ end processing with the T/ U at the first and fifth positions being the most critical for function; on a large-scale analysis (1017 yeast nuclear transcripts) more than half of 3’UTR (52%) contained this optimal EE sequence [19]; in more cases, transcripts contain several consecutive copies of EB sequence [20]. Owing to these reported TA-rich EE structures, we searched evidence for a general relationship between the tandem repeats (corresponding to flexibility peaks) and EE elements.
We found that the EB elements co-localize with an under-threshold flexible region (i.e. a genomic region where flexibility is enhanced, but does not reach the peak threshold). Similar results have been obtained for the expanded EE element detected within the 3’ UTR of FBPI gene, constituted by a (TA) 14 repeat [21] , again co-localizing with an under-threshold flexible region; this last element is of special interest because it has been experimentally shown to be a very potent polyadenylation element in both strand orientations. The expanded EE has been suggested [22] to affect polyadenylation offering several overlapping binding sites to Hrpl or allowing its association/disassociation at multiple binding sites. Thus, we speculated that all the flexibility peaks that are positioned at 3’UTR might have the potential to serve as EEs, with an expansion linked to functional features, where the determinant for complex 3’ end formation could be just the DNA/ RNA secondary structure due to helix flexibility. Ozsolak et al. [23] have obtained very informative data in a map of poly(A) cleavage sites in yeast genome generated by a direct RNA sequencing.
There are 2874 intense sites (out of 34444) which are closer than 500m from a repeat within a peak. As shown by Fig. 3, intense poly(A) sites occur in a highly position-specific manner, prevalently within a distance range of 5m to 25m from repeats: 91.7% of them are closer than 100m and 73.8% are closer than 25m. If we limit this analysis only to (TA)n, then 75% are closer than 25m. Poly(A) intense cleavage sites usually are present as multiple and clustered elements inside range [O-25nt] from repeats. Almost all peaks in convergent and unidirectional intergenic regions match to intense poly(A) signals. The authors of [23] read weak and isolated signals as indicative of a low transcriptional activity; this occurs only in nine peaks, so it is nearly negligble.
The analysis of repeats position and of strand direction of signals highlights a peculiar organization of 3’UTR extremity or of its extension. In unidirectional intergenic regions, the repeat sequence covers the extremity of mapped 3’UTR or lies slightly outside it, Distance of each intense poly(A) site from repeats within nearest flexibility peak 700
Distance of most intense poly(A) sites (score greater than 945)—following Ozsolak et al. [23]—from the midpoint of repeats inside each flexibility peak (see text for details on calculations). The outer bar (large and blue) refers to distance from (TA)n, only. The inner bar (thin and yellow) refers to distance from any repeat, indifferently.
The PE element, when present, may be positioned either upstream the EE (within the 3’UTR), or downstream the complete 3’-end forming signal, as well as in both positions within the same 3’UTR. Examples include the 3’-ends of genes IMEI (peakX-5), DBF4 (peakIV-14) or CDC53 (peakIV-5) (see supporting 81 file, figure 1).
Examples are the peculiar 3’-ends of the convergent gene pairs TSRI and RAD59 (peakIV-9, see Fig. 4), as well as ERV15 and AME] (peakII-lO), SNCI and M YO4 (peakI- 1), or DI G2 and PH 08 (peakIV-27) (see supporting 81 file, figure 2).
Sometimes the 3’-end signals lie on 5’UTR with sense or antisense orientation as respect to the adjacent ORF, as it happens for the region within the divergent PUF3 and YEHI genes (peakXII-3); in other cases signals are distant from ORFs without any overlap with its components, as for region of peakX-3 within the divergent TDH2 and MET3 genes (see supporting 81 file, figure 3). These findings clearly indicate the presence of termination signals in absence of annotated transcriptional units; therefore, peaks which are positioned at 3’ UTR may also mark non coding RNA genes, that frequently may be antisense transcripts. A large quantity of antisense transcripts has been reported by both Ozsolak and Nagalakshmi studies [23, 25] and they are estimated to cover in yeast the 80% of annotated ORFs. Antisense transcripts are in lower amount and so are characterized by a low number of 3’ end signals; this motivates the presence of weak signals in peaks which are not positioned at 3’ end of ORFs. Finally, concerning peaks lying inside an ORF, we remark that we found poly(AAT) codons coding for poly-Asn region of polypeptide—instead of poly(A) signals. On conclusion, TAYRTA elements, closely adjacent to cleavage site, have a non-canonical position in the peak-associated 3’ UTRs. To explore the concomitant occurrence of further polyadenylation elements we performed a search for motifs by a MEME analysis [26] , carried out on 183 peak regions. We identified, as expected, a TATATATATATATATATGTATAT motif (MEME statistical significance E-value = 4.6 X 10—585) in 145 peaks and a ATTATTAT-TATTATTATTATTATTATT motif (MEME statistical significance E-value = 3.7 x 10—119) in 32 of them. Moreover, performing an analogous analysis on flexible regionsilOO (i.e. peak regions, comprehensive of additional 100m upstream and downstream), we found that in 183 sites the novel A/T-rich motif CTTCTTTTCTTC (MEME statistical significance E-function since it again occurs in all interORF peak regions.
This last motif seems to have some
They have been described in yeast, where they may depend on the dense arrangement of genes and possibly to cause transcriptional interference [27]. It is credible that, similarly, for unidirectional genes, failure to terminate transcription at the end of first gene will result in inhibition of the next gene [28] and that this interference type could act as a regulatory system for the differential expression of adjacent gene pairs or for the sense-antisense transcription [29]. This suggests that the flexible elements inside 3’ UTR could characterize genes with specific types of termination, where peculiar signals are required possibly to regulate a programmed RNA interference.
For this analysis a dataset by Scannell et al. [30, 31], containing the alignment of 4298 intergenic regions, was analysed. Out of the 170 flexible sequences (excluding those inside ORFs), 131 regions (77%) conserve a flexibility peak exceeding the fixed threshold in at least one species and 70 regions (41%) in all species; in most cases of conservation failure, under-threshold flexible regions were observed. Conservation of peaks is particularly strong for the convergent and unidirectional intergenic regions. Out of the 67 convergent ones, 55 regions (82, 1%) conserve the flexibility
Significantly recurrent motifs identified by MEME algorithm [26] on peak regions. Motif 1 has the consensus sequence TATATATATATATATATGTATAT (E-value = 4.6 x 10585) and is found in 145 peaks; motif 2 has the consensus sequence ATTATTATTATTATTATTATTATTATT (E-value = 3.7 x 10-119) and is found in 32 peaks. Motif 3 has the consensus sequence CTTCTT'I'I'CTTC (E-value = 1.8 x 10‘”) and is found in 183 peaks; in this case the analysis has been performed on peak sequence comprehensive of additional 100nt upstream and downstream.
Consistently, 51 out of the 55 conserved flexible sequences are in regions with conserved synteny maintaining convergent transcription. The unidirectional regions conserving a flexibility peak in at least one species are 67 (81, 8%), all maintaining unidirectional transcription. Differently, the peak conservation in divergent inter-genic regions is significantly underrepresented (50%; Fisher test: p = 0.002).
These findings are indicative of an evolutive differentiation among species with a substantial conservation of flexibility peaks, even when there is a weak sequence conservation among the four genomes. Notably, 38 conserved flexibile ORFs (22 in converging and 11 in unidirectional transcription) were found to belong to the list of ohnologs i.e. paralogous genes arising from whole genome duplication [32] (see S2 Table); in all cases, except one, only one member of ohnolog pair carries a flexibility peak in 3’UTR. Usually, the pair members of ohnologs underwent sequence modifications related to functional changes of different extent. Consequently, the peak sequence on one onholog may be a peculiar modification linked to functional divergence between pair members, possibly leading to sub or neo-functionalization, which are processes already defined in yeast for a number of duplicated genes [33].
In yeast, for instance, adjacent genes are co-eXpressed to a significantly higher level than expected [35]; moreover, many highly co-eXpressed gene pairs take part in the same cellular processes [36]. Accordingly, the conservation of flexibility peaks in convergent or unidirectional pattern may be related to the peculiar structural or functional aspects of gene pairs eXpression.
mRNA stability is a key regulatory step controlling gene eXpression and ultimately affects protein levels and function. Notably, long and short-lived transcripts appear to have systematic differences in the BE, suggesting peculiar roles of this poly(A) signal in mRNA stability [37]. Therefore we checked whether the ORFs with peak in 3’UTR could be related with a differential mRNA stability. We took advantage of data about mRNA halflives derived by Wang et al. [38] coming from mRNA decay profiles measured by microarrays following transcriptional shutoff. Results were searched for the 175 ORFs with peak in 3’ UTR compared with all other ORFs; they show that these ORFs are characterized by significant lowering of both poly(A) halflife (t-test: p < 2.5 X 10—2) and overall halflife (t-test: p < 1 x 10—2), indicating their production of unstable mRNAs (see Fig. 6). According to current models for major decay pathways, in yeast poly(A) shortening precedes the decay of the entire transcript and is a rate-limiting step [39]. Differential degradation of mRNAs can play an important role in setting the basal level of mRNA eXpression and how that mRNA level is modulated by environmental stimuli. It has been suggested that there is a general relationship between the stability of an mRNA and the physiological function of its product. Accordingly, mRNAs involved in
Comparison between overall mRNA decay rates (left) and poly(A) mRNA decay rates (right) in the 175 ORFs containing a 3’UTR peak against all the other ORFs (data from [38]). For each group, the histogram shows the mean value i standard error of the halflives of mRNAs either overall or poly(A). The halflives are measured in minutes. central metabolic functions are generally relatively long-lived, Whereas those involved in regulatory systems turn over relatively rapidly [38]. Consistently, flexibility peaks inside 3’UTR may be proposed to be part of the regulatory machinery of short-lived mRNAs.
A functional analysis of all such 175 ORFs (listed in S3 Table) was carried out by identifying the Gene Ontology (GO) terms, using the YeastMine search engine [40]. The search reveals enrichment for 72 GO Biological Process (p < 1.1 x 102) as well as for 14 GO Molecular Function categories
The first 10 GO BP terms (i.e. with lowest p-value) are identified for a range of 31 to 86 ORFs per GO term, with a mean value of 62.3 ORFs per GO term. The G0 MF term “binding” is identified for 101 ORFs.
The outcomes for Biological Process GO terms (visualized as treemap in supporting 81 file, figure 6, top) point out the presence of ORFs with role in cell cycle, phosphorus/ organic cyclic compound/ nitrogen compound metabolism, phosphorylation reproduction, growth, response to acid, signaling. The 175 ORFs include genes expressing key components of cell cycle progression and regulation: TUBZ and TUB3 encoding a and fl tubulins, CLB4 and PH 080 encoding cyclins, CDC53 and APC9 encoding respectively the cullin structural protein of SCF complexes and a subunit of the Ana-phase-Promoting Complex/Cyclosome; moreover, AME] , RAD24, RAD59 and SWEI involved in checkpoint maintenance, the F U83, DI G2 and SLT2 encoding MAP-kinases and their regulator BMH1 encoding the major isoform of 14-3-3 proteins. Further IMEI, encoding a master regulator of meiosis and its convergent gene UME6, the key transcriptional regulator of early meiotic genes; moreover MFA], encoding the essential mating pheromone a-factor, STESO the major protein involved in mating response. Finally, ASGI , TSRI , I CT], YAPI , PH 080, FRTI and HAAI, regulators involved in the stress response. In accordance with the prevalent regulatory functions revealed for Biological Process GO terms, the REVIGO outcomes for Molecular Function GO terms point out the presence of numerous ORFs with role in binding and in phosphatase and kinase activities (visualized as treemap in supporting 81 file, figure 6, bottom). All these findings confirm the general involvement of ORFs with peak in 3’ UTR in regulatory systems as well as their characterization by unstable transcripts. Moreover, these results seem to be coherent with the picture where regulatory function of genes is related to short halflife [38].
Nucleosome free regions or nucleo-some depleted regions (NFR or NDR) were observed at regulatory regions such as gene T88 and TTS, affecting binding of regulatory proteins, nucleosome ordering inside genes and tran-scriptional plasticity [44, 45]. Since AT-rich sequences in defined contests have nucleosome-disfavoring property, we evaluated whether the AT-rich sequence in flexible peaks in 3’ UTR could play a regulatory role by determining specific nucleosome positioning; thus, we analyzed the co-localization of peaks with NDR, obtained from [46]. We found that large distances occur between each peak and nearest segment with high nucleosome depletion (Fig. 7 in supporting Sl file), indicating that AT-rich peak regions and NDR are not associated elements. A manual inspection was then performed on nucleosome occupancy of all peaks localized in 3’UTR of convergent genes, to be sure to consider only transcriptional terminators. Data on experimental nucleosome occupancy, reported by [47], together with nucleosome coverage predicted by a model based on in vitro sequence data, were available through the SwissRegulon server [48, 49]. We found that no peak shows altered nucleosome coverage. These are unexpected results, as many papers describe nucleosome depletion in yeast gene 3’-end termination. Anyway, they contribute to circumstantiate the flexibility peak’s action, by suggesting that flexible peak may exert exert its function on polyadenylation by affecting phases not directly dependent on local chromatin structure, for example by modulating the nascent
A complete description of the human ortholog genes related to diseases is reported in S4 Table, including, besides genes, related diseases and detailed references, the chromosome band localization and the coincidental occurrence of common fragile sites. We highlight that the map position of the human ortho-log genes for eleven yeast genes is coincidental with that of known fragile sites [51]; moreover most of orthologs are implied in cancer development. These findings support the relationship between peak associated ORFs and fragile sites.
Indeed, the FHI T gene spans FRA3B, the most common human fragile site characterized for the presence of clusters of high flexibility peaks [52]. The FHI T gene has been suggested to have biological effects similar to NIT and to share with it signaling pathways [53].
In this paper we sistematically study the presence of flexibility peaks in S. cerevisiae genome and explore their functional role.
The peculiar architecture of repeats and poly(A) signals inside peaks suggests that they could mark terminations in ORFs characterized by specific requirements in RNA cleavage. Consistently, we characterize the peak presence in ORFs as prevalently lying in regions where convergent transcription occurs. Peaks show a general conservation among different Saccharomyces yeast species, but with a sequence variation in orthologous genes and a clear differentiation between paralogous genes, suggesting that they could be the result of an evolutive differentiation. We provide evidence that ORFs with peak in 3’UTR have transcripts with lower halflife, item considered peculiar of genes involved in regulatory systems with high turnover. More, we show that ORFs with peak in 3’ UTR share a number of common functions in biological processes such as cell cycle regulation or stress response. From these findings we infer that flexibility peaks could play a functional role as regulatory elements of gene expression for a peculiar set of genes. A regulation based on flexible sequences has not so far experimental foundation. However, we must consider that, while the impact of 3’-end sequence on gene expression is well established, the understanding of how its effect is encoded in DNA is limited. Polyadenylation is critical for many aspects of mRNA metabolism, including mRNA sytability, translation and transport. PolyA signals act as substrate for cleavage and polyadenylation, for which RNA structure is also a critical determinant [54]. Then, RNA binding proteins regulate almost all post-transcriptional stages [55]. Specific sequence motifs in 3’UTR have been identified in yeast implied in stabilization [56] and stress response [57]. In particular, an increased AT-content upstream the polyadenylation site has been shown to modulate protein expression dynamics [58]. Thus, AT rich tandem repeats and strand flexibility may be crucial in determining the interaction with polyadenylation factors, the mRNA structure and the accessibility of binding sites to multiple regulators. The notion that enriched tandem repeats in S. cerevisiae could guide transcriptional modulation has been established for genes carrying very variable tracts of repeats in promoter; the involved genes have the general feature of interacting with the cell environment and so requiring rapid response changes [59, 60]. Gene regulation differs greatly among related species, constituting a major source of phenotypic diversity. This issue assumes relevant significance for gene evolution and tandem repeats have been considered able to drive transcriptional divergence and to confer evolvability to gene expression [61]. The variable repeat-based component of peaks inside 3’UTR may have similar origin and evolution. Tandem repeats are intrinsically prone to variation having often units lost or gained by replication slippage [62]: Thus, long repeat stretches could be derived from the well-known polyadenylation enhancement elements; their potential in modulating gene expression regulation (termination efficiency and transcript halflife) may have been the feature that determined their fixation in peculiar genes.
Common fragile sites are chromosome regions prone to breakage upon replication stress. To date, 22 fragile sites, among the 230 mapped in human lymphocytes, are known at molecular level but the molecular basis of fragility remains unknown. They extend over megabase-long regions, tend to overlap very large genes and share a delayed completion of DNA replication. Recently, delayed replication has been correlated with a paucity of initiation events [63, 64]. Notably, the authors found that FRA3B and FRA16D, the most active fragile sites in human lymphocytes, have low levels of fragility in fibroblasts, where instead other sites show very high fragility; cell-type-specific replication programs characterize the commitment to fragility at different loci in each cell-type, indicating that fragility is epigenetically defined.
The second one is their enrichment in genes related to cell cycle regulation, apoptosis or similar processes involved in cancer development [65]. More in detail, chromosomal fragile sites FRA3B and FRA16D, carrying the FHIT and WWOX genes respectively, that are genes playing a major role in apoptosis, show correlated expression and association with failure of apoptosis in lymphocytes from cancer patients [66]. In the same perspective, all fragile sites belong to networks of correlated breakage, comparable to gene expression pathways activated in response to damage stress; in particular the correlated fragile sites, analyzed in lymphocytes, are enriched in genes involved in immunity and inflammation, that are cell-type specific processes of lymphocytes [8].
While yeast is a unicellular and quite simple organism, many processes are highly conserved; it is conceivable that conservation may concern the specific mechanisms that differentiate the expression of peculiar gene classes. In higher eukaryote evolution, these mechanisms may have been used in the commitment of the different genes to stress response, that is cell and tissue specific [67]. In this view, the regulatory role of flexibility peaks inferred for yeast genes could be actual also for human fragile genes, even if not necessarily involving 3’ end termination process. The extent of this correlation Will be determined by a comparable genome-Wide analysis on human sequence DNA flexibility.
We refer to complete Saccharomyces cerevisiae RefSeq genome as obtained and annotated on SGD (SacCer2 assembly).
The conformation of DNA and its sequence dependence are mainly determined by the chemical structures of the base pairs and their interactions. The computational model by Sarai et al [68] examines DNA flexibility on the basis of base pairs interactions and the results agree With available experimental observations. The algorithm STABFLEX is used to calculate potential local variations in the DNA structure that are expressed as fluctuations in the twist angle (degrees, deg). It is a reimplementation of the TWistFlex software [11] and it is targeted to analyze very large sequences.
Within each window the flexibility is calculated for consecutive dinucleotide steps, and the average value of all steps in the window is assigned to the midpoint dinucleotide step. The flexibility is measured in degrees (deg) in the range [7 deg;16 deg].
Peaks emerge spontaneously as short genomic regions where signal is extremely high. They are marked by arrows in the top picture. The complete flexibility data for a genomic region are plotted as a quantized signal and each flexibility value refers to 100bp, as shown in the bottom zoomed snapshot.
7 (top picture) shows the normalized distribution of windows flexibility values for all 16 chromosomes of yeast genome. As shown in Fig. 7 (bottom picture), for large flexibility values (greater than 12deg) the distribution is no longer Gaussian. The non-Gaussian tail identifies flexibility peaks, as follows. First, we preselected regions with outstanding flexibility values, deviating significantly from the average (not lower than S = mean+2xstand dev, which is 12.1 for all chromosomes). That value 12.1 may be read as the point where Gaussianity is lost (see inplot in Fig. 7). Regions correspond to the genomic sequence covered by overlapping consecutive windows simultaneously exceeding 8. Second, such regions whose maximal flexibility value exceeds threshold 9 = 13.8 are defined flexibility peaks. The threshold has been fixed as in literature [12, 52]. Peaks have been denoted by peakIV- 16, meaning the 16th peak within chrIV.
The statistical significance of properties and classifications has been assessed by means of Fish-er’s exact test and t-test. Fisher’s exact test is used in the analysis of 2 x 2 contingency tables built for categorical data that result from classifying objects in two different ways; it is used to DEVIATIONS OF TWIST ANGLE DEVIATIONS OF TWIST ANGLE IS
A: Flexibility values normalized distribution for all the yeast chromosomes. B: Upmost tail for flexibility values greater than 12deg (within cthl), compared to a Gaussian distribution with same mean and standard deviation. ln-plot: values greater than 13deg.
A t-test is a statistical hypothesis test in which the test statistic follows a Student’s tdistri-bution if the null hypothesis is supported. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. For both tests, specific R programs have been designed and implemented by the authors.
This applies to motifs found by MEME and to GO terms’ enrichment. As stated by the authors in [26] , MEME usually finds the most statistically significant (low E-Value) motifs first. The E-Value of a motif is based on its log likelihood ratio, Width, sites, the background letter frequencies, and the size of the training set. The E-Value is an estimate of the expected number of motifs With the given log likelihood ratio, and With the same Width and site count, that one would find in a similarly sized set of random sequences.
Tools differ in the algorithms they use, and the statistical tests they perform. All enrichment Widgets list a term, a count and an associated p-Value. The term can be something like a publication name or a GO term. The count is the number of times that term appears for objects in your list. The p-Value is the probability that result occurs by chance, thus a lower p-Value indicates greater enrichment Without corrections. The p-Value is calculated using the Hypergeometric distribution.
Individual chromosomal flexibility peaks’ annotations in BED format, suitable for a visualisa-tion through the Genome Browser [15] are part of online supplementary material. The algorithm STABFLEX is available at http://home.gna.org/stabfleX/.
Peaks and ORFs involved. A.pdf file containing: a summary table on peaks and chromosome length; UCSC snapshots for peaks Within unidirectional, convergent and divergent intergenic regions; alignments of peakIV-14 and peakIV-9 for Saccharomyces sensu stricto species; treemaps of the outcomes of REVIGO for Biological Process and Molecular Functions GO terms, referring to 175 ORFs characterized in 3’UTR by a peak; results of the comparison of peaks With the nucleosome depleted regions. (PDF)
Genomic features of peaks. A.csv table containing all genomic features corresponding to flexibility peaks. (CSV)
Conservation of peaks. A.Xls file containing siX tables about conservation in Saccha-S3 Table. ORFs involved in peaks in their 3’ UT R. A.Xls file containing the list of 175 ORFs With peak in 3’UTR and tables about GO terms results. (XLS)
Peak positions. An archive containing the flexibility peaks positions, in.bed format, suitable for UCSC Visualization. (ZIP)
Human genes ortholog to ORFs involved in peaks. A.Xls file containing the list of disease-associated human genes Which are ortholog t0 yeast ORFs associated to peaks. (XLS)
Performed the experiments: GM. Analyzed the data: GM IS. Contributed reagents/materials/analysis tools: AB. Wrote the paper: GM IS RB.
See all papers in April 2015 that mention mRNA.
See all papers in PLOS Comp. Biol. that mention mRNA.
Back to top.
See all papers in April 2015 that mention gene expression.
See all papers in PLOS Comp. Biol. that mention gene expression.
Back to top.
See all papers in April 2015 that mention transcriptional.
See all papers in PLOS Comp. Biol. that mention transcriptional.
Back to top.
See all papers in April 2015 that mention cell cycle.
See all papers in PLOS Comp. Biol. that mention cell cycle.
Back to top.
See all papers in April 2015 that mention statistical significance.
See all papers in PLOS Comp. Biol. that mention statistical significance.
Back to top.
See all papers in April 2015 that mention budding yeast.
See all papers in PLOS Comp. Biol. that mention budding yeast.
Back to top.
See all papers in April 2015 that mention cancer genes.
See all papers in PLOS Comp. Biol. that mention cancer genes.
Back to top.
See all papers in April 2015 that mention p-Value.
See all papers in PLOS Comp. Biol. that mention p-Value.
Back to top.
See all papers in April 2015 that mention base pairs.
See all papers in PLOS Comp. Biol. that mention base pairs.
Back to top.
See all papers in April 2015 that mention cell cycle regulation.
See all papers in PLOS Comp. Biol. that mention cell cycle regulation.
Back to top.
See all papers in April 2015 that mention cycle regulation.
See all papers in PLOS Comp. Biol. that mention cycle regulation.
Back to top.
See all papers in April 2015 that mention genome-wide.
See all papers in PLOS Comp. Biol. that mention genome-wide.
Back to top.
See all papers in April 2015 that mention normalized distribution.
See all papers in PLOS Comp. Biol. that mention normalized distribution.
Back to top.
See all papers in April 2015 that mention t-test.
See all papers in PLOS Comp. Biol. that mention t-test.
Back to top.