visualization of alternative splicing - vizbi

26
Visualization of Alternative Splicing Yoseph Barash, University of Toronto March 2011

Upload: others

Post on 03-Feb-2022

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visualization of Alternative Splicing - VIZBI

Visualization of Alternative Splicing Yoseph Barash, University of Toronto

March 2011

Page 2: Visualization of Alternative Splicing - VIZBI

Transcription

RNA

TranscriptionTranscription

RNA

Translation

Protein

TranslationTranslation

ProteinDNA

Transcription

RNA

TranscriptionTranscription

RNA

Translation

Protein

TranslationTranslation

ProteinDNA

Transcription

DNA

The Central Dogma

RNA Protein

Translation

exons

pre-mRNA mRNA

Transcription

RNA

TranscriptionTranscription

RNA

Translation

Protein

TranslationTranslation

ProteinDNA

Translationsplicing

Page 3: Visualization of Alternative Splicing - VIZBI

Visualizing a Transcript

5’ UTR 3’ UTRStart

CodonStop

Codon

Exon 4Exon 3Exon 2Exon 1 Intron Intron

coding sequence

Page 4: Visualization of Alternative Splicing - VIZBI

Transcription

RNA

TranscriptionTranscription

RNA

Translation

Protein

TranslationTranslation

ProteinDNA

Transcription

RNA

TranscriptionTranscription

RNA

Translation

Protein

TranslationTranslation

ProteinDNA

Transcription

DNA

The Central Dogma

RNA Protein

Translation

exons

pre-mRNATranscription

RNA

TranscriptionTranscription

RNA

Translation

Protein

TranslationTranslation

ProteinDNA

alternativesplicing mRNA

Transcription

RNA

TranscriptionTranscription

RNA

Translation

Protein

TranslationTranslation

ProteinDNA

Translationsplicing

Page 5: Visualization of Alternative Splicing - VIZBI

Visualizing two Transcript...

Page 6: Visualization of Alternative Splicing - VIZBI

Alternative Splicing Significantly Increases Complexity

• >90% of human genes alternatively spliced (Pan et al 2008; Wang et al 2008)

• Mostly tissue-dependent alternative splicing (Pan et al 2008)

• 15%-50% of complex human disease mutations affect alternative splicing (Wang et al 2007)

Organism #Genes %With Multi Exons %With AS

S. Cerevisiae 6K 3% 0.002%C. Elegans 19K

D. Melanogaster 14K

H. Sapiens 20K-25K >90% ~90%

types (Boise et al. 1993; Clarke et al. 1995; Minn et al.1996).High Bcl-x(L)/Bcl-x(s) ratios are observed in a variety of

cancer types, consistent with an important role for thelong isoform in cancer cell survival (Xerri et al. 1996;Olopade et al. 1997; Takehara et al. 2001). This likelyreflects a reduction in Bcl-x(s) in cancer, as an examina-tion of endometrial carcinomas showed a down-regula-tion of Bcl-x(s) mRNA compared with normal endome-trial tissue, with the extent of Bcl-x(s) down-regulationcorrelated with clinical staging (Ma et al. 2010). Illustrat-ing the importance of this AS event to cancer cells, anantisense oligonucleotide complementary to the Bcl-x(L)isoform 59 splice site shifted splicing of Bcl-x to the Bcl-x(s) isoform, and was sufficient to induce apoptosis ina prostate cancer (PCa) cell line (Mercatante et al. 2002).A variety of signals and effectors that regulate Bcl-x AS

have been identified. Prompted by reports that Sam68overexpression can result in apoptosis in NIH-3T3 cells(Babic et al. 2006), and their observation that Sam68interacts with the Bcl-x mRNA, Sette and colleagues(Paronetto et al. 2007) investigated a possible role for thisprotein in Bcl-x splicing. Overexpression of Sam68 in 293cells resulted in an increase in Bcl-x(s) isoform, consistentwith the proapoptotic effects observed upon Sam68 over-expression (Taylor et al. 2004; Paronetto et al. 2007).Interestingly, Sam68 phosphorylation by the Src-liketyrosine kinase Fyn reversed the effects of Sam68 over-expression, switching splicing of Bcl-x back to the longisoform. This result indicates that, in the presence ofSam68, growth factors or other signals that activate Fynor other tyrosine kinases are necessary to maintainexpression of Bcl-x(L), providing an additional connectionbetween mitogenic signaling pathways and regulation ofapoptosis.While mitogenic signaling pathways have been impli-

cated inmaintaining high levels of Bcl-x(L), a proapoptoticpathway initiated by the sphingolipid ceramide has been

suggested to promote Bcl-x(s) splicing (Chalfant et al.2002; Pettus et al. 2002). Ceramide activates the serine/threonine phosphatases PP1 and PP2A, and treatment ofcells with an inhibitor of PP1 negated the effects of cer-amide on Bcl-x splicing (Chalfant et al. 2002). Ceramide-induced activation of PP1 has been shown to result inwidespread dephosphorylation of SR proteins, althoughno direct connection between this event and Bcl-x splic-ing has been established (Chalfant et al. 2001). The RBPSAP155, best known as a member of the SF3b complexthat associates with the U2 snRNP, has been shown tobind to a ceramide-responsive element present in the Bcl-x pre-mRNA and is necessary for the effects of ceramideon Bcl-x splicing (Massiello et al. 2006). Incidentally,SAP155 is a known target of PP1/PP2A prior to thesecond step of splicing (Shi et al. 2006). It is tempting tospeculate that dephosphorylation of SAP155 by PP1 hasa role in regulating Bcl-x splicing. In a separate pathway,expression of the transcription factor E2F1, which canpromote apoptosis, resulted in an increase in Bcl-x(s)(Merdzhanova et al. 2008). Depletion of the SR proteinSRSF2 (formerly SC35) (Manley and Krainer 2010), whichis specifically induced by E2F1, reversed this effect,implicating SRSF2 in regulation of Bcl-x splicing.

Caspase-2

Caspase-2 is a highly conserved cysteine protease firstidentified as a mammalian homolog of the CED-3 caspasein Caenorhabditis elegans (Wang et al. 1994). While itwas first implicated in apoptosis on the basis of itssimilarity to CED-3, it has since been shown to act asa tumor suppressor that participates in a wide variety ofcellular processes (Ho et al. 2009; Kumar 2009). Caspase-2mRNA is alternatively processed to produce multipleisoforms (Fig. 2B). The predominant form in most tissues,caspase-2L, produces a full-length protein with proapo-ptotic properties (Wang et al. 1994). However, in certain

Figure 2. Schematic representation of the AS eventsdiscussed in this review. In each case, isoforms that areup-regulated in cancer or that are otherwise shown tohave positive effects on growth, survival, or invasivebehavior are shown at the top of each diagram.

Alternative splicing in cancer

GENES & DEVELOPMENT 2345

Cold Spring Harbor Laboratory Press on December 29, 2010 - Published by genesdev.cshlp.orgDownloaded from

Example: Bcl-2: (Boise et al. 1993) 5’ splice variant

Page 7: Visualization of Alternative Splicing - VIZBI

Alternative Splicing Types

(Fig. 2 and Supplementary Table 4). In all, a set of over 22,000 tissue-specific alternative transcript events was identified, far exceedingprevious sets of tissue-specific alternative splicing events that havetypically numbered in the hundreds to low thousands6–9,18,19. Tissue-regulated skipped exon and MXE events are listed in SupplementaryTables 5 and 6, respectively. Binning events by expression level com-monly yielded sigmoid curves for the fraction of tissue-regulatedevents of each type, enabling estimation of the true frequency oftissue regulation for each event type (Supplementary Figs 5 and 6).These estimates, ranging from 52% to 80% (Fig. 2), indicated thatmost alternative splicing events are regulated between tissues, pro-viding an important element of support for the hypothesis thatalternative splicing is a principal contributor to the evolution ofphenotypic complexity in mammals.

Individual-specific isoform expression

To assess the extent of alternative splicing isoform variation betweenindividuals in comparison to tissue-regulated alternative splicing, thecorrelations among the vectors of inclusion ratios for all expressedskipped exons between pairs of samples were determined (Fig. 3); thiswas performed similarly for other event types (not shown). In thisanalysis, strong clustering of the six cerebellar cortex samples wasobserved, with generally higher correlations among these samplesthan between pairs representing distinct tissues. Strong clusteringof the five cell lines was also observed. This probably results from acombination of factors, including the common mammary epithelial

origin of the cell lines studied, similar adaptations to culture condi-tions, and the high diversity of the tissues chosen.

The extent of variation in alternative isoform expression betweenindividuals was also addressed by determining the number of differ-entially expressed exons among the six cerebellar cortex samples.Using the same approach as in Fig. 2, between ,10% and 30% ofalternative transcript events showed individual-specific variation,depending on the event type (Supplementary Fig. 7), providingupdated estimates of the scope of mRNA isoform variation betweenindividuals16. These numbers are higher than estimates based onmicroarray analyses20, but are in general agreement with an inte-grated analysis of multiple data types that estimated that ,21% ofalternatively spliced genes are affected by polymorphisms that alterthe relative abundances of alternative isoforms17. However, thesefrequencies are still below the 47–74% of events that showed vari-ation among the ten tissues (Fig. 2), and approximately twofold tothreefold less than the frequencies observed in comparisons amongsubsets of six tissues (Supplementary Fig. 7), indicating that,although inter-individual variation is fairly common, it is still sub-stantially less frequent than variation between tissues. Thus, most ofthe differences observed between tissue samples are likely to representtissue-specific rather than individual-specific variation.

Switch-like alternatively spliced exons

The quantitative nature of the mRNA-Seq approach allowed assess-ment of both subtle and switch-like alternative splicing events. By

Constitutive exon or region

Alternative exon or extension

Body read Junction read

Inclusive/extended isoform Exclusive isoform

Alternative transcript events

Total

Both isoforms

Skipped exon

Retained intron

Tandem 3′ UTRs

Alternative 5′ splicesite (A5SS)

Alternative 3′ splicesite (A3SS)

Mutually exclusive exon (MXE)

Alternative firstexon (AFE)

105

37

1

15

17

4

14

9

7

Totalevents(×103)

68

72

71

72

74

66

63

52

80

% Tissue-regulated

(estimated)

100

35

1

15

16

4

13

8

7

Numberdetected

(×103)

37,782

10,436

167

2,168

4,181

167

10,281

5,246

5,136

Bothisoformsdetected

60

65

57

64

64

57

52

47

74

% Tissue-regulated(observed)

22,657

6,822

96

1,386

2,655

95

5,311

2,491

3,801

Numbertissue-

regulated

pA

pA

Alternative lastexon (ALE)

pA Polyadenylation site

Figure 2 | Pervasive tissue-specific regulation of alternative mRNAisoforms. Rows represent the eight different alternative transcript eventtypes diagrammed. Mapped reads supporting expression of upper isoform,lower isoform or both isoforms are shown in blue, red and grey, respectively.Columns 1–4 show the numbers of events of each type: (1) supported bycDNAand/or EST data; (2) with$1 isoform supported bymRNA-Seq reads;(3) with both isoforms supported by reads; and (4) events detected as tissue-regulated (Fisher’s exact test) at an FDR of 5% (assuming negligible

technical variation10). Columns 5 and 6 show: (5) the observed percentage ofeventswith both isoforms detected that were observed to be tissue-regulated;and (6) the estimated true percentage of tissue-regulated isoforms aftercorrection for power to detect tissue bias (Supplementary Fig. 6) and for theFDR. For some event types, ‘common reads’ (grey bars) were used in lieu of(for tandem 39UTR events) or in addition to ‘exclusion’ reads for detectionof changes in isoform levels between tissues.

ARTICLES NATURE |Vol 456 |27 November 2008

472 ©2008 Macmillan Publishers Limited. All rights reserved

Alternative isoform regulation in human tissue transcriptomes Wang et al. Nature 2008

(Fig. 2 and Supplementary Table 4). In all, a set of over 22,000 tissue-specific alternative transcript events was identified, far exceedingprevious sets of tissue-specific alternative splicing events that havetypically numbered in the hundreds to low thousands6–9,18,19. Tissue-regulated skipped exon and MXE events are listed in SupplementaryTables 5 and 6, respectively. Binning events by expression level com-monly yielded sigmoid curves for the fraction of tissue-regulatedevents of each type, enabling estimation of the true frequency oftissue regulation for each event type (Supplementary Figs 5 and 6).These estimates, ranging from 52% to 80% (Fig. 2), indicated thatmost alternative splicing events are regulated between tissues, pro-viding an important element of support for the hypothesis thatalternative splicing is a principal contributor to the evolution ofphenotypic complexity in mammals.

Individual-specific isoform expression

To assess the extent of alternative splicing isoform variation betweenindividuals in comparison to tissue-regulated alternative splicing, thecorrelations among the vectors of inclusion ratios for all expressedskipped exons between pairs of samples were determined (Fig. 3); thiswas performed similarly for other event types (not shown). In thisanalysis, strong clustering of the six cerebellar cortex samples wasobserved, with generally higher correlations among these samplesthan between pairs representing distinct tissues. Strong clusteringof the five cell lines was also observed. This probably results from acombination of factors, including the common mammary epithelial

origin of the cell lines studied, similar adaptations to culture condi-tions, and the high diversity of the tissues chosen.

The extent of variation in alternative isoform expression betweenindividuals was also addressed by determining the number of differ-entially expressed exons among the six cerebellar cortex samples.Using the same approach as in Fig. 2, between ,10% and 30% ofalternative transcript events showed individual-specific variation,depending on the event type (Supplementary Fig. 7), providingupdated estimates of the scope of mRNA isoform variation betweenindividuals16. These numbers are higher than estimates based onmicroarray analyses20, but are in general agreement with an inte-grated analysis of multiple data types that estimated that ,21% ofalternatively spliced genes are affected by polymorphisms that alterthe relative abundances of alternative isoforms17. However, thesefrequencies are still below the 47–74% of events that showed vari-ation among the ten tissues (Fig. 2), and approximately twofold tothreefold less than the frequencies observed in comparisons amongsubsets of six tissues (Supplementary Fig. 7), indicating that,although inter-individual variation is fairly common, it is still sub-stantially less frequent than variation between tissues. Thus, most ofthe differences observed between tissue samples are likely to representtissue-specific rather than individual-specific variation.

Switch-like alternatively spliced exons

The quantitative nature of the mRNA-Seq approach allowed assess-ment of both subtle and switch-like alternative splicing events. By

Constitutive exon or region

Alternative exon or extension

Body read Junction read

Inclusive/extended isoform Exclusive isoform

Alternative transcript events

Total

Both isoforms

Skipped exon

Retained intron

Tandem 3′ UTRs

Alternative 5′ splicesite (A5SS)

Alternative 3′ splicesite (A3SS)

Mutually exclusive exon (MXE)

Alternative firstexon (AFE)

105

37

1

15

17

4

14

9

7

Totalevents(×103)

68

72

71

72

74

66

63

52

80

% Tissue-regulated

(estimated)

100

35

1

15

16

4

13

8

7

Numberdetected

(×103)

37,782

10,436

167

2,168

4,181

167

10,281

5,246

5,136

Bothisoformsdetected

60

65

57

64

64

57

52

47

74

% Tissue-regulated(observed)

22,657

6,822

96

1,386

2,655

95

5,311

2,491

3,801

Numbertissue-

regulated

pA

pA

Alternative lastexon (ALE)

pA Polyadenylation site

Figure 2 | Pervasive tissue-specific regulation of alternative mRNAisoforms. Rows represent the eight different alternative transcript eventtypes diagrammed. Mapped reads supporting expression of upper isoform,lower isoform or both isoforms are shown in blue, red and grey, respectively.Columns 1–4 show the numbers of events of each type: (1) supported bycDNAand/or EST data; (2) with$1 isoform supported bymRNA-Seq reads;(3) with both isoforms supported by reads; and (4) events detected as tissue-regulated (Fisher’s exact test) at an FDR of 5% (assuming negligible

technical variation10). Columns 5 and 6 show: (5) the observed percentage ofeventswith both isoforms detected that were observed to be tissue-regulated;and (6) the estimated true percentage of tissue-regulated isoforms aftercorrection for power to detect tissue bias (Supplementary Fig. 6) and for theFDR. For some event types, ‘common reads’ (grey bars) were used in lieu of(for tandem 39UTR events) or in addition to ‘exclusion’ reads for detectionof changes in isoform levels between tissues.

ARTICLES NATURE |Vol 456 |27 November 2008

472 ©2008 Macmillan Publishers Limited. All rights reserved

(Fig. 2 and Supplementary Table 4). In all, a set of over 22,000 tissue-specific alternative transcript events was identified, far exceedingprevious sets of tissue-specific alternative splicing events that havetypically numbered in the hundreds to low thousands6–9,18,19. Tissue-regulated skipped exon and MXE events are listed in SupplementaryTables 5 and 6, respectively. Binning events by expression level com-monly yielded sigmoid curves for the fraction of tissue-regulatedevents of each type, enabling estimation of the true frequency oftissue regulation for each event type (Supplementary Figs 5 and 6).These estimates, ranging from 52% to 80% (Fig. 2), indicated thatmost alternative splicing events are regulated between tissues, pro-viding an important element of support for the hypothesis thatalternative splicing is a principal contributor to the evolution ofphenotypic complexity in mammals.

Individual-specific isoform expression

To assess the extent of alternative splicing isoform variation betweenindividuals in comparison to tissue-regulated alternative splicing, thecorrelations among the vectors of inclusion ratios for all expressedskipped exons between pairs of samples were determined (Fig. 3); thiswas performed similarly for other event types (not shown). In thisanalysis, strong clustering of the six cerebellar cortex samples wasobserved, with generally higher correlations among these samplesthan between pairs representing distinct tissues. Strong clusteringof the five cell lines was also observed. This probably results from acombination of factors, including the common mammary epithelial

origin of the cell lines studied, similar adaptations to culture condi-tions, and the high diversity of the tissues chosen.

The extent of variation in alternative isoform expression betweenindividuals was also addressed by determining the number of differ-entially expressed exons among the six cerebellar cortex samples.Using the same approach as in Fig. 2, between ,10% and 30% ofalternative transcript events showed individual-specific variation,depending on the event type (Supplementary Fig. 7), providingupdated estimates of the scope of mRNA isoform variation betweenindividuals16. These numbers are higher than estimates based onmicroarray analyses20, but are in general agreement with an inte-grated analysis of multiple data types that estimated that ,21% ofalternatively spliced genes are affected by polymorphisms that alterthe relative abundances of alternative isoforms17. However, thesefrequencies are still below the 47–74% of events that showed vari-ation among the ten tissues (Fig. 2), and approximately twofold tothreefold less than the frequencies observed in comparisons amongsubsets of six tissues (Supplementary Fig. 7), indicating that,although inter-individual variation is fairly common, it is still sub-stantially less frequent than variation between tissues. Thus, most ofthe differences observed between tissue samples are likely to representtissue-specific rather than individual-specific variation.

Switch-like alternatively spliced exons

The quantitative nature of the mRNA-Seq approach allowed assess-ment of both subtle and switch-like alternative splicing events. By

Constitutive exon or region

Alternative exon or extension

Body read Junction read

Inclusive/extended isoform Exclusive isoform

Alternative transcript events

Total

Both isoforms

Skipped exon

Retained intron

Tandem 3′ UTRs

Alternative 5′ splicesite (A5SS)

Alternative 3′ splicesite (A3SS)

Mutually exclusive exon (MXE)

Alternative firstexon (AFE)

105

37

1

15

17

4

14

9

7

Totalevents(×103)

68

72

71

72

74

66

63

52

80

% Tissue-regulated

(estimated)

100

35

1

15

16

4

13

8

7

Numberdetected

(×103)

37,782

10,436

167

2,168

4,181

167

10,281

5,246

5,136

Bothisoformsdetected

60

65

57

64

64

57

52

47

74

% Tissue-regulated(observed)

22,657

6,822

96

1,386

2,655

95

5,311

2,491

3,801

Numbertissue-

regulated

pA

pA

Alternative lastexon (ALE)

pA Polyadenylation site

Figure 2 | Pervasive tissue-specific regulation of alternative mRNAisoforms. Rows represent the eight different alternative transcript eventtypes diagrammed. Mapped reads supporting expression of upper isoform,lower isoform or both isoforms are shown in blue, red and grey, respectively.Columns 1–4 show the numbers of events of each type: (1) supported bycDNAand/or EST data; (2) with$1 isoform supported bymRNA-Seq reads;(3) with both isoforms supported by reads; and (4) events detected as tissue-regulated (Fisher’s exact test) at an FDR of 5% (assuming negligible

technical variation10). Columns 5 and 6 show: (5) the observed percentage ofeventswith both isoforms detected that were observed to be tissue-regulated;and (6) the estimated true percentage of tissue-regulated isoforms aftercorrection for power to detect tissue bias (Supplementary Fig. 6) and for theFDR. For some event types, ‘common reads’ (grey bars) were used in lieu of(for tandem 39UTR events) or in addition to ‘exclusion’ reads for detectionof changes in isoform levels between tissues.

ARTICLES NATURE |Vol 456 |27 November 2008

472 ©2008 Macmillan Publishers Limited. All rights reserved

(Fig. 2 and Supplementary Table 4). In all, a set of over 22,000 tissue-specific alternative transcript events was identified, far exceedingprevious sets of tissue-specific alternative splicing events that havetypically numbered in the hundreds to low thousands6–9,18,19. Tissue-regulated skipped exon and MXE events are listed in SupplementaryTables 5 and 6, respectively. Binning events by expression level com-monly yielded sigmoid curves for the fraction of tissue-regulatedevents of each type, enabling estimation of the true frequency oftissue regulation for each event type (Supplementary Figs 5 and 6).These estimates, ranging from 52% to 80% (Fig. 2), indicated thatmost alternative splicing events are regulated between tissues, pro-viding an important element of support for the hypothesis thatalternative splicing is a principal contributor to the evolution ofphenotypic complexity in mammals.

Individual-specific isoform expression

To assess the extent of alternative splicing isoform variation betweenindividuals in comparison to tissue-regulated alternative splicing, thecorrelations among the vectors of inclusion ratios for all expressedskipped exons between pairs of samples were determined (Fig. 3); thiswas performed similarly for other event types (not shown). In thisanalysis, strong clustering of the six cerebellar cortex samples wasobserved, with generally higher correlations among these samplesthan between pairs representing distinct tissues. Strong clusteringof the five cell lines was also observed. This probably results from acombination of factors, including the common mammary epithelial

origin of the cell lines studied, similar adaptations to culture condi-tions, and the high diversity of the tissues chosen.

The extent of variation in alternative isoform expression betweenindividuals was also addressed by determining the number of differ-entially expressed exons among the six cerebellar cortex samples.Using the same approach as in Fig. 2, between ,10% and 30% ofalternative transcript events showed individual-specific variation,depending on the event type (Supplementary Fig. 7), providingupdated estimates of the scope of mRNA isoform variation betweenindividuals16. These numbers are higher than estimates based onmicroarray analyses20, but are in general agreement with an inte-grated analysis of multiple data types that estimated that ,21% ofalternatively spliced genes are affected by polymorphisms that alterthe relative abundances of alternative isoforms17. However, thesefrequencies are still below the 47–74% of events that showed vari-ation among the ten tissues (Fig. 2), and approximately twofold tothreefold less than the frequencies observed in comparisons amongsubsets of six tissues (Supplementary Fig. 7), indicating that,although inter-individual variation is fairly common, it is still sub-stantially less frequent than variation between tissues. Thus, most ofthe differences observed between tissue samples are likely to representtissue-specific rather than individual-specific variation.

Switch-like alternatively spliced exons

The quantitative nature of the mRNA-Seq approach allowed assess-ment of both subtle and switch-like alternative splicing events. By

Constitutive exon or region

Alternative exon or extension

Body read Junction read

Inclusive/extended isoform Exclusive isoform

Alternative transcript events

Total

Both isoforms

Skipped exon

Retained intron

Tandem 3′ UTRs

Alternative 5′ splicesite (A5SS)

Alternative 3′ splicesite (A3SS)

Mutually exclusive exon (MXE)

Alternative firstexon (AFE)

105

37

1

15

17

4

14

9

7

Totalevents(×103)

68

72

71

72

74

66

63

52

80

% Tissue-regulated

(estimated)

100

35

1

15

16

4

13

8

7

Numberdetected

(×103)

37,782

10,436

167

2,168

4,181

167

10,281

5,246

5,136

Bothisoformsdetected

60

65

57

64

64

57

52

47

74

% Tissue-regulated(observed)

22,657

6,822

96

1,386

2,655

95

5,311

2,491

3,801

Numbertissue-

regulated

pA

pA

Alternative lastexon (ALE)

pA Polyadenylation site

Figure 2 | Pervasive tissue-specific regulation of alternative mRNAisoforms. Rows represent the eight different alternative transcript eventtypes diagrammed. Mapped reads supporting expression of upper isoform,lower isoform or both isoforms are shown in blue, red and grey, respectively.Columns 1–4 show the numbers of events of each type: (1) supported bycDNAand/or EST data; (2) with$1 isoform supported bymRNA-Seq reads;(3) with both isoforms supported by reads; and (4) events detected as tissue-regulated (Fisher’s exact test) at an FDR of 5% (assuming negligible

technical variation10). Columns 5 and 6 show: (5) the observed percentage ofeventswith both isoforms detected that were observed to be tissue-regulated;and (6) the estimated true percentage of tissue-regulated isoforms aftercorrection for power to detect tissue bias (Supplementary Fig. 6) and for theFDR. For some event types, ‘common reads’ (grey bars) were used in lieu of(for tandem 39UTR events) or in addition to ‘exclusion’ reads for detectionof changes in isoform levels between tissues.

ARTICLES NATURE |Vol 456 |27 November 2008

472 ©2008 Macmillan Publishers Limited. All rights reserved

(Fig. 2 and Supplementary Table 4). In all, a set of over 22,000 tissue-specific alternative transcript events was identified, far exceedingprevious sets of tissue-specific alternative splicing events that havetypically numbered in the hundreds to low thousands6–9,18,19. Tissue-regulated skipped exon and MXE events are listed in SupplementaryTables 5 and 6, respectively. Binning events by expression level com-monly yielded sigmoid curves for the fraction of tissue-regulatedevents of each type, enabling estimation of the true frequency oftissue regulation for each event type (Supplementary Figs 5 and 6).These estimates, ranging from 52% to 80% (Fig. 2), indicated thatmost alternative splicing events are regulated between tissues, pro-viding an important element of support for the hypothesis thatalternative splicing is a principal contributor to the evolution ofphenotypic complexity in mammals.

Individual-specific isoform expression

To assess the extent of alternative splicing isoform variation betweenindividuals in comparison to tissue-regulated alternative splicing, thecorrelations among the vectors of inclusion ratios for all expressedskipped exons between pairs of samples were determined (Fig. 3); thiswas performed similarly for other event types (not shown). In thisanalysis, strong clustering of the six cerebellar cortex samples wasobserved, with generally higher correlations among these samplesthan between pairs representing distinct tissues. Strong clusteringof the five cell lines was also observed. This probably results from acombination of factors, including the common mammary epithelial

origin of the cell lines studied, similar adaptations to culture condi-tions, and the high diversity of the tissues chosen.

The extent of variation in alternative isoform expression betweenindividuals was also addressed by determining the number of differ-entially expressed exons among the six cerebellar cortex samples.Using the same approach as in Fig. 2, between ,10% and 30% ofalternative transcript events showed individual-specific variation,depending on the event type (Supplementary Fig. 7), providingupdated estimates of the scope of mRNA isoform variation betweenindividuals16. These numbers are higher than estimates based onmicroarray analyses20, but are in general agreement with an inte-grated analysis of multiple data types that estimated that ,21% ofalternatively spliced genes are affected by polymorphisms that alterthe relative abundances of alternative isoforms17. However, thesefrequencies are still below the 47–74% of events that showed vari-ation among the ten tissues (Fig. 2), and approximately twofold tothreefold less than the frequencies observed in comparisons amongsubsets of six tissues (Supplementary Fig. 7), indicating that,although inter-individual variation is fairly common, it is still sub-stantially less frequent than variation between tissues. Thus, most ofthe differences observed between tissue samples are likely to representtissue-specific rather than individual-specific variation.

Switch-like alternatively spliced exons

The quantitative nature of the mRNA-Seq approach allowed assess-ment of both subtle and switch-like alternative splicing events. By

Constitutive exon or region

Alternative exon or extension

Body read Junction read

Inclusive/extended isoform Exclusive isoform

Alternative transcript events

Total

Both isoforms

Skipped exon

Retained intron

Tandem 3′ UTRs

Alternative 5′ splicesite (A5SS)

Alternative 3′ splicesite (A3SS)

Mutually exclusive exon (MXE)

Alternative firstexon (AFE)

105

37

1

15

17

4

14

9

7

Totalevents(×103)

68

72

71

72

74

66

63

52

80

% Tissue-regulated

(estimated)

100

35

1

15

16

4

13

8

7

Numberdetected

(×103)

37,782

10,436

167

2,168

4,181

167

10,281

5,246

5,136

Bothisoformsdetected

60

65

57

64

64

57

52

47

74

% Tissue-regulated(observed)

22,657

6,822

96

1,386

2,655

95

5,311

2,491

3,801

Numbertissue-

regulated

pA

pA

Alternative lastexon (ALE)

pA Polyadenylation site

Figure 2 | Pervasive tissue-specific regulation of alternative mRNAisoforms. Rows represent the eight different alternative transcript eventtypes diagrammed. Mapped reads supporting expression of upper isoform,lower isoform or both isoforms are shown in blue, red and grey, respectively.Columns 1–4 show the numbers of events of each type: (1) supported bycDNAand/or EST data; (2) with$1 isoform supported bymRNA-Seq reads;(3) with both isoforms supported by reads; and (4) events detected as tissue-regulated (Fisher’s exact test) at an FDR of 5% (assuming negligible

technical variation10). Columns 5 and 6 show: (5) the observed percentage ofeventswith both isoforms detected that were observed to be tissue-regulated;and (6) the estimated true percentage of tissue-regulated isoforms aftercorrection for power to detect tissue bias (Supplementary Fig. 6) and for theFDR. For some event types, ‘common reads’ (grey bars) were used in lieu of(for tandem 39UTR events) or in addition to ‘exclusion’ reads for detectionof changes in isoform levels between tissues.

ARTICLES NATURE |Vol 456 |27 November 2008

472 ©2008 Macmillan Publishers Limited. All rights reserved

•Sources:•EST/cDNA libraries•Microarrays (exon/junction probes)•High Throughput Sequencing

Page 8: Visualization of Alternative Splicing - VIZBI

A Splicing Graph of a Gene

Compact representationMore than just the list of exons in the gene•May represent the only knowledge we have (HTS data)• longer read may specify regional constrains

Isoforms are under-specified (#paths is exponential):•Where are the starts? ends? •Which combinations actually exist?

“75 individuals, single tissue - 150,000 new splice junctions, low abundance isoforms” - Pickrell et. al Dec 2010

What are the actual transcripts?How much support each splice variant has?

In what context?

Page 9: Visualization of Alternative Splicing - VIZBI

The Genome Browser - A Real Life Example (RBM39)Where is the problem?

Page 10: Visualization of Alternative Splicing - VIZBI

Argo

Page 11: Visualization of Alternative Splicing - VIZBI

Integrative Genomics Viewer (IGV)

Page 12: Visualization of Alternative Splicing - VIZBI

What are we missing for visualizing alternative splicing?

Relational paradigm• different entities (genes,exons,introns)• relational information

Representation of isoform specific data:• coding/non coding, PTC, validated transcript end/start, canonical splice junctions (gt/c-ag) etc.• Type of experimental support• Level of experimental support•Context information

“The data displayed....features on sequence. Simply put, a sequence is a string, a feature is a location on a string. Features are divided into logical and display groups called ‘tracks.’ (the Argo manual)

WYSIWYG = What You See Is What You GetYCSWYDG = You Can’t See What You Don’t Get

Page 13: Visualization of Alternative Splicing - VIZBI

What are we missing for visualizing alternative splicing?

Appropriate filters/control knobs (discrete/real)Alternative representations for transcript/data entities “Switch entity focus”A matching rendering engineA relational DBA flexible interface definitionA “Visual Proxy” design pattern

AceView (NCBI)

Page 14: Visualization of Alternative Splicing - VIZBI

How isoforms visualization should look like

Appropriate filters/control knobs (discrete/real) Alternative representations for transcript/data entities “Switch entity focus” A matching rendering engine A relational DB A flexible interface definition A “Visual Proxy” design pattern

2D/3D display Seamless natural language voice control Gesture control Cool & Fun

Page 15: Visualization of Alternative Splicing - VIZBI

Alternative Splicing Regulation

intron structures can be obtained, permitting developmentof statistical models of 59ss and 39ss motifs that capturesecond-order statistical interactions between positions tomore accurately predict splice site locations (Yeo and Burge2004). However, only a few dozen mammalian BPSs havebeen mapped. The limited size of the available BPS data setand the low information content of this motif make itdifficult to derive a reliable sequence model to predict BPSsin introns. A recent study used comparative genomics toimprove BPS prediction (Kol et al. 2005), and more humanBPS sequences were identified recently by sequencing lariatRT-PCR products (GAO et al. 2008).

EXON DEFINITION AS A KEY STEPIN MAMMALIAN SPLICING REGULATION

A typical human gene contains relatively short exons(typically, 50–250 base pairs [bp] in length) separated bymuch larger introns (typically, hundreds to thousands of

base pairs or more) that on averageaccount for >90% of the primary tran-script. This transcript geometry, andthe predominant exon-skipping pheno-type of splice site mutations, are con-sistent with the idea that in mammalssplice sites are predominantly recog-nized in pairs across the exon through‘‘exon definition’’ (Robberson et al. 1990;Nakai and Sakamoto 1994; Sterneret al. 1996). Exon definition involvesinitial interaction across the exonbetween factors recognizing the 59ssand the upstream 39ss, whereas in thealternative model, intron definition, in-teractions occur first across the intronbetween factors recognizing the 59ssand the downstream 39ss (for review,see Berget 1995). Recent analyses of thecoevolution of the 59ss and 39ss havedetected predominant cross-exon inter-actions in human and mouse, butcross-intron interactions in inverte-brates, plants, and fungi, with puffer-fishes representing an intermediatestate, supporting the primacy of exondefinition in mammals and intron def-inition in most other metazoans (Xiaoet al. 2007). Because exon definitionoccurs early during splicing and maylead to commitment of the exon tosplicing, this step is critical for splicingregulation and specificity. For example,polypyrimidine tract binding protein(PTB/hnRNP I) can inhibit exon defi-nition complex formation by binding to

an ESS sequence causing skipping of Fas exon 6 (Izquierdoet al. 2005); this same factor can also inhibit the spliceo-some assembly across introns (intron definition) in repres-sing splicing of the c-src N1 exon (Sharma et al. 2005).Following initial splice site recognition in exon definition, aseries of sequential structural rearrangements is required toactivate the spliceosome, and commitment to alternative splicesite pairing may occur after initial splice site recognitionand E complex formation (Lim and Hertel 2004).

SPLICING REGULATORY ELEMENTSAND ASSOCIATED FACTORS

Monte Carlo simulations inserting artificial motifs withvarying information content into transcripts in place ofnatural splice sites have been used to estimate that the corehuman splice site motifs contain only about half of theinformation required to accurately define exon/intronboundaries, even considering only short introns (Lim and

FIGURE 1. (A) Major forms of alternative splicing. In many cases, these common forms canbe combined to generate more complicated alternative splicing events. (B) A schematic ofregulated splicing. (Open boxes) Exons, (jagged lines) introns, (brackets) splice sites (ss). Theconsensus motifs of ss are shown in pictogram, and the branch point adenosine is indicated.(Dashed lines) Two alternative splicing pathways, with the middle exon either included orexcluded. Splicing is regulated by cis-elements (ESE, ESS, ISS, and ISE) and trans-actingsplicing factors (SR proteins, hnRNP, and unknown factors).

Assembling a splicing code

www.rnajournal.org 803

JOBNAME: RNA 14#5 2008 PAGE: 2 OUTPUT: Saturday April 5 10:40:30 2008

csh/RNA/152280/rna8763

Fig. 1 live 4/C

Cold Spring Harbor Laboratory Press on April 30, 2010 - Published by rnajournal.cshlp.orgDownloaded from

Can we Predict? Can we Quantitate? Can we exploit to explore?

From  a  Parts  Lists  to  an  Integrated  Splicing  Code(Wang  and  Burge,  RNA  2008)“An  important  long-­‐term  goal  is  to  determine  a  ‘splicing  code’:  a  set  of  rules  that  can  predict  the  splicing  pa5ern  of  any  primary  transcript  sequence.”

Page 16: Visualization of Alternative Splicing - VIZBI

The First Tissue Specific Alternative Splicing Computational Code

(ESEs and ISEs) and silencers (ESSs and ISSs),which are 6–8nucleotideslong and were identified without regard to possible tissue-dependentroles24–26, and 314 5–7-nucleotide-long motifs that are conserved inintronic sequences neighbouring alternative exons27. There are also460 region-specific counts of 1–3-nucleotide ‘short motifs’, becausesuch features were previously associated with alternative splicing28.We included 57 ‘transcript structure’ features implicated in determin-ing spliced transcript levels, such as exon/intron lengths, regionalprobabilities of secondary structures29, and whether exon inclusion/exclusion introduces a premature termination codon (PTC).

In addition to the feature compendium, we constructed a set of,1,800 ‘unbiased motifs’ by performing a de novo search10 for eachtissue type and direction of splicing change (SupplementaryInformation 3). Later, we report results obtained with and withoutusing these features.

Assembling a high-information code

Ourmethod seeks a code that is able to predict the splicing patterns ofall exons as accurately as possible, based solely on the tissue type andproximal RNA features. The putative features for a particular exonare appended to make a feature vector r, and the correspondingprediction in tissue type c is denoted p(c,r). Like q, p(c,r) consistsof probabilities of increased inclusion or exclusion, or no change. Thecode is combinatorial and accounts for how features cooperate orcompete in a given tissue type, by specifying a subset of important

features, thresholds on feature values and softmax parameters30 relat-ing active feature combinations to the prediction p(c,r) (Supplemen-tary Information 4).

We use a measure of ‘code quality’ that is based on informationtheory31 (see Methods). It can be viewed as the amount of informa-tion about genome-wide tissue-dependent splicing accounted for bythe code. A code quality of zero indicates that the predictions are nobetter than guessing, whereas a higher code quality indicatesimproved prediction capability.

To assemble a code, our method recursively selects features fromthe compendium, while optimizing their thresholds and softmaxparameters to maximize code quality (Supplementary Informa-tion 5). The code quality increased monotonically during assembly,but diminished gains were observed after 200 features were included(Fig. 1b, c, based on fivefold cross-validation). The final assembledcode contained,200 features. When a code was assembled using thecompendium plus the unbiased motifs, the increase in code qualitydid not exceed 1 s.d. in error (data not shown), but, interestingly,some of the unbiased motifs that did not correspond to any com-pendium features were selected and subsequently experimentallyverified (see later).

To quantify the contributions of its different components, wecompared our final assembled code to partial codes whose onlyinputs were the tissue type, previously describedmotifs, conservationlevels, or the compendium with transcript structure features or con-servation levels removed (Fig. 1d).

Predicting alternative splicing

On the task of distinguishing alternatively spliced exons from con-stitutively spliced exons, our method achieves a true positive rate ofmore than 60% at a false positive rate of 1% (SupplementaryInformation 6). To address the more difficult challenge of predictingtissue-dependent regulation, we applied the code to various sets ofunique test exons (exons not similar to those used during codeassembly) and verified the predictions using microarray data, PCRwith reverse transcription (RT–PCR) and focused studies (see laterand Supplementary Information 5).

We first asked whether the theoretical ranking of the differentcodes shown in Fig. 1d corresponds well to their relative abilities topredict microarray-assessed tissue-dependent regulation (seeMethods). Indeed, the final assembled code achieved significantlyhigher accuracy than the partial codes (Fig. 2a). For exons in geneswith median expression in the top 20th percentile, at a false positiverate of 1%, a true positive rate of 21% was achieved, and this rose to51% for a false positive rate of 10%.

We next asked how well the splicing code predicts significantdifferences in the percentage exon inclusion between pairs of tissues,for cases where the predicted difference is large (Fig. 2b andSupplementary Fig. 12). For microarray data, the splicing codecorrectly predicted the direction of change (positive or negative) in82.4% of cases (P, 13 10230, Binomial test; see Methods). For RT–PCR evaluation, 14 exons that the splicing code predicted wouldexhibit significant tissue-dependent splicing were profiled in 14diverse tissues. The splicing code correctly predicted the directionof change in 93.3% of cases (P, 13 10210, Binomial test). A scatter-plot comparing predictions and measurements (Fig. 2c) illustratesthat the code is able to predict an exon’s direction of regulation betterthan its percentage inclusion level. Figure 2d shows RT–PCR dataand predictions for four representative exons.

To assess whether the code recapitulates results from experimentalstudies of individual exons and tissue-specific splicing factors, wesurveyed 97 CNS- and/or muscle-regulated exons targeted by Nova,Fox, PTB, nPTB and/or unknown factors18,19,32–39. For each test exon,we extracted its features, applied the code and examined whether ornot it correctly predicts splicing patterns in CNS or muscle tissues(Supplementary Table 3). The code’s predictionswere correct for 74%of the combined set of 97 exons (P, 13 10241, Bernoulli test), 65%

Short motifs

Known motifs

Transcript structureNew motifs

Splicing code

a

Tissue type

300 nt 300 nt 300 nt 300 nt

Alternatively spliced exon

Feature setPredicted change in

exon inclusion

Cod

e qu

ality

(bits

) 400

Feat

ure

dete

ctio

n ra

te

Number of RNA features

Knownmotifsonly

Finalassembled

code

Tissuetypeonly

Consonly

W/otranscriptstructure

W/ocons

Short motifs

Known motifs

Transcript structure New motifs

d

b

0 100 200

0

0.4

0.0

0.8

200 400

0

200

Codes derived using different feature sets

300

100

c

RNA featureextraction

Cod

e qu

ality

(bits

)

Code assembly

Figure 1 | Assembling the splicing code. a, The code extracts hundreds ofRNA features (known/new/short motifs and transcript structure features)from any exon of interest (red), its neighbouring exons (yellow) andintervening introns (blue). It then predicts whether or not the exon isalternatively spliced, and if so, whether the exon’s inclusion level will increaseor decrease in a given tissue, relative to others. b, c, Code assembly proceedsby recursively adding features to maximize an information measure of codequality (b), and different feature types are preferred at different stages ofassembly (c). d, The final assembled code achieves higher code quality thansimpler codes derived using previously reported features and feature subsets.Cons, conservation; w/o, without. Error bars represent 1 s.d.

ARTICLES NATURE |Vol 465 |6 May 2010

54Macmillan Publishers Limited. All rights reserved©2010

(ESEs and ISEs) and silencers (ESSs and ISSs),which are 6–8nucleotideslong and were identified without regard to possible tissue-dependentroles24–26, and 314 5–7-nucleotide-long motifs that are conserved inintronic sequences neighbouring alternative exons27. There are also460 region-specific counts of 1–3-nucleotide ‘short motifs’, becausesuch features were previously associated with alternative splicing28.We included 57 ‘transcript structure’ features implicated in determin-ing spliced transcript levels, such as exon/intron lengths, regionalprobabilities of secondary structures29, and whether exon inclusion/exclusion introduces a premature termination codon (PTC).

In addition to the feature compendium, we constructed a set of,1,800 ‘unbiased motifs’ by performing a de novo search10 for eachtissue type and direction of splicing change (SupplementaryInformation 3). Later, we report results obtained with and withoutusing these features.

Assembling a high-information code

Ourmethod seeks a code that is able to predict the splicing patterns ofall exons as accurately as possible, based solely on the tissue type andproximal RNA features. The putative features for a particular exonare appended to make a feature vector r, and the correspondingprediction in tissue type c is denoted p(c,r). Like q, p(c,r) consistsof probabilities of increased inclusion or exclusion, or no change. Thecode is combinatorial and accounts for how features cooperate orcompete in a given tissue type, by specifying a subset of important

features, thresholds on feature values and softmax parameters30 relat-ing active feature combinations to the prediction p(c,r) (Supplemen-tary Information 4).

We use a measure of ‘code quality’ that is based on informationtheory31 (see Methods). It can be viewed as the amount of informa-tion about genome-wide tissue-dependent splicing accounted for bythe code. A code quality of zero indicates that the predictions are nobetter than guessing, whereas a higher code quality indicatesimproved prediction capability.

To assemble a code, our method recursively selects features fromthe compendium, while optimizing their thresholds and softmaxparameters to maximize code quality (Supplementary Informa-tion 5). The code quality increased monotonically during assembly,but diminished gains were observed after 200 features were included(Fig. 1b, c, based on fivefold cross-validation). The final assembledcode contained,200 features. When a code was assembled using thecompendium plus the unbiased motifs, the increase in code qualitydid not exceed 1 s.d. in error (data not shown), but, interestingly,some of the unbiased motifs that did not correspond to any com-pendium features were selected and subsequently experimentallyverified (see later).

To quantify the contributions of its different components, wecompared our final assembled code to partial codes whose onlyinputs were the tissue type, previously describedmotifs, conservationlevels, or the compendium with transcript structure features or con-servation levels removed (Fig. 1d).

Predicting alternative splicing

On the task of distinguishing alternatively spliced exons from con-stitutively spliced exons, our method achieves a true positive rate ofmore than 60% at a false positive rate of 1% (SupplementaryInformation 6). To address the more difficult challenge of predictingtissue-dependent regulation, we applied the code to various sets ofunique test exons (exons not similar to those used during codeassembly) and verified the predictions using microarray data, PCRwith reverse transcription (RT–PCR) and focused studies (see laterand Supplementary Information 5).

We first asked whether the theoretical ranking of the differentcodes shown in Fig. 1d corresponds well to their relative abilities topredict microarray-assessed tissue-dependent regulation (seeMethods). Indeed, the final assembled code achieved significantlyhigher accuracy than the partial codes (Fig. 2a). For exons in geneswith median expression in the top 20th percentile, at a false positiverate of 1%, a true positive rate of 21% was achieved, and this rose to51% for a false positive rate of 10%.

We next asked how well the splicing code predicts significantdifferences in the percentage exon inclusion between pairs of tissues,for cases where the predicted difference is large (Fig. 2b andSupplementary Fig. 12). For microarray data, the splicing codecorrectly predicted the direction of change (positive or negative) in82.4% of cases (P, 13 10230, Binomial test; see Methods). For RT–PCR evaluation, 14 exons that the splicing code predicted wouldexhibit significant tissue-dependent splicing were profiled in 14diverse tissues. The splicing code correctly predicted the directionof change in 93.3% of cases (P, 13 10210, Binomial test). A scatter-plot comparing predictions and measurements (Fig. 2c) illustratesthat the code is able to predict an exon’s direction of regulation betterthan its percentage inclusion level. Figure 2d shows RT–PCR dataand predictions for four representative exons.

To assess whether the code recapitulates results from experimentalstudies of individual exons and tissue-specific splicing factors, wesurveyed 97 CNS- and/or muscle-regulated exons targeted by Nova,Fox, PTB, nPTB and/or unknown factors18,19,32–39. For each test exon,we extracted its features, applied the code and examined whether ornot it correctly predicts splicing patterns in CNS or muscle tissues(Supplementary Table 3). The code’s predictionswere correct for 74%of the combined set of 97 exons (P, 13 10241, Bernoulli test), 65%

Short motifs

Known motifs

Transcript structureNew motifs

Splicing code

a

Tissue type

300 nt 300 nt 300 nt 300 nt

Alternatively spliced exon

Feature setPredicted change in

exon inclusion

Cod

e qu

ality

(bits

) 400

Feat

ure

dete

ctio

n ra

te

Number of RNA features

Knownmotifsonly

Finalassembled

code

Tissuetypeonly

Consonly

W/otranscriptstructure

W/ocons

Short motifs

Known motifs

Transcript structure New motifs

d

b

0 100 200

0

0.4

0.0

0.8

200 400

0

200

Codes derived using different feature sets

300

100

c

RNA featureextraction

Cod

e qu

ality

(bits

)

Code assembly

Figure 1 | Assembling the splicing code. a, The code extracts hundreds ofRNA features (known/new/short motifs and transcript structure features)from any exon of interest (red), its neighbouring exons (yellow) andintervening introns (blue). It then predicts whether or not the exon isalternatively spliced, and if so, whether the exon’s inclusion level will increaseor decrease in a given tissue, relative to others. b, c, Code assembly proceedsby recursively adding features to maximize an information measure of codequality (b), and different feature types are preferred at different stages ofassembly (c). d, The final assembled code achieves higher code quality thansimpler codes derived using previously reported features and feature subsets.Cons, conservation; w/o, without. Error bars represent 1 s.d.

ARTICLES NATURE |Vol 465 |6 May 2010

54Macmillan Publishers Limited. All rights reserved©2010

~1000 Features:Known motifs compendium

New motifsSequence conservation

RNA structure

➔How do we visualize the code??

Page 17: Visualization of Alternative Splicing - VIZBI

Visualizing Alternative Splicing Regulatory Information

!"#$%&'()'*+%+,-'&.'+/'

!"#$%&" '( )(*+,- )(*.,- / )0*+,- )0*.,- '0) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3

!"#$%&'( )'*#+(, -'$.&/

!"#$%

&'()*

+,-

.'/0#1(

213456

7,8/

9)1:

;<;

;<<

<&=>

<?*>>

<?*@A

9,BC&!!!D&!E

9,BC!DF!!!!E

9,BC!FFF!DE

9,BC!!DD!E

9,BC!!!F!FFE

9,BCD!FFDE

9,BC!D!!FFE

9,BC!D&!!E

9,BC!&F!!!&E

9,BC!!!FF&E

9,BC!!&&!!E

9,BC!!FDFFE

9,BC!FF&!E

9,BCD&!!!D&E

9,BCD!!!!E

9,BC!DD&!!E

9,BC&!!!&FE

9,BCF&!FF!E

9,BCFFD!&E

9,BC!FF!DE

9,BC!!!FDE

9,BC!!!!!&E

9,BCF!!FFF!E

45&#1$

45&G-$

+"/HG<%#IB

J'1$B#,1<$,"G

&,1KG"8/B#,1

LG1(B%

<G$,1M/"N<B"'$B'"G

+#"KBCFDE'*KB"G/H

4)5562!)7328273274958:)')7;

'274</:972<=>6595?54219*'75-

165':2

21@<?>

3);254)=2

=AA1B =AA1B =AA1B =AA1B2ABC92ADE%FGBC

2ABC9)CDE%FGBC

!"#$%&"93"HE"$GBC

!"#$%&"92C&GDIJ"C$

!"#$%&'()'*+%+,-'&.'+/'

!"#$%&" '( )(*+,- )(*.,- / )0*+,- )0*.,- '0) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3

!"#$%&'( )'*#+(, -'$.&/

!"#$%

&'()*

+,-

.'/0#1(

213456

7,8/

9)1:

;<;

;<<

<&=>

<?*>>

<?*@A

9,BC&!!!D&!E

9,BC!DF!!!!E

9,BC!FFF!DE

9,BC!!DD!E

9,BC!!!F!FFE

9,BCD!FFDE

9,BC!D!!FFE

9,BC!D&!!E

9,BC!&F!!!&E

9,BC!!!FF&E

9,BC!!&&!!E

9,BC!!FDFFE

9,BC!FF&!E

9,BCD&!!!D&E

9,BCD!!!!E

9,BC!DD&!!E

9,BC&!!!&FE

9,BCF&!FF!E

9,BCFFD!&E

9,BC!FF!DE

9,BC!!!FDE

9,BC!!!!!&E

9,BCF!!FFF!E

45&#1$

45&G-$

+"/HG<%#IB

J'1$B#,1<$,"G

&,1KG"8/B#,1

LG1(B%

<G$,1M/"N<B"'$B'"G

+#"KBCFDE'*KB"G/H

4)5562!)7328273274958:)')7;

'274</:972<=>6595?54219*'75-

165':2

21@<?>

3);254)=2

=AA1B =AA1B =AA1B =AA1B2ABC92ADE%FGBC

2ABC9)CDE%FGBC

!"#$%&"93"HE"$GBC

!"#$%&"92C&GDIJ"C$

!"#$%&'()'*+%+,-'&.'+/'

!"#$%&" '( )(*+,- )(*.,- / )0*+,- )0*.,- '0) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3

!"#$%&'( )'*#+(, -'$.&/

!"#$%

&'()*

+,-

.'/0#1(

213456

7,8/

9)1:

;<;

;<<

<&=>

<?*>>

<?*@A

9,BC&!!!D&!E

9,BC!DF!!!!E

9,BC!FFF!DE

9,BC!!DD!E

9,BC!!!F!FFE

9,BCD!FFDE

9,BC!D!!FFE

9,BC!D&!!E

9,BC!&F!!!&E

9,BC!!!FF&E

9,BC!!&&!!E

9,BC!!FDFFE

9,BC!FF&!E

9,BCD&!!!D&E

9,BCD!!!!E

9,BC!DD&!!E

9,BC&!!!&FE

9,BCF&!FF!E

9,BCFFD!&E

9,BC!FF!DE

9,BC!!!FDE

9,BC!!!!!&E

9,BCF!!FFF!E

45&#1$

45&G-$

+"/HG<%#IB

J'1$B#,1<$,"G

&,1KG"8/B#,1

LG1(B%

<G$,1M/"N<B"'$B'"G

+#"KBCFDE'*KB"G/H

4)5562!)7328273274958:)')7;

'274</:972<=>6595?54219*'75-

165':2

21@<?>

3);254)=2

=AA1B =AA1B =AA1B =AA1B2ABC92ADE%FGBC

2ABC9)CDE%FGBC

!"#$%&"93"HE"$GBC

!"#$%&"92C&GDIJ"C$

!"#$%&'()'*+%+,-'&.'+/'

!"#$%&" '( )(*+,- )(*.,- / )0*+,- )0*.,- '0) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3

!"#$%&'( )'*#+(, -'$.&/

!"#$%

&'()*

+,-

.'/0#1(

213456

7,8/

9)1:

;<;

;<<

<&=>

<?*>>

<?*@A

9,BC&!!!D&!E

9,BC!DF!!!!E

9,BC!FFF!DE

9,BC!!DD!E

9,BC!!!F!FFE

9,BCD!FFDE

9,BC!D!!FFE

9,BC!D&!!E

9,BC!&F!!!&E

9,BC!!!FF&E

9,BC!!&&!!E

9,BC!!FDFFE

9,BC!FF&!E

9,BCD&!!!D&E

9,BCD!!!!E

9,BC!DD&!!E

9,BC&!!!&FE

9,BCF&!FF!E

9,BCFFD!&E

9,BC!FF!DE

9,BC!!!FDE

9,BC!!!!!&E

9,BCF!!FFF!E

45&#1$

45&G-$

+"/HG<%#IB

J'1$B#,1<$,"G

&,1KG"8/B#,1

LG1(B%

<G$,1M/"N<B"'$B'"G

+#"KBCFDE'*KB"G/H

4)5562!)7328273274958:)')7;

'274</:972<=>6595?54219*'75-

165':2

21@<?>

3);254)=2

=AA1B =AA1B =AA1B =AA1B2ABC92ADE%FGBC

2ABC9)CDE%FGBC

!"#$%&"93"HE"$GBC

!"#$%&"92C&GDIJ"C$

A

[U]GCAUG (Fox)

C - CNSM - MuscleE - EmbryoD - DigestiveI - (Tissue) Independent

Page 18: Visualization of Alternative Splicing - VIZBI

!"#$%&'()'*+%+,-'&.'+/'

!"#$%&" '( )(*+,- )(*.,- / )0*+,- )0*.,- '0) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3

!"#$%&'( )'*#+(, -'$.&/

!"#$%

&'()*

+,-

.'/0#1(

213456

7,8/

9)1:

;<;

;<<

4<<9=<&>?

4<<9=<@*??

4<<9=<@*AB

9,CD&!E&!F

9,CD&!!!E&!F

9,CD!EG!!!!F

9,CD!GGG!EF

9,CD!!EE!F

9,CD!!!G!GGF

9,CDE!GGEF

9,CD!E!!GGF

9,CD!E&!!F

9,CD!&G!!!&F

9,CD!!!GG&F

9,CD!!&&!!F

9,CD!!GEGGF

9,CD!GG&!F

9,CDE!EGEF

9,CDE&!!!E&F

9,CDE!!!!F

9,CD!EE&!!F

9,CD&!!!&GF

9,CDG&!GG!F

9,CDGGE!&F

9,CD!GG!EF

9,CD!!!GEF

9,CD!!!!!&F

9,CDG!!GGG!F

45&#1$

45&H-$

+"/IH<%#JC

K'1$C#,1<$,"H

&,1LH"8/C#,1

MH1(C%

<H$,1N/"O<C"'$C'"H

+#"LCDGEF'*LC"H/I

4)5562!)7328273274958:)')7;

'274</:972<=>6595?54219*'75-

165':2

21@<?>

3);254)=2

>BB1C >BB1C >BB1C >BB1C2ABC92ADE%FGBC

2ABC9)CDE%FGBC

!"#$%&"93"HE"$GBC

!"#$%&"92C&GDIJ"C$

!"#$%&'()'*+%+,-'&.'+/'

!"#$%&" '( )(*+,- )(*.,- / )0*+,- )0*.,- '0) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3 ) '1 2 3

!"#$%&'( )'*#+(, -'$.&/

!"#$%

&'()*

+,-

.'/0#1(

213456

7,8/

9)1:

;<;

;<<

4<<9=<&>?

4<<9=<@*??

4<<9=<@*AB

9,CD&!E&!F

9,CD&!!!E&!F

9,CD!EG!!!!F

9,CD!GGG!EF

9,CD!!EE!F

9,CD!!!G!GGF

9,CDE!GGEF

9,CD!E!!GGF

9,CD!E&!!F

9,CD!&G!!!&F

9,CD!!!GG&F

9,CD!!&&!!F

9,CD!!GEGGF

9,CD!GG&!F

9,CDE!EGEF

9,CDE&!!!E&F

9,CDE!!!!F

9,CD!EE&!!F

9,CD&!!!&GF

9,CDG&!GG!F

9,CDGGE!&F

9,CD!GG!EF

9,CD!!!GEF

9,CD!!!!!&F

9,CDG!!GGG!F

45&#1$

45&H-$

+"/IH<%#JC

K'1$C#,1<$,"H

&,1LH"8/C#,1

MH1(C%

<H$,1N/"O<C"'$C'"H

+#"LCDGEF'*LC"H/I

4)5562!)7328273274958:)')7;

'274</:972<=>6595?54219*'75-

165':2

21@<?>

3);254)=2

>BB1C >BB1C >BB1C >BB1C2ABC92ADE%FGBC

2ABC9)CDE%FGBC

!"#$%&"93"HE"$GBC

!"#$%&"92C&GDIJ"C$

Visualizing the Regulatory Code

(ESEs and ISEs) and silencers (ESSs and ISSs),which are 6–8nucleotideslong and were identified without regard to possible tissue-dependentroles24–26, and 314 5–7-nucleotide-long motifs that are conserved inintronic sequences neighbouring alternative exons27. There are also460 region-specific counts of 1–3-nucleotide ‘short motifs’, becausesuch features were previously associated with alternative splicing28.We included 57 ‘transcript structure’ features implicated in determin-ing spliced transcript levels, such as exon/intron lengths, regionalprobabilities of secondary structures29, and whether exon inclusion/exclusion introduces a premature termination codon (PTC).

In addition to the feature compendium, we constructed a set of,1,800 ‘unbiased motifs’ by performing a de novo search10 for eachtissue type and direction of splicing change (SupplementaryInformation 3). Later, we report results obtained with and withoutusing these features.

Assembling a high-information code

Ourmethod seeks a code that is able to predict the splicing patterns ofall exons as accurately as possible, based solely on the tissue type andproximal RNA features. The putative features for a particular exonare appended to make a feature vector r, and the correspondingprediction in tissue type c is denoted p(c,r). Like q, p(c,r) consistsof probabilities of increased inclusion or exclusion, or no change. Thecode is combinatorial and accounts for how features cooperate orcompete in a given tissue type, by specifying a subset of important

features, thresholds on feature values and softmax parameters30 relat-ing active feature combinations to the prediction p(c,r) (Supplemen-tary Information 4).

We use a measure of ‘code quality’ that is based on informationtheory31 (see Methods). It can be viewed as the amount of informa-tion about genome-wide tissue-dependent splicing accounted for bythe code. A code quality of zero indicates that the predictions are nobetter than guessing, whereas a higher code quality indicatesimproved prediction capability.

To assemble a code, our method recursively selects features fromthe compendium, while optimizing their thresholds and softmaxparameters to maximize code quality (Supplementary Informa-tion 5). The code quality increased monotonically during assembly,but diminished gains were observed after 200 features were included(Fig. 1b, c, based on fivefold cross-validation). The final assembledcode contained,200 features. When a code was assembled using thecompendium plus the unbiased motifs, the increase in code qualitydid not exceed 1 s.d. in error (data not shown), but, interestingly,some of the unbiased motifs that did not correspond to any com-pendium features were selected and subsequently experimentallyverified (see later).

To quantify the contributions of its different components, wecompared our final assembled code to partial codes whose onlyinputs were the tissue type, previously describedmotifs, conservationlevels, or the compendium with transcript structure features or con-servation levels removed (Fig. 1d).

Predicting alternative splicing

On the task of distinguishing alternatively spliced exons from con-stitutively spliced exons, our method achieves a true positive rate ofmore than 60% at a false positive rate of 1% (SupplementaryInformation 6). To address the more difficult challenge of predictingtissue-dependent regulation, we applied the code to various sets ofunique test exons (exons not similar to those used during codeassembly) and verified the predictions using microarray data, PCRwith reverse transcription (RT–PCR) and focused studies (see laterand Supplementary Information 5).

We first asked whether the theoretical ranking of the differentcodes shown in Fig. 1d corresponds well to their relative abilities topredict microarray-assessed tissue-dependent regulation (seeMethods). Indeed, the final assembled code achieved significantlyhigher accuracy than the partial codes (Fig. 2a). For exons in geneswith median expression in the top 20th percentile, at a false positiverate of 1%, a true positive rate of 21% was achieved, and this rose to51% for a false positive rate of 10%.

We next asked how well the splicing code predicts significantdifferences in the percentage exon inclusion between pairs of tissues,for cases where the predicted difference is large (Fig. 2b andSupplementary Fig. 12). For microarray data, the splicing codecorrectly predicted the direction of change (positive or negative) in82.4% of cases (P, 13 10230, Binomial test; see Methods). For RT–PCR evaluation, 14 exons that the splicing code predicted wouldexhibit significant tissue-dependent splicing were profiled in 14diverse tissues. The splicing code correctly predicted the directionof change in 93.3% of cases (P, 13 10210, Binomial test). A scatter-plot comparing predictions and measurements (Fig. 2c) illustratesthat the code is able to predict an exon’s direction of regulation betterthan its percentage inclusion level. Figure 2d shows RT–PCR dataand predictions for four representative exons.

To assess whether the code recapitulates results from experimentalstudies of individual exons and tissue-specific splicing factors, wesurveyed 97 CNS- and/or muscle-regulated exons targeted by Nova,Fox, PTB, nPTB and/or unknown factors18,19,32–39. For each test exon,we extracted its features, applied the code and examined whether ornot it correctly predicts splicing patterns in CNS or muscle tissues(Supplementary Table 3). The code’s predictionswere correct for 74%of the combined set of 97 exons (P, 13 10241, Bernoulli test), 65%

Short motifs

Known motifs

Transcript structureNew motifs

Splicing code

a

Tissue type

300 nt 300 nt 300 nt 300 nt

Alternatively spliced exon

Feature setPredicted change in

exon inclusion

Cod

e qu

ality

(bits

) 400

Feat

ure

dete

ctio

n ra

te

Number of RNA features

Knownmotifsonly

Finalassembled

code

Tissuetypeonly

Consonly

W/otranscriptstructure

W/ocons

Short motifs

Known motifs

Transcript structure New motifs

d

b

0 100 200

0

0.4

0.0

0.8

200 400

0

200

Codes derived using different feature sets

300

100

c

RNA featureextraction

Cod

e qu

ality

(bits

)

Code assembly

Figure 1 | Assembling the splicing code. a, The code extracts hundreds ofRNA features (known/new/short motifs and transcript structure features)from any exon of interest (red), its neighbouring exons (yellow) andintervening introns (blue). It then predicts whether or not the exon isalternatively spliced, and if so, whether the exon’s inclusion level will increaseor decrease in a given tissue, relative to others. b, c, Code assembly proceedsby recursively adding features to maximize an information measure of codequality (b), and different feature types are preferred at different stages ofassembly (c). d, The final assembled code achieves higher code quality thansimpler codes derived using previously reported features and feature subsets.Cons, conservation; w/o, without. Error bars represent 1 s.d.

ARTICLES NATURE |Vol 465 |6 May 2010

54Macmillan Publishers Limited. All rights reserved©2010

(ESEs and ISEs) and silencers (ESSs and ISSs),which are 6–8nucleotideslong and were identified without regard to possible tissue-dependentroles24–26, and 314 5–7-nucleotide-long motifs that are conserved inintronic sequences neighbouring alternative exons27. There are also460 region-specific counts of 1–3-nucleotide ‘short motifs’, becausesuch features were previously associated with alternative splicing28.We included 57 ‘transcript structure’ features implicated in determin-ing spliced transcript levels, such as exon/intron lengths, regionalprobabilities of secondary structures29, and whether exon inclusion/exclusion introduces a premature termination codon (PTC).

In addition to the feature compendium, we constructed a set of,1,800 ‘unbiased motifs’ by performing a de novo search10 for eachtissue type and direction of splicing change (SupplementaryInformation 3). Later, we report results obtained with and withoutusing these features.

Assembling a high-information code

Ourmethod seeks a code that is able to predict the splicing patterns ofall exons as accurately as possible, based solely on the tissue type andproximal RNA features. The putative features for a particular exonare appended to make a feature vector r, and the correspondingprediction in tissue type c is denoted p(c,r). Like q, p(c,r) consistsof probabilities of increased inclusion or exclusion, or no change. Thecode is combinatorial and accounts for how features cooperate orcompete in a given tissue type, by specifying a subset of important

features, thresholds on feature values and softmax parameters30 relat-ing active feature combinations to the prediction p(c,r) (Supplemen-tary Information 4).

We use a measure of ‘code quality’ that is based on informationtheory31 (see Methods). It can be viewed as the amount of informa-tion about genome-wide tissue-dependent splicing accounted for bythe code. A code quality of zero indicates that the predictions are nobetter than guessing, whereas a higher code quality indicatesimproved prediction capability.

To assemble a code, our method recursively selects features fromthe compendium, while optimizing their thresholds and softmaxparameters to maximize code quality (Supplementary Informa-tion 5). The code quality increased monotonically during assembly,but diminished gains were observed after 200 features were included(Fig. 1b, c, based on fivefold cross-validation). The final assembledcode contained,200 features. When a code was assembled using thecompendium plus the unbiased motifs, the increase in code qualitydid not exceed 1 s.d. in error (data not shown), but, interestingly,some of the unbiased motifs that did not correspond to any com-pendium features were selected and subsequently experimentallyverified (see later).

To quantify the contributions of its different components, wecompared our final assembled code to partial codes whose onlyinputs were the tissue type, previously describedmotifs, conservationlevels, or the compendium with transcript structure features or con-servation levels removed (Fig. 1d).

Predicting alternative splicing

On the task of distinguishing alternatively spliced exons from con-stitutively spliced exons, our method achieves a true positive rate ofmore than 60% at a false positive rate of 1% (SupplementaryInformation 6). To address the more difficult challenge of predictingtissue-dependent regulation, we applied the code to various sets ofunique test exons (exons not similar to those used during codeassembly) and verified the predictions using microarray data, PCRwith reverse transcription (RT–PCR) and focused studies (see laterand Supplementary Information 5).

We first asked whether the theoretical ranking of the differentcodes shown in Fig. 1d corresponds well to their relative abilities topredict microarray-assessed tissue-dependent regulation (seeMethods). Indeed, the final assembled code achieved significantlyhigher accuracy than the partial codes (Fig. 2a). For exons in geneswith median expression in the top 20th percentile, at a false positiverate of 1%, a true positive rate of 21% was achieved, and this rose to51% for a false positive rate of 10%.

We next asked how well the splicing code predicts significantdifferences in the percentage exon inclusion between pairs of tissues,for cases where the predicted difference is large (Fig. 2b andSupplementary Fig. 12). For microarray data, the splicing codecorrectly predicted the direction of change (positive or negative) in82.4% of cases (P, 13 10230, Binomial test; see Methods). For RT–PCR evaluation, 14 exons that the splicing code predicted wouldexhibit significant tissue-dependent splicing were profiled in 14diverse tissues. The splicing code correctly predicted the directionof change in 93.3% of cases (P, 13 10210, Binomial test). A scatter-plot comparing predictions and measurements (Fig. 2c) illustratesthat the code is able to predict an exon’s direction of regulation betterthan its percentage inclusion level. Figure 2d shows RT–PCR dataand predictions for four representative exons.

To assess whether the code recapitulates results from experimentalstudies of individual exons and tissue-specific splicing factors, wesurveyed 97 CNS- and/or muscle-regulated exons targeted by Nova,Fox, PTB, nPTB and/or unknown factors18,19,32–39. For each test exon,we extracted its features, applied the code and examined whether ornot it correctly predicts splicing patterns in CNS or muscle tissues(Supplementary Table 3). The code’s predictionswere correct for 74%of the combined set of 97 exons (P, 13 10241, Bernoulli test), 65%

Short motifs

Known motifs

Transcript structureNew motifs

Splicing code

a

Tissue type

300 nt 300 nt 300 nt 300 nt

Alternatively spliced exon

Feature setPredicted change in

exon inclusion

Cod

e qu

ality

(bits

) 400

Feat

ure

dete

ctio

n ra

te

Number of RNA features

Knownmotifsonly

Finalassembled

code

Tissuetypeonly

Consonly

W/otranscriptstructure

W/ocons

Short motifs

Known motifs

Transcript structure New motifs

d

b

0 100 200

0

0.4

0.0

0.8

200 400

0

200

Codes derived using different feature sets

300

100

c

RNA featureextraction

Cod

e qu

ality

(bits

)Code assembly

Figure 1 | Assembling the splicing code. a, The code extracts hundreds ofRNA features (known/new/short motifs and transcript structure features)from any exon of interest (red), its neighbouring exons (yellow) andintervening introns (blue). It then predicts whether or not the exon isalternatively spliced, and if so, whether the exon’s inclusion level will increaseor decrease in a given tissue, relative to others. b, c, Code assembly proceedsby recursively adding features to maximize an information measure of codequality (b), and different feature types are preferred at different stages ofassembly (c). d, The final assembled code achieves higher code quality thansimpler codes derived using previously reported features and feature subsets.Cons, conservation; w/o, without. Error bars represent 1 s.d.

ARTICLES NATURE |Vol 465 |6 May 2010

54Macmillan Publishers Limited. All rights reserved©2010

Page 19: Visualization of Alternative Splicing - VIZBI

Visualizing the Regulatory Features Interaction

For example, the code predicts a class of exons that insert a PTC afterinclusion in transcripts and that are skipped in embryonic tissues butincluded in adult tissues. Later, we describe experiments indicating thatthese exons have an important role in the regulation of developmentalstage-specific gene expression.

Predicting regulatory feature maps

By flagging regulatory elements in RNA sequences surrounding analternative exon, the splicing code yields a visual feature map thatpartially accounts for how the exon is regulated. Predicted featuremaps were initially evaluated by their overlap with 376 nucleotides of

RNA sequence analysed by mutagenesis in more than 60 splicingreporter constructs from Agrn33, Src19,43, Casp2 (ref. 35) and the SloK1 STREX exon44. Our feature maps (Supplementary Figs 2–7 andSupplementary Information 10) achieve an overlap of 90% with astatistical significance of P, 0.002 (empirical, using maps fromunrelated exons). In contrast, feature maps constructed using onlyknown motifs achieve an overlap of 38% (P5 0.004) and mapsderived solely from conservation information27 achieve poor specifi-city (P5 0.27).

Code-generated feature maps can be used to guide focused mech-anistic studies. We examined exon 16 of the Daam1 gene, which our

U-rich

Feature C1

I CMED I CMEDI CMED I CMED I CMED I CMED I CMED

AI1(5′) I1(3′) I2(5′) I2(3′) C2

CUG-rich(Cugbp)

[U]GCAUG(Fox)

ACUAAY(Qkl)

CU-rich(nPTB)

YCAY(Nova)

YGCUKY(Mbnl)

ESE

ESS

GRYYcSYR(SC35)

YRCYKM(SRp55)

YYACWSS(SRp40)

Mot[UGAUUUU]

Mot[UAAAUG]

Mot[UUGGU]

Mot[UUUAUAA]

Mot[GUAAG]

Mot[UGCUU]

Mot[UCAUUUC]

Mot[UUAGAA]

Mot[GUUUU]

Mot[UGGCUU]

Mot[AAGUC]

Mot[UUUAG]

Mot[AUUAAAU]

PTCinc

PTCexc

Frameshift

Junction score

Conservation

Length

Secondary structure

First[AG]

Tis

sue-

ind

epen

den

t sp

licin

gC

NS

Musc

leE

mb

ryo

Dig

estiv

e

Exonexclusion

Exoninclusion

Alternative exon

300 nt 300 nt 300 nt 300 nt

a UUUAUAA

UGAUUUU

UCAUUUC

AUUAAAU

Secondstructure

YCAY(Nova)

YCAY(Nova)

U-ruch(Tia1/Tiar)

YCAY(Nova)

YCAY(Nova)

YCAY(Nova)

YCAY(Nova)

YGCUKY(Mbnl)

Length

Length

Length

FrameshiftFirst[AG]

First[AG]

[U]GCAUG(Fox)

[U]GCAUG(Fox)

[U]GCAUG(Fox)

PTCexc

[U]GCAUG(Fox)

ACUAAY(Qkl)

ACUAAY(Qkl)

CU-rich(nPTB)

ESE

UGGCUU

b

c

d

Featu

re en

richm

ent

Featu

re dep

letion

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

ACUAAY(Qkl)

ACUAAY(Qkl)

ACUAAY(Qkl)

YYACWSS(SRp40)

CUG-rich(Cugbp)

CUG-rich(Cugbp)

CUG-rich(Cugbp)

CUG-rich(Cugbp)

YGCUKY(Mbnl)

Figure 3 | Graphical depiction of the splicing code. a, The region-specificactivity of each feature in increased exon inclusion (red bar) or exclusion(blue bar) is shown for CNS (C), muscle (M), embryo (E) and digestive (D)tissues, plus a tissue-independentmixture (I). A bar with/without a black hatindicates activity due to feature depletion/enrichment. Bar size conveysenrichmentP-value;P, 0.005 in all cases. Potential feature binding proteinsare shown in parentheses. b–d, Unexpectedly frequent feature pairs were

identified and used to generate feature interaction networks for CNS(b), muscle (c) and embryonic (d) tissues. Node size and colour indicate thefeature’sP-value and region (see colour key in a). Red/blue edges correspondto increased inclusion/exclusion and edge thickness conveys interactionP-value (false discovery rate-corrected Fisher test); P, 0.05 in all cases. Athick/thin node boundary indicates activity due to feature depletion/enrichment.

ARTICLES NATURE |Vol 465 |6 May 2010

56Macmillan Publishers Limited. All rights reserved©2010

region colour map

motif feature

structural feature

depleted feature

co-occurrence strength

no co-occurrence

For example, the code predicts a class of exons that insert a PTC afterinclusion in transcripts and that are skipped in embryonic tissues butincluded in adult tissues. Later, we describe experiments indicating thatthese exons have an important role in the regulation of developmentalstage-specific gene expression.

Predicting regulatory feature maps

By flagging regulatory elements in RNA sequences surrounding analternative exon, the splicing code yields a visual feature map thatpartially accounts for how the exon is regulated. Predicted featuremaps were initially evaluated by their overlap with 376 nucleotides of

RNA sequence analysed by mutagenesis in more than 60 splicingreporter constructs from Agrn33, Src19,43, Casp2 (ref. 35) and the SloK1 STREX exon44. Our feature maps (Supplementary Figs 2–7 andSupplementary Information 10) achieve an overlap of 90% with astatistical significance of P, 0.002 (empirical, using maps fromunrelated exons). In contrast, feature maps constructed using onlyknown motifs achieve an overlap of 38% (P5 0.004) and mapsderived solely from conservation information27 achieve poor specifi-city (P5 0.27).

Code-generated feature maps can be used to guide focused mech-anistic studies. We examined exon 16 of the Daam1 gene, which our

U-rich

Feature C1

I CMED I CMEDI CMED I CMED I CMED I CMED I CMED

AI1(5′) I1(3′) I2(5′) I2(3′) C2

CUG-rich(Cugbp)

[U]GCAUG(Fox)

ACUAAY(Qkl)

CU-rich(nPTB)

YCAY(Nova)

YGCUKY(Mbnl)

ESE

ESS

GRYYcSYR(SC35)

YRCYKM(SRp55)

YYACWSS(SRp40)

Mot[UGAUUUU]

Mot[UAAAUG]

Mot[UUGGU]

Mot[UUUAUAA]

Mot[GUAAG]

Mot[UGCUU]

Mot[UCAUUUC]

Mot[UUAGAA]

Mot[GUUUU]

Mot[UGGCUU]

Mot[AAGUC]

Mot[UUUAG]

Mot[AUUAAAU]

PTCinc

PTCexc

Frameshift

Junction score

Conservation

Length

Secondary structure

First[AG]

Tis

sue-

ind

epen

den

t sp

licin

gC

NS

Mu

scle

Em

bry

oD

iges

tive

Exonexclusion

Exoninclusion

Alternative exon

300 nt 300 nt 300 nt 300 nt

a UUUAUAA

UGAUUUU

UCAUUUC

AUUAAAU

Secondstructure

YCAY(Nova)

YCAY(Nova)

U-ruch(Tia1/Tiar)

YCAY(Nova)

YCAY(Nova)

YCAY(Nova)

YCAY(Nova)

YGCUKY(Mbnl)

Length

Length

Length

FrameshiftFirst[AG]

First[AG]

[U]GCAUG(Fox)

[U]GCAUG(Fox)

[U]GCAUG(Fox)

PTCexc

[U]GCAUG(Fox)

ACUAAY(Qkl)

ACUAAY(Qkl)

CU-rich(nPTB)

ESE

UGGCUU

b

c

d

Featu

re en

richm

ent

Featu

re dep

letion

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

ACUAAY(Qkl)

ACUAAY(Qkl)

ACUAAY(Qkl)

YYACWSS(SRp40)

CUG-rich(Cugbp)

CUG-rich(Cugbp)

CUG-rich(Cugbp)

CUG-rich(Cugbp)

YGCUKY(Mbnl)

Figure 3 | Graphical depiction of the splicing code. a, The region-specificactivity of each feature in increased exon inclusion (red bar) or exclusion(blue bar) is shown for CNS (C), muscle (M), embryo (E) and digestive (D)tissues, plus a tissue-independentmixture (I). A bar with/without a black hatindicates activity due to feature depletion/enrichment. Bar size conveysenrichmentP-value;P, 0.005 in all cases. Potential feature binding proteinsare shown in parentheses. b–d, Unexpectedly frequent feature pairs were

identified and used to generate feature interaction networks for CNS(b), muscle (c) and embryonic (d) tissues. Node size and colour indicate thefeature’sP-value and region (see colour key in a). Red/blue edges correspondto increased inclusion/exclusion and edge thickness conveys interactionP-value (false discovery rate-corrected Fisher test); P, 0.05 in all cases. Athick/thin node boundary indicates activity due to feature depletion/enrichment.

ARTICLES NATURE |Vol 465 |6 May 2010

56Macmillan Publishers Limited. All rights reserved©2010

Created using Cytoscape

Page 20: Visualization of Alternative Splicing - VIZBI

muscle inclusion

Srp40

ESESrp40

CUG-rich(Cugbp)

[U]GCAUG(Fox)

ACUAAY (QKI)

CU rich(PTB)

YCAY (Nova)

YGCUKY (Mbnl)UGCUU

UGCUU

UCAUUUCUGGCUU

UUUAGSplice site score

Splice site score

Conservation

Length

First [AG]

Conservation

First [AG]

Srp40

PTCincLength

Length

Splice site score

SC35

SC35

CUG-rich (Cugbp)

CU rich(PTB)

CU rich(PTB)

AAGUC

Splice site score

Splice site score

Conservation

Conservation

CU rich(PTB)[U]GCAUG(Fox) YCAY Nova

CUG-rich(Cugbp)

[U]GCAUG(Fox)

YGCUKY (Mbnl)

YGCUKY (Mbnl)

ACUAAY (QKI)[U]GCAUG(Fox)

Muscle speci!c splicing code

YGCUKY (Mbnl)

muscle skipping

Figure 1.- Muscle speci!c splicing code (adapted from Barash et al., 2010). Features enriched (in red) or depleted (in blue) associated with exon skipping (top) or exon inclusion (bottom) are shown for seven transcript regionsincluding upstream !anking exon(white box), 5’ upstream intron, 3’ upstream intron, cassette exon (in green), 5’ downstream intron, 3’ downstream intron and downstream !anking exon (white box). For structural featuresrepresenting feature size (exon length or distance to "rst upstream AG), red denotes larger and blue smaller.For each feature, font size conveys enrichment P-value; (P<0.005) (taken from Barash et al,. 2010) and correlates with activity due to depletion (blue) or enrichment (red). Frequent feature pairs or networks are shown where boxed features represent nodes and red/blue edges correspond to increased inclusion/skipping and edge thickness conveys strength of interaction as in Barash et al., 2010.

Combined Visualization

figure by Miriam Llorian, Chris Smith (unpublished)

(ESEs and ISEs) and silencers (ESSs and ISSs),which are 6–8nucleotideslong and were identified without regard to possible tissue-dependentroles24–26, and 314 5–7-nucleotide-long motifs that are conserved inintronic sequences neighbouring alternative exons27. There are also460 region-specific counts of 1–3-nucleotide ‘short motifs’, becausesuch features were previously associated with alternative splicing28.We included 57 ‘transcript structure’ features implicated in determin-ing spliced transcript levels, such as exon/intron lengths, regionalprobabilities of secondary structures29, and whether exon inclusion/exclusion introduces a premature termination codon (PTC).

In addition to the feature compendium, we constructed a set of,1,800 ‘unbiased motifs’ by performing a de novo search10 for eachtissue type and direction of splicing change (SupplementaryInformation 3). Later, we report results obtained with and withoutusing these features.

Assembling a high-information code

Ourmethod seeks a code that is able to predict the splicing patterns ofall exons as accurately as possible, based solely on the tissue type andproximal RNA features. The putative features for a particular exonare appended to make a feature vector r, and the correspondingprediction in tissue type c is denoted p(c,r). Like q, p(c,r) consistsof probabilities of increased inclusion or exclusion, or no change. Thecode is combinatorial and accounts for how features cooperate orcompete in a given tissue type, by specifying a subset of important

features, thresholds on feature values and softmax parameters30 relat-ing active feature combinations to the prediction p(c,r) (Supplemen-tary Information 4).

We use a measure of ‘code quality’ that is based on informationtheory31 (see Methods). It can be viewed as the amount of informa-tion about genome-wide tissue-dependent splicing accounted for bythe code. A code quality of zero indicates that the predictions are nobetter than guessing, whereas a higher code quality indicatesimproved prediction capability.

To assemble a code, our method recursively selects features fromthe compendium, while optimizing their thresholds and softmaxparameters to maximize code quality (Supplementary Informa-tion 5). The code quality increased monotonically during assembly,but diminished gains were observed after 200 features were included(Fig. 1b, c, based on fivefold cross-validation). The final assembledcode contained,200 features. When a code was assembled using thecompendium plus the unbiased motifs, the increase in code qualitydid not exceed 1 s.d. in error (data not shown), but, interestingly,some of the unbiased motifs that did not correspond to any com-pendium features were selected and subsequently experimentallyverified (see later).

To quantify the contributions of its different components, wecompared our final assembled code to partial codes whose onlyinputs were the tissue type, previously describedmotifs, conservationlevels, or the compendium with transcript structure features or con-servation levels removed (Fig. 1d).

Predicting alternative splicing

On the task of distinguishing alternatively spliced exons from con-stitutively spliced exons, our method achieves a true positive rate ofmore than 60% at a false positive rate of 1% (SupplementaryInformation 6). To address the more difficult challenge of predictingtissue-dependent regulation, we applied the code to various sets ofunique test exons (exons not similar to those used during codeassembly) and verified the predictions using microarray data, PCRwith reverse transcription (RT–PCR) and focused studies (see laterand Supplementary Information 5).

We first asked whether the theoretical ranking of the differentcodes shown in Fig. 1d corresponds well to their relative abilities topredict microarray-assessed tissue-dependent regulation (seeMethods). Indeed, the final assembled code achieved significantlyhigher accuracy than the partial codes (Fig. 2a). For exons in geneswith median expression in the top 20th percentile, at a false positiverate of 1%, a true positive rate of 21% was achieved, and this rose to51% for a false positive rate of 10%.

We next asked how well the splicing code predicts significantdifferences in the percentage exon inclusion between pairs of tissues,for cases where the predicted difference is large (Fig. 2b andSupplementary Fig. 12). For microarray data, the splicing codecorrectly predicted the direction of change (positive or negative) in82.4% of cases (P, 13 10230, Binomial test; see Methods). For RT–PCR evaluation, 14 exons that the splicing code predicted wouldexhibit significant tissue-dependent splicing were profiled in 14diverse tissues. The splicing code correctly predicted the directionof change in 93.3% of cases (P, 13 10210, Binomial test). A scatter-plot comparing predictions and measurements (Fig. 2c) illustratesthat the code is able to predict an exon’s direction of regulation betterthan its percentage inclusion level. Figure 2d shows RT–PCR dataand predictions for four representative exons.

To assess whether the code recapitulates results from experimentalstudies of individual exons and tissue-specific splicing factors, wesurveyed 97 CNS- and/or muscle-regulated exons targeted by Nova,Fox, PTB, nPTB and/or unknown factors18,19,32–39. For each test exon,we extracted its features, applied the code and examined whether ornot it correctly predicts splicing patterns in CNS or muscle tissues(Supplementary Table 3). The code’s predictionswere correct for 74%of the combined set of 97 exons (P, 13 10241, Bernoulli test), 65%

Short motifs

Known motifs

Transcript structureNew motifs

Splicing code

a

Tissue type

300 nt 300 nt 300 nt 300 nt

Alternatively spliced exon

Feature setPredicted change in

exon inclusion

Co

de q

ualit

y (b

its) 400

Featu

re d

ete

ctio

n r

ate

Number of RNA features

Knownmotifsonly

Finalassembled

code

Tissuetypeonly

Consonly

W/otranscriptstructure

W/ocons

Short motifs

Known motifs

Transcript structure New motifs

d

b

0 100 200

0

0.4

0.0

0.8

200 400

0

200

Codes derived using different feature sets

300

100

c

RNA featureextraction

Co

de q

ualit

y (b

its)

Code assembly

Figure 1 | Assembling the splicing code. a, The code extracts hundreds ofRNA features (known/new/short motifs and transcript structure features)from any exon of interest (red), its neighbouring exons (yellow) andintervening introns (blue). It then predicts whether or not the exon isalternatively spliced, and if so, whether the exon’s inclusion level will increaseor decrease in a given tissue, relative to others. b, c, Code assembly proceedsby recursively adding features to maximize an information measure of codequality (b), and different feature types are preferred at different stages ofassembly (c). d, The final assembled code achieves higher code quality thansimpler codes derived using previously reported features and feature subsets.Cons, conservation; w/o, without. Error bars represent 1 s.d.

ARTICLES NATURE |Vol 465 |6 May 2010

54Macmillan Publishers Limited. All rights reserved©2010

(ESEs and ISEs) and silencers (ESSs and ISSs),which are 6–8nucleotideslong and were identified without regard to possible tissue-dependentroles24–26, and 314 5–7-nucleotide-long motifs that are conserved inintronic sequences neighbouring alternative exons27. There are also460 region-specific counts of 1–3-nucleotide ‘short motifs’, becausesuch features were previously associated with alternative splicing28.We included 57 ‘transcript structure’ features implicated in determin-ing spliced transcript levels, such as exon/intron lengths, regionalprobabilities of secondary structures29, and whether exon inclusion/exclusion introduces a premature termination codon (PTC).

In addition to the feature compendium, we constructed a set of,1,800 ‘unbiased motifs’ by performing a de novo search10 for eachtissue type and direction of splicing change (SupplementaryInformation 3). Later, we report results obtained with and withoutusing these features.

Assembling a high-information code

Ourmethod seeks a code that is able to predict the splicing patterns ofall exons as accurately as possible, based solely on the tissue type andproximal RNA features. The putative features for a particular exonare appended to make a feature vector r, and the correspondingprediction in tissue type c is denoted p(c,r). Like q, p(c,r) consistsof probabilities of increased inclusion or exclusion, or no change. Thecode is combinatorial and accounts for how features cooperate orcompete in a given tissue type, by specifying a subset of important

features, thresholds on feature values and softmax parameters30 relat-ing active feature combinations to the prediction p(c,r) (Supplemen-tary Information 4).

We use a measure of ‘code quality’ that is based on informationtheory31 (see Methods). It can be viewed as the amount of informa-tion about genome-wide tissue-dependent splicing accounted for bythe code. A code quality of zero indicates that the predictions are nobetter than guessing, whereas a higher code quality indicatesimproved prediction capability.

To assemble a code, our method recursively selects features fromthe compendium, while optimizing their thresholds and softmaxparameters to maximize code quality (Supplementary Informa-tion 5). The code quality increased monotonically during assembly,but diminished gains were observed after 200 features were included(Fig. 1b, c, based on fivefold cross-validation). The final assembledcode contained,200 features. When a code was assembled using thecompendium plus the unbiased motifs, the increase in code qualitydid not exceed 1 s.d. in error (data not shown), but, interestingly,some of the unbiased motifs that did not correspond to any com-pendium features were selected and subsequently experimentallyverified (see later).

To quantify the contributions of its different components, wecompared our final assembled code to partial codes whose onlyinputs were the tissue type, previously describedmotifs, conservationlevels, or the compendium with transcript structure features or con-servation levels removed (Fig. 1d).

Predicting alternative splicing

On the task of distinguishing alternatively spliced exons from con-stitutively spliced exons, our method achieves a true positive rate ofmore than 60% at a false positive rate of 1% (SupplementaryInformation 6). To address the more difficult challenge of predictingtissue-dependent regulation, we applied the code to various sets ofunique test exons (exons not similar to those used during codeassembly) and verified the predictions using microarray data, PCRwith reverse transcription (RT–PCR) and focused studies (see laterand Supplementary Information 5).

We first asked whether the theoretical ranking of the differentcodes shown in Fig. 1d corresponds well to their relative abilities topredict microarray-assessed tissue-dependent regulation (seeMethods). Indeed, the final assembled code achieved significantlyhigher accuracy than the partial codes (Fig. 2a). For exons in geneswith median expression in the top 20th percentile, at a false positiverate of 1%, a true positive rate of 21% was achieved, and this rose to51% for a false positive rate of 10%.

We next asked how well the splicing code predicts significantdifferences in the percentage exon inclusion between pairs of tissues,for cases where the predicted difference is large (Fig. 2b andSupplementary Fig. 12). For microarray data, the splicing codecorrectly predicted the direction of change (positive or negative) in82.4% of cases (P, 13 10230, Binomial test; see Methods). For RT–PCR evaluation, 14 exons that the splicing code predicted wouldexhibit significant tissue-dependent splicing were profiled in 14diverse tissues. The splicing code correctly predicted the directionof change in 93.3% of cases (P, 13 10210, Binomial test). A scatter-plot comparing predictions and measurements (Fig. 2c) illustratesthat the code is able to predict an exon’s direction of regulation betterthan its percentage inclusion level. Figure 2d shows RT–PCR dataand predictions for four representative exons.

To assess whether the code recapitulates results from experimentalstudies of individual exons and tissue-specific splicing factors, wesurveyed 97 CNS- and/or muscle-regulated exons targeted by Nova,Fox, PTB, nPTB and/or unknown factors18,19,32–39. For each test exon,we extracted its features, applied the code and examined whether ornot it correctly predicts splicing patterns in CNS or muscle tissues(Supplementary Table 3). The code’s predictionswere correct for 74%of the combined set of 97 exons (P, 13 10241, Bernoulli test), 65%

Short motifs

Known motifs

Transcript structureNew motifs

Splicing code

a

Tissue type

300 nt 300 nt 300 nt 300 nt

Alternatively spliced exon

Feature setPredicted change in

exon inclusion

Co

de q

ualit

y (b

its) 400

Featu

re d

ete

ctio

n r

ate

Number of RNA features

Knownmotifsonly

Finalassembled

code

Tissuetypeonly

Consonly

W/otranscriptstructure

W/ocons

Short motifs

Known motifs

Transcript structure New motifs

d

b

0 100 200

0

0.4

0.0

0.8

200 400

0

200

Codes derived using different feature sets

300

100

c

RNA featureextraction

Co

de q

ualit

y (b

its)

Code assembly

Figure 1 | Assembling the splicing code. a, The code extracts hundreds ofRNA features (known/new/short motifs and transcript structure features)from any exon of interest (red), its neighbouring exons (yellow) andintervening introns (blue). It then predicts whether or not the exon isalternatively spliced, and if so, whether the exon’s inclusion level will increaseor decrease in a given tissue, relative to others. b, c, Code assembly proceedsby recursively adding features to maximize an information measure of codequality (b), and different feature types are preferred at different stages ofassembly (c). d, The final assembled code achieves higher code quality thansimpler codes derived using previously reported features and feature subsets.Cons, conservation; w/o, without. Error bars represent 1 s.d.

ARTICLES NATURE |Vol 465 |6 May 2010

54Macmillan Publishers Limited. All rights reserved©2010

Page 21: Visualization of Alternative Splicing - VIZBI

Motif Map Visualization

!"#$%"&'()*+

!",-."/0102(+

3 4 5

3

4

5

6

7

8

!"#$%&'()*$

+',-$,(

./0/1/2 . 0 1 2 .3 4,51

63

61

23

21

73

!"#$%&'()'*+%+,-'&.'+/'

*'

0'

1' 2'

!"#$%"&'()"

*+,"#$%"&'()-"

."#$%"&'()-"

345&%#",67'85.&%+7695' 05.+#95",67'85.&%+7695'

!" # !$%&& %&& %&& %&&

'()*

+(,

-./012

34.5

!6748

9!:;<=

>6*?;.7

!(.@A:)ABC3(D;E@

FA<(.B*:GCFD:6<DH

!(.@A:)*D;(.

$

"&

+A*DHC!(I8A.B;6I

9.4;*@AB

3(D;E@

J'#

!%&& !$K& !$&& !"K& !"&& !K& &

" $ % L K M N O P "& "" "$

[email protected]@

06D*D;)AC+A*D6:A@

3(D;EC3*8

!"#$"##!

%&'()'*)+),-.'&./(+0+$1234!

!"

# $ % & ' ( ) * +!#,!##

+ #, ## #$ #-$-'-*

+-##

./0"/1234/5/6"0

('7) *%7, *(7% )*7% ($7) *%7% **7) (&7) ))7) ,7, (*7) *%7) %)7) *)7, *)7% &(7%896:

;$<2;/=>?6@4

/012'345"6055"5730"8/*9:"

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,#7), (

,&7), (

,,7,, ,

896:

;9A!%.%2;?6!6/=>@4

%&'()'*)+),-.'&./(+0+$1234!

/'3+3012'345"6055"5730"8/;<+=$=:"

>?'3"!@A"B44C!"

!"#$%"&'()*+

!",-."/0102(+

3 4 5

3

4

5

6

7

8

!"#$%&'()*$

+',-$,(

./0/1/2 . 0 1 2 .3 4,51

63

61

23

21

73

!"#$%&'()'*+%+,-'&.'+/'

*'

0'

1' 2'

!"#$%"&'()"

*+,"#$%"&'()-"

."#$%"&'()-"

345&%#",67'85.&%+7695' 05.+#95",67'85.&%+7695'

!" # !$%&& %&& %&& %&&

'()*

+(,

-./012

34.5

!6748

9!:;<=

>6*?;.7

!(.@A:)ABC3(D;E@

FA<(.B*:GCFD:6<DH

!(.@A:)*D;(.

$

"&

+A*DHC!(I8A.B;6I

9.4;*@AB

3(D;E@

J'#

!%&& !$K& !$&& !"K& !"&& !K& &

" $ % L K M N O P "& "" "$

[email protected]@

06D*D;)AC+A*D6:A@

3(D;EC3*8

!"#$"##!

%&'()'*)+),-.'&./(+0+$1234!

!"

# $ % & ' ( ) * +!#,!##

+ #, ## #$ #-$-'-*

+-##

./0"/1234/5/6"0

('7) *%7, *(7% )*7% ($7) *%7% **7) (&7) ))7) ,7, (*7) *%7) %)7) *)7, *)7% &(7%896:

;$<2;/=>?6@4

/012'345"6055"5730"8/*9:"

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,#7), (

,&7), (

,,7,, ,

896:

;9A!%.%2;?6!6/=>@4

%&'()'*)+),-.'&./(+0+$1234!

/'3+3012'345"6055"5730"8/;<+=$=:"

>?'3"!@A"B44C!"

!"#$%"&'()*+

!",-."/0102(+

3 4 5

3

4

5

6

7

8

!"#$%&'()*$

+',-$,(

./0/1/2 . 0 1 2 .3 4,51

63

61

23

21

73

!"#$%&'()'*+%+,-'&.'+/'

*'

0'

1' 2'

!"#$%"&'()"

*+,"#$%"&'()-"

."#$%"&'()-"

345&%#",67'85.&%+7695' 05.+#95",67'85.&%+7695'

!" # !$%&& %&& %&& %&&

'()*

+(,

-./012

34.5

!6748

9!:;<=

>6*?;.7

!(.@A:)ABC3(D;E@

FA<(.B*:GCFD:6<DH

!(.@A:)*D;(.

$

"&

+A*DHC!(I8A.B;6I

9.4;*@AB

3(D;E@

J'#

!%&& !$K& !$&& !"K& !"&& !K& &

" $ % L K M N O P "& "" "$

[email protected]@

06D*D;)AC+A*D6:A@

3(D;EC3*8

!"#$"##!

%&'()'*)+),-.'&./(+0+$1234!

!"

# $ % & ' ( ) * +!#,!##

+ #, ## #$ #-$-'-*

+-##

./0"/1234/5/6"0

('7) *%7, *(7% )*7% ($7) *%7% **7) (&7) ))7) ,7, (*7) *%7) %)7) *)7, *)7% &(7%896:

;$<2;/=>?6@4

/012'345"6055"5730"8/*9:"

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,#7), (

,&7), (

,,7,, ,

896:

;9A!%.%2;?6!6/=>@4

%&'()'*)+),-.'&./(+0+$1234!

/'3+3012'345"6055"5730"8/;<+=$=:"

>?'3"!@A"B44C!"

!"#$%"&'()*+

!",-."/0102(+

3 4 5

3

4

5

6

7

8

!"#$%&'()*$

+',-$,(

./0/1/2 . 0 1 2 .3 4,51

63

61

23

21

73

!"#$%&'()'*+%+,-'&.'+/'

*'

0'

1' 2'

!"#$%"&'()"

*+,"#$%"&'()-"

."#$%"&'()-"

345&%#",67'85.&%+7695' 05.+#95",67'85.&%+7695'

!" # !$%&& %&& %&& %&&

'()*

+(,

-./012

34.5

!6748

9!:;<=

>6*?;.7

!(.@A:)ABC3(D;E@

FA<(.B*:GCFD:6<DH

!(.@A:)*D;(.

$

"&

+A*DHC!(I8A.B;6I

9.4;*@AB

3(D;E@

J'#

!%&& !$K& !$&& !"K& !"&& !K& &

" $ % L K M N O P "& "" "$

[email protected]@

06D*D;)AC+A*D6:A@

3(D;EC3*8

!"#$"##!

%&'()'*)+),-.'&./(+0+$1234!

!"

# $ % & ' ( ) * +!#,!##

+ #, ## #$ #-$-'-*

+-##

./0"/1234/5/6"0

('7) *%7, *(7% )*7% ($7) *%7% **7) (&7) ))7) ,7, (*7) *%7) %)7) *)7, *)7% &(7%896:

;$<2;/=>?6@4

/012'345"6055"5730"8/*9:"

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,#7), (

,&7), (

,,7,, ,

896:

;9A!%.%2;?6!6/=>@4

%&'()'*)+),-.'&./(+0+$1234!

/'3+3012'345"6055"5730"8/;<+=$=:"

>?'3"!@A"B44C!"

neuronal

!"#$%"&'()*+

!",-."/0102(+

3 4 5

3

4

5

6

7

8

!"#$%&'()*$

+',-$,(

./0/1/2 . 0 1 2 .3 4,51

63

61

23

21

73

!"#$%&'()'*+%+,-'&.'+/'

*'

0'

1' 2'

!"#$%"&'()"

*+,"#$%"&'()-"

."#$%"&'()-"

345&%#",67'85.&%+7695' 05.+#95",67'85.&%+7695'

!" # !$%&& %&& %&& %&&

'()*

+(,

-./012

34.5

!6748

9!:;<=

>6*?;.7

!(.@A:)ABC3(D;E@

FA<(.B*:GCFD:6<DH

!(.@A:)*D;(.

$

"&

+A*DHC!(I8A.B;6I

9.4;*@AB

3(D;E@

J'#

!%&& !$K& !$&& !"K& !"&& !K& &

" $ % L K M N O P "& "" "$

[email protected]@

06D*D;)AC+A*D6:A@

3(D;EC3*8

!"#$"##!

%&'()'*)+),-.'&./(+0+$1234!

!"

# $ % & ' ( ) * +!#,!##

+ #, ## #$ #-$-'-*

+-##

./0"/1234/5/6"0

('7) *%7, *(7% )*7% ($7) *%7% **7) (&7) ))7) ,7, (*7) *%7) %)7) *)7, *)7% &(7%896:

;$<2;/=>?6@4

/012'345"6055"5730"8/*9:"

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,,7,, ,

,#7), (

,&7), (

,,7,, ,

896:

;9A!%.%2;?6!6/=>@4

%&'()'*)+),-.'&./(+0+$1234!

/'3+3012'345"6055"5730"8/;<+=$=:"

>?'3"!@A"B44C!"

non-neuronal

Known + novel cis elementsDirect mutagenesis, double/triple/...

Page 22: Visualization of Alternative Splicing - VIZBI

Online Splicing Prediction + UCSC GB Visualization

Page 23: Visualization of Alternative Splicing - VIZBI

Very Open Questions

How to represent complex combinations of regulatory features?How do we visually compare different codes?How do we visualize combined effects of motifs?How do we visualize more complex splicing events?Building a real splicing interactive & integrative tool....

Page 24: Visualization of Alternative Splicing - VIZBI

Visualization in scientific papers

Example: Journals on the iPad/TabletsDynamic visualizationInteractive visualization•standards• tools

Requires concerted, long term, effort

For example, the code predicts a class of exons that insert a PTC afterinclusion in transcripts and that are skipped in embryonic tissues butincluded in adult tissues. Later, we describe experiments indicating thatthese exons have an important role in the regulation of developmentalstage-specific gene expression.

Predicting regulatory feature maps

By flagging regulatory elements in RNA sequences surrounding analternative exon, the splicing code yields a visual feature map thatpartially accounts for how the exon is regulated. Predicted featuremaps were initially evaluated by their overlap with 376 nucleotides of

RNA sequence analysed by mutagenesis in more than 60 splicingreporter constructs from Agrn33, Src19,43, Casp2 (ref. 35) and the SloK1 STREX exon44. Our feature maps (Supplementary Figs 2–7 andSupplementary Information 10) achieve an overlap of 90% with astatistical significance of P, 0.002 (empirical, using maps fromunrelated exons). In contrast, feature maps constructed using onlyknown motifs achieve an overlap of 38% (P5 0.004) and mapsderived solely from conservation information27 achieve poor specifi-city (P5 0.27).

Code-generated feature maps can be used to guide focused mech-anistic studies. We examined exon 16 of the Daam1 gene, which our

U-rich

Feature C1

I CMED I CMEDI CMED I CMED I CMED I CMED I CMED

AI1(5′) I1(3′) I2(5′) I2(3′) C2

CUG-rich(Cugbp)

[U]GCAUG(Fox)

ACUAAY(Qkl)

CU-rich(nPTB)

YCAY(Nova)

YGCUKY(Mbnl)

ESE

ESS

GRYYcSYR(SC35)

YRCYKM(SRp55)

YYACWSS(SRp40)

Mot[UGAUUUU]

Mot[UAAAUG]

Mot[UUGGU]

Mot[UUUAUAA]

Mot[GUAAG]

Mot[UGCUU]

Mot[UCAUUUC]

Mot[UUAGAA]

Mot[GUUUU]

Mot[UGGCUU]

Mot[AAGUC]

Mot[UUUAG]

Mot[AUUAAAU]

PTCinc

PTCexc

Frameshift

Junction score

Conservation

Length

Secondary structure

First[AG]

Tis

sue-

ind

epen

den

t sp

licin

gC

NS

Musc

leE

mb

ryo

Dig

estiv

e

Exonexclusion

Exoninclusion

Alternative exon

300 nt 300 nt 300 nt 300 nt

a UUUAUAA

UGAUUUU

UCAUUUC

AUUAAAU

Secondstructure

YCAY(Nova)

YCAY(Nova)

U-ruch(Tia1/Tiar)

YCAY(Nova)

YCAY(Nova)

YCAY(Nova)

YCAY(Nova)

YGCUKY(Mbnl)

Length

Length

Length

FrameshiftFirst[AG]

First[AG]

[U]GCAUG(Fox)

[U]GCAUG(Fox)

[U]GCAUG(Fox)

PTCexc

[U]GCAUG(Fox)

ACUAAY(Qkl)

ACUAAY(Qkl)

CU-rich(nPTB)

ESE

UGGCUU

b

c

d

Featu

re en

richm

ent

Featu

re dep

letion

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

CU-rich(nPTB)

ACUAAY(Qkl)

ACUAAY(Qkl)

ACUAAY(Qkl)

YYACWSS(SRp40)

CUG-rich(Cugbp)

CUG-rich(Cugbp)

CUG-rich(Cugbp)

CUG-rich(Cugbp)

YGCUKY(Mbnl)

Figure 3 | Graphical depiction of the splicing code. a, The region-specificactivity of each feature in increased exon inclusion (red bar) or exclusion(blue bar) is shown for CNS (C), muscle (M), embryo (E) and digestive (D)tissues, plus a tissue-independentmixture (I). A bar with/without a black hatindicates activity due to feature depletion/enrichment. Bar size conveysenrichmentP-value;P, 0.005 in all cases. Potential feature binding proteinsare shown in parentheses. b–d, Unexpectedly frequent feature pairs were

identified and used to generate feature interaction networks for CNS(b), muscle (c) and embryonic (d) tissues. Node size and colour indicate thefeature’sP-value and region (see colour key in a). Red/blue edges correspondto increased inclusion/exclusion and edge thickness conveys interactionP-value (false discovery rate-corrected Fisher test); P, 0.05 in all cases. Athick/thin node boundary indicates activity due to feature depletion/enrichment.

ARTICLES NATURE |Vol 465 |6 May 2010

56Macmillan Publishers Limited. All rights reserved©2010

Page 25: Visualization of Alternative Splicing - VIZBI

Acknowledgements

Brendan  FreyBen  BlencoweJohn Calarco

Xinchen WangSandy Pan

Weijun Gao Leo Lee

AS D.B.RNA-Seq processing

Web SiteAS D.B.

RNA-Seq processing

Disease StudyComputationalExperimental

Experiments

Page 26: Visualization of Alternative Splicing - VIZBI

Thank You