20110524zurichngs 2nd pub

Next Generation Sequencing for Model and Non-Model Organism

2nd day

Jun Sese and Kentaro [email protected]

Ph.D course lecture @ Institute of Plant Biology, Univ. of Zurich

26/05/2011

mailto:[email protected]

mailto:[email protected]

Today’s Menu

• Lecture

• Current RNA-Seq analysis

• Genome and RNA Asembly

• Introduction to AWK

• First step of programming

• Exercise

• Visualization of mapped reads

• RNA-Seq analysis

• Genome assembly

2

Sequencerʼs Output

Mapping Program

Genome Sequence

Mapping Result

Visualization Further AnalysisRNA-Seq

3

RNA-Seq• Which genes are highly expressed?

• Need to normalize by sequence length

• RPKM (Reads Par Kilo-basepair per Million reads) [Mortazavi et al. Nature Methods. 2008]

• An initial gene expression counting method

Think about two genes expressed in a cell.Suppose that a mRNA is expressed from each gene.Short Gene Long Gene

2 8

Longer gene has more frequency.4

RNA-Seq (contd)• Some corrections including multiple-test and

fragment bias will be required.

• Srivastava and Chen. NAR. 2010

• Li, Jiang and Wong. Genome Research. 2010

• No standard method.

• After mapping reads, some tools are available to count reads.

• Cufflinks

• HTSeq

• R packages

• DEGSeq [Wang et al. 2010]

• edgeR [Robinson, McCarthy and Smyth. 2010]

• DEseq [Anders and Huber. 2010]

5

Sequencerʼs Output

Mapping Program

Genome Sequence

Mapping Result

Visualization Further Analysis

SequencerAssemble

6

Assembly

• Genome/Gene assembly is a kind of puzzle.

• Assemble a long sequence by combining short reads

GATGCTAAGCATATGCGAGGCATGCCATATGGATG

GATGCTAAG

CATATGCGA

GGCATGCC

ATATGGATG

CTAAGCAT

CGAGGCAT

TGCCATAT

GATGCTAAGCATATGCGA

GGCATGCC

ATATGGATG CTAAGCAT

CGAGGCATTGCCATAT

7

Assembly programs also depend on sequence length

• Sanger sequence

• Archine

• Roche 454

• Mira3, Newbler

• Illumina/SOLiD sequencers

• Velvet, ABySS, SOAPdenovo,...

• Recently gene(RNA) assemble programs have been developed

• Oases http://www.ebi.ac.uk/~zerbino/oases/

• Trinity [Grabherr et al. Nature Biotech. 2011]8

http://www.ebi.ac.uk/~zerbino/oases/

http://www.ebi.ac.uk/~zerbino/oases/

Overlap-Layout-Consensus• Mainly used to assemble Sanger and Roche 454 sequences.

Kasahara and Morishita.Large-scale genome sequence processing.2006. 9

de Bruijn Graph approach• Used in recent short read assemblers

• Velvet, ABySS,...

• Generate k-mer graph (de Bruijn graph), and then find minimum paths covering all edges

• Originally introduced in Pevzner, Tang and Waterman, PNAS, 2001.

Miller, Koren and Sutton. Genomics. 2010. 10

Miller, Koren and Sutton. Genomics, 2010.11

Genome assembly problem has no correct answer.

• True genome sequence exists, I know.

• In reality, we can not know the whole genome sequence exactly.

• In most genome assemble study, some indexes are used to check whether the assembly is success or not.

• Number of contigs

• Total length of contigs

• N50

• If you read EST sequences, the sequences can use to check the assemble quality.

• Note: You can not use the ESTs to do assemble genome because of keeping independency between training set and test set.

12

Assembled sequences vary between assemblers

• Compare 5 assemblers for RNA assembly with Roche 454 reads

• Kumar and Blaxter. BMC Genomics. 2010.

• Compare Newbler, SeqMan, CLC (Commercial), CAP3 and MIRA3 (Free)

• No winner

• Newbler 2.5 generates longest contigs

• SeqMan is the best for recapturing known genes

• MIRA3 is competitive for Newbler and SeqMan

13

Assembled sequences vary between assemblers (contd)

• Compare 6 assemblers for genome assembly

• Bao et al. J. Hum Gen. 2010.

• Use 1.5 million reads. Human genome resequencing data. Read length is 76 bp.

• Authors conclude that SOAPdenovo was the best.

• High genome coverage, low memory and fast.

• SSAKE and ABySS generated very longer contig than SOAPdenovo.

• Because of shortage of # of reads, this comparison is not practical.

• They selected reads because their machine only have 32GB memory.

• Genome assembly require various parameters to get “good” result. Authors did not mention about the parameter tuning.

14

Sequencerʼs Output

Mapping Program

Genome Sequence

Mapping Result

Visualization

Sequence Format

Output Format

BWA, Bowtie, etc.

We have to change file format

Change File Format

Further Analysis15

Introduction to AWK• “grep” is very useful command, but we may require more

complicated search.

• e.g., select lines whose third column is “Chr1.”

• ‘grep “Chr1” file’ select lines even when the line contains “Chr1” in first columns.

• e.g., select lines whose values are less than 100.

• Grep cannot compare values.

• Replace a word with other word in file.

• Editors can do that if file size is small.

• AWK is one of the traditional and simple solution.

• For more complicated tasks, script languages like perl, python and ruby are useful.

• We here introduce “minimum” requirements about AWK.

• You can find many introductory documents about awk in the Web.

16

AWK in a nutshell• Process each line

• $n means n-th column.

• $1 is first column and $2 is second column.

• $0 means whole line

# same as “cut -f2 nums.tab”$ awk '{print $2}' nums.tab13.87.77.010.99.1# Only print second column is equal to “10.9”# Compare with ‘grep “10.9” nums.tab’$ awk '{if($2 == "10.9") print $0}' nums.tab9.4 10.9# Compare as numerical value$ awk '{if($2 > 10) print $0}' nums.tab 11.2 13.89.4 10.9$ awk '{if($2 > 10 & $2 < 12) print $0}' nums.tab9.4 10.9

$ cat nums.tab11.2 13.810.9 7.715.2 7.09.4 10.98.8 9.1

17

AWK in a nutshell (2)$ cat nums.tab11.2 13.810.9 7.715.2 7.09.4 10.98.8 9.1

# Print lines contains “9” in second column$ awk '{if($2 ~ /9/) print $0}' nums.tab9.4 10.98.8 9.1# Print lines start from “1”$ awk '{if($1 ~ /^9/) print $0}' nums.tab9.4 10.9# Replace special string$ awk '{gsub(/10/,"15"); print $0}' nums.tab11.2 13.815.9 7.715.2 7.09.4 15.98.8 9.1

“ ” is just string, and / / is regular expression.18

Sequencerʼs Output

Mapping Program

Genome Sequence

Mapping Result

Visualization Further AnalysisRNA-Seq

19

Convert SAM to BAM• SAM file is very large file size.

• We convert the SAM file into BAM file, which is computer friendly format.

• Install SAMtools

• http://samtools.sourceforge.net/

# $ curl -O http://switch.dl.sourceforge.net/project/samtools/samtools/0.1.16/samtools-0.1.16.tar.bz2# $ bzip2 -dc samtools-0.1.16.tar.bz2 | tar xvf -# $ ln -s samtools-0.1.16 samtools# $ cd samtools# $ make # $ cd ..

$ ./samtools/samtools faidx TAIR10_chr_all.fas# Generate TAIR10_chr_all.fas.fai# “\” indicates that the line continues to next line. # You do not need to input the “\”$ ./samtools/samtools view -bt TAIR10_chr_all.fas.fai \-o tha_reads.bam tha_reads.sam# Generate tha_reads.bam$ ./samtools/samtools sort tha_reads.bam tha_reads.sorted# Sort reads and generate tha_reads.sorted.bam$ ./samtools/samtools index tha_reads.sorted.bam tha_reads.sorted.bai# Generate index of bam file into tha_reads.sorted.bai 20

http://samtools.sourceforge.net/

http://samtools.sourceforge.net/

Visualize mapped result (IGV) • 1. Install IGV

• 2. Start IGV$ unzip IGV_1.5.64.zip #install$ java -Xmx1g -jar IGV_1.5.64/igv.jar #start IGV# Wait a minute. New window will appear.

3. Select A.thaliana (TAIR10)

4. File > Load from File > Select “tha_reads.sorted.bam”

5. Zoomin, Zoomin...but it is difficult to find mapped reads :(

21

Mapped reads on Chr1• Use SRR038985_chr1.sam

• Include all reads mapped onto Chromosome 1

• Convert the SAM file into BAM, and load from IGV

# We can skip this > $ ./samtools/samtools faidx TAIR10_chr_all.fas$ ./samtools/samtools view -bt TAIR10_chr_all.fas.fai \-o SRR038985_chr1.bam SRR038985_chr1.sam$ ./samtools/samtools sort SRR038985_chr1.bam SRR038985_chr1.sorted$ ./samtools/samtools index \SRR038985_chr1.sorted.bam SRR038985_chr1.sorted.bai

23

Visualize mapped result (Ensembl)• Install BEDTools

• http://code.google.com/p/bedtools/

• Using bamToBed in the BEDTools, you can convert bam format into BED format.

• BED format can describe simple track information. Ensembl and UCSC genome browser can read this file and display its contents.

# Skip install process# $ curl -O http://bedtools.googlecode.com/files/BEDTools.v2.12.0.tar.gz# $ tar zxvf BEDTools.v2.12.0.tar.gz# $ ln -s BEDTools-Version-2.12.0 BEDTools# $ cd BEDTools-Version-2.12.0# $ make# $ cd ..

$ ./BEDTools/bin/bamToBed -i SRR038985_chr1.sorted.bam \> SRR038985_chr1.sorted.bed 24

http://code.google.com/p/bedtools/

http://code.google.com/p/bedtools/

http://bedtools.googlecode.com/files/BEDTools.v2.12.0.tar.gz#




Visualize mapped result (Ensembl)• Go to http://plants.ensembl.org in

your browser

• Select Arabidopsis thaliana

• Click manage your data in left column

• Select “Upload Data” in left column

• Name for this upload: my_reads

• Data format: BED

• Upload file: select your bed file

• DON’T push Upload now!!!

25

Problems...• Two problems

• BED file is too large to upload. Maximum file size we can upload to Ensembl Plants is 5MB

• We have to select region in the BED file.

• Chromosome name is different

• In our BED file, chromosome name is like “Chr1,” while in ensembl, the name is just “1.”

• We have to convert the name.

• Finally, we can upload the BED file!

• It takes about a minute. Don’t push “Upload” button repeatedly.

$ awk '{if($3 < 1000000) print $0}' SRR038985_chr1.sorted.bed \> SRR038985_chr1_to_1M.sorted.bed# You can change region by replacing “$3 < 100000” with “$3 < 100000 && $3 > 50000”$ awk 'gsub(/^Chr/,"")' SRR038985_chr1_to_1M.sorted.bed \> SRR038985_chr1_to_1M.ensembl.bed$ ls -lh SRR038985_chr1_to_1M.ensembl.bed# Please check the file size is less than 5MB

26

Visualize mapped result (Ensembl)• Click link “1:0-100000”

• You can see your reads on “my_reads” track.

• Only you can see your track

• You have to upload BED file again after you logout your computer.

27

Count tags on each gene• Most RNA-Seq tools depend on some libraries.

• We have to install several programs to use them.

• Some of them require administrator authority.

• Provide simple python script and count the numbers of tags.

# We skip download GFF file. # GFF file contains gene positions on chromosomes.# $ curl -O ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff

$ python count_reads_on_gene.py SRR038985_chr1.sam TAIR10_GFF3_genes.gff > SRR038985_chr1.exp

...AT1G01046 0AT1G01050 1AT1G01060 0....

SRR038985_chr1.exp

Gene Name Count Sort by count in reverse order% sort -k2 -nr SRR038985_chr1.expAT1G18745 59AT1G16635 47AT1G21650 27AT1G75163 16... 29

A.lyrata reads and visualization• A.lyrata genome paper was published on April. 2011.

• Genome sequence forms small contigs

• These status is similar to just after sequence assembly

• We map reads on A.lyrata and visualize the data in IGV.

• In Ensembl Plants, A.lyrata genome is already available. However, unpublished genome sequence is not available on the site.

• This is limitation of web application (web sites).

• We here select IGV again.

• IGV does not contain A.lyrata genome information.

• We start from importing genome and gene informations.

30

Mapping A.lyrata Reads#Archive includes these files#$ curl -O \#ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Alyrata/assembly/Alyrata_107_RM.fa.gz#This file contains all chromosome sequences. Need not concatenate.#$ curl -O \#ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Alyrata/annotation/Alyrata_107_gene.gff3.gz#$ gzip -d Alyrata_107_RM.fa.gz#$ gzip -d Alyrata_107_gene.gff3.gz$ ./bwa/bwa index -c Alyrata_107_RM.fa$ python csfasta2fastq.py --bwa lyr_reads > lyr_reads.bwa$ ./bwa/bwa aln -c Alyrata_107_RM.fa lyr_reads.bwa > lyr_reads.sai$ ./bwa/bwa samse Alyrata_107_RM.fa lyr_reads.sai lyr_reads.bwa \> lyr_reads.sam$ ./samtools/samtools faidx Alyrata_107_RM.fa$ ./samtools/samtools view -bt Alyrata_107_RM.fa.fai -o \lyr_reads.bam lyr_reads.sam$ ./samtools/samtools sort lyr_reads.bam lyr_reads.sorted$ ./samtools/samtools index lyr_reads.sorted.bam lyr_reads.sorted.bai

$ java -jar ./IGV_1.5.64/igv.jar 31

Visualization of Mapped Result

• Load genome and genes. In IGV, File > Import Genome

• Name: A.lyrata (as you like!)

• Sequence File: Select your Alyrata_107_RM.fa

• Cytoband File: [empty]

• Gene File: Select your Alyrata_107_gene.gff3

• To check file contents, you need wait a moment.

• Then, save.

• Select file to save genome information.

• Load read information. In IGV, File > Load from File.

• Select “lyr_reads.sorted.bam”32

Assemble reads with velvet• This is toy example. We just check the usage.

• Genome/Gene assembly requires huge main memory.

• Velvet requires “AT LEAST” 12GB.

• Require two steps: velveth and velvetg

• For SOLiD reads, use velveth_de and velvetg_de

• Options are the same.

• Before run velvet, we have to change format using ABI’s script called denovo2.0 (SOLiD only)

• http://solidsoftwaretools.com/gf/project/denovo/frs/?action=FrsReleaseBrowse&frs_package_id=65

• After this process (if the reads come from genome), you can run gene prediction programs (Fgenesh, EuGene, GenomeThreader etc.).

• Modern assemblers use de Brujin graph (k-mer graph). The change of parameter k will change assemble result drastically.

• We have to generate many assemble results with various parameters to obtain the best one.

34

http://solidsoftwaretools.com/gf/project/denovo/frs/?action=FrsReleaseBrowse&frs_package_id=65




# # Download and install velvet# $ curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.1.03.tgz# $ gzip -dc velvet_1.1.03.tgz | tar xvf -# $ ln -s velvet_1.1.03 velvet# $ cd velvet# $ make color# Download ABI’s scripts and extract it$ gzip -dc denovo2.tgz | tar xvf -# Preprocessing for velvet$ perl ./denovo2/utils/solid_denovo_preprocessor_v1.2.pl --run_type \ fragment -output chr1_de --f3_file SRR038985_chr1.csfasta# Run velvet$ ./velvet/velveth_de assemble_chr1 17 -fasta -short \chr1_de/doubleEncoded_input.de$ ./velvet/velvetg_de assemble_chr1 -exp_cov auto# assemble_chr1/contigs.fa contains generated contigs

# Show status$ perl ./denovo2/utils/assembly_stats.pl assemble_chr1/contigs.fa

Sum contig length : 303616Num contigs : 3796Mean contig length : 79Median contig length : 66N50 value : 79Max : 583

35

http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.1.04.tgz

http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.1.04.tgz

For Roche 454 (or IonTorrent)

• Read length is longer than illumina and SOLiD

• Traditional Sanger sequence analysis can be used

• Homology search with BLAST or BLAT

• Assembly with CAP3 or MIRA3

• Combining Roche 454 with illumina/SOLiD will produce better result.

• Recent assemblies for long genome have used the combination.

• One of the problems when we use BLAST/BLAT is that the programs do not support modern file format such as SAM/BAM.

• Some programs such as GMAP support new format.

• To solve the problem, we make a format converting script and use it.

36

Mapping 454 Reads

• We use EST sequences

• EST sequences contains poly-A tail and vector strings.

• For short reads, we did not this phase because the sequences are too short to check whether they are vector strings.

• Procedure

• Remove these sequences

• Use lucy

• Map trimmed sequences against genome

• BLAST and BLAT

• Convert the result to SAM format

• Convert the SAM to BAM and check the result in viewer.

37

# # Download and install lucy from http://lucy.sourceforge.net/# curl -O http://jaist.dl.sourceforge.net/project/lucy/lucy/lucy%201.20/lucy1.20.tar.gz# gzip -dc lucy1.20.tar.gz | tar xvf -# cd lucy-1.20p# make; ln -s lucy-1.20p lucy# Download blat executable file (For Mac OS X) and set it up# $ curl -O \# http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/blat/blat# $ chmod 755 blat# To trim vector sequences, run lucy$ ./lucy/lucy -vector pDNR.vec pDNRsplice.spl -out \roche454_test_trim.fasta roche454_test_trim.qual \roche454_test.fna roche454_test.qual# Run blat. (About 7mins required)# We may need to change score matrix to get meaningful alignment$ ./blat -t=dna -q=dna -tileSize=8 -out=blast TAIR10_chr_all.fas \ roche454_test_trim.fasta roche454_test_TAIR10.result# Convert the result into SAM file# -t option specifies the maximum threshold of E-value in SAM file.$ ruby blastn2sam.rb -t 0.00001 -s roche454_test_TAIR10.result \> roche454_test_TAIR10_e5.sam# After this process, you can do the same procedure as short reads# (converting SAM to BAM and visualize the data in IGV.) 38

Concluding Remarks• Analysis in this lecture is first step for bioinformatics and

computer science.

• Softwares and methods for analysis of next generation sequencers are initial phase.

• Only mapping and assemble softwares are widely used. Other processes are under development.

• To use NGS, we have to check the updates of softwares and unpublished information.

• Use mailing lists and QA sites.

• Most softwares in biology have limited numbers of users.

• Think about Microsoft Word. Many users, but many...

• Many softwares have poor documentation.

• Bugs always exist.

• Good softwares update frequently to fix bugs and catch up new information.

• If no software exists for your experiment, simple script may help your analysis. 39

20110524zurichngs 2nd pub

Documents

awk grep

genome assembly bao

genome research

true genome sequence

mapping reads

assembly genomegene

rna assembly withroche

high genome coverage