galaxy carc workshop 2016 - university of new mexico · using&the&galaxy&...

51
Using the Galaxy Local Bioinformatics Cloud at CARC Lijing Bu Sr. Research Scientist – Bioinformatics Specialist Center for Evolutionary and Theoretical Immunology (CETI) Department of Biology, University of New Mexico CARC Galaxy Workshop @ UNM 1

Upload: hoangthuy

Post on 07-Jul-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Using  the  Galaxy  Local  Bioinformatics  Cloud  at  CARC

Lijing  BuSr.  Research   Scientist  – Bioinformatics  Specialist

Center   for  Evolutionary  and  Theoretical   Immunology  (CETI)Department   of  Biology,  University  of  New  Mexico

CARC  Galaxy  Workshop  @  UNM 1

Outline

• Self-­‐introduction

• Galaxy

• Hands  on  activity

– Demo  1

– Demo  2

• Useful  informationCARC  Galaxy  Workshop  @  UNM 2

Hands  up  if  you  have

• Got  in  touch  with  NGS  Data?• Known  Fastq  format  – what  the  4  lines  means?• Done  RNA-­‐Seq?  

• Run  command  lines  • Created  a  BLAST  database• Installed  tools  on  Linux/Unix  • Used  online  Bioinformatics  Platform

CARC  Galaxy  Workshop  @  UNM 3

Self-­‐Introduction• Name• Department• Project

• Tools

Lijing  BuCETI  @  BiologyRNA-­‐Seq,  Genome  re-­‐sequencing

Shell,  PerlBlast,  ClustaWTophat,  Cufflinks,  TrinityR,  edgeR,  DESeqAbyss,  SOAPdenovo,  Velvet

CARC  Galaxy  Workshop  @  UNM 4

Big  Data  and  Abundant  Tools

• NGS  Data:  2~50  GB  initial   data  per  project• Analysis   involves  multiple   steps  and  tools• Computational   challenge  • Command  lines  make  it  easy  to  construct  

workflows  but  it  takes  time   to  master  them

blastx -­‐query  Trinity.fasta -­‐db /home/blast/2015-­‐06-­‐18/swissprot -­‐out  blastx.outfmt6   -­‐evalue 1e-­‐20  -­‐num_threads 44  -­‐max_target_seqs 1  -­‐outfmt 6

CARC  Galaxy  Workshop  @  UNM5

Bioinformatics  Clouds

• Easy  to  use• Share  data,  analysis  steps• Workflows

$$$$$ Fixed  workflows Few  Apps

CARC  Galaxy  Workshop  @  UNM 6

Galaxy  @  PSU• Open  source• 700  individual  tools  in  200  packages,  40  categories  • Easy  to  use• Highly  customizable– Local  instance– Add  almost  any  Bioinformatics  tool  – Use  customized  reference  database

• Capable  to  use  high-­‐performance  computer  clusters

• Developers  can  publish  new  toolsCARC  Galaxy  Workshop  @  UNM 7

Galaxy  InterfaceTools  Panel View  Panel History  Panel

CARC  Galaxy  Workshop  @  UNM 8

Example  Workflow  – RNA-­‐Seq

CARC  Galaxy  Workshop  @  UNM 9

User

Galaxy  Web  Server

Ulam  Cluster16  nodes  x  8  CPUs/32  GB

UserUser Administrator

CARC  Manager

UNM Local Cloud for Bioinformatics

Xena Cluster1T  ~  3  T  shared  MEM

Galaxy  @  CARC

CARC  Galaxy  Workshop  @  UNM 10

Agenda  of  Galaxy  @  CARC• Phase  I  -­‐ Sputnik:  Proof  of  concept  

– Local  galaxy  test  run.– Tools  installation.– Connect  to  CARC  server,  submit  PBS  jobs.

• Phase  II  -­‐ Pluto:  Internal  test.  – Hardware  connection  to  cluster,  install  Linux  and  galaxy,  set  up  to  connect  to  submit  PBS  jobs,  

main  page  design.  – Continue  to  add  software,  separate  cluster  jobs  (60  s  lag)  versus  local  jobs.– For  a  few  tools,  do  batch  mark  test  to  find  best  setting  to  provide  best  performance.– For  some  tools,  extend  PBS  jobs  to  be  submitted  to  server  of  large  shared  memory  (1TB  ~  3TB).– Open  to  few  internal  users,  workshops.– Fix  and  add  more  tools  and  local  databases  based  on  feedback.

• Phase  III  -­‐ Pluto:  Open to  more  users  – Install  more  tools  as  requested  by  users.– Build  workflows  from  repeated  used  tools.– Develop  tools/workflows   for  specific  purpose,  and  publish/share  them  to  all  Galaxy  group.– Possible  upgrade  hardware.

CARC  Galaxy  Workshop  @  UNM 11

Register  CARC  Account• PI  apply  a  project  (approve  in  1-­‐2  days)

– https://www.carc.unm.edu/getting-­‐started/request-­‐a-­‐project.html– Name,  email,  title– Abstract

• Students  apply  for  an  account  linked  to  PI’s  project  (approve  in  1-­‐2  days)– https://www.carc.unm.edu/getting-­‐started/request-­‐an-­‐account.html– Name,  email  and  project  name  to  link  to.– Select  machines  want  to  use

• Contact  Lijing  Bu  to  create  a  Galaxy  accountCARC  Galaxy  Workshop  @  UNM 12

Recommend  Links  about  Galaxy• All  about  Galaxy

– https://galaxyproject.org/

• Ask  Questions  on  BioStar

– https://biostar.usegalaxy.org/

• Videos  of  various  analysis  using  Galaxy

– https://vimeo.com/galaxyproject/videos/page:1/sort

:alphabetical/format:thumbnailCARC  Galaxy  Workshop  @  UNM 13

Galaxy  – Pluto  @  CARChttp://pluto.alliance.unm.edu

User  name:   workshop-­‐user#       where  #  is  your  seat  numberPassword: carcgalaxy Change  password  after  login!

Temporary  user  accounts  were  created  for  workshop  use  only.All  data/workflows  of  temp  user  accounts  will  be  deleted  one  month  after  the  workshop.

CARC  Galaxy  Workshop  @  UNM 14

Hands  OnDemo  2

RNA-­‐Seq workflow– Copy  datasets– Upload  data  with  a  link– View  and  run  workflow

Datasets  management– Manage  history– Delete/hide  datasets– Share  history

Detailed   instructions   PDF  file   is  at  https://www.carc.unm.edu/education-­‐outreach/workshops-­‐-­‐training/workshop-­‐materials/index.html

Derived   from  online  Galaxy  Project’s  video  at  https://vimeo.com/galaxyproject/videos/page:1/sort:alphabetical/format:thumbnail

Demo  1Basic  dataset  management

1. Shared  histories2. NGS  Reads  QC3. Workflow

Handle  Multiple  Datasets1. Select  multiple  datasets  

as  input2. Build  datasets  collection  

15

CARC  Galaxy  Workshop  @  

UNM

Demo  1• Basic  dataset  management

1. Shared  histories

2. Reads  QC

3. Workflow

• Handle  Multiple  Datasets

1. Select  multiple  datasets  as  input

2. Build  datasets  collection  CARC  Galaxy  Workshop  @  UNM 16

Find  Published  History

CARC  Galaxy  Workshop  @  UNM 17

Import  History  -­‐ 1

CARC  Galaxy  Workshop  @  UNM

18

Import  History  -­‐ 2

CARC  Galaxy  Workshop  @  UNM 19

Click  to  view  tools  in  this  category.  

Eye:  view  dataset  Pencil:  change  featuresCross:  delete  dataset

Click  to  have  brief  view

Download  dataset

Check  on  the  Reads  Quality

Single  FastQC on  fastq  read  file  1.CARC  Galaxy  Workshop  @  UNM 20

Delete  dataset  3,Click  to  check  deleted  data,  and  undelete  dataset  3.

Click  on  the  link  to  open  the  tool

FastQC Results  – Single  Input

FastQC generates  2  output  files1. HTML  webpage  report  (shown  here  data6)2. Raw  text  report  (data  7)

CARC  Galaxy  Workshop  @  UNM21

Good  or  bad  Illumina  Data?  http://www.bioinformatics.babraham.ac.uk/projects/fastqc

Reads  Quality  Filtering

CARC  Galaxy  Workshop  @  UNM 22

Find  Trimmomatic  in  the  tool  panel,Click  on  the  link

Run  with  default  setting

Single  end  data

CARC  Galaxy  Workshop  @  UNM 23

Trimmomatic  Results

FastQC on  Filtered  Data

CARC  Galaxy  Workshop  @  UNM 24

1.  The  re-­‐run  button

2.  Switch  to  filtered  dataset 3.  Run

Improved  Reads  Quality  by  Filtering

Before After

TrimmomaticCARC  Galaxy  Workshop  @  UNM

25

Extract  Workflow  from  History  

Uncheck  dataset  2-­‐5,keep  the  analysis  steps  on  dataset  1  only.

CARC  Galaxy  Workshop  @  UNM26

Extract  Workflow  from  History  

CARC  Galaxy  Workshop  @  UNM 27

Extract  Workflow  from  History  

CARC  Galaxy  Workshop  @  UNM 28

Click  tools  to  add  them  into  current  workflow

Mark  output  files  to  hide  the  rest  in  the  history.  

Save  &  Run

Select  Multiple  Files  as  Input

CARC  Galaxy  Workshop  @  UNM 29

Shift  +  Select:  press  the  shift  key  to  select   a  series   of  files.Control  or  command   key:  press  to  select   or  deselect  multiple   files.  

View  from  individual  tool.

Button  to  Select  multiple  files

Create  Datasets  Collection  for  Multiple  Step  Analysis

CARC  Galaxy  Workshop  @  UNM 30

Build  list  for  pair-­‐end  read  files.

Datasets  Collection  Created

CARC  Galaxy  Workshop  @  UNM 31

FastQC -­‐ Select  Datasets  Collection

CARC  Galaxy  Workshop  @  UNM 32

Results  of  FastQC on  Collection

CARC  Galaxy  Workshop  @  UNM

33

Instead  of  two  output  files,  there  are  two  lists  of  output   files.  Each  list  has  4  files.

Mange  the  History

CARC  Galaxy  Workshop  @  UNM 34

Share  your  analysis  to  another  user  or  to  everyone.

Copy  Datasets  to  a  New  History

Select  fastq  datasets  1  to  5

Name  the  new  history

CARC  Galaxy  Workshop  @  UNM 35

Demo  2

• RNA-­‐Seq workflow– Copy  datasets– Upload  data  with  a  link– View  and  run  workflow

• Datasets  management–Manage  history– Delete/hide  Datasets– Share  history

CARC  Galaxy  Workshop  @  UNM 36

RNA-­‐Seq Technology

CARC  Galaxy  Workshop  @  UNM 37

http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_12_16_2013/Rrnaseq/Rrnaseq.pdf

Input 6  files• Reads 2 Samples x 2 Replicates• Reference Sequences• General  Feature  Format file

Tools• NGS aligner – TopHat2• Reads counter/Stats -­‐ Cufflinks

Upload  Reference  Sequence1. On  a  new  window,  open  the  follow  address  

• UCSC  FTP  site  of  human  reference   genome  sequences• http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes

2. Right  click  on  chromosome  19.fa.gz,  and  copy  link  address.

CARC  Galaxy  Workshop  @  UNM38

Right  click:  On  Mac  use  two  fingers

Correct  linkhttp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr19.fa.gz

Note:  Be  careful  where  you  get  data!NCBI,  UCSC,  ENSEMBL  databases  store  data  in  slightly  different   format  (ID  system,  chromosome   label,  GFF).

Upload  Reference

CARC  Galaxy  Workshop  @  UNM 39

Galaxy  is  sensitive   to  data  type!    Most  tools  require  fastqsanger type  for  fastq  files,   rather  than  fastq,  fastcssanger,   fastqillumina.

When  paste  the  link,  make  sure  the  size  is  not  empty.  If  empty,  type  a  space  after  your  pasted  link  address.  

Upload  Reference  Sequence

CARC  Galaxy  Workshop  @  UNM 40

Find  Published  Workflows

CARC  Galaxy  Workshop  @  UNM 41

CARC  Galaxy  Workshop  @  UNM 42

!!!  The  default  input  file  is  the  last  file  that  fit  the  type  format.  For  multiple  files  with  the  same  format  type  (here  fastq),  the  input  order  need  to  be  checked.  !!!  

CARC  Galaxy  Workshop  @  UNM 43

!!!  The  default  input  file  is  the  last  file  that  fit  the  type  format.  For  multiple  files  with  the  same  format  type  (here  fastq),  the  input  order  need  to  be  checked.  !!!  

CARC  Galaxy  Workshop  @  UNM 44

CARC  Galaxy  Workshop  @  UNM 45

Jobs  Running

If  the  page  didn’t  reload  automatically,  but  the  circle  in  the  tab  is  circling,  the  job  is  running.    Be  patient.

CARC  Galaxy  Workshop  @  UNM 46

Grey  box  – Jobs  are  waiting

CARC  Galaxy  Workshop  @  UNM 47

Yellow  – Jobs  are  running

CARC  Galaxy  Workshop  @  UNM 48

Red  box  – Error  messages

CARC  Galaxy  Workshop  @  UNM 49

Manage  Datasets  in  the  History

Click  to  show  deleted   files.Click  to  show  hidden  files.Click  again  to  hide  them.

In  workflows,  you  can  specify  to  hide  unwanted  intermediate   files.(more  details   in  workflow  build  section)  

CARC  Galaxy  Workshop  @  UNM 50

Demo  Results

CARC  Galaxy  Workshop  @  UNM 51

BAM  files  – reads  aligned  to  reference.

Differential  Expression  Analysis  results

Newly  found  transcripts  in GFF format (two  samples  merged)