galaxy carc workshop 2016 - university of new mexico · using&the&galaxy&...
TRANSCRIPT
Using the Galaxy Local Bioinformatics Cloud at CARC
Lijing BuSr. Research Scientist – Bioinformatics Specialist
Center for Evolutionary and Theoretical Immunology (CETI)Department of Biology, University of New Mexico
CARC Galaxy Workshop @ UNM 1
Outline
• Self-‐introduction
• Galaxy
• Hands on activity
– Demo 1
– Demo 2
• Useful informationCARC Galaxy Workshop @ UNM 2
Hands up if you have
• Got in touch with NGS Data?• Known Fastq format – what the 4 lines means?• Done RNA-‐Seq?
• Run command lines • Created a BLAST database• Installed tools on Linux/Unix • Used online Bioinformatics Platform
CARC Galaxy Workshop @ UNM 3
Self-‐Introduction• Name• Department• Project
• Tools
Lijing BuCETI @ BiologyRNA-‐Seq, Genome re-‐sequencing
Shell, PerlBlast, ClustaWTophat, Cufflinks, TrinityR, edgeR, DESeqAbyss, SOAPdenovo, Velvet
CARC Galaxy Workshop @ UNM 4
Big Data and Abundant Tools
• NGS Data: 2~50 GB initial data per project• Analysis involves multiple steps and tools• Computational challenge • Command lines make it easy to construct
workflows but it takes time to master them
blastx -‐query Trinity.fasta -‐db /home/blast/2015-‐06-‐18/swissprot -‐out blastx.outfmt6 -‐evalue 1e-‐20 -‐num_threads 44 -‐max_target_seqs 1 -‐outfmt 6
CARC Galaxy Workshop @ UNM5
Bioinformatics Clouds
• Easy to use• Share data, analysis steps• Workflows
$$$$$ Fixed workflows Few Apps
CARC Galaxy Workshop @ UNM 6
Galaxy @ PSU• Open source• 700 individual tools in 200 packages, 40 categories • Easy to use• Highly customizable– Local instance– Add almost any Bioinformatics tool – Use customized reference database
• Capable to use high-‐performance computer clusters
• Developers can publish new toolsCARC Galaxy Workshop @ UNM 7
User
Galaxy Web Server
Ulam Cluster16 nodes x 8 CPUs/32 GB
UserUser Administrator
CARC Manager
UNM Local Cloud for Bioinformatics
Xena Cluster1T ~ 3 T shared MEM
Galaxy @ CARC
CARC Galaxy Workshop @ UNM 10
Agenda of Galaxy @ CARC• Phase I -‐ Sputnik: Proof of concept
– Local galaxy test run.– Tools installation.– Connect to CARC server, submit PBS jobs.
• Phase II -‐ Pluto: Internal test. – Hardware connection to cluster, install Linux and galaxy, set up to connect to submit PBS jobs,
main page design. – Continue to add software, separate cluster jobs (60 s lag) versus local jobs.– For a few tools, do batch mark test to find best setting to provide best performance.– For some tools, extend PBS jobs to be submitted to server of large shared memory (1TB ~ 3TB).– Open to few internal users, workshops.– Fix and add more tools and local databases based on feedback.
• Phase III -‐ Pluto: Open to more users – Install more tools as requested by users.– Build workflows from repeated used tools.– Develop tools/workflows for specific purpose, and publish/share them to all Galaxy group.– Possible upgrade hardware.
CARC Galaxy Workshop @ UNM 11
Register CARC Account• PI apply a project (approve in 1-‐2 days)
– https://www.carc.unm.edu/getting-‐started/request-‐a-‐project.html– Name, email, title– Abstract
• Students apply for an account linked to PI’s project (approve in 1-‐2 days)– https://www.carc.unm.edu/getting-‐started/request-‐an-‐account.html– Name, email and project name to link to.– Select machines want to use
• Contact Lijing Bu to create a Galaxy accountCARC Galaxy Workshop @ UNM 12
Recommend Links about Galaxy• All about Galaxy
– https://galaxyproject.org/
• Ask Questions on BioStar
– https://biostar.usegalaxy.org/
• Videos of various analysis using Galaxy
– https://vimeo.com/galaxyproject/videos/page:1/sort
:alphabetical/format:thumbnailCARC Galaxy Workshop @ UNM 13
Galaxy – Pluto @ CARChttp://pluto.alliance.unm.edu
User name: workshop-‐user# where # is your seat numberPassword: carcgalaxy Change password after login!
Temporary user accounts were created for workshop use only.All data/workflows of temp user accounts will be deleted one month after the workshop.
CARC Galaxy Workshop @ UNM 14
Hands OnDemo 2
RNA-‐Seq workflow– Copy datasets– Upload data with a link– View and run workflow
Datasets management– Manage history– Delete/hide datasets– Share history
Detailed instructions PDF file is at https://www.carc.unm.edu/education-‐outreach/workshops-‐-‐training/workshop-‐materials/index.html
Derived from online Galaxy Project’s video at https://vimeo.com/galaxyproject/videos/page:1/sort:alphabetical/format:thumbnail
Demo 1Basic dataset management
1. Shared histories2. NGS Reads QC3. Workflow
Handle Multiple Datasets1. Select multiple datasets
as input2. Build datasets collection
15
CARC Galaxy Workshop @
UNM
Demo 1• Basic dataset management
1. Shared histories
2. Reads QC
3. Workflow
• Handle Multiple Datasets
1. Select multiple datasets as input
2. Build datasets collection CARC Galaxy Workshop @ UNM 16
Import History -‐ 2
CARC Galaxy Workshop @ UNM 19
Click to view tools in this category.
Eye: view dataset Pencil: change featuresCross: delete dataset
Click to have brief view
Download dataset
Check on the Reads Quality
Single FastQC on fastq read file 1.CARC Galaxy Workshop @ UNM 20
Delete dataset 3,Click to check deleted data, and undelete dataset 3.
Click on the link to open the tool
FastQC Results – Single Input
FastQC generates 2 output files1. HTML webpage report (shown here data6)2. Raw text report (data 7)
CARC Galaxy Workshop @ UNM21
Good or bad Illumina Data? http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Reads Quality Filtering
CARC Galaxy Workshop @ UNM 22
Find Trimmomatic in the tool panel,Click on the link
Run with default setting
Single end data
FastQC on Filtered Data
CARC Galaxy Workshop @ UNM 24
1. The re-‐run button
2. Switch to filtered dataset 3. Run
Extract Workflow from History
Uncheck dataset 2-‐5,keep the analysis steps on dataset 1 only.
CARC Galaxy Workshop @ UNM26
Extract Workflow from History
CARC Galaxy Workshop @ UNM 28
Click tools to add them into current workflow
Mark output files to hide the rest in the history.
Save & Run
Select Multiple Files as Input
CARC Galaxy Workshop @ UNM 29
Shift + Select: press the shift key to select a series of files.Control or command key: press to select or deselect multiple files.
View from individual tool.
Button to Select multiple files
Create Datasets Collection for Multiple Step Analysis
CARC Galaxy Workshop @ UNM 30
Build list for pair-‐end read files.
Results of FastQC on Collection
CARC Galaxy Workshop @ UNM
33
Instead of two output files, there are two lists of output files. Each list has 4 files.
Copy Datasets to a New History
Select fastq datasets 1 to 5
Name the new history
CARC Galaxy Workshop @ UNM 35
Demo 2
• RNA-‐Seq workflow– Copy datasets– Upload data with a link– View and run workflow
• Datasets management–Manage history– Delete/hide Datasets– Share history
CARC Galaxy Workshop @ UNM 36
RNA-‐Seq Technology
CARC Galaxy Workshop @ UNM 37
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_12_16_2013/Rrnaseq/Rrnaseq.pdf
Input 6 files• Reads 2 Samples x 2 Replicates• Reference Sequences• General Feature Format file
Tools• NGS aligner – TopHat2• Reads counter/Stats -‐ Cufflinks
Upload Reference Sequence1. On a new window, open the follow address
• UCSC FTP site of human reference genome sequences• http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes
2. Right click on chromosome 19.fa.gz, and copy link address.
CARC Galaxy Workshop @ UNM38
Right click: On Mac use two fingers
Correct linkhttp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr19.fa.gz
Note: Be careful where you get data!NCBI, UCSC, ENSEMBL databases store data in slightly different format (ID system, chromosome label, GFF).
Upload Reference
CARC Galaxy Workshop @ UNM 39
Galaxy is sensitive to data type! Most tools require fastqsanger type for fastq files, rather than fastq, fastcssanger, fastqillumina.
When paste the link, make sure the size is not empty. If empty, type a space after your pasted link address.
Upload Reference Sequence
CARC Galaxy Workshop @ UNM 40
!!! The default input file is the last file that fit the type format. For multiple files with the same format type (here fastq), the input order need to be checked. !!!
CARC Galaxy Workshop @ UNM 43
!!! The default input file is the last file that fit the type format. For multiple files with the same format type (here fastq), the input order need to be checked. !!!
CARC Galaxy Workshop @ UNM 44
Jobs Running
If the page didn’t reload automatically, but the circle in the tab is circling, the job is running. Be patient.
CARC Galaxy Workshop @ UNM 46
Manage Datasets in the History
Click to show deleted files.Click to show hidden files.Click again to hide them.
In workflows, you can specify to hide unwanted intermediate files.(more details in workflow build section)
CARC Galaxy Workshop @ UNM 50