implementing a proteomics data pipeline and database on … · 2017. 11. 1. · implementing a...

29
Implementing a Proteomics Data Pipeline and Database on LabKey to promote in-depth analysis, data sharing & integration Wen Yu , Jonathan Pryke, Gina Dangelo, Raghothama Chaerkady, Sonja Hess, Adolf Brown, Paula Gegwich and David Fenstermacher Research Bioinformatics, RD&I, Statistical Science and AD&PE MedImmune LLC, One MedImmune Way, Gaithersburg, Maryland 20878 USA

Upload: others

Post on 01-Feb-2021

16 views

Category:

Documents


0 download

TRANSCRIPT

  • Implementing a Proteomics Data Pipeline and Database on LabKey to promote in-depth analysis, data sharing & integration

    Wen Yu, Jonathan Pryke, Gina Dangelo, Raghothama Chaerkady, Sonja Hess, Adolf Brown, Paula Gegwich and David Fenstermacher

    Research Bioinformatics, RD&I, Statistical Science and AD&PE

    MedImmune LLC, One MedImmune Way, Gaithersburg, Maryland 20878 USA

  • EnablingData-drivenDiscoveryo  Systematic collection of quantitative molecular phenotypes to

    probe what has happened;

    o  Focused experimental design with specified outcomes; o  Big-data approaches leading to deep, holistic, perhaps

    unexpected understanding of the system and biology.

    o  DATA à Understanding à Predictive engineering à LabKey à assay, execution

    analytics à algorithms à interpretation

    vis/integration à network biology à big picture, discovery.

    Genotypes (Genomics, Genetics)

    Proteins, Peptides, Lipids & Metabolites

    Signals, Causality Targets, Biomarkers

    transcripts

  • ~8,000 (cells, tissue) or >1500 (plasma) proteins identified & quantified 480k MS/MS reads / 10-samples

    NextGenProteomics:~CompleteQuan?ta?on

    Sample 1 2 …

    10 TMT 126

    127N …

    131

    Sample

    11 12 …

    20 TMT 126

    127N …

    131

    FASP trypsin digest TMT Labelling/mux

    OrbiTrap Fusion Tribrid

  • Live poll from the audience at a recent ASMS workshop

  • raw PD xlsx Stat MetaBase Custom node

    Protein Set Enrichment Analysis

    Internal & public Proteomics Studies

    Phase 3

    Cell-surface Proteins

    Protein Expression Profiles or Signature

    Mechanism

    of Action Biomark

    ers Reports

  • Sample MEM A B

    Protein 1 yes expr expr Protein 2 no expr expr

    Uns

    uper

    vise

    d cl

    assi

    fica;

    on

    raw PD

    RDB

    Signature, Protein Matrix Proteins, Peptides Abundances, Spectra Expt, Sample, Study

    (2) M

    sf fi

    les

    Impo

    rt

    Custom node

    Visu

    aliz

    a;on

    Protein Set Enrichment Analysis

    Path

    way

    ana

    lysi

    s

    Signature

    QC Metrics Supporting Evidences

    Protein Matrix

    Internal & public Proteomics Studies

    FC & pval / fdr

    Cell-surface Proteins

    Network of overlapping gene sets

    TheProteomicsDataPipeline

    Phase 3

    Reports

    Protein Expression Profiles or Signature

    Mechanism

    of Action Biomark

    ers

    Stat

  • PickingaSolu?on,akaHouseHun?ng

    Fully Furnished Estate Commercialsystemswithbuilt-inanalysispipeline

    Custom Home

    Proprietaryapplica0on

  • ThePipelineDesign:theDIYEdi?onComponent Function Solution Study Design Sample annotation and experimental factors Proteome

    Discoverer Protein ID and Quant Peptide/protein ID, reporter ion intensities

    Data Preparation Protein quants, imputation, data cleansing & formatting Custom R-scripts

    Statistical modeling Ad hoc and formal analysis for the significant changes in protein abundances

    R/SAS, LabKey

    Data Visualization Delivery to the investigators for access and exploration. Ad hoc experimentation.

    R/Shiny, LabKey

    Pipelining Streamlining the workflow LabKey

    Data management Repository, project tracking, visualization, R-integration LabKey

    Pathway, G/PSEA Biological contextualization and hypothesis generation R/Shiny

  • Solu?onsintheEraofOpenSource

    Free Widgets DIY Designers Builders

    Free new construction

    LabKey,R/Shiny

  • LabKeyImplementa?on:DataImport

    Ini?atenewinves?ga?on

    • Dataacquisi?on&PDprocessing• LoaddatatoLabKeydatabase• SendprocesseddataoutforstatanalysisImportnewstudy

    Visualizeprimarydata• Browsethecontentsbytableorview’• In-depthreviewinShinyapp

    Importstatoutcomes• AUachdatatothestudy

    Visualizestudy&analysis• Drill-downfromproteinID• In-depthreviewinShinyapp

    medImmuneModule

  • DataModel:MetaData

    Study

    Sample

    MsRun

    Factor

    1..n

    n..n

    FactorOption SampleOption n..n

    SampleSet 1..n

    1..n

    TMT plate SILAC, H/L

    Investigation 1..n

    QuanMethod

    Channel 1..n

    ProjectOption

    1..n Analysis Config

  • Study

    Sample

    MsRun

    Factor

    1..n

    n..n

    FactorOption SampleOption n..n

    SampleSet 1..n

    1..n

    TMT plate SILAC, H/L

    Investigation 1..n

    QuanMethod

    Channel 1..n

    ProjectOption

    1..n Analysis Config

    FolderStructurevsMetaDataHierarchy

    Folder Structure

  • DataLoadingLogics:Samplesu  Populatethe‘Factor’and‘Op?on’

    tablesifnecessaryasdescribedpreviously

    u  Createentriesin‘SampleOp?on’tablethatlinksamplestoop?ons

    u  Cross-referencethe’Channel’tablevia‘Channel’fieldandpopulatethe‘ChannelId’

  • MSMS

    ProteinGrp

    Precursor

    Reporter

    PSM

    DataModel:MSData

    MsRun

    1..n

    1..n 1..n

    n..n

    IsobaricQuan

    Accession, Gene

    Sequences

    AnalysisId, Master_Accession, Confidence, Coverage, etc

    Feature

    1..n

    1..n

    FeatureMSMS

    Channel

    ProteinGrpQuan Sample

    Analysis

    ReporterQuan

    PsmPGrp

  • MSMS,PSM,PsmPGrpu  ForeachrowinPSM.all.Rdata

    –  Grab the FK to ‘MsRun’ with the ‘Run’ field –  Create an entry in ‘MSMS’ if the Run/Scan combination is not

    present already –  Create entry in ‘Precursor’ table with FK to ‘MSMS’

    –  Create entries in ‘Reporter’ table for each channels with FK to ‘Channel’ rows sharing the same ‘ChannelName’

    –  Create an entry in ‘PSM’ table with FK to the ‘MSMS’ entry

    –  Grab the FK to ‘ProteinGrp’ with the ‘Accessions’ and ‘AnalysisId’ fields;

    –  Create an entry in ‘PsmPGrp’ table with FK to PSM and ProteinGrp.

  • DataModel:Quan?ta?veOutcomes

    IsobaricQuan

    ProteinGrp

    ProteinGrpTest StatTest

    FC, pval, qval

    Factor, FO_A, FO_B, Model n..n

    FactorOption

  • medImmuneModuleandTMTTemplateu Alldatabasetablesandserver-

    sidelogicsareimplementedinanewMedImmunemodule

    u UIlayoutandinterac?verepor?ngarewriUenasa“template”folder,whichcanthenbeusedtocreateanew“inves?ga?on”.

  • 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies

    DataLoading

    One of the first workflow being implemented is TMT-based multiplex quantitation of the total proteome. Following the data acquisition and processing in ProteomeDiscoverer, the experimental design, peptide and protein identification and quantitation are imported into LabKey where a custom-built data ingestion pipeline written in R will transform the data and prepare them for deposition in a Microsoft SQL database.

    Configureoutput

    1. Upload PD data 2. Import to R pipeline

    3. Analysis Parameters

    4. Save analysis to database

  • LabKeyImpl:ResultsandVisualiza?on

    Ini?atenewinves?ga?on

    • Dataacquisi?on&PDprocessing• LoaddatatoLabKeydatabase• SendprocesseddataoutforstatanalysisImportnewstudy

    Visualizeprimarydata• Browsethecontentsbytableorview’• In-depthreviewinShinyapp

    Importstatoutcomes• AUachdatatothestudy

    Visualizestudy&analysis• Drill-downfromproteinID• In-depthreviewinShinyapp

    medImmuneModule

  • 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies

    StudyPortal

  • StudyPortal DIYUIandReportu  Datagrid+customqueryoffersflexible

    waytoreviewthedataset

    u  CustomURLenablemaster-detailviewintocomplexhierarchicaldata

    1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies

    /project/${container}/begin.view?pageId=MsRun%20%28New%29&qwp1.param.RunParam=${RowId}&qwp2.param.RunParam=${RowId}&qwp3.param.RunParam=${RowId}&qwp4.param.RunParam=${RowId}&qwp5.param.RunParam=${RowId}

  • Proteins,PSMandMSMSResultsfromaMS-Run

    1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies

    /project/${container}/begin.view?pageId=MsRun%20%28New%29&qwp1.param.RunParam=${RowId}&qwp2.param.RunParam=${RowId}&qwp3.param.RunParam=${RowId}&qwp4.param.RunParam=${RowId}&qwp5.param.RunParam=${RowId}

    •  A key strengths of LabKey is the flexibility of custom query, visualization and report with SQL/R or point-n-click interface.

    •  Once a study is imported, its experimental design, LcMsMs runs, protein identification and quantitation can be inspected via the web-interface as data grids or plots.

  • plt01 = dcast(data=labkey.data[,c("name","value","abundance")], abundance~name, value.var='value'); plt02 = dcast(data=labkey.data[,c("name","value","mswept")], mswept~name, value.var='value'); plt = rbind(plt01, plt02); bwplot(abundance~Cohort|Norm, groups=SampleID, data=plt, type=c("p"), layout=c(2, 1), par.settings=simpleTheme(pch=c(10:20),cex=1.25,lwd=2),xlab="Cohort", ylab="Estimated Protein Abundance (log2, mean-summarized)", scale=list(relation="free",alternating=1, y=list(tick.number=10, log=F, equispaced.log=F)), panel = function(x, y, ...) { panel.dotplot(x, y, par.settings=simpleTheme(pch=c(10:20),cex=0.75,lwd=1), cex=1, alpha=0.85, ...) panel.bwplot(x, y, pch = "|", ...) } );

    ProteinAbundances-SQL/RRepor?ng

    1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies

    For simple visualization, boxplot, volcano plot can be readily generated in LabKey and shared with other researchers.

  • PSMàMS/MSSpectrumviaOpenSlice

    1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies

    •  To visualize the raw MS and MS/MS data, another open-source program, OpenSlice, was adopted. It pre-processes the raw files to allow instantaneous review of spectrum and XIC trace.

    •  Custom URL in LabKey enables drill-down of the experimental evidences from summary levels downward with OpenSlice.

  • In-depthAnalysisinaR/ShinyApp

    1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies 1. Portal to the Internal and Public Studies

    •  Expose LabKey data to Shiny app for in-depth analysis •  Live data tables, linked volcano plots, enrichment

    analysis, heatmap, and integration with RNASeq, etc.

  • LessonsLearnedu  LabKey,duetoitsopen-source

    architectandabundanceofwidgetsandcustomizability,isanidealenvironmenttomanagecomplexomicsdatasuchproteomics

    u  Byexternalizingtheplacorm-specificprocesses,differentdatatypescanbereadilymanagedinLabKey.

    u  TemplatefolderprovidesagoodcompromisebetweenUIflexibilityandusability.

    u  BeUergraspofthe‘folder’conceptandthescopingruleiscrucial

    u  Properdivisionoflaborsiscri?cal–  Server-sidedatamanagement–  Client-sideUIcustomiza?onand

    repor?ng

    u  BeUerout-of-boxfeatureswillreducetheupfrontworksinacommercialseengs.

  • FuturePlan

    u Addi?onalworkflowsfor–  Label-freequan?ta?onbyMaxquant

    –  Targetedproteomicsusingthe“Panorama”module

    u UIrefinements–  toaccommodatemul?pleworkflowsandtoclarifyuserinputs

    –  Factors,factorop?onsandsampleaUributeeditor

    –  RequestforStatAnalysis

  • Acknowledgements

    u ResearchBioinforma?cs,u Proteomics,AD&PEu RD&Iu Sta?s?calScience

    u MedImmune,AZ

    u CoryNathe,FrankLeeu  JoshEckelsu SteveHanson,AvitalSadot

    u LabKey

  • DataPipelineforProteo/Metabolo/Lipidomics

    PD2.1

    UI

    Searching Clustering

    Stat

    SIGNALpreview

    Query,~,Studies

    Data

    Analytics

    RDBSlice & Dice

    Knowledge BasesPAPP,MetaBase,GeneSetsData store

    Annotation, RAW

    SIGNAL

    PIPE

    Pipelinekickoff

    SpotfireSlice&Dice

    Networking

    Compute Physical Server LabKeyServer Shiny ServerHPC

    Molecular Genotypes,

    Phenotypes + Observations

    Discovery, Insights, Decision

    modeling