the mygrid project

28
1 The myGrid Project Professor Chris Greenhalgh University of Nottingham

Upload: leanne

Post on 17-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

The myGrid Project. Professor Chris Greenhalgh University of Nottingham. Open Source Upper Middleware for Bioinformatics (Web) Service-based architecture Targeted at Tool Developers, Bioinformaticians and Service Providers. Newcastle. Sheffield. Manchester. Nottingham. Hinxton. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The myGrid Project

1

The myGrid Project

Professor Chris Greenhalgh

University of Nottingham

Page 2: The myGrid Project

2

• Open Source Upper Middleware for Bioinformatics

• (Web) Service-based architecture• Targeted at Tool Developers,

Bioinformaticians and Service Providers

Newcastle

NottinghamManchester

Southampton

Hinxton

Sheffield

Page 3: The myGrid Project

3

Philosophy

• Openness– open source– open world of services– open to wider eScience context– open to user feedback– open to third party metadata

• Collection of components for assembly– Pick and mix

Page 4: The myGrid Project

4

Data-intensive bioinformatics

ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

Page 5: The myGrid Project

5

Use ScenariosGrave’s Disease• Autoimmune disease of the thyroid • Simon Pearce and Claire Jennings, Institute of

Human Genetics School of Clinical Medical Sciences, University of Newcastle

• Discover all you can about a gene• Annotation pipelines and Gene expression analysis• Services from Japan, Hong Kong, various sites in UK

Williams-Beuren Syndrome• Microdeletion of 155 Mbases on Chromosome 7• Hannah Tipney, May Tassabehji, Andy Brass, St

Mary’s Hospital, Manchester, UK• Characterise an unknown gene• Annotation pipelines and Gene expression analysis

Services from USA, Japan, various sites in UK

Page 6: The myGrid Project

6

Williams-Beuren Syndrome Microdeletion

**

Chr 7 ~155 Mb

~1.5 Mb7q11.23

GTF2I

RFC2

CYLN2

GTF2IRD1

NCF1

WBSCR1/E1f4H

LIM

K1

ELN

CLDN4

CLDN3

STX1A

WBSCR18

WBSCR21

TBL2

BCL7B

BAZ1B

FZD9

WBSCR5/LAB

WBSCR22

FKBP6

POM121

NOLR1

GTF2IRD2

C-c

en

C-m

id

A-c

en

B-m

id

B-c

en

A-m

id

B-t

el

A-t

el

C-t

el

WBSCR14

WBS

SVAS

ST

AG

3P

MS

2L

Block A

FK

BP

6T

PO

M12

1N

OL

R1

Block C

GT

F2I

P

NC

F1P

GT

F2I

RD

2P

Block B

Patient deletions

CTA-315H11

CTB-51J22

Gap

Physical Map

Page 7: The myGrid Project

7

Manually filling a genomic gap

• Numerous web-based services (i.e. BLAST, RepeatMasker)

• Cutting and pasting• Large number of steps• Frequently repeated – info now rapidly added to public

databases• Don’t always get results• Time consuming• Huge amount of interrelated data is produced – handled

in lab book and files saved to local hard drive• Mundane• Much knowledge remains undocumented .:

Bioinformatician does the analysis

Page 8: The myGrid Project

8

WBS Workflows:

GenBank Accession No

GenBank Entry

Seqret

Nucleotide seq (Fasta)

GenScanCoding sequence

ORFs

prettyseq

restrict

cpgreport

RepeatMasker

ncbiBlastWrapper

sixpack

transeq

6 ORFs

Restriction enzyme map

CpG Island locations and %

Repetative elements

Translation/sequence file. Good for records and publications

Blastn Vs nr, est databases.

Amino Acid translation

epestfind

pepcoil

pepstats

pscan

Identifies PEST seq

Identifies FingerPRINTS

MW, length, charge, pI, etc

Predicts Coiled-coil regions

SignalPTargetPPSORTII

InterProPFAMPrositeSmart

Hydrophobic regions

Predicts cellular location

Identifies functional and structural domains/motifs

Pepwindow?Octanol?

ncbiBlastWrapper

URL inc GB identifier

tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr

RepeatMasker

Query nucleotide sequence ncbiBlastWrapper

Sort for appropriate Sequences only

Pink: Outputs/inputs of a servicePurple: Taylor-made servicesGreen: Emboss soaplab services Yellow: Manchester soaplab services Grey: Unknowns

RepeatMasker

Page 9: The myGrid Project

9

Workflow approach:in-silico experiments

Williams-Beuren Syndrome• Manually: takes two days (+)

including analysis• Now takes 30 mins to

produce results and half a day for analysis

• Manually: Do analysis as perform experiment

• Workflow: Do analysis at end of experiment

• Therefore need good result co-ordination for back-tracking

Page 10: The myGrid Project

10

(e-)Scientists…• …Experiment

• Can workflow be used as an experimental method?• How many times has this experiment been run?

• …Analyze• How do we manage the results to draw conclusions from

them?• How reliable are these results?

• …Collaborate• Can we share workflows, results, metadata etc?

• …Publish• Can we link to these workflows and results from our papers?

• …Review• Can I find, comprehend and review your work?• How was that result derived?

Page 11: The myGrid Project

11

Web Service (Grid Service) communication fabricWeb Service (Grid Service) communication fabric

AMBITText Extraction

Service

Provenance

Personalisation

Event Notification

Gateway

Service and WorkflowDiscovery

myGrid Information Repository

Ontology Mgt

Metadata Mgt

Work bench Taverna Talisman

Native Web Services

SoapLab

Web Portal

Legacy apps

Registries

Ontologies

FreeFluo Workflow Enactment Engine

OGSA-DQPDistributed Query Processor

Bio

info

rmat

icia

nsT

ool P

rovi

ders

Ser

vice

Pro

vide

rsA

pplicationsC

ore servicesE

xternal servicesmyGrid Service Stack

Views

Legacy apps

GowLab

Page 12: The myGrid Project

12

Web Service (Grid Service) communication fabricWeb Service (Grid Service) communication fabric

AMBITText Extraction

Service

Provenance

Personalisation

Event Notification

Gateway

Service and WorkflowDiscovery

myGrid Information Repository

Ontology Mgt

Metadata Mgt

Work bench Taverna Talisman

Native Web Services

SoapLab

Web Portal

Legacy apps

Registries

Ontologies

FreeFluo Workflow Enactment Engine

OGSA-DQPDistributed Query Processor

Bio

info

rmat

icia

nsT

ool P

rovi

ders

Ser

vice

Pro

vide

rsA

pplicationsC

ore servicesE

xternal servicesmyGrid Service Stack

Views

Legacy apps

GowLab

Page 13: The myGrid Project

13

Page 14: The myGrid Project

14

• Control flow, iteration and data flow

• Data sets and nested flows• Configurable failure handling• Incorporated Life Science Id

resolution• Provenance and status

reporting• Type and data management• Plug-ins• User notification• Data entry wizard

• Libraries of SHIM services• Libraries of workflows

FreeFluo Features

Page 15: The myGrid Project

15

Domain Services• Native WSDL Web services

– DDBJ, NCBI BLAST, PathPort, BioMOBY

• Wrapped legacy services– SoapLab – GowLab

• Web pages as web services– One button wrapping– Leveraged the EMBOSS Suite– ~159 services

• Lots of them and lots of redundant services

• The joys of firewalls and licensing

EBI Support agreed to support Soaplab services as core business

http://industry.ebi.ac.uk/soaplab/

For each applicationCreateJobRunWaitForGetResultsDestroy

Page 16: The myGrid Project

16

Two+ Paths

Core functionality• Services – Soaplab

and Gowlab• Workflow enactment

engine – Freefluo• Workflow workbench

– Taverna• Data integration –

OGSADQP• Information model &

management

Innovative work• Service and workflow

registration• Semantic discovery• Provenance

management• Text mining

In between• Event notification• Gateway

Page 17: The myGrid Project

17

Drilling Down: myGrid and Semantics

• Workflow and service discovery – Prior to and during enactment– Semantic registration

• Workflow assembly– Semantic service typing of inputs and outputs

• Provenance of workflows and other entities• Experimental metadata glue• Use of RDF, RDFS, DAML+OIL/OWL

– Instance store, ontology server, reasoner– Materialised vs at point of delivery reasoning.

• myGrid Information Model

Page 18: The myGrid Project

18

Workflow run

Workflow design

Experiment design

Project

Person

Organisation

Process

Service

Event

Data item

Data itemData item

data derivation e.g. output data derived from input data

knowledge statementse.g. similar protein sequence to

instanceOf

partOf componentProcesse.g. web service invocation of BLAST @ NCBI

componentEvente.g. completion of a web service invocation at 12.04pm

runBye.g. BLAST @ NCBI

run for

Organisation level provenance Process level provenance

Data/ knowledge level provenance

Pro

vena

nce

(1)

User can add templates to each workflow process to determine links between data items.

Page 19: The myGrid Project

19

19747251 AC005089.3831Homo sapiens BAC

clone CTA-315H11 from 7, complete sequence15145617 AC073846.6

815Homo sapiens BAC

clone RP11-622P13 from 7, complete sequence15384807 AL365366.20

46.1Human DNA sequence

from clone RP11-553N16 on chromosome 1, complete sequence7717376 AL163282.2

44.1Homo sapiens

chromosome 21 segment HS21C08216304790 AL133523.5

44.1Human chromosome 14

DNA sequence BAC R-775G15 of library RPCI-11 from chromosome 14 of Homo sapiens (Human), complete sequence34367431 BX648272.1

44.1Homo sapiens mRNA;

cDNA DKFZp686G08119 (from clone DKFZp686G08119)5629923 AC007298.17

44.1Homo sapiens 12q22

BAC RPCI11-256L6 (Roswell Park Cancer Institute Human BAC Library) complete sequence34533695 AK126986.1

44.1Homo sapiens cDNA

FLJ45040 fis, clone BRAWH302048620377057 AC069363.10

44.1Homo sapiens

chromosome 17, clone RP11-104J23, complete sequence4191263 AL031674.1

44.1Human DNA sequence

from clone RP4-715N11 on chromosome 20q13.1-13.2 Contains two putative novel genes, ESTs, STSs and GSSs, complete sequence17977487 AC093690.5

44.1Homo sapiens BAC

clone RP11-731I19 from 2, complete sequence17048246 AC012568.7

44.1Homo sapiens

chromosome 15, clone RP11-342M21, complete sequence14485328 AL355339.7

44.1Human DNA sequence

from clone RP11-461K13 on chromosome 10, complete sequence5757554 AC007074.2

44.1Homo sapiens PAC

clone RP3-368G6 from X, complete sequence4176355 AC005509.1

44.1Homo sapiens

chromosome 4 clone B200N5 map 4q25, complete sequence2829108 AF042090.1

44.1Homo sapiens

chromosome 21q22.3 PAC 171F15, complete sequence

>gi|19747251|gb|AC005089.3| Homo sapiens BAC clone CTA-315H11 from 7, complete sequenceAAGCTTTTCTGGCACTGTTTCCTTCTTCCTGATAACCAGAGAAGGAAAAGATCTCCATTTTACAGATGAGGAAACAGGCTCAGAGAGGTCAAGGCTCTGGCTCAAGGTCACACAGCCTGGGAACGGCAAAGCTGATATTCAAACCCAAGCATCTTGGCTCCAAAGCCCTGGTTTCTGTTCCCACTACTGTCAGTGACCTTGGCAAGCCCTGTCCTCCTCCGGGCTTCACTCTGCACACCTGTAACCTGGGGTTAAATGGGCTCACCTGGACTGTTGAGCG

urn:lsid:taverna:datathing:15

..BLAST_Report

rdf:type

urn:lsid:taverna:datathing:13

..similar_sequences_to

.. nucleotide_sequence

rdf:type

service invocation

..created_by

workflow invocation

workflow definition

experiment definition

project

person

group

service description

organisation

..described_by

..run_during

..invocation_of

..part_of

..works_for

..part_of

..part_of

..author

..author

..run_for

A B

..masked_sequence_of

..filtered_version_of

Relationship BLAST report has with other items in the repository

Other classes of information related to BLAST report

RDF Rules

Page 20: The myGrid Project

20

Information Model v2

• Resources and Identifiers

• People, teams and organizations• Representing the e-science

process• Experimental methods for e-

science

1..*0..* uses

1

0..*

contains

10..*

selected studies

0..*1

method

0..*

0..*

acts in

10..*

labBooks

scmInvestigator

1 0..*has participants 10..* participates in

0..*

1

uses

method

1 0..*has instances

AgentExperimentInstance

LabBookView

+name:String+rule:String

SubjectObject

Resources.Resource

+getId:URIString

ProgrammeResource

+name:String

<<Resource>>Study

+name:String+description:String+startTime:DateTime+endTime:DateTime+status:String

Programme

<<Resource>>Operations.Operation

<<Resource>>ExperimentDesign

Investigation

<<Resource>>PeopleAndTeams.Person

StudyRole

+roleName:String+description:String

Agent<<Resource>>

StudyParticipation

• Scientific data and the life-science identifier– Types– Identifier Types– Values and Documents

• Provenance information• Annotation and Argumentation

In the middle of deployment

Bioinformatics middleware – domain neutral

Page 21: The myGrid Project

21

LSIDs• LSID provides a uniform naming

scheme.• LSID Resolver guarantees to

resolve to same data object.• LSID Authority dishes them out.• Also returns metadata of object.• Used throughout myGrid as an

object naming device.• myGrid Repository acts an LSID

Authority• LSID allows universal access to

results for collaboration, as well as for review.

• RDF+LSID explains the context of results, and provides guidance for further investigations.

Pioneered by myGrid

I3C / IBM / EBI proposal for a Life Science Identifier

http://www.i3c.org/wgr/ta/resources/lsid/docs/

Page 22: The myGrid Project

22

Using Haystack

Page 23: The myGrid Project

23

In a nutshell

Pre-Prototype

Prototype 1

ExperimentalWeb-based

Requirements gathering

Architectural workoutAll services represented

NetBeans workbenchAPI-based integration

Info Repository orientedXML-based process provenance

Workflow enactment engine

Prototype 2

Second generation servicesReworked information model

Open information managementLife Science IdentifiersRDF based provenance

Taverna workbenchWeb-based portal

Demo at ISMB 2003

Full paper and demoat ISMB 2004GSK deployment

Real biology

Page 24: The myGrid Project

24

To Dos• Improve results management• Deployment of mIR• Portal for finding workflows, launching & monitoring workflows,

launching taverna, browsing results• Deploying publicly accessible semantic registry• Reinstate service discovery during enactment• Large scale data throughput workflow engine• Event notification on services• Using provenance graphs for impact analysis• Hiding LSIDs• Lexicons for concept names• Hardening semantic discovery• Ambient Text• Er..Security• Etc…• “myGrid in a box”

Page 25: The myGrid Project

25

Ongoing/Future Activities

• myGrid-in-a-box• Technical follow-ons

– Best practice (6) and OMII (Freefluo,Taverna, Event notification) bids

• Research follow-ons– Semantic Grids, Data Grids, Workflow, Provenance services– PhD students

• Science follow-ons– Life Sciences: ISPIDER, e-Fungi– Clinical: PsyGrid, CLEF-II– PhD students

• Networking– LinK-up with BIRN/SEEK/GEON (SDSC) & SCEC/GriPhyN

(ISI,USC)

Page 26: The myGrid Project

26

Wrap Up• Managed the transition from generic middleware

development to practical day to day useful services– Real users (plural) fundamental to that

• End to end support for an entire scenario– A broad view of the e-Science process

• Show stoppers for practical adoption are not sexy technical showstoppers– Can I incorporate my favourite service?– Can I manage the results?

• Tapping into (defacto) standards and communities to leverage others results and tools – LSID, Haystack, Pedro…

• http://www.mygrid.org.uk

Page 27: The myGrid Project

27

AcknowledgementsmyGrid is an EPSRC funded UK eScience Program Pilot Project

Particular thanks to the other members of the Taverna project, http://taverna.sf.net

Page 28: The myGrid Project

28

myGrid PeopleCore• Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis,

Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.

Users• Simon Pearce and Claire Jennings, Institute of Human Genetics School of

Clinical Medical Sciences, University of Newcastle, UK• Hannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital,

Manchester, UKPostgraduates• Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman,

Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair HampshireIndustrial • Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)• Robin McEntire (GSK)Collaborators• Keith Decker