cancer genome anatomy project (cgap) informatics carl f. schaefer december 7, 2001

39
Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Post on 19-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Cancer Genome Anatomy Project (CGAP) Informatics

Carl F. Schaefer

December 7, 2001

Page 2: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Agenda

• Overview• New Features• Future plans• Behind the scenes

url: http://cgap.nci.nih.gov

Page 3: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

CGAP Informatics

• Main CGAP– Susan Greenhut (OCG)

– Denise Hise (NCICB)

– Carl Schaefer (NCICB)

– Kotien Wu (NCICB)

• GAI– Bob Clifford (LPG)

– Michael Edmonson (LPG)

– Ying Hu (LPG)

– Cu Nguyen (LPG)

Page 4: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Cancer Genome Anatomy Project

Gene expression

Polymorphism

Chromosomal aberrations

Topography (tissue)

Morphology (histology)

Find correlations between ...

?

• Earlier informatics segments:– Tumor Gene Index

– Cancer Chromosome Aberration Project

– Gene Annotation Initiative

Page 5: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

History (Simplified)

N C B I

N C I/LP G

Prototype

01/00 09/00

Live, para lle l

02/01

Live

05/01

N C I/N C IC B

Live

Page 6: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

NCI CGAP Site: Original Goals

• Organize by biology rather than by funding– Genes, tissues, chromosomes, reagents, …

• Add bio-functional component (e.g. pathways)• Make the site

– Consistent: search forms; lists; info pages

– Coherent: tied together with internal links

Page 7: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

New Features

• Expression: DGED; SAGE data• Function: GO and pathways• Structure: protein motifs• Chromosome aberrations in cancer

Page 8: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Measuring Gene Expression

• Sequencing– ESTs (100-600 bp end or full sequence of single clone)

– SAGE (10 bp “tag” excised with restriction enzymes)

• Hybridization– Spotted cDNA arrays

• Longer probes, fewer features per slide

– “Gene chip” (e.g. Affymetrix)• Multiple shorter probes, more features per chip

Page 9: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Digital Gene Expression Display

• Evolved from a concrete request: help find vaccine targets

• Similar to NCBI’s DDD but more flexible interface

• Queries both EST and SAGE data– But better not to mix

Page 10: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

DGED-1

Page 11: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

DGED-2

Page 12: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

DGED-3

Page 13: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Other Expression Stuff

• SAGE data– SAGE libraries accessible in library browser and DGED

– Caveat: tag-to-gene mapping is ambiguous both ways• Stay tuned for improvement here

• Virtual Northern– For each gene, contrast cancer vs. normal, in ESTs and SAGE, for

each of 50 tissues

– Ratio: tags for G in given tissue, histology divided by total tags in given tissue, histology

– Convert to decile

Page 14: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Virtual Northern

Page 15: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Functional Information

• Ontologies– Set membership

– E.g., TP53 is in “DNA binding”, “DNA repair”, “transcription factor”, …

• Pathways– Set membership, e.g. TP53 is in “ATM Signaling

Pathway”, “p53 Signaling Pathway” …

– But also relations among members of a pathway, e.g. “is catalyst for”, “activates”, …

Page 16: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Gene Ontology

• Three top-level categories– Biological process

– Molecular function

– Cellular component

• Given gene may appear in multiple sets• Mouse by JAX; human by Proteome• An evolving vocabulary

– The sorry fate of “tumor suppressor” and “oncogenesis”

Page 17: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

GO Browser

Page 18: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

BioCarta Pathways

• 95 pathway diagrams• Artistic rendering …• but dumb

– Relations (e.g. “is catalyst for”) are drawn, but are not data

Page 19: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

AKT Signaling Pathway

Page 20: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

KEGG Pathways

• Mainly metabolic pathways; some regulatory• Genes represented by EC numbers

– Many can be hyperlinked to CGAP gene info pages

– Some refs to non-human organisms

• Compounds appear under various names; each has a unique KEGG compound number

• Database contains representation of reactions (unlike BioCarta)

Page 21: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

D-Glutamine and D-Glutamate Metabolism

Page 22: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

L-Glutamate Compound Info Page

Page 23: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Summary of Functional Information for CASPASE 7

Page 24: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Structure: Protein Motifs

• GAI using HMMER to locate Pfam motifs on RefSeq (NM_ …) and MGC (BC…) transcripts

• Similarity among transcripts:– Raw sequence– Single motif occurrences– Multiple motif occurrences

• E-value: fit of motif to transcript• P-value: relative probability that two transcripts

are closely related

Page 25: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Structure: Protein Motifs (Example: ICE_p10, ICE_p20, and CARD among

the CASPASes)

Page 26: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Mitelman Database of Chromosomal Aberrations in Cancer

• Data culled from literature -- 39,000 cases• Case records:

– Clinical/demographic– Topography/morpology– Karyotype– Reference

• Recurrent subset• Separate dataset of associations, often to specific

genes

Page 27: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Future Plans

• Function: smarter pathways• Expression:

– New SAGE data and display

– Microarray data (NCI 60 cell lines) (see CMAP presentation)

• Structure: gene query by motif• Operations on lists of genes

– Adding columns of information to gene lists

Page 28: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Genes in AKT Signaling Pathway

Page 29: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Clone List

Page 30: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Genes in GO Apoptosis

Page 31: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Pathways, Ontology, Tissues

Page 32: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Behind the Scenes

• The build process (not a pretty sight)• Software architecture

Page 33: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Data Sources/SizesSource Dataset MB

UniGene Hs.data 236Hs.seq.all 1965Hs.lib.info 1Mm.data 151Mm.seq.all 1022Mm.lib.info 1

LocusLink LL_tmpl 81HomoloGene hmlg.ftp 18SAGE tag_lib_freq 67

tag_cluster_map 26Research Genetics Hs verified clones 10

Mm verified clones 2Felix Mitelman cytogenetic data 27NCBI custom library.report 6

hierarchy.txt 1BAC clones 1

DTP microarray ids 1Gene Ontology GO terms 1BioCarta Pathways 8Total 3625

Page 34: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Hs.dataHs.seq.all Hs.lib.infoLL_tm pl

LL_stripped

hm lg.ftpResGen svc lib.rep hier.txt

unilib.info

BLAST libs

SAGE tag-gene m ap

SAGE tag-lib-freq

Hs.seq.tm p

Hs.accs

hs_cluster.datall_libraries.dat library_keyword.dat

gene_alias.dat

gene_keyword.dat

hs_gxs.dat

hs_svc.dat

hs_gene_tissue.daths_clust2est.dat

hs_clust2sage.dat

hs_vn.dat

hs_vn_lib.dat

hs_decile.dat

hs_ug_clones.dat

Raw Input

Oracle input

go_genes.dat

go_nam es.dat

GO hier.

hs_gl.dat

Page 35: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Build Process -- Goals

• Automated• Current (with respect to external data sources)• Internally consistent (i.e. new UniGene cluster

numbers throughout)• Efficient (only recompute when necessary)

Page 36: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Makefile Example

$(HS_GENE_TISSUE_DAT): $(TISSUE_SELECTION_DAT) $(HS_GXS_DAT)

$(GENE_TISSUE_CMD) \

Hs \

$(ALL_LIBRARIES_DAT) \

$(TISSUE_SELECTION_DAT) \

$(LIBRARY_KEYWORD_DAT) \

$(HS_GXS_DAT) \

$(HS_GENE_TISSUE_DAT)

$(DATA_DIR)/load_hs_gene_tissue.mak: $(HS_GENE_TISSUE_DAT)

echo "drop index Hs_Gene_Tissue1;" | sqlplus $(DB_USER)

echo "drop index Hs_Gene_Tissue2;" | sqlplus $(DB_USER)

sqlldr userid=$(DB_USER) control=$(LOAD_DIR)/Hs_Gene_Tissue.ctl \>$(LOAD_DIR)/load_hs_gene_tissue.log 2>&1

echo "create index Hs_Gene_Tissue1 on Hs_Gene_Tissue(tissue_code);" | sqlplus $(DB_USER)

echo "create index Hs_Gene_Tissue2 on Hs_Gene_Tissue(cluster_number);" | sqlplus $(DB_USER)

echo "analyze table Hs_Gene_Tissue compute statistics for table;" | sqlplus $(DB_USER)

echo "analyze table Hs_Gene_Tissue compute statistics for all indexes;" | sqlplus $(DB_USER)

touch $(DATA_DIR)/load_hs_gene_tissue.mak

Page 37: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

CGAP Site Architecture (Overview)

S taticC ontent

W eb S erver(Zope)

C ontentM anagem ent

S ystem(Zope)

Python

P erl O racle

O racleTab les

socket S Q L

Fla t F ilesin itia liza tion data

Ÿ G eneS erverŸ L ibS erverŸ G LS erverŸ G XS S erverŸ C ytS earchS erverŸ B lastQ ueryS erver

Page 38: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Distributed Processing

GENE_HOST = VenusGENE_PORT = 9000BLAST_HOST = MarsBLAST_PORT = 9001GLS_HOST = P lutoGLS_PORT = 9002LIB_HOST = P lutoLIB_PORT = 9003

GENE_PORT = 9000

Venus

Zope

CGAPConfig

GeneServer

BLAST_PORT = 9001

CGAPConfig

BlastServer

LIB_PORT = 9003GLS_PORT = 9002

CGAPConfig

GLSServerMars

LibServer

Pluto

Page 39: Cancer Genome Anatomy Project (CGAP) Informatics Carl F. Schaefer December 7, 2001

Application Support

Security Screen IP Screen request

Request status O K N O D AT A BAD R EQ U EST SER VER FAIL

Server loop W ait for request R eplicate process ing thread Set tim er R eset server

# Define Services:

sub F indG ene { ... } sub G etC lones { ...}

# "Publish" services:

SetSafe ("F indG ene", "G etC lones", ...); SetForkable ("F indG ene", "G etC lones", ...);

# Start server loop :

S tartServer(G EN E_PO R T );

Application

Support