ucsc genome tools and databases jim kent - genome bioinformatics group university of california...

49
UCSC Genome Tools and Databases QuickTime™ and aTIFF (Uncomp QuickTime™ and aTIFF (U Jim Kent - Genome Bioinformatics Group University of California Santa Cruz

Post on 21-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

UCSC Genome Tools and Databases

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Jim Kent - Genome Bioinformatics GroupUniversity of California Santa Cruz

Behind the Genome Browser• ‘Genome’ database, one for each assembly of

each genome.– hg17 (human genome assembly 17)– mm6 (mus musculus 6)– canFam1 (canis familiaris 1)

• hg17 has 1616 tables, but not really– Some tables split across chromosomes for speed– 228 logical tables– Only ~30 different types of tables

Selected fields from related tables results: Ensemble Gene (ensGene) and Superfamily Description (sfDescription).

Custom Track Output

• Useful for visualizing results of queries in genome browser

• The way to produce more complex queries.

681/3329 (20%) of Ensemble not known also not conserved1728/33,666 (5%) of Ensembl in general not conserved

Meta-data behind Table Browser

• The trackDb table describes each track.

• Table and field descriptions in AutoSql .as files, which also generate SQL code and C code to load/save from database and tab-separated files.

• Descriptions of how tables are connected in all.joiner file, which along with joinerCheck program checks database integrity.

.as Files - table and field docstable cpgIsland"Describes the CpG Islands" ( string chrom; "Human chromosome or FPC contig" uint chromStart; "Start position in chromosome" uint chromEnd; "End position in chromosome" string name; "CpG Island" uint length; "Island Length" uint cpgNum; "Number of CpGs in island" uint gcNum; "Number of C and G in island" float perCpg; "Percentage of island that is CpG" float perGc; "Percentage of island that is C or G" )

autoSql generates code from these. They also help document.

all.joiner - basic example

• The central concept is an identifier that appears in fields in multiple table, sometimes even multiple databases.

• $gbd is a variable that contains a comma-separated list of databases.

• An identifier record ends with a blank line.

identifier softberryGeneName"Link together Fshgene++ gene structure, peptide, and homolog" $gbd.softberryGene.name $gbd.softberryPep.name $gbd.softberryHom.name

# Genbank/trEMBL Accessions and meaningful subsets thereofidentifier genbankAccession external=genbank"Generic Genbank Accession. More specific Genbank accessions follow" $gbd.seq.acc

identifier bacEndAccession typeOf=genbankAccession"Genbank accession of a BAC end read." $gbd.all_bacends.qName dupeOk $gbd.bacEndPairs.lfNames comma $hg.fishClones.beNames comma minCheck=0.70

typeOf - allows joins between parent and child, but not between siblings. dupeOk - allows more than one row with same identifier in primary tablecomma - indicates field is comma separated list of identifiersminCheck - indicates only a portion identifiers in field is in the primary table

identifier hugoName external=HUGO fuzzy"International Human Gene Identifier" $hg.refLink.name $hg.atlasOncoGene.locusSymbol $hg.kgAlias.alias $hg.kgXref.geneSymbol $hg.refFlat.geneName $hg.jaxOrtholog.humanSymbol hg13,hg15.geneBands.name

“Biological” names for human genes are so messy, no validation is done (note ‘fuzzy’ keyword).

Other Databases• Genome databases - one for each assembly of each

organism: hg17, mm6, canFam1, etc.• hgCentral - home to dbDb and user settings info.

One database shared by all web servers.• hgFixed - mostly microarray data. • uniProt - Relationalized SwissProt/trEMBL

database.• go - Gene ontology terms and term/gene

associations.• genePix - gene image database

Gene Pix

• Image browser for in-situ and other gene- oriented pictures

• Hopefully in the long run will have a million images covering almost all vertebrate genes.

• (Needs new name, Gene Pix is a microarray analysis program. VisiGene?)

Data Sets• Paul Gray - ~1000 mouse transcription factor

genes - whole embryo & sections. These are in the database now.

• Other potential sources:– German AxelDB frog in situs– Japanese NIBB frog in situs (have nice browser)– Genepaint.org - mouse stuff– EMAGE and Jackson Lab mouse images

• From development and other journals, copyright issues.

– Nathaniel Heintz BAC expression constructs– Eddy Rubin lab mouse embryos– UCSF cell-localization stuff?

Types of images• Whole animal vs.

sectioned tissues, vs. single cell.

• Single vs. multiple probes within same image.

• Single image vs. image series (movies even).

• RNA, Antibody, Fusion protein.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Mitotic cell 3 stains

Gene Pix Programs• genePixLoad - loads SQL database from a well

defined format involving a .ra file and a tab separated file. See genePixLoad.doc

• loadMahoney - converts Paul Gray (Mahoney center) spreadsheet and image directory into genePixLoad format

• Hg/lib/genePix.c - interface with SQL database.• hgGenePix - cgi script to display images• knownToGenePix - makes table in mm5 (or other)

genome database to connect known genes to genePix Ids.

Gene Pix Database

• Just a single database for all assemblies of all organisms.

• A knownToGenePix table in the assembly database.

GenePix tables

• fileLocation - directory• bodyPart - whole, brain etc. • sliceType - transverse, sagital • treatment - tech details • contributor - who done it• Journal - scientific journal• submissionSet - info about a

whole set of images from one author

• sectionSet - links together separate sections of same specimen.

• Gene - gene info

• geneSynonym

• Antibody - info on an antibody

• probeType - antibody, RNA, fusion protein

• Probe - links gene, primers, sequence Ab.

• probeColor - color probe is

• imageFile - file containing image

• Image - a single image.

• imageProbe links image and probe

Some Anatomy Required

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (LZW) decompressorare needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Especially with slices

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Edinburgh mouse atlas

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Theiler Stages

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Later Stages

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

NIBB Japanese Frog Site

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Earlier Stages

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Who you gonna call?

Angie Hinrichs - developer of 2nd and 4th versions of Table Browser. Genome browser hacker extraordinaire.

Hiram Clawson - main mouse man at the moment. Developed ‘wiggle’ tracks.

Kate Rosenbloom - ENCODE project and multiple alignment display.

Bob Kuhn - Software and database quality assurance.

David Haussler - Ideas. Money. Comparative genomics.

More Acknowledgements• UCSC - Robert Baertsch, Gill Bejerano, Galt

Barber, Ron Chao, Mark Diekhans, Jorge Garcia, Patrick Gavin, Rachel Harte, Fan Hsu, Yontoa Lu, Crystal Lynch, Donna Karolchik, Jennifer Jackson, Ann Pace, Jacob Pedersen, Andy Pohl, Katie Pollard, Ali Sultan-Qurraie, Brian Raney, Krishna Roskin, Adam Siepel, Chuck Sugnet, Paul Tatarsky, Daryl Thomas, Heather Trumbower

• Penn State - Scott Schwartz, Laura Elnitski, Belinda Giardine, Ross Hardison, Minmei Hou, Webb Miller, Anton Nekrutenko

• Funding - NHGRI, HHMI, NCI, UCSC