http:// databases in biomart format: ensemblhapmaphtgthgncdictybasewormbasegramene...
TRANSCRIPT
http://www.biomart.org/
Databases in Biomart format:
EnsemblHapMap HTGT HGNC Dictybase Wormbase Gramene Europhenome UniPro Rat Genome Database DroSpeGe ArrayExpress DW Eurexpress GermOnLine PRIDE PepSeeker VectorBase Pancreatic Expression Database Reactome EU Rat Mart Paramecium DB
“BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research
(OICR) and the European Bioinformatics Institute (EBI).”
Open Source – LGPL
* Perl API → Web Interface, Web Services Interface, REST API
* Java API → Mart Explorer GUI, MartShell
* 3rd Party Software → Bioclipse, biomaRt-BioConductor, Cytoscape, Galaxy, Taverna, WebLab
A Mart is a collection of datasets (~=Database).
Marts are optimised for querying.
A Dataset has a main table, with an entry (and Primary Key) for each of the items of interest in that dataset (eg Mouse Transcripts).
Related bits of information about these items are hung off the table in dimension tables (eg. Affy Ids corresponding to this gene)
More Info: http://www.biomart.org/user-docs.pdf
Ensembl annotates everything at the transcript level:
Ensembl_transcript_1
Ensembl_transcript_2
Ensembl_transcript_3
AffyID
HUGO Symbol
1939_at ENST0000037891939_at ENST000003790 1939_at ENST000003791 TP53
Affy Ids are mapped by Ensembl. If there is no clear match then that probe is not assigned to a gene.
Web Interface:
http://www.biomart.org/biomart/martview/
Choose a Database (mart) to query (eg Ensembl)
Choose a Dataset from that mart to query (eg Mus Musculus Genes)
Filters
Use filters to select the members of the dataset in which you're interested
eg.
Limit to miRNA genes from Chr1
→
Attributes
Use attributes to define what bits of information you want to retrieve about the members of the dataset
eg. Gene ID, Transcript ID, Start, End and Status:
Results:
http://www.biomart.org/biomart/martview
www.bioconductor.org
source("http://bioconductor.org/biocLite.R")
#Default package setbiocLite()
#ORbiocLite(“someBiocPkg”)
#ORbiocLite(groupName=”pkgGroupName”)
“Bioconductor is an open source and open development software projectfor the analysis and comprehension of genomic data.”
Core Packages:
affy, affydata, affyPLM, annaffy, annotate, Biobase, Biostrings, DynDoc, gcrma, genefilter, geneplotter, hgu95av2.db, limma, marray, matchprobes, multtest, ROC, vsn, xtable, affyQCReport.
Alternative Package Groups
lite, affy, graph, all
http://www.bioconductor.org/packages/release/BiocViews.html
Full Package Listing (software)
http://www.bioconductor.org/packages/release/data/annotation/
Full Package Listing (annotation)
Querying biomart from R:
# Install librarysource(“http://www.bioconductor.org/biocLite.R”)biocLite(“biomaRt”)
# Load librarylibrary(biomaRt)
listMarts()
# result is just a data.frame, so you can subset it:
listMarts()[1:5,]
# or search it:
grep('ensembl', listMarts()[,1], value=TRUE)
# Select a mart
mart <- useMart('ensembl')
# List the available datasets (returns data.frame)
listDatasets(mart)
# Select a dataset
mart <- useDataset('mmusculus_gene_ensembl', mart=mart)
# Both in one:
mart <- useMart('ensembl', dataset='mmusculus_gene_ensembl')
# Available Filters (returns data.frame)listFilters(mart)
# Available Attributes (returns data.frame)listAttributes(mart)
# A Simple Query
getBM(filters=c('ensembl_gene_id'), values=c('ENSMUSG00000029249','ENSMUSG00000048482'),
attributes=c('ensembl_gene_id', 'ensembl_transcript_id', 'transcript_start', 'transcript_end'), mart=mart)
ensembl_gene_id ensembl_transcript_id transcript_start transcript_end1 ENSMUSG00000029249 ENSMUST00000113448 77694516 777089552 ENSMUSG00000029249 ENSMUST00000113449 77695221 777154573 ENSMUSG00000029249 ENSMUST00000080359 77694516 777120094 ENSMUSG00000048482 ENSMUST00000053317 109514857 1095672005 ENSMUSG00000048482 ENSMUST00000111052 109533720 1095672006 ENSMUSG00000048482 ENSMUST00000111051 109516054 1095672007 ENSMUSG00000048482 ENSMUST00000111050 109532593 1095672008 ENSMUSG00000048482 ENSMUST00000111047 109516054 1095671639 ENSMUSG00000048482 ENSMUST00000111049 109516054 10956716310 ENSMUSG00000048482 ENSMUST00000111046 109517251 10956716311 ENSMUSG00000048482 ENSMUST00000111045 109533720 10956716312 ENSMUSG00000048482 ENSMUST00000111044 109534626 10956716313 ENSMUSG00000048482 ENSMUST00000111043 109534626 10956716314 ENSMUSG00000048482 ENSMUST00000111042 109534628 109567204
# If using multiple filters, values should be a list
# If chromosome_name, start and end filters used they are auto# interpreted as 'search within this region'
getBM(filters=c('chromosome_name', 'start', 'end' ), values=list(10, 80000000,80050000), attributes= c('ensembl_gene_id', 'start_position','end_position'), mart=mart)
ensembl_gene_id start_position end_position1 ENSMUSG00000003346 80046400 800530492 ENSMUSG00000035397 80029874 800400663 ENSMUSG00000047417 80005138 800242864 ENSMUSG00000003341 79982330 80001869
# Filters can be either numeric, string or boolean.# Boolean filters need a TRUE or FALSE value
# Determine type of filter with:
filterType('with_unigene', mart)
# Attributes and filters are organised into categories
# To get a list of the categories:attributeSummary(mart)filterSummary(mart) # You can then list attributes and filters limited to a # specified category:listAttributes(mart, category='Variations')
# Older versions of ensembl are archived, useful if you've # got genome positions to a previous build
old.mart <- useMart('ensembl_mart_46', dataset='mmusculus_gene_ensembl', archive=TRUE)
Retrieving Sequences:
# can get complicated with getBM. Use the getSequence wrapper
# Genome Sequences always 5'-3' but...
# Web-Services mode (default): Strand is context dependant # MySQL mode: Always top strand
#eg...
# BRCA1 peptide sequence from gene symbolgetSequence(id="BRCA1", type="mgi_symbol", seqType="peptide", mart = mart)
# REST transcript 20 bases upstream getSequence(id='ENSMUST00000113448', type='ensembl_transcript_id', seqType='transcript_flank', upstream=20, mart=mart)
# Chromosome 4 100,000,000-100,000,010getSequence(chromosome=4, start=10000000, end=11000000, mart=mart, seqType="gene_exon", type="ensembl_gene_id")
seqTypes:
Note that any of the _flank types need an 'upstream' or 'downstream' argument to determine the size of the flanking region. At the moment, you can't specify both.
Exporting Sequences:
# The exportFASTA function provides a quick way of saving # sequences in FASTA format:
res <- getSequence(id="BRCA1", type="mgi_symbol", seqType="peptide", mart = mart)
exportFASTA(res, file='sequence.fa')
Linking Datasets...
# Make mart connections for each of the datasets:mouse.mart<-useMart('ensembl', dataset="mmusculus_gene_ensembl")people.mart<-useMart('ensembl', dataset='hsapiens_gene_ensembl')
# In Ensembl, datasets are made of transcripts # from a single species. # Linking datasets amounts to homology
#eg. Get pos of mouse homolog to human 'TP53' gene
getLDS(attributes = c("hgnc_symbol","chromosome_name", "start_position"), filters = "hgnc_symbol", values = "TP53", mart = people.mart, attributesL = c("chromosome_name","start_position"), martL = mouse.mart) }
V1 V2 V3 V4 V51 TP53 17 7512445 11 69393861
Pretty HTML Output:
library(annotate)# Provides the htmlpage function. Salient args are:# genelist – a list or dataframe of IDs to be made into links# filename# title – for the table# othernames – a list of other things to add to the table as is# table.head – a character vector of col headers for the table.# repository – a list of repositories to use for creating links
ids <- c('ENSMUSG00000029249','ENSMUSG00000048482')
genelist <- getBM(attributes=c('uniprot_swissprot_accession', 'entrezgene'), filters='ensembl_gene_id', values=ids, output='list', na.value=' ', mart=mart)
othernames <- getBM(attributes=c('ensembl_gene_id','mgi_symbol', 'description'), filters='ensembl_gene_id', values=ids, output='list', na.value='&nsbp;',mart=mart)
htmlpage(genelist=genelist, othernames=othernames, title='Some Genes', table.head=c('Uniprot', 'Entrezgene', 'Ensembl','Name', 'Description'), repository=list('sp', 'en'), filename='genes.html')
# Note that all the lists are expected to be in the right order
More Info...
Bioconductor Mailing List:
http://www.bioconductor.org/docs/mailList.html
biomaRt Users' Guide:
vignette('biomaRt')
Biomart Website
http://www.biomart.org
Slides & examples:
http://www.cassj.co.uk/biomart_slides.ppt
http://www.cassj.co.uk/worksheet.txthttp://www.cassj.co.uk/worksheet_code.R