an introduction to bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

45
An Introduction to Bioi nformatics 北北北北北北北北北北北北北 北北北 11-16, 2008

Upload: oliver-stanley

Post on 11-Jan-2016

500 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

An Introduction to Bioinformatics

北京大学医学部医学信息学系崔庆华

11-16, 2008

Page 2: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Introduction of basic concepts

Page 3: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Bioinformatics-- a definition --by NIH(1995)

Bioinformatics is defined as a scientific discipline that encompasses all aspects of biological information acquisition, processing, storage, distribution, analysis and interpretation, that combines the tools and techniques of mathematics, computer science and biology with the aim of understanding the biological significance of a variety of data.

Page 4: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Bio-informatics– the term

• Bio-informatics

• Computational biology

• Biological computing

Page 5: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008
Page 6: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Data……

★ Large-scale and high-throughput

★ High-dimensional

★ Non-linear

★ Noisy

★ Unequally distributed

Page 7: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Bioinformatics– what is the most important

• Algorithms?

• Data?

• Questions!

Page 8: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Bioinformatics– 误解

Biology

ComputationalTheoretical

Experimental

• 什么都能做?• 生物学 / 信息学

Page 9: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Sequences & Structures

Page 10: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Alignment

• blastall– blastp– blastn– blastx– tblastn– Tblastx

• clusterX

E<10-20

Page 11: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

EvolutionConstructing phylogenetic trees

•Phylip•Clustalw•PAML•MEGA (Kumar et al., Briefings i

n Bioinformatics 2004)

Selection

•Coding region: Ka, Ks (dn,ds), Ka/Ks (dn/ds)

•PAML•Kaks_calculator•K-estimator•Mega•Database: UCSC or ENSEMBL

•Non-coding region•Ralph Haygood (Nature Genetics 2007)

•Recent populations•LRH test (Sabeti et al., Nature 2002•iHS test (Voight et al., Plos Biology 2006)•XP-EHH (Sabeti et al., Nature 2007)

Page 12: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Evolution—An application

•Recent positive selectionSLC24A5, SLC45A2, skin pigment, Europe populationLARGE, DMD, Lassa fever virus, Africa populationEDAR, EDA2R, the development of hair, teeth and exocrine glands, Asia population (Sabeti, Nature 2007).

Page 13: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Alternative Splicing (AS)

•Predicted from ESTs•Predicted from cDNA clones•Prediction of tissue-specific AS•Splicing graphs and EST assembly problem

Page 14: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Functional Domain

•TF binding sites•TRANSFAC: a TF binding site database•TESS: a web-based program

•Exons, introns, 5’UTR, 3’UTR•UCSC

•Promoter•CorePromoter

•Motif•Weeder

•RNA family•Rfam

•Protein domain•Pfam: database•InterPro: database•HMMER: a program based on HMM

Page 15: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Finding genes

Page 16: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Sequence mutations

Huang et al., Science 2007

Gymnopoulos et al., pnas 2007

•Tool: SIFT & Sapred•Conservation score?•Near functional sites?•Similarity score?•Surface?•………

PIK3CA

Page 17: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Modeling structures

•RNAfold•RNAStructure

Page 18: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Modeling structures

•Homology modeling•ESyPred3D•Swiss Model

•Ab initio prediction•Rosetta

•Single mutation modeling•Modeller

•Visualization•Pymol

Page 19: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

最优化算法

目标: max ( 或 min)Y=f(x)约束: x>=0解:求 x=?

目标 约束 解

Page 20: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

确定性优化算法 - 智能优化

遗传算法、模拟退火

Page 21: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

DNA microarray data analysis

Page 22: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Biological Question

Sample Preparation

Data Analysis & Modelling

Microarray Reaction

MicroarrayDetection

Taken from Schena & Davis

Microarray 总流程

Page 23: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

s1 s2 s3• • • • • • • • sj • • • •

• sMg1

g2

gi

gN

gene profile

arr

ay

pro

file

Gi

A jMicroarray data matrix

Mi,j

Page 24: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

数据预处理• 数据缺失

– 原因• 图像受到污染• 图像分辨率不足• 片上灰尘或刮痕

– 缺失数据的处理方法• 舍弃该数据(同时丢掉了有用信息!)• 再做一次实验 (太昂贵了!)• 用某个数取代,比如样本均值• K-nearest neighbors 估计• 奇异值分解( SVD ) 估计

• 标准化– Log 变换– 线性回归– 伸缩 + 平移

Page 25: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Microarray 数据模式分类

预处理 特征提取 机器学习 决策

训练样本

新样本

分类器 决策

X

F(X)

Y

Page 26: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

x1

x2

L: c1x1+c2x2- c=0

G1

G2

Page 27: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

模式分类算法• 线性分类器• 神经网络• 最近邻• 贝叶斯分类器• 隐马尔科夫模型分类器• 决策树• 支持向量机

Page 28: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Microarray 数据模式聚类• 层次聚类• K-means 聚类• Fuzzy C-means 聚类• 自组织映射• Replicator dynamics

(Cui, 2004)

Page 29: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

基因表达特征抽取• 区分男女的特征

– 头发长度?– 皮肤光滑度?– 嗓音?– 身高?– 力量?– 穿着?– 姿态?– XX/XY

• 差异表达基因• Gene set or pathway

• PCA

• SVD

• ISOMAP

• MDS

Page 30: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

基因关系的刻划• Static relationship

– Pearson’s correlation

– Spearman’s correlation

– Mutual information

– Other similarity metric

• Dynamic relationship– Dynamic regression (Cui, 2005)

– Window based correlation

Page 31: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

基因表达网络• Pearson’s correlation

– Hard threshold

– Weighted

• Mutual information

• Bayesian network

Page 32: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Computational Systems Biology

Page 33: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

What is Systems Biology?

• Not a new concept!• Systems biology is an emergent field that ai

ms at system-level understanding of biological systems (Kitano 2002).

• To understand biology at the system level, we must examine the structure and dynamics and cellular organismal function, rather than the characteristics of isolated parts of a cell or organism.

Page 34: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Why Systems Biology?

http://www.newvisions.ucsb.edu/background/images/elephant.gif

++

+

_

0

A

B

C

D

E

Page 35: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Why Computational Systems Biology?

• Golden opportunity, now!

★ More than 16 international meetings in 2006

★ More than 10 books in the past two years

★ Journals: Molecular systems biology (Nature & EMBO), BMC systems biology, IET systems biology, EURASIP Journal on Bioinformatics and Systems Biology etc.

Large-scale, high-throughput data

Page 36: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Fields of Computational Systems Biology?

• Biological networks construction, such as gene regulatory networks, cellular signaling networks, metabolic networks, protein-protein interaction networks, genetic interaction networks, gene co-expression networks, literature networks.

Page 37: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Fields of Computational Systems Biology?

• Properties of systems, such as topology, robustness, tolerance.

Albert et al., Nature 2000

Page 38: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Fields of Computational Systems Biology?

Goh et al., PNAS 2007

P53 region TGFβ regionRas region

Cui et al., MSB 2007

• Biological questions on systems-levels, such as diseases, evolution, medicine etc.

Page 39: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

一个应用: microRNA-disease systems biology

D1

D2

M1

M2

M3

M4

D3

D1 D1

Page 40: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Human microRNA disease network

Page 41: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

我的建议以及需要大家帮助的问题

Page 42: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

第一,相关参考文献通读一遍,相关数据要记录下来。

第二,浏览本 ppt 一遍或者咨询生物信息学专业人士看有无Bioinformatics就可以解决的问题

第三,所阅读文献中数据本身有无生物信息学分析的可能,比如 Meta-analysis, Systems biology.

第四,包括生物信息学在内的新知识并不难,当你亲自完成一个项目的时候就会深有体会!

My Suggestions

Page 43: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

我们需要实验验证的工作• The functions of mir-423, mir-608 that are under recent positive selection

– SLC24A5, SLC45A2, skin pigment, Europe population

– LARGE, DMD, Lassa fever virus, Africa population

– EDAR, EDA2R, the development of hair, teeth and exocrine glands, Asia population (Sabeti, Nature 2007).

• Experimental validation of a potential liver-disease related microRNA: miR-149– SNP: rs2292832, CEU and YRI 80% C 20% U; CHB and JPT 20% C 80% U.

– Host gene is GPC1(Glypican 1,硫酸乙酰肝素蛋白聚糖 ), which is overexpressed in pancreas cancer; and another member (GPC3) of this host gene family is a liver cancer marker.

– GPC1是肝素结合生长因子的受体– Not expression in liver/ Expression in liver

– Target HEV and HGV

– Free energy: C: -54.9; U: -52.7

Page 44: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

我们需要实验验证的工作

0. 2

0. 25

0. 3

0. 35

0. 4

0. 45

0 1 2 3 4 5 6 7

• Cardiovascular• miR-1• miR-133• miR-199a• miR-21• miR-23a• miR-23b• miR-208

• Liver (miR-122)• Kidney• Brain• Lung• ………

Page 45: An Introduction to Bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

谢谢大家欢迎指导

崔庆华: 15801250611 , 82801585Email: [email protected]

您身边最好的裁缝