an introduction to bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008
TRANSCRIPT
An Introduction to Bioinformatics
北京大学医学部医学信息学系崔庆华
11-16, 2008
Introduction of basic concepts
Bioinformatics-- a definition --by NIH(1995)
Bioinformatics is defined as a scientific discipline that encompasses all aspects of biological information acquisition, processing, storage, distribution, analysis and interpretation, that combines the tools and techniques of mathematics, computer science and biology with the aim of understanding the biological significance of a variety of data.
Bio-informatics– the term
• Bio-informatics
• Computational biology
• Biological computing
Data……
★ Large-scale and high-throughput
★ High-dimensional
★ Non-linear
★ Noisy
★ Unequally distributed
Bioinformatics– what is the most important
• Algorithms?
• Data?
• Questions!
Bioinformatics– 误解
Biology
ComputationalTheoretical
Experimental
• 什么都能做?• 生物学 / 信息学
Sequences & Structures
Alignment
• blastall– blastp– blastn– blastx– tblastn– Tblastx
• clusterX
E<10-20
EvolutionConstructing phylogenetic trees
•Phylip•Clustalw•PAML•MEGA (Kumar et al., Briefings i
n Bioinformatics 2004)
Selection
•Coding region: Ka, Ks (dn,ds), Ka/Ks (dn/ds)
•PAML•Kaks_calculator•K-estimator•Mega•Database: UCSC or ENSEMBL
•Non-coding region•Ralph Haygood (Nature Genetics 2007)
•Recent populations•LRH test (Sabeti et al., Nature 2002•iHS test (Voight et al., Plos Biology 2006)•XP-EHH (Sabeti et al., Nature 2007)
Evolution—An application
•Recent positive selectionSLC24A5, SLC45A2, skin pigment, Europe populationLARGE, DMD, Lassa fever virus, Africa populationEDAR, EDA2R, the development of hair, teeth and exocrine glands, Asia population (Sabeti, Nature 2007).
Alternative Splicing (AS)
•Predicted from ESTs•Predicted from cDNA clones•Prediction of tissue-specific AS•Splicing graphs and EST assembly problem
Functional Domain
•TF binding sites•TRANSFAC: a TF binding site database•TESS: a web-based program
•Exons, introns, 5’UTR, 3’UTR•UCSC
•Promoter•CorePromoter
•Motif•Weeder
•RNA family•Rfam
•Protein domain•Pfam: database•InterPro: database•HMMER: a program based on HMM
Finding genes
Sequence mutations
Huang et al., Science 2007
Gymnopoulos et al., pnas 2007
•Tool: SIFT & Sapred•Conservation score?•Near functional sites?•Similarity score?•Surface?•………
PIK3CA
Modeling structures
•RNAfold•RNAStructure
Modeling structures
•Homology modeling•ESyPred3D•Swiss Model
•Ab initio prediction•Rosetta
•Single mutation modeling•Modeller
•Visualization•Pymol
最优化算法
目标: max ( 或 min)Y=f(x)约束: x>=0解:求 x=?
目标 约束 解
确定性优化算法 - 智能优化
遗传算法、模拟退火
DNA microarray data analysis
Biological Question
Sample Preparation
Data Analysis & Modelling
Microarray Reaction
MicroarrayDetection
Taken from Schena & Davis
Microarray 总流程
s1 s2 s3• • • • • • • • sj • • • •
• sMg1
g2
•
•
•
•
gi
•
•
•
•
•
gN
gene profile
arr
ay
pro
file
Gi
A jMicroarray data matrix
Mi,j
数据预处理• 数据缺失
– 原因• 图像受到污染• 图像分辨率不足• 片上灰尘或刮痕
– 缺失数据的处理方法• 舍弃该数据(同时丢掉了有用信息!)• 再做一次实验 (太昂贵了!)• 用某个数取代,比如样本均值• K-nearest neighbors 估计• 奇异值分解( SVD ) 估计
• 标准化– Log 变换– 线性回归– 伸缩 + 平移
Microarray 数据模式分类
预处理 特征提取 机器学习 决策
训练样本
新样本
分类器 决策
X
F(X)
Y
x1
x2
L: c1x1+c2x2- c=0
G1
G2
模式分类算法• 线性分类器• 神经网络• 最近邻• 贝叶斯分类器• 隐马尔科夫模型分类器• 决策树• 支持向量机
Microarray 数据模式聚类• 层次聚类• K-means 聚类• Fuzzy C-means 聚类• 自组织映射• Replicator dynamics
(Cui, 2004)
基因表达特征抽取• 区分男女的特征
– 头发长度?– 皮肤光滑度?– 嗓音?– 身高?– 力量?– 穿着?– 姿态?– XX/XY
• 差异表达基因• Gene set or pathway
• PCA
• SVD
• ISOMAP
• MDS
基因关系的刻划• Static relationship
– Pearson’s correlation
– Spearman’s correlation
– Mutual information
– Other similarity metric
• Dynamic relationship– Dynamic regression (Cui, 2005)
– Window based correlation
基因表达网络• Pearson’s correlation
– Hard threshold
– Weighted
• Mutual information
• Bayesian network
Computational Systems Biology
What is Systems Biology?
• Not a new concept!• Systems biology is an emergent field that ai
ms at system-level understanding of biological systems (Kitano 2002).
• To understand biology at the system level, we must examine the structure and dynamics and cellular organismal function, rather than the characteristics of isolated parts of a cell or organism.
Why Systems Biology?
http://www.newvisions.ucsb.edu/background/images/elephant.gif
++
+
_
0
A
B
C
D
E
Why Computational Systems Biology?
• Golden opportunity, now!
★ More than 16 international meetings in 2006
★ More than 10 books in the past two years
★ Journals: Molecular systems biology (Nature & EMBO), BMC systems biology, IET systems biology, EURASIP Journal on Bioinformatics and Systems Biology etc.
Large-scale, high-throughput data
Fields of Computational Systems Biology?
• Biological networks construction, such as gene regulatory networks, cellular signaling networks, metabolic networks, protein-protein interaction networks, genetic interaction networks, gene co-expression networks, literature networks.
Fields of Computational Systems Biology?
• Properties of systems, such as topology, robustness, tolerance.
Albert et al., Nature 2000
Fields of Computational Systems Biology?
Goh et al., PNAS 2007
P53 region TGFβ regionRas region
Cui et al., MSB 2007
• Biological questions on systems-levels, such as diseases, evolution, medicine etc.
一个应用: microRNA-disease systems biology
D1
D2
M1
M2
M3
M4
D3
D1 D1
Human microRNA disease network
我的建议以及需要大家帮助的问题
第一,相关参考文献通读一遍,相关数据要记录下来。
第二,浏览本 ppt 一遍或者咨询生物信息学专业人士看有无Bioinformatics就可以解决的问题
第三,所阅读文献中数据本身有无生物信息学分析的可能,比如 Meta-analysis, Systems biology.
第四,包括生物信息学在内的新知识并不难,当你亲自完成一个项目的时候就会深有体会!
My Suggestions
我们需要实验验证的工作• The functions of mir-423, mir-608 that are under recent positive selection
– SLC24A5, SLC45A2, skin pigment, Europe population
– LARGE, DMD, Lassa fever virus, Africa population
– EDAR, EDA2R, the development of hair, teeth and exocrine glands, Asia population (Sabeti, Nature 2007).
• Experimental validation of a potential liver-disease related microRNA: miR-149– SNP: rs2292832, CEU and YRI 80% C 20% U; CHB and JPT 20% C 80% U.
– Host gene is GPC1(Glypican 1,硫酸乙酰肝素蛋白聚糖 ), which is overexpressed in pancreas cancer; and another member (GPC3) of this host gene family is a liver cancer marker.
– GPC1是肝素结合生长因子的受体– Not expression in liver/ Expression in liver
– Target HEV and HGV
– Free energy: C: -54.9; U: -52.7
我们需要实验验证的工作
0. 2
0. 25
0. 3
0. 35
0. 4
0. 45
0 1 2 3 4 5 6 7
• Cardiovascular• miR-1• miR-133• miR-199a• miR-21• miR-23a• miR-23b• miR-208
• Liver (miR-122)• Kidney• Brain• Lung• ………