bai giang tin sinh hoc
Post on 17-Nov-2015
206 Views
Preview:
DESCRIPTION
TRANSCRIPT
-
1
TRNG I HC NNG NGHIP H NI KHOA CNG NGH SINH HC
....................................
Bi ging
TIN SINH HC NG DNG
(Applied bioinformatics)
NGUYN C BCH
H NI, 8/2013
-
2
PHN 1. GII THIU CHUNG 5
CHNG 1. GII THIU V BIOINFORMATICS 5 1.1. Khi nim 5 1.2. Nn tng sinh hc v s pht trin ca bioinformatics 5 1.3. Vai tr ca bioinformatics trong nghin cu sinh hc 7 1.4. Nhim v v cc hng nghin cu ca Bioinformatic 12 1.5. Xu hng pht trin ca bioinformatics 16 Tm tt chng 1 18 Cu hi n tp chng 1 18
CHNG 2 19 NN TNG SINH HC CA TIN SINH HC 19 2.1. Axit nucleic v protein 19 2.2. Cu trc ca axit nucleic 19 2.3. Genome v nghin cu genome 24 2.4. Pht hin gene v xc nh chc nng gene trong genome 26 2.5. Hot ng chc nng ca gene v iu ha hot ng ca gene 29 2.6. Proteome v lnh vc nghin cu protein (proteomics) 29 2.7. Tin ha v bn cht phn t ca qu trnh tin ha cc sinh vt 30 2.8. Phn tch mi quan h tin ha ca cc sinh vt 31 Tm tt chng 2 33 Cu hi n tp chng 2 33
CHNG 3 35 TM KIM V QUN L TI LIU NGHIN CU 35 3.1. Phng php tm kim thng tin 35 3.2. Cch tm ti liu phc v nghin cu 35 3.3. Lm quen vi Pubmed 36 3.4. Cch qun l ti liu nghin cu 37 Tm tt chng 3 38 Cu hi n tp chng 3 38
PHN 2 40
C S D LIU SINH HC 40
NG K TRNH T VO C S D LIU 40
CHNG 4. C S D LIU SINH HC 40 4.1. C s d liu s cp 41 4.1.1. CSDL trnh t nucleotide 41 4.1.2. CSDL trnh t protein 41 4.1.3. C s d liu cu trc cc phn t 43 4.2. C s d liu th cp 45 4.3. Cc c s d liu khc 46 4.3.1. C s d liu kiu gene v kiu hnh 46 4.3.2. CSDL kiu gene (PhenomicDB) 46 4.3.3. PubChem 46 4.4. Ngn hng gene 47 Tm tt chng 4 50 Cu hi n tp chng 4 50 CHNG 5 52 XC NH TRNH T V NG K TRNH T VO NGN HNG GENE 52 5.1. Xc nh trnh t nucleotide 52 5.2. Xc nh trnh t genome 52 5.3. Lp rp trnh t 53 5.4. ng k trnh t 55 5.5. Cc cng c ng k trnh t 58 5.5.1. Cc thng tin cn thit phi chun b trc khi ng k trnh t 61 5.5.2. V d ng k trnh t bng WebIn 62 5.5.3. V d ng k trnh t bng Sequin 62
-
3
Tm tt chng 5 65 Cu hi n tp chng 5 65
PHN 3 66
CC CNG C PHN TCH 66
KHAI THC V X L D LIU TRNH T SINH HC 66
CHNG 6. GENOME BROWSER 66 6.1. Khi nim genome browser 66 6.2. Gii thiu mt s genome browser quan trng 66 6.2.1. Ensembl 66 6.2.2. UCSC 68 6.2.3. NCBI Genomes and MapViewer 70 6.3. c im v ng dng ca cc genome browser 71 Tm tt chng 6 72 Cu hi n tp chng 6 72
CHNG 7 74 LM QUEN VI CC CNG C PHN TCH CSDL SINH HC 74 7.1. Lm quen vi cc cng c phn tch c bn 74 7.1.1. Tm v copy trnh t 74 7.1.2. Nhm cng c tm kim trnh t ging nhau 75 7.2. Tm cc vng chc nng, vng bo th 79 7.2.1. Cn nhiu trnh t (multi sequence alignment) 79 7.2.2. Xy dng bn gii hn (restriction map contruction) 81 7.2.3. D on cu trc bc 2 v bc 3 ca phn t protein 83 7.2.4. Phn tch trnh t axit nucleic 84 7.2.5. Thit k mi cho PCR v mu d lai axit nucleic 85 7.2.6. Xc nh khung c m 86 7.2.7. Tm cc bi bo khoa hc 87 7.2.8. Lp rp trnh t 87 7.2.9. Phn tch quan h tin ha 88 7.2.10. Phn tch protein 90 7.2.11. Nghin cu biu hin gene 90 7.3. Cc nhm cng c phn tch 91 7.3.1. Cng c phn tch ca NCBI 91 7.3.2. Nhm cng c ca EMBL 92 7.3.3. Nhm cng c ca ExPASy 95 7.3.4. Cc nhm cng c khc 97 Tm tt chng 7 97 Cu hi n tp chng 7 98
CHNG 8 99 LM QUEN VI PHN TCH D LIU SINH HC 99 8.1. Tm d liu trong cc ngn hng CSDL 99 8.1.1. D liu trnh t 99 8.1.2. D liu cu trc 99 8.1.3. Cc d liu khc 102 8.2. Phn tch trnh t 102 8.2.1. So snh trnh t 102 8.2.2. Phn tch khung c m v vng trnh t m ha 106 8.2.3. Tm kim Promoter v cc vng iu ha hot ng gene 106 8.2.4. Tm kim vng chc nng ca protein (functional motif searching) 109 8.2.5. D on v m phng tng tc protein 110
CHNG 9 113 CN TRNH T V NGUYN L CA CN TRNH T 113 9.1. Gii thiu v cn trnh t 113 9.2. Nguyn l ca cn trnh t 114 9.3. Cn nhiu trnh t v nguyn l cn nhiu trnh t 118
-
4
9.4. Cc cng c tm kim trnh t tng ng 119
CHNG 10. PHN TCH MI QUAN H TIN HA 125 10.1. Khi nim 125 10.2. D liu dng xy dng cy tin ha 127 10.2.1. Phng php da vo khong cch 129 10.2.2. Phng php phn tch k t 131 10.3. La chn m hnh tin ha 133 10.4. nh gi cy phn tin ha 133
-
5
PHN 1. GII THIU CHUNG
CHNG 1. GII THIU V BIOINFORMATICS
1.1. Khi nim
Tin sinh hc l ngnh khoa hc ng dng ton hc v khoa hc my tnh vo
lnh vc sinh hc c bit l sinh hc phn t v y hc. Thut ng tin sinh hc ln u
tin c Paulien Hogeweg gii thiu nm 1979 dng m t nghin cu v cc qu
trnh trong h thng sinh hc. Vo cui nhng nm 1980, thut ng ny c a vo
lnh vc di truyn hc v nghin cu genome. Tin sinh hc lin quan n vic xc nh
trnh t, qun l, phn tch v khai thc cc CSDL sinh hc. Tin sinh hc hin lin
quan n xy dng v pht trin cc c s d liu, cc thut ton, thng k v cc k
thut my tnh gii quyt cc vn lin quan n l thuyt v thc nghim trong
vic qun l v phn tch cc d liu sinh hc. Tin sinh hc cng bao gm m phng
v d on tng tc gia cc phn t v cc qu trnh sinh hc.
Hnh 1: Tin sinh hc v mi lin h gia cc lnh vc
1.2. Nn tng sinh hc v s pht trin ca bioinformatics
Vic pht hin DNA l vt cht mang thng tin di truyn v xc nh m hnh
cu trc ca DNA m ra thi k pht trin ca sinh hc phn t. DNA m ha cho
mRNA v cc loi RNA khc. Protein c dch m t phn t mRNA s thc hin
nhiu chc nng sinh hc trong t bo k c iu ha hot ng ca gene cng nh cc
qu trnh sinh hc. Mc d vic xc nh trnh t genome ca cc sinh vt hin nay
tr nn n gin nhng lm sng t thng tin di truyn cha trong genome v s
hot ng chc nng cng nh mi tng tc gia cc gene vn cn l mt thch thc
ln. Chng hn ngi, mi t bo cha 23 cp NST v kch thc genome khong
3,2.109 cp nucleotide trong cha khong 23.000 gene (
1). n nay v c bn cc
qu trnh phin m v dch m c bit nhng xc nh c chnh xc s
lng gene, v tr v s tng tc ca cc gene ny vn cn l cu hi kh.
1 International Human Genome Sequencing Consortium (2004). "Finishing the euchromatic sequence of the
human genome.". Nature 431 (7011): 93145. Bibcode
-
6
Vi s pht trin nhanh chng ca cc k thut v cng ngh mi, d liu sinh
hc m ch yu l trnh t nucleotide, amino acid, c to ra hng ngy cng nhiu.
Vic thu thp, lu tr, cho php truy cp, tm kim, phn tch v so snh mi lin quan
gia cc d liu trong cc c s d liu khng l l nhim v ca tin sinh hc. Thc t
i hi cc nh tin sinh hc, khoa hc my tnh cn phi pht trin cc thut ton mi
nng cao chnh xc v gim thi gian cho cc nh nghin cu sinh hc.
Tin sinh hc l mt lnh vc nghin cu a ngnh, mc nht nh, n c
t trn nn tng ca sinh hc phn t (ngun cung cp CSDL cn phn tch), khoa
hc my tnh (cung cp cc phn cng cho vic phn tch v mng li my tnh so
snh, i chiu cc kt qu phn tch), cc thut ton phn tch d liu. Ba yu t
ny c vai tr sng cn i vi tin sinh hc. Bn thn sinh hc phn t cng l mt
lnh vc tng i mi c da trn nn tng ca nhiu mn khoa hc c bn m
quan trng nht l di truyn hc, ha sinh hc, t bo hc Chnh v vy vic ra i,
nghin cu tin sinh hc cng nh ng dng tin sinh hc cng i hi kin thc c bn
lin ngnh v hiu bit v khoa hc my tnh. Di y l mt vi im mc lch s
quan trng cho s pht trin ca sinh hc phn t v tin sinh hc.
Nm Pht minh
1930 Tiselius a ra k thut in di phn tch protein trong dung dch
1951 Pauling v Corey xut cu trc xon alpha v phin gp np beta
1953 Watson v Crick xut m hnh chui xon kp DNA da trn d liu thu c t kt
qu phn tch nhiu x tia X ca Franklin and Wilkins
1954 Nhm nghin cu ca Perutz pht trin phng php dng nguyn t nng (heavy
atom) gii quyt kh khn trong vic kt tinh protein.
1955 Trnh t ca protein u tin c phn tch l insulin b bi F. Sanger.
1970 Thut ton ca Needleman-Wunsch cho vic cn trnh t (alignment) c cng b.
1972 Phn t DNA ti t hp c to ra bi Paul Berg v nhm nghin cu ca mnh.
1973 C s d liu Protein c cng b bi Brookhaven
1974 Vint Cerf v Robert Kahn pht trin phng thc giao tip my tnh TCP lm nn tng
cho internet.
1975 in di 2 chiu c pht trin bi P. H. O'Farrell
Phng php Southern blot c m t v cng b bi E. M. Southern
1977 C d liu protein, PDB, chnh thc ra i
Maxam v Walter Gilbert (Harvard) v Frederick Sanger (U.K. Medical Research
Council) cng b phng php xc nh trnh t DNA.
1980 Trnh t genome hon chnh ca mt sinh vt (FX174) c cng b. Genome cha 5,386
cp base m ha cho 9 protein.
Phng php NMR a chiu (multi-dimensional NMR) c s dng xc nh cu
trc protein
1981 Thut ton Smith-Waterman cn trnh t c cng b
1982 Genetics Computer Group (GCG) to ra nhiu cng c phn tch trong sinh hc phn
t ti trung tm Cng ngh sinh hc Wisconsin thuc trng i hc Wisconsin.
1985 Thut ton FASTP c cng b
Phn ng PCR c m t bi Kary Mullis v cng s
1986 Thut ng Genomics" xut hin ln u tin m t lnh vc khoa hc lin quan n
vic lp bn , xc nh trnh t v phn tch cc gene. Thut ng c a ra bi
Thomas Roderick, sau ny l tn ca mt tp ch ni ting: Genomes.
CSDL SWISS-PROT c to ra bi phng sinh ha y hc (Department of Medical
Biochemistry) ca trng i hc Geneva v ngn hng CSDL chu u EMBL ra i
-
7
(European Molecular Biology Laboratory).
1987 NST nhn to ca nm men (YAC) c gii thiu
Bn vt l ca E.coli c cng b
Ngn ng lp trnh Perl c pht trin bi Larry Wall.
1988 NCBI (National Center for Biotechnology Information) c thnh lp vin nghin cu
ung th quc gia (National Cancer Institute).
D n xc nh genome ngi c khi ng (Commission on Life Sciences, National
Research Council. Mapping and Sequencing the Human Genome, National Academy
Press: Washington, D.C.), 1988.
Thut ton FASTA dng so snh trnh t c cng b bi Pearson v Lupman.
Des Higgins v Paul Sharpe cng b pht trin chng trnh CLUSTAL
1990 Chng trnh BLAST ra i (Altschul, et. al.)
Molecular Applications Group c thnh lp California bi Michael Levitt v Chris
Lee. Sn phm ca cng ty l Look and SegMod c dng thit k cc m hnh phn
t v protein.
InforMax c thnh lp Bethesda, MD. Sn phm ca cng ty hng ti l cc phn
mm, chng trnh phn tch trnh t, qun l v phn tch CSDL, tm kim, hin th d
liu bn ha, thit k dng (clone construction), mapping v thit k mi.
1991 Vin nghin cu Geneva (Research institute in Geneva/ CERN) cng b to ra phng
thc make-up cho World Wide Web.
1997 Genome ca E.coli (4.7 Mbp) c cng b
1998 Genom ca Caenorhabditis elegans v nm men bnh m c cng b.
Swiss Institute of Bioinformatics c thnh lp di dng hip hi nghin cu phi li
nhn
2000 Genome ca Pseudomonas aeruginosa (6.3 Mbp) c cng b
Genome ca Arabidopsis thaliana (100 Mb) c xc nh trnh t
Genome Drosophila melanogaster (180Mb) c xc nh trnh t
2001 Genome ngi c kch thc 3,000 Mbp c cng b
2004 Bn nhp genome ca chut, Rattus norvegicus, c cng b
2004 Th h xc nh trnh t mi chnh thc ra i khi u vi k thut 454 sequencing
2008 Cc d n xc nh trnh t genome 1000 loi http://www.1000genomes.org/
1.3. Vai tr ca bioinformatics trong nghin cu sinh hc Trong mt vi thp k gn y, lnh vc genomic v cng ngh sinh hc phn t
pht trin nhanh chng to ra mt khi lng thng tin rt ln lm c s cho cc
phn tch so snh v i chiu. phn tch c s d liu (CSDL) cn phi c thut
ton kt hp vi khoa hc my tnh. Tin sinh hc vi s kt hp cht ch ca CSDL,
thut ton v khoa hc my tnh s lm sng t bn cht ca cc qu trnh sinh hc. C
th tm tt vai tr ca tin sinh hc nh sau:
- Thu thp, t chc v qun l cc d liu sinh hc (database); - Pht trin cc cng c tm kim d liu (search tools, data mining) - Phn tch trnh t (sequence analysis), m t genome (genome annotation), so
snh genome (genomic comparison);
- M phng cu trc, m phng tng tc phn t (molecular interaction modelling), d on cu trc protein (prediction of protein structure);
- Phn tch chc nng protein (protein function analysis), tng tc protein v cc con ng chuyn ha (protein interactions and metabolism pathways), m
hnh ha cc h thng sinh hc (modeling biological systems), phn tch m
hnh biu hin gene (analysis of gene expression profile),
http://www.1000genomes.org/
-
8
- Phn tch trnh t genome pht hin gene, cc gene t bin, ung th, xc nh c vai tr ca cc gene v hng ti cc liu php iu tr (genome
analysis and treatment);
- Phn tnh mi quan hin tin ha, di truyn qun th da trn cc phn mm v cng c my tnh;
- Phn tch hnh nh quy m ln (high-throughput image analysis), - Pht trin cc thut ton, phn mm gii quyt nhu cu ca cc nh khoa hc
trong lnh vc sinh hc.
Phn tch trnh t (sequence analysis)
Phn tch trnh t l qu trnh gm nhiu thao tc lin quan n tm kim cc d
liu trnh t, so snh cc trnh t vi nhau v kt hp vi cc cng c khc tm ra
nhng thng tin cn thit nm trong chui trnh t cn phn tch. Nhng thng tin thu
c bao gm s tng ng, cc vng hot ng chc nng (domain), cc vng c
trng (motif), v tr ca cc gene trong genome (gene finding), cc yu t iu ha
hot ng gene (promoter, intron, exon, vng cu trc iu ha phin m).
Nm 1977, genome u tin c xc nh trnh t l ca phage -X174. n
nay genome ca hng nghn sinh vt c xc nh trnh t v lu gi trong cc
ngn hng gene. Nhiu cng c tin sinh hc quan trng v cc chng trnh h tr
phn tch, so snh trnh t sinh hc c pht trin v ng dng ph bin.
M t genome (genome annotation)
Trong nghin cu genome, qu trnh nh du cc trnh t DNA v gn cc
thng tin sinh hc vo nhng trnh t DNA c gi l m t (annotation). H thng
phn mm cho php m t genome u tin c Dr. Owen White xy dng vo nm
1995. i tng u tin l vi khun Haemophilus influenzae. ng xy dng h
thng ny vi mc tiu ban u l tm ra cc gene, cc tRNA trong genome... sau
gn nhng chc nng sinh hc bit vo cc yu t ny. n nay c nhiu h
thng m t genome c pht trin. V cn bn cc h thng m t ny ging
nhau nhng c s khc nhau v thut ton v chng trnh my tnh.
So snh genome
Trng tm ca so snh genome l xc nh s ging nhau hoc mi lin h gia
cc gene (orthology analysis) hoc cc c im chung trong genome ca cc sinh vt.
So snh genome c hin th di dng bn tng tc gia cc genome cho php
pht hin c cc s kin hoc mc bin i genome trong qu trnh tin ha dn
n s khc nhau hoc bin i gia cc genome, gia cc vng gene hoc gia cc
gene.
Cc s kin tin ha phc tp xy ra nhiu mc khc nhau dn n tin
ha genome. mc thp nht (mc phn t), cc t bin im lm thay i
genome nhng nucleotide n l. S bin i ny c th gy ra hu qu nghim
trng, trung tnh hoc khng nh hng g. mc cao hn, cc t bin lp on,
o on, mt on v thay i v tr cc trnh t DNA trong NST (gene nhy,
transposable elements) lm thay i t chc vt l ca genome. Theo thi gian, cui
cng ton b genome tham gia vo qu trnh lai, lng bi ha v tng tc cng sinh
ni bo dn n s phn loi. Tnh phc tp ca tin ha genome dn n nhng s
kh khn trong vic pht trin thut ton cng nhng m hnh ton hc m phng
-
9
chnh xc. Chnh v vy cc thut ton trong tin sinh hc ch mang tnh hp l nht
(heuristic) ch khng phi l chnh xc (precise). Cc thut ton v m hnh ang
dng ph bin hin nay bao gm: heuristics, approximation algorithms, parsimony
models, Markov Chain Monte Carloalgorithms, Bayesian analysis, probabilistic
models.
Xy dng v m phng cu trc
D on cu trc phn t protein l mt trong nhng ng dng quan trng ca
tin sinh hc. Trnh t amino acid ca mt phn t protein c th c xc nh trc
tip hoc suy din t trnh t nucleotide ca gene m ha tng ng. m phng cu
trc ngi ta cn nhng thng tin c th v protein, tt nht l cu trc kt tinh ca
phn t protein. Trong nhng trng hp kh kt tinh hoc ch c trnh t amino acid
ngi ta c th so snh trnh t amino acid ca mt protein hoc polypeptide vi
nhng protein khc bit trong CSDL s dng cc thut ton tm ra s tng
ng, t a ra cu trc m phng tng i ca cc protein cha bit. Thng
thng cc trnh t c mc ging nhau >40% c th p dng d on cu trc.
Mc d c s tng quan cht ch gia mc ging nhau v trnh t v cu trc
nhng trong nhiu trng hp mc d cu trc ging nhau nhng trnh t amino acid
c th li khc nhau. V th vic xc nh hoc m phng cu trc cng khng th da
n thun vo thut ton hay chng trnh my tnh. Trong nhiu trng hp, vic m
phng ch s dng sng lc v tham kho.
S tng ng gia haemoglobin ca ngi v ca cc cy h u
(leghemoglobin) cng l mt trong nhng v d v mi tng quan gia trnh t v
cu trc. C hai protein u c dng vn chuyn oxy. Mc d chng c trnh t
amino acid rt khc nhau nhng cu trc ca chng li ging nhau mt cch c bit.
iu ny cng phn nh mi quan h gia cu trc v hot ng chc nng.
M phng tng tc phn t
M phng tng tc phn t l xy dng cc m hnh m t s tng tc khi
hai hay nhiu phn t tip xc vi nhau. Thng tin v s tng tc bao gm v tr,
nhm tng tc v c ch hnh thnh nhng tng tc. Tng tc phn t lin quan
n nhng thay i v nhit ng hc, thay i trng thi phn t (thay i in tch,
chuyn dch cc nhm lin kt, thay i cu hnh v trng thi hnh hc khng gian).
Cc tng tc phn t in hnh nh tng tc protein-protein/peptide, enzyme-c
cht, ligand-cht tng tc. Thut ng thng s dng hin nay l docking v thut
ton tng ng ca n l docking algorithms.
Cc k thut c dng h tr bao gm: CD (circular dichroism), phn tch
nhiu x tia X (X-ray crystallography), phn tch cng hng t ht nhn protein
(protein nuclear magnetic resonance spectroscopy protein NMR). Mt trong nhng
cu hi quan trng l liu ch cn phn tch cu trc phn t (3D) d on s tng
tc phn t hay cn phi lm thc nghim c th cho tng protein-protein (protein
protein interaction experiments) hoc proteinprotein docking.
D on cu trc protein (prediction of protein structure)
D on cu trc protein da vo nhng thng tin nh trnh t amino acid, kt
qu khi ph (MS), kt tinh v phn tch nhiu x tia X, cc c im sinh hc tng
-
10
ng (s ging nhau trn c s cng thc hin chc nng sinh hc hoc cc enzyme
xc tc mt kiu phn ng hoc nhm c cht).
Cc thut ton u da trn c s tnh ton cc lin kt ha hc, kh nng hnh
thnh cc lin kt, tng tc gia cc phn t, phn tch nhit ng hc, nng lng t
do, nng lng lin kt xy dng ln cc m hnh cu trc khng gian. Tuy nhin,
hin nay vic phn tch mi lin h v so snh gia cc cu trc v chc nng bit
vn c coi l nn tng d on cu trc cc protein. Chnh v vy, nhng protein
mi vi cu trc cha c xc nh thng c d on da vo vic so snh trnh
t kt hp vi cc c im vt l v ha hc.
Phn tch biu hin gene (analysis of gene expression)
Cc CSDL v mRNA, cDNA, EST h tr pht hin s biu hin hoc mc
biu hin ca cc gene. Cc CSDL v protein microarray v khi ph (MS) c vai tr
rt quan trng trong vic phn tch hoc pht hin s c mt ca mt protein no
mt mu sinh hc. Bng cch so snh v i chiu cc CSDL ny cho php rt ngn
thi gian nghin cu. Tuy nhin, qu trnh ny i thng tr ln phc tp khi x l
khi lng mu ln (high through put analysis) v s liu nhiu do cc sai s gp phi
trong thc nghim.
T phn tch trnh t genome n vic iu tr (from genome to therapy)
Mt trong nhng nguyn nhn chnh dn n ung th l s tch ly cc t bin.
Phn tch nhiu trnh t c th xc nh c cc t bin tim n trong cc gene c
lin quan n ung th. Tin sinh hc xy dng cc h thng phn tch t ng qun
l, lu gi cc thng tin t h tr cc thao tc tm kim, so snh v i chiu gia
cc gene, genome pht hin s a hnh (chng hn cc c s d liu dbVar, dbSNP,
CancerChromosome). Kt qu nhng phn tch h tr cho vic iu tr v chn on
bnh d dng hn. Mt v d in hnh l s pht trin cc loi thuc khc nhau p
ng vi mi c th.
Cc k thut mi ang c p dng nh so snh trnh t cc nucleotide
pht hin s khc bit mc nucleotide n tm ra cc t bin im (single-
nucleotide polymorphism arrays) nhiu v tr, vng trnh t khc nhau trong genome.
Thut ton ang dng hin nay l Hidden Markov model, change-point analysis
methods.
Nghin cu tin ha (Computational evolutionary biology)
Nghin cu tin ha bao gm xc nh ngun gc tin ha ca cc loi cng
nh s bin i v pht sinh loi mi theo thi gian. Cng ngh thng tin v tin sinh
hc h tr cc nh nghin cu sinh hc nhiu kha cnh, bao gm:
- Pht hin c s tin ha da vo so snh, pht hin s thay i trnh t DNA ch khng da nhiu vo s bin i hnh thi.
- So snh ton b genome cho php nghin cu cc s kin phc tp xy ra trong qu trnh tin ha chng hn nh lp on, trao i vt cht di truyn hoc ly
mt phn vt cht di truyn ca mt loi (chng hn nh chuyn gene ngang,
bao gm bin np, chuyn np, ti np, cng sinh, ti t hp genome, chuyn
gene)
- Xy dng cc m hnh my tnh d on din tin v h qu ca cc qun th theo thi gian.
-
11
- Theo di v chia s thng tin ca mt s lng ln cc loi v c th. - Xy dng bc tranh tng th v cy pht sinh chng loi.
Phn tch hnh nh
Cng ngh my tnh hin nay cng vi cc th nghim phn tch t ng quy
m ln to ra mt s lng hnh nh vi dung lng rt ln. Thm vo , nhng loi
hnh nh cha ng nhiu thng tin nh: nh phn tch cc mu, m bnh, nh chp
trong y hc, lm sng cn phi c phn tch cn thn nhiu mc . Vic lu tr
cc hnh nh ny c ngha khi cn i chiu v so snh cht lc thng tin phc v
cho chn on v iu tr. Di y l mt s v d v nhng ng dng tin sinh hc
trong x l v phn tch hnh nh:
- Phn tch nh lng cc c im bn trong hnh nh nh bo quan, kch thc, hnh dng, v tr phn b ca cc phn t hoc kt qu chp ct lp ca
cc m, c quan.
- Xc nh cc m hnh, hnh mu real-time ca dng kh vn chuyn trong phi ng vt, s vn chuyn ca cc cht qua mng t bo, m (drug delivery).
- D on kch thc ca cc ht, vn cc xy ra trong qu trnh phu thut (real-time imaginery) v qu trnh hi phc sau b thng cc ng mch.
- Phn tch cc hnh nh hng ngoi xc nh hot ng trao i cht - Phn tch cc hnh nh hunh quang chng hn vi cc k thut xc nh trnh
t th h mi, cc k thut nh du hunh quang v phn tch real-time.
Phn tch chc nng protein
Cc CSDL MS, trnh t, cu trc, tng tc protein-protein, protein docking l
nn tng phn tch chc nng protein. Vic so snh trnh t, cn trnh t h tr rt
c lc pht hin cc motif, domain, (m hnh) pattern pht hin v phn tch
chc nng cc protein. Cc h protein hoc cc protein cng thc hin chc nng cng
c pht hin da trn nhng c s so snh ny.
Tng tc protein v cc con ng chuyn ha
Nghin cu tng tc gia cc protein, enzyme trong cc qu trnh sinh hc c
ngha ng dng rt ln. Chng hn tm c cht cho enzyme, xc nh protein khng
nguyn, khng th... Nghin cu xy dng m hnh tng tc gia cc protein gip
xc nh vai tr ca cc yu t tham gia cng nh c ch iu ha s biu hin ca cc
gene tham gia trong cc mng li. S ri lon hoc thay i cc mi quan h tng
tc s dn n nhng bnh tt. Vic iu tr cc bnh da trn c s hiu bit mi lin
h nhiu yu t s c hiu qu rt ln. y cng l hng c cc nh sinh hc, tin
sinh hc ang tp trung nghin cu hin nay.
M hnh ha cc h thng sinh hc (Modeling biological systems)
Thc cht l s m phng bng my tnh cc qu trnh sinh hc din ra trong h
thng sng (t bo, m hoc ton b c th). thc hin c iu ny cn kt hp
gia sinh hc h thng (system biology) v ton sinh hc (mathematical biology). V
d nh cc h thng t bo, cc bo quan, cc cht trao i v cc enzymes tham gia
hnh thnh cc con ng trao i cht, cc con ng dn truyn tn hiu, iu ha
hot ng gene. Tt c nhng qu trnh ny cn c phn tch v hin th trong phc
hp ca cc thnh phn bn trong t bo hoc cc bo quan trong t bo. Ngoi ra vi
-
12
tin sinh hc v sinh hc my tnh c th m phng s sng nhn to lin quan n qu
trnh tin ha ca sinh vt.
Pht trin cc phn mm v cng c phn tch (Software and tools) Thut ton v cc thch thc trong khoa hc my tnh
Cc phn mm hoc chng trnh my tnh c pht trin da vo nhiu thut
ton. Mc chnh xc v tc x l ph thuc vo thut ton v phn cng my
tnh. Pht trin thut ton mi s ti u ha, rt ngn thi gian phn tch, gim thiu s
dng ti nguyn my tnh v nng cao tin cy ca cc phn tch, m phng.
Cc cng c tm kim trnh t ging v tng ng:
Trnh t tng ng (homology): gia cc trnh t DNA hoc cc tnh trng
phn tch c cng ngun gc, quan h tin ha t mt t tin chung. Mc ging
nhau (similarity) gia hai (cc) trnh t c th c xc nh liu s tng ng l
thc s hay l ngu nhin.
Cc cng c thuc nhm ny nhm xc nh s ging nhau gia mt trnh t
mi a vo (novel query sequence) vi cu trc v chc nng cha bit vi ton b
CSDL c bit.Nhm ny bao gm cc cng c chnh: FASTA, BLAST v cc
bin th ca chng (xem cc chng sau).
Phn tch chc nng protein:
Phn tch chc nng: Xc nh chc nng v lp bn ca cc thnh phn
chc nng bao gm phn m ha v khng m ha ca gene trong genome.
thc hin cn s h tr ca cc chng trnh v cng c my tnh trong vic so
snh trnh t protein truy vn vi cc CSDL protein th cp cha thng tin v
cc motif, domain. Kt qu tm kim s cho ra danh sch cc protein ging
nhau t php d on chc nng ca protein cha bit.
- Phn tch cu trc Cho php so snh cc cu trc cha bit vi cc CSDL cu trc bit. Chc
nng ca mt protein c th xc nh chnh xc hn khi so snh cu trc ca n
hn l ch trnh t amino acid. V cu trc tng t nhau thng gn lin vi s
tng ng v chc nng hot ng. Vic xc nh cu trc protein dng 2D/3D
c ngha v cng quan trng nghin cu chc nng ca n. Cng vic ny
i km vi vic tinh sch, kt tinh protein v kt hp vi cc phng php phn
tch tinh th.
- Phn tch trnh t Cc cng c thuc nhm ny cho php thc hin cc phn tch su hn v trnh
t cha bit bao gm: phn tch tin ha, xc nh t bin, cc vng a nc,
CpG islands v xu hng s dng cc thnh phn base trong cc m di truyn
(compositional biases). Nhng kt qu phn tch ny s h tr cho cc nghin
cu lm sng t chc nng ca trnh t cha bit.
1.4. Nhim v v cc hng nghin cu ca Bioinformatic
Vo giai on u ca cuc cch mng genomics, tin sinh hc tp trung vo
vic tp hp v lu gi cc thng tin, c s d liu sinh hc hnh thnh cc ngn
hng c s d liu (ch yu l trnh t amino acid, nucleotide). Qu trnh ny lin quan
-
13
n vic thit k mng li CSDL lin kt v pht trin cc giao din web nh cc
nh nghin cu va c th truy cp vo cc c s d liu va c th ng k thm cc
trnh t, d liu mi hoc cc d liu c chnh sa, b sung. Xut pht t nhu
cu ca cc nh khoa hc v vic tm kim v phn tch d liu (data mining) dn
n vic pht trin cc cng c tm kim kt hp vi vic so snh cc d liu. Vic s
dng cc chng trnh FASTA, BLAST, cn trnh t (sequence alignment); lp rp cc
trnh t (genome assembly);tm kim gene trong genome (gene finding), phn tch cc
domain trong phn t protein v xc nh cu trc ca chng tr thnh nhng thao
tc thng thng hng ngy ca cc nh nghin cu. Nhng ng dng mc cao hn
v phc tp hn nh xc nh c v tr v vai tr ca gene trn cc nhim sc th
(position cloning); so snh cu trc ba chiu ca cc protein,d on cu trc protein
v cc tng tc protein-protein; nhn dng m hnh (pattern recognition); d on
m hnh biu hin gene (gene expression profile prediction)ang tr nn ph bin
nhng phng nghin cu mnh.
T kt qu ca cc nghin cu v xc nh vai tr cc gene v tng tc gene,
nh khoa hc c th so snh cc hot ng ca nhng t bo bnh thng v nhng t
bo b bnh. lm c iu nycn thit phi c s kt hp v i chiu gia cc
CSDL sinh hc to thnh mt bc tranh tng th v din t c cc mi lin h
ca cc hot ng qua s nghin cu c cc con ng chuyn ha
(metabolomics). y cng l mt trong nhng thch thc rt ln ca cc nh tin sinh
hc.
Hnh 2. Mi lin h gia transcriptomics, proteomics v cc con ng chuyn
ha (metabolomics) (Goodacre (2005) J Exp Bot 56: 245)
Hng pht trin cao hn na l xy dng c cc m hnh v s tng tc
gia cc m hnh chuyn ha trn c s ny s lm sng t c cc m hnh biu
hin gene, s tng tc gia cc gene v nhm cc gene. Nhng kt qu ny s gp
phn trong vic iu khin s hot ng ca gene v pht trin cc liu php iu tr
hiu qu.
-
14
Hnh 3. Mng li cc gene lin quan n cc bnh ngi
(The human disease network. PNAS. vol. 104, no. 21, 86858690)
Nghin cu pht trin thut ton, phn mm v cc cng c phn tch mi
(software and tools) chng hn: h tr trong vic xc nh s c mt v v tr ca cc
gene trong mt trnh t DNA hay trn NST, d on cu trc protein v chc nng ca
chng hoc phn tch, sp xp cc nhm trnh t protein thnh mt h gm cc trnh t
c lin quan.
Cc cng c chnh ca Bioinformatics (Bioinformatics tools)
BLAST
BLAST l ch vit tt ca Basic Local Alignment Search Tool. y l nhm
cng c cho php so snh cc trnh t DNA v protein vi cc trnh t khc c trong
CSDL. Hin nay c mt s bin th ca BLAST nh: PSI-BLAST, PHI-BLAST,
DELTA-BLAST. Ngoi ra cn c mt s cng c BLAST c bit p dng cho cc
genome ngi, vi sinh vt, k sinh trng st rt v cc genome khc. Cc cng c h
tr pht hin cc trnh t c ln vi trnh t ca vector (c bit khi ng k vo
ngn hng gene), cc trnh t globulin min dch, v cc trnh t bo th...
-
15
FASTA
L mt cng c tm kim CSDL c s dng so snh trnh t nucleotide
hoc amino acid vi mt CSDL trnh t. Chng trnh ny da vo thut ton tm
kim trnh t nhanh bi Lipman v Pearson. y cng l thut ton u tin c
dng tm kim cc trnh t ging nhau trong CSDL.
EMBOSS
EMBOSS c vit tt t (European Molecular Biology Open Software Suite),
l mt t hp cc phn mm phn tch ngun m min ph ng dng trong lnh vc
sinh hc phn t. C khong hn 100 chng trnh ng dng so snh trnh t, tm
trnh t trong CSDL, tm kim cc m hnh (pattern), tm kim domain, motif trong
phn t protein bng cch so snh trnh t amino acid, so snh trnh t nucleotide
pht hin cc pattern, phn tch tn sut s dng b m (codon bias analysis)
Mt danh sch cc ng dng c th tm a ch:
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/
Clustalw
ClustalW l chng trnh dng so snh cc trnh t DNA v protein. Mc
ch l tm ra cc vng trnh t ging nhau v khc nhau. Trn c s h tr cho
nhiu ng dng khc nh: phn tch domain, motif, pattern, xy dng mi quan h tin
ha.
RasMol
y l cng c nghin cu rt hiu qu hin th cu trc DNA, protein v cc
phn t nh. Protein Explorer l mt dng bin th d s dng ca RasMol.
Chng trnh ng dng cho chuyn ngnh bioinformatics
- JAVA: Do bn cht Java l chng trnh c lp v vy n l mt thnh phn quan trng ca bioinformatics (BioJava)
- Perl: S dng x l cc d liu sinh hc (BioPerl) - BioXML: L mt phn ca d n BioPerl, l ngun tp hp cc ti liu dng
XML v DTD
Xy dng cc CSDL ti liu, tp ch phc v nghin cu
- Bi bo, tp ch (pubmed); - H thng phn loi, kha phn loi (taxon); - Sch (book); - Bi bo, tp ch, ti liu lin quan n cc phn ng sinh ha
(pubchembioassay);
- Cc ti liu lin quan n cc hp cht ha hc (Pubchem compounds); - Cc ti liu v cc cht ha hc (pubchem substances); - Cc c s d liu: genomics, proteomics, metabolomics, microarray gene
expression v phylogenetics.
Thng tin cha ng bn trong cc CSDL sinh hc bao gm: tn gene, trnh t
gene, v tr ca gene trn NST hoc genome (locus tag), cu trc v chc nng
ca cc gene, hu qu ca cc t bin gene , cc gene lin quan (h gene) v
cu trc ca chng (nu l protein, RNA...)
http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/
-
16
D liu bao gm: Cc trnh t gene, cc m t v c im ca gene (gene m
ha cho mRNA, tRNA, rRNA), thut ng phn loi (ngun gc ca gene,
sinh vt cha gene ), cc trch dn (bi bo lin quan n gene, protein) v
cc bng s liu (nu c).
Kiu nh dng CSDL
Cc dng nh dng ca d liu sinh hc gm nhiu loi: ch, d liu trnh t,
cu trc protein v cc lin kt (link).
- Dng ch: PubMed v OMIM. - Dng trnh t: GenBank (DNA) v UniProt (protein). - Dng cu trc: PDB, SCOP, v CATH.
Nhng vn lin quan n CSDL protein
Vic pht trin CSDL cu trc protein thng rt kh khn v chm hn so vi
trnh t DNA v cu trc 3 chiu ca protein rt kh xc nh. xc nh cu trc 3
chiu ca mt phn t protein ngi ta phi tch ring hay tinh sch protein vi
lng ln, tip tm cc iu kin ph hp cho protein kt tinh sau s dng
cc k thut xc nh cu trc, chng hn nh dung tia X (X-ray crystallography),
cng hng t ht nhn (NMR spectroscopy), CD (Circular Dichroism), knh hin vi
in t... Cc d liu cu trc c ng k v c th truy cp thng qua cc CSDL
thnh vin ca wwPDB (PDBe, PDBj v RCSB PDB, SCOP) v CATH.
Cc CSDL c th loi
Mt s CSDL c th loi c cng b, ch yu dng cho nghin cu.
Chng hn: Colibase (CSDL cho E.coli). Cc CSDL khc nh Flybase cho Drosophila
v WormBase cho cc bn giun trn (Caenorhabditis elegans v Caenorhabditis
briggsae). Ngoi ra cn c cc CSDL khc cho la (Oryza sativa), Arabidopsis
1.5. Xu hng pht trin ca bioinformatics Xu hng ca bioinformatics tp trung vo cc hng sau:
- Pht trin cc thut ton v my tnh (Algorithms and computational challenges)
- Phn tch chc nng protein (Protein function) - Tng tc protein v cc con ng chuyn ha(Protein interactions and
pathways)
- p dng trong lm sng v nghin cu tm thuc mi, d on ri ro, nguy c.
Cc xu hng hin nay ca Bioinformatics
- Thut ton: 27% - Machine learning: 21% - Thng k: 18% - Sinh hc: 10% - CSDL: 10% - Cc hng khc: 14%
-
17
Cc ch nghin cu hin nay:
- Phng php: 26% - Phn tch trnh t (motif, domain), so snh trnh t : 25% - M phng cu trc protein: 19% - M hnh cu trc v iu ha hot ng gene: 12% - Phn tch trnh t lin quan n tin ha: 12% - M phng v xy dng mng li trao i cht (metabolome): 6%
K nng v yu t con ngi pht trin bioinformatics:
- Hiu bit su rng c hai lnh vc: sinh hc v tin hc - Nm c nhng vn cn quan tm c 2 lnh vc - Hi t c khoa hc my tnh v phn mm: t vn v pht trin thut
ton
mc nht nh c th ni tin sinh hc l lnh vc th v, hp dn, mi, thch thc,
c th truy cp c, lnh vc c th m rng nghin cu, c s nh hng nhiu, c
hi cho ngi lm my tnh.
Nhng ch cn khm ph:
- Cc k thut CSDL cho d liu Bioinformatics - Di truyn phn t (nn tng ch yu thuc v lnh vc sinh hc) - So snh trnh t, m hnh mu (patterns), profiles - Pht hin cc pattern - Gene expression arrays - Xy dng cu trc protein (nn tng ch yu thuc v lnh vc sinh hc) - Xy dng hnh hc khng gian (lp th) ca protein (k thut my tnh v cc
cng c)
- D on cu trc protein - Xy dng mng li ha sinh hc, metabolome (nn tng ch yu thuc v lnh
vc sinh hc)
- Xy dng cc con ng trao i cht, cc con ng iu ha v tn hiu iu ha gene: CSDL, k thut my tnh v cc cng c
-
18
Tm tt chng 1
Tin sinh hc l mt lnh vc khoa hc mi c s kt hp cht ch ca sinh hc
m ch yu l di truyn hc, sinh hc phn t vi cc cng c thng k, ton hc v
khoa hc my tnh. Chng 1 gii thiu khi nim, vai tr ca tin sinh hc cng nh
cc cng c phc v cho nhng vn nghin cu ca sinh hc phn t hin i chng
hn nh tm kim cc trnh t sinh hc tng ng hoc ging nhau trong cc ngn
hng c s d liu, m phng v d on s tng tc gia cc phn t, pht hin cc
m hnh biu hin gene v cc mi lin h gia cc geneCc ni dung chnh ca tin
sinh hc cng nh xu hng pht trin ca lnh vc ny cng c cp qua gip
sinh vin c mt ci nhn bao qut v mt lnh vc khoa hc mang tnh ng dng, h
tr cho cc nh nghin cu trong cc lnh vc di truyn phn t, sinh hc phn t, y
hc
Cu hi n tp chng 1
1. Trnh by khi nim tin sinh hc. 2. Hy nu tm tt vai tr ca tin sinh hc trong nghin cu sinh hc. 3. Trnh t sinh hc l g? Hy nu mt vi v d v vic phn tch trnh t sinh
hc.
4. Th no so snh trnh t? Mc ch ca vic so snh trnh t lm g? 5. Ti sao phi nghin cu cu trc cc i phn t ? tin sinh hc h tr nh th
no trong vic d on cu trc phn t.
6. Nhng hiu bit v vai tr ca cc gene, mi lin h gia cc gene c vai tr nh th no trong y hc hin i?
7. Th no l mi quan h tin ha gia cc sinh vt? Tin sinh hc s h tr g trong nghin cu tin ha.
8. Hy nu nhim v v cc hng nghin cu ca tin sinh hc hin nay. 9. Hy nu nhng ch ang c cc nh tin sinh hc tp trung nghin cu. 10. tr thnh nhng nh nghin cu trong lnh vc tin sinh hc chng ta cn
phi c nhng yu t g?
-
19
CHNG 2
NN TNG SINH HC CA TIN SINH HC
2.1. Axit nucleic v protein
Axit nucleic v protein l hai i phn t sinh hc ng vai tr quan trng trong
th gii sng. Axit deoxyribonuleotide nucleic (DNA) mang thng tin di truyn v axit
ribonucleic (RNA) lin quan n qu trnh sinh tng hp protein v tham gia vo iu
ha hot ng sng ca t bo. n v cu to nn axit nucleic l cc nucleotide v
protein l cc amino acid.
2.2. Cu trc ca axit nucleic
DNA v RNA c cu to bi cc n phn l nucleotide v ribonucleotide.
Trong phn t DNA, mi nucleotide c cu to bi gc axit phosphoric, mt phn
t ng pentose v mt base. Cc nucleotide ni vi nhau bi lin kt phosphodiester
gia nhm 5PO4 ca phn t ng pentose ca mt nucleotide v nhm 3OH ca
phn t ng pentose mt nucleotide tip theo. V vy phn t axit nucleic bao gi
cng tn ti u 5PO4 v 3OH. Theo quy c i vi mt axit nucleic bao gi cng
vit theo hng 5 n 3 theo chiu t tri sang phi.
Hnh 4. Cu trc DNA
Axit nucleic c cu to bi 5 loi base khc nhau: cytosine (C), uracil (U),
thymine (T), adenine (A) v guanine (G). Tuy nhin, U ch c mt trong phn t RNA
v C ch c mt trong DNA. Phn t DNA v RNA khng ch khc nhau v thnh
phn base m cn khc nhau v phn t ng. RNA c ng ribose trong khi
DNA cha ng 2-deoxyribose. Phn t DNA gm 2 chui polynucleotide xon vi
-
20
nhau theo hng i song. Phn t DNA c th tn ti di dng si n (ssDNA) v
dng si kp (dsDNA). Trong phn t DNA, hai si c gn vi nhau qua lin kt
hydro gia cc base. Hai lin kt hydro gia A v T v ba lin kt hydro gia C v G.
Hai si DNA b sung vi nhau do nu bit trnh t ca mt si s suy ra trnh t
ca si cn li.
Lu tr thng tin di truyn
Trnh t cc base mang thng tin m ha cho cc protein. Phn t protein c
cu to bi 20 amino acid v mi amino acid c m ha bi 1 b ba gm 3
nucleotide tng ng trn phn t DNA. Mi b ba nh vy c gi l b m
(codon). Mi sinh vt c xu hng s dng cc b m khc nhau. Chng hn
prokaryote mt s loi dng b m khc vi cc sinh vt eukaryote. M di truyn ca
genome ti th cng c mt s khc bit so vi m di truyn ca genome trong nhn.
Hnh 4. M di truyn
Mi quan h gia DNA, RNA v protein c m t trong lun thuyt trung
tm (Crick 1970)
-
21
Hnh 5. Lun thuyt trung tm
Ton b thng tin di truyn cha trong nhn hoc kiu nhn ca mt sinh vt
c gi l genome. Ngoi tr cc retrovirus genome l RNA, thng tin di truyn
c cha ng trong cc trnh t nucleotide ca phn t DNA. Ngoi tr qu trnh
phin m ngc t RNA sang DNA mt s virus RNA, dng thng tin c chuyn
mt chiu t genome n transcriptome v n proteome thng qua qu trnh phin m
v dch m. Ton b cc bn phin m RNA (mRNA, tRNA, rRNA v cc RNA
khng m ha khc) ca mt sinh vt c gi l transcriptome. Ton b protein c
th c dch m t cc mRNA c gi l proteome. Nh vy trnh t amino acid
trong phn t protein c quyt nh bi trnh t DNA v dng thng tin c
chuyn t DNA n protein thng qua mRNA.
Genome ca eukaryote v prokaryote c nhiu im khc bit. prokaryote
thng tin di truyn c m ha trn mt on DNA lin tc, trong khi
eukaryote, cc trnh t m ha (exon) c ngn cch bi cc trnh t khng m ha
gi l intron. Ngoi ra, eukaryote, s phin m t DNA thnh mRNA trng thnh
cng phc tp hn nhiu chng hn cc intron c loi b trong qu trnh phn ct
mRNA. Cng chnh v qu trnh ny t mt gene ban u c th hnh thnh nn nhiu
mRNA v to ra nhiu protein tng ng. iu ny gii thch ti sao genome sinh
vt bc cao cha mt s lng gene nht nh, chng hn ngi c khong 25.000
gene, tuy nhin s lng protein thc t c to ra ln hn nhiu, khong 1 triu
protein.
-
22
Hnh 6. Cu trc vng gene ca prokaryote v eukaryote
Cu trc phn t protein
Cu trc s cp
Cc phn t protein l cc i phn t sinh hc c cu thnh t khong 20
loi amino acid. Trong iu kin nht nh phn t protein s cun gp li hnh thnh
cu trc 3 chiu mang y cc c im v chc nng sinh hc. Cc gc amino acid
trong chui polypeptide s quyt nh nhng c im ha hc nh tnh k nc, phn
cc, acid, base ca phn t protein. Cu trc s cp ca phn t protein hay cn gi l
cu trc bc 1 l trt t sp xp ca amino acid trong chui polypeptide. Cu trc bc
1 s quyt nh cc cu trc khng gian ca phn t protein.
Trong phn t protein, amino acid ni vi nhau to thnh chui polypeptide. Cc
amino acid c ni vi nhau thng qua lin kt amide ca nhm carboxyl vi nhm
amino ca amino acid tip theo. Chnh v vy chui polypeptide c 2 u N v C tn
cng. Theo quy c v chiu, u N bn tay tri v u C bn phi.
-
23
Hnh 7. Cc amino acid trong phn t protein
Cu trc bc 2
Thut ng cu trc bc 2 ch nhng vng khng gian cc b trn chui
polypeptide. Cu trc bc hai lin quan n s c mt ca cc xon alpha (-helix) v
phin gp np beta (-strand) v cc cu trc vng xon (loop). C s ca vic hnh
thnh cc cu trc ny l do cc c im hnh hc ca cc gc trong cc amino acid.
Vo nhng nm 1930 v 1940, Linus Pauling v Robert Corey m t cc lin kt
peptide l dng cu trc phng, cng (khng xoay). Nh vy, mt chui polypeptide
c th c xem nh l mt chui cc trnh t ni vi nhau v nm trn mt mt
phng. Xon alpha, phin beta v cc vng xon tham gia hnh thnh nn cu trc bc
2. Cu trc xon alpha v phin beta c gi n nh nh lin kt hydro. Phin beta
c th c 2 dng song song v i song (hnh 8).
-
24
Hnh 8. Cu trc bc 2 ca mt phn t protein Xon alpha v phin beta. Cu disulfide lm n nh cu trc bc 3 v cc vng lin
quan n hot tnh xc tc (mu vng).
Cu trc bc 3 v bc 4
Cu trc bc 3 c hnh thnh t vic sp xp v gp np tip theo t cc thnh phn
cu trc bc 2. Nhng polypeptide c chiu di ln hn 200 amino acid thng t gp
np vi nhau thnh mt s n v cu trc gi l domain. Cu trc bc 4 l dng cu
trc tip theo ca cu trc bc 3. Cc protein c cu trc bc 4 thng c hnh thnh
t nhiu chui polypeptide (subunit).
Trong cu trc bc 4 s tng tc gia cc amino acid bao gm lin kt hydro gia cc
chui peptide, cu disulfide gia cc gc cystein, cc lin kt ion gia cc nhm tch
in ca cc gc (chui bn) v tng tc k nc.
2.3. Genome v nghin cu genome Genome
Genome cha ng ton b thng tin di truyn ca mt sinh vt. Cc thng tin
di truyn c m ha trong DNA hoc RNA. Ly genome ngi lm mt v d, nu
coi genome l mt cun sch th cun sch ny c chia thnh 23 chng (tng ng
vi 23 cp NST). Mi chng cha 48 n 250 triu ch tin tc (A,C,G,T). Ton b
cun sch c hn 3,2 t ch v c t trong nhn ca t bo.
D n xc nh trnh t genome u tin hon tt nm 1977 bi Fred Sanger.
ng v cng s xc nh trnh t phage -X174, cha 5386 base. Genome ca vi
khun u tin c xc nh trnh t l Haemophilus influenzae vo nm 1995. Sau
genome eukaryote u tin c xc nh trnh t l ca nm men Saccharomyces
cerevisiae. Hin nay, s pht trin nhanh chng ca cng ngh (Ilumina solexa, 454
pyrosequencing, ion torrent, solid sequencing...) s lng genome ca cc loi c
xc nh trnh t tng ln mt cch nhanh chng.
Nghin cu genome (genomic research)
Nghin cu genome khng n thun ch l vic tng kt cc genome c
xc nh trnh t hay cc ch ra s lng gene c trong mt genome v tnh trng
tng ng. Nghin cu genome cn bao gm c vic so snh kch thc genome, s
lng NST (karyotype), trt t cc gene, tn sut s dng codon, thnh phn GC, v
tin ha genome. Ngoi ra nghin cu genome cng bao gm c vic so snh nhiu
-
25
genome pht hin ra cc vng bo th, cc s kin bin i din ra trong genome.
Cc kt qu nghin cu genome thng c biu din di dng ha thng qua
cc trnh duyt genome hay genome browser.
Genome hc (genomics) l mt mn hc gn lin vi di truyn hc. Genomics
lin quan n vic nghin cu genome ca cc sinh vt bao gm xc nh trnh t
DNA ca ton b genome v lp bn di truyn c mc phn gii cao (khong cch
gia cc marker rt gn nhau). Genomics cn nghin cu cc hin tng xy ra bn
trong genome chng hn nh: hin tng u th lai (heterosis), tc ng ln t ca cc
gene (epistasis), nh hng ca mt gene ln nhiu gene (pleiotropy) v tng tc
gia cc locus v cc allele bn trong genome. Khc vi nghin cu vai tr v chc
nng ca nhng gene n l, genomics nghin cu mi quan h tng th ca cc thnh
phn trong genome.
Lp genome (genome duplication) ng vai tr ch yu trong vic hnh thnh
loi mi. Lp geneome c th dao ng t phm vi nh (lp li cc on ngn/short
tandem repeat) hoc lp li c gene hoc c cm gene, lp c NST v thm ch ton b
genome. Nhng s kin ny l nn tng to ra c tnh di truyn mi, lm c s ca
tin ha. Trao i gene ngang (horizontal gene transfer) c vai tr quan trng trong
vic gii thch s ging nhau gia cc phn nh trong cc genome ca hai sinh vt vn
khng cng ngun gc tin ha. Vic trao i gene ny cng tng i ph bin gia
cc vi sinh vt chng hn hin tng khng khng sinh cc vi sinh vt l mt v d
in hnh. Vt cht di truyn c chuyn t genome ti th v lc lp vo NST cc
t bo eukaryote cng l mt v d cho hin tng ny.
Genome ngi (human genome)
Nm 2001, bn nhp u tin ca genome ngi c cng b. Vo nm 2007,
d n xc nh trnh t genome ngi hon tt vi t l li rt nh (khong 1/20.000
base). C th truy cp cc phin bn lp rp trnh t genome ngi bng cch dng
UCSC Genome Browser, Ensembl.
Nghin cu genome ca virus (bacterophage)
Bacteriophages ng vai tr quan trng trong nghin cu di truyn vi khun v
sinh hc phn t. V mt lch s, chng c s dng xc nh cu trc gene v
nghin cu c ch cng nh m hnh iu ha hot ng gene. Do genome c kch
thc nh v khng cha intron nn bacteriophase c la chn xc nh trnh t
u tin. Tuy nhin, nghin cu v bacteriophage khng m ra s cch mng v
genome (cuc cch mng v genome bt u t vic xc nh trnh t cc vi khun).
Trnh t genome ca cc bacteriophage thng c xc nh thng bng vic c
trnh t trc tip. Phn tch genome vi khun cho thy mt phn ng k DNA vi
khun cha cc trnh t tin phage (prophage) v dng ging nh prophage (prophage-
like). Nh vy, vic khai thc thng tin trong CSDL ca bacteriophage gp phn gii
thch c vai tr ca prophage trong vic hnh thnh dng genome ca vi khun.
Nghin cu genome vi khun lam (Cyanobacteria genomics)
Hin ti c 24 vi khun lam c xc dnh trnh t. 15 trong s chng c
phn lp t bin. C 6 chng thuc chi Prochlorococcus, 7 chng thuc chi nc mn
Synechococcus, Trichodesmium erythraeum IMS101 v Crocosphaera watsonii
WH8501. Mt s nghin cu cho thy cc trnh t ny c th c s dng rt hu
-
26
ch trong vic suy din cc c tnh sinh l v sinh thi ca vi khun lam bin. Tuy
nhin, c rt nhiu d n xc nh trnh t genome ang c thc hin trong s c
cc dng phn lp thuc chi Prochlorococcus v Synechococcus ( bin),
Acaryochloris v Prochloron, mt dng khun lam dng si c kh nng c nh
nitrogen Nodularia spumigena, Lyngbya aestuarii v Lyngbya majuscul cng nh tc
ng ca bacteriophage ln vi khun lam bin. Nh vy, vic nghin cu genome
ng vai tr quan trng trong vic gii thch ngun gc tin ha ca cc sinh vt v
cc qu trnh sinh hc chng hn nh quang hp.
Mi quan h gia C-value v s lng gene:
Gi tr C (C-value) l hm lng DNA ca mt sinh vt. Gi tr ny c s bin
ng rt ln cc loi. Khng c mi lin h r rng no gia C-value v s lng
gene ca sinh vt. cc genome phc tp, t l cc trnh t DNA khng m ha (non-
coding DNA) khng mang thng tin di truyn m ha RNA cng ln. ngi,
DNA khng m ha chim ti gn 75% genome. Nghch l gi tr C (C-value paradox)
ch mi quan h khng t l gia kch thc genome v s lng gene.
2.4. Pht hin gene v xc nh chc nng gene trong genome
Hnh 10. T chc genome ngi
-
27
Sau khi cc d n xc nh trnh t genome kt thc, kt qu thu c l cc
chui trnh t c sp xp trong cc nhim sc th. Vn tip theo l phi gii
m thng tin cha ng trong cc chui trnh t . Vic gii m thng tin thc cht
l tr li nhng cu hi nh: (i) genome ca sinh vt cha bao nhiu gene, (ii) cc
gene phn b u trn cc nhim sc th, (iii) chc nng ca cc gene l g,
(iv) c ch iu ha ng ca cc gene nh th no v mi lin h gia cc gene
trong vic hnh thnh kiu hnh hoc bnh tt... tr li nhng cu hi ny i hi
rt nhiu thi gian, cng sc v trong mt s trng hp cha th tm ra p n cho
nhng cu hi . C nhiu hng tip cn gii m genome, trong cc cng c
tin sinh hc c vai tr rt ln. Chng hn xc nh s lng gene ngi ta phi da
vo cc c im ca gene bao gm: trnh t m ha (coding sequence) hay cc khung
c m (open reading frame), trnh t promoter, cc trnh t ni gia exon v intron
cng nh cc trnh t iu khin hot ng ca gene (cc vng 5 UTR, 3UTR)... So
snh genome, so snh trnh t DNA l nhng thao tc quan trng u tin pht hin
cng nh d on chc nng ca gene.
Lp bn vt l da trn c s trt t cc gene v thng tin bit ca cc
gene cng l bc u tin trong nghin cu genome. Thng tin ny s c hin th
di dng ha cc genome browser. Xc nh chc nng ca gene c coi l
mt trong nhng thch thc vi cc nh nghin cu genome. Mc d thng tin v trnh
t, cu trc v chc nng sinh hc ca cc gene, cc trnh t sinh hc c cng b
ngy cng nhiu nhng vic d on chc nng ca cc gene thng rt phc tp. C
nhiu hng tip cn cho bi ton ny trong c th tip cn t genome hoc t sn
phm gene (protein) hoc kiu hnh. Gi s ngi ta mun bit tnh trng chiu cao
cy, kh nng khng su bnh, mu sc hoa hay hm lng protein trong sa do gene
no m ha. Nu tnh trng cn nghin cu l n gene th s tng i n gin. Tuy
nhin nu tnh trng do nhiu gene quy nh (tnh trng s lng) th cng vic ny
s tr ln v cng phc tp. Vn l lm th no ch r c gene hoc cc gene
no phn b u trong genome (trn NST) trc tip m ha hoc tham gia vo qu
trnh hnh thnh nn tnh trng . Ngoi ra, m hnh hot ng hoc c ch, iu kin
biu hin ca cc gene nh th no?
Trn thc t cho d s phng php no hay hng tip cn no th cui cng
vn phi xc nhn li c ng gene tham gia vo vic hnh thnh tnh trng
khng. Vic kim chng ny thc s l mt cu hi v cng nan gii c bit nhng
tnh trng di truyn s lng cc i tng sinh vt bc cao bi v cc k thut
knock out, knock down, c ch s biu hin gene bng RNAi khng phi lc no cng
c th p dng v p dng thnh cng. Mt hng tip cn khc xc nh chc
nng ca gene nh k thut microarray nhm pht hin s xut hin hoc thay i mc
biu hin ca cc mRNA trong nhng iu kin nht nh cng gp phn vo vic
nhn din v nghin cu chc nng gene. Nhng nghin cu so snh genome, so snh
trnh t, so snh cu trc (data mining and analysis) cng l mt xu hng v l thao
tc u tin khi cc CSDL cha thng tin v cc trnh t sinh hc ngy cng nhiu.
Tuy nhin mc chnh xc v tin cy ca cc thng tin a ra ph thuc rt nhiu
vo cc thut ton v mc phong ph ca thng tin trong cc c s d liu.
S lng gene ca cc sinh vt
ngi, lc ban u genome ngi d on cha khong 50.000 n 100.000
gene. Gn y s lng gene c bit khong hn 20.000. Chut v rui cng c s
-
28
lng gene tng t. Giun trn c khong 13.000 v la c khong 46.000. ngi,
trnh t gene m ha protein chim khong 12% genome.
Cu trc gene
Hnh 11. S cu trc mt gene prokaryote
prokaryote, v mt quy c u 5 ca gene c t bn tri, u 3 bn phi.
Cu trc mt gene in hnh c minh ha di y.
Hnh 12. S cu trc vng trnh t promoter ca prokaryote
Hnh 13. Cu trc gene ca eukaryote (trn) v vng promoter (di)
-
29
2.5. Hot ng chc nng ca gene v iu ha hot ng ca gene
Hot ng chc nng ca gene l mt qu trnh phc tp, c s tham gia ca rt
nhiu thnh phn ca t bo. prokaryote, hot ng chc nng v iu ha hot
ng ca gene tng i n gin. Tuy nhin eukaryote iu ha hot ng ca gene
v cng phc tp lin quan n nhiu qu trnh t cu trc nhim sc th lin quan n
cc c ch epigenetics (methyl ha, acetyl ha, phosphoril ha), khi u phin m,
phin m, ci bin sau phin m, dch m, ci bin sau dch m v vn chuyn hng
ch. Nghin cu hot ng ca mt gene phc tp th iu ha hot ng ca mt
con ng chuyn ha (metabolomic pathway) cn phc tp hn nhiu do c s tham
gia ca rt nhiu gene v tng tc ca nhiu protein, enzyme khc trong t bo.
Chnh v vy nghin cu hot ng chc nng ca gene cn c s so snh v i chiu
vi nhiu c s d liu v nhiu genome khc nhau.
Hnh 14. Cc qu trnh iu ha hot ng gene eukaryote
2.6. Proteome v lnh vc nghin cu protein (proteomics)
Proteome c coi l ton b protein c biu hin bi mt genome, t bo,
m hoc cc sinh vt mt thi im hoc iu kin nht nh. Xt v mc ang
dng, proteome ln hn nhiu so vi genome, c bit sinh vt nhn chun. Ni cch
khc s lng protein ln hn nhiu so vi s lng cc gene c trong genome.
Nguyn nhn l do cc hin tng phn ct, sa cha tin mRNA (pre-mRNA) ca
cc gene v qu trnh ci bin sau dch m chng hn nh phosphoryl ha, glycosyl
ha. Nu so vi d liu v genome ch yu l trnh t DNA, RNA th d liu v
proteome phc tp hn bi v ngoi trnh t amino acid cn c cc d liu cu trc,
chc nng v s tng tc gia cc protein.
Lnh vc nghin cu proteome lin quan n nhiu k thut phc tp nh tch
chit, tinh sch protein, phn tch protein bng in di 2 chiu, cc k thut phn tch
-
30
khi ph, so snh s ng dng gia cc mnh peptide, so snh trnh t amino acid...
Proteomics bao gm ni dung quan trng l nghin cu cu trc v nghin cu chc
nng. Nhng thng tin v trnh t amino acid, cu trc v chc nng gip cc nh
nghin cu gii thch c bn cht ca cc qu trnh sinh hc, c ch ca cc qu
trnh ri lon, bnh tt v nhn dng v d on chc nng ca nhng protein mi.
2.7. Tin ha v bn cht phn t ca qu trnh tin ha cc sinh vt
t bin v tch ly t bin
Mc d c ch v nguyn nhn ca tin ha n nay vn cn nhiu tranh ci,
tuy nhin trn quan im hin i, t bin c coi l vt liu ban u ca tin ha
bi v y l con ng dn n vic hnh thnh allele mi hoc cc vng c chc
nng iu ha b thay i hoc to mi. t bin c th gy ra hu qu nghim trng
nhng cng c t bin trung tnh hoc khng nh hng n kiu hnh (t bin
trong cc vng DNA khng m ha/ non-coding DNA).
Hu ht cc t bin trong gene cu trc u tc ng n sn phm protein
hoc dn n s a dng v sn phm protein do qu trnh phn ct, ghp ni exon ca
mRNA. Nhng thay i cu trc v chc nng ca cc phn t biu hin thnh cc
dng bin d ca c th trong qun th. Tri qua cc s kin tin ha cui cng c th
dn n phn loi v hnh thnh loi mi. y, cu hi t ra l ti sao nhng thay
i nh trong cc gene do t bin, c bit l t bin im, li dn n s phn bit
loi ny vi loi khc. tr li cu hi ny cn phi xem xt c hai kha cnh
khng gian v thi gian. Khng gian y l nhng chn lc ngu nhin t ln
nhng c th b t bin. Thi gian l h qu ca mt qu trnh chn lc t nhin lu
di. Khng gian v thi gian c mi quan h cht ch vi nhau nu p lc chn lc qu
mnh th trong mt thi gian ngn c th hnh thnh loi mi hoc dn n tuyt
chng.
S lp gene v genome (gene/genome duplication)
Nu mt gene c lp li hay c nhiu bn copy th t bin xy ra mt bn
copy c th khng nh hng g n hot ng sng ca t bo. Lp gene trong mt c
th lng bi s to ra thm mt cp gene v th mt cp vn hot ng chc nng
bnh thng, cp cn li c th b bin i hoc tn ti cc dng t hp khc nhau.
Vy li ch ca qu trnh lp gene ny l g? Theo thi gian, mt bn copy c th to
ra chc nng mi, lm nn tng cho vic thch nghi trong qu trnh tin ha. Ngay c
khi hai bn copy ca gene tn ti theo kiu paralogous, tc l c trnh t v chc
nng tng t nhau th s tn ti ca cc bn copy l mt dng d tha (gene
redundancy). iu ny gii thch ti sao trong mt s trng hp chut hoc nm men
b knock out mt gene nhng khng thy nh hng hoc nh hng khng qu nng
n ln kiu hnh. Nh vy, chc nng ca cc gene b knock out c th b trung ha
bi mt dng paralog tng ng ca n.
Sau khi gene c lp, tri qua cc s kin tin ha mt bn copy ca gene c
th b bin i hoc mt i. Nhng bin i xy ra nhiu gene v nhiu v tr trong
genome dn n nhng ro cn (post-zygotic isolating mechanism) trong qu trnh
giao phi v sinh sn gia chng. Nhng ro cn ny c th dn dn gy ra s phn
loi.
Cc t bin trong vng iu ha
-
31
Mc d v mt s lng gene c th ni l nh nhau tt c cc t bo, tuy
nhin khng phi tt c cc gene u c biu hin nh nhau mi t bo. S khc
bit ny ph thuc vo loi t bo, s tng tc ca cc tn hiu ngoi bo, cc yu t
phin m...
C nhiu bng chng cho rng t bin trong vng iu khin ng vai tr quan
trng trong tin ha. Chng hn: Ngi c mt gene (LCT) m ha cho lactase,
enzyme ny ng vai tr phn gii lactose. Hu ht mi ngi trn th gii gene ny
v u hot ha tr nh nhng s khng hot ng ngi ln. Tuy nhin, nhng
ngi Bc u v 3 b tc chu Phi gene ny vn hot ng v trong khu phn n ca
h vn dng sa. Nguyn nhn l do c mt t bin trong vng iu khin gene
lactose cho php n vn c biu hin. Mt v d khc l gene Prx1. Gene ny m
ha cho mt yu t phin m quyt nh cho s hnh thnh chn trc ng vt c
v. Khi chut c vng enhancer ca gene Prx1 b thay th bi vng enhancer tng
ng ca di (chn trc s l i cnh), khi cc chn trc di hn 6% so vi bnh
thng. Nh vy, mt s thay i v hnh thi khng c iu khin bi s thay i
protein Prx1 nhng li do s thay i v mc biu hin ca gene ny.
2.8. Phn tch mi quan h tin ha ca cc sinh vt
Tin ha l mt qu trnh dn s thay i v vn gen ca mt qun th theo thi
gian. Mc d bn cht ca tin ha din ra mc qun th, tuy nhin vic xc nh
v phn tch mi quan h tin ha c th nhiu mc khc nhau nh qun th, loi,
nhm c th, t bo, cc bo quan v mc phn t. Trong lnh vc tin sinh hc
ng dng vic phn tch mi quan h tin ha ch yu da vo phn tch mc
phn t hay tin ha phn t. Chng hn gn y ngi ta da vo vic phn tch cc
trnh t DNA m ha cho ribosome, cytochrome c, Rubisco ribolose (RuBisCo), gene
ti th... phn loi sinh vt v xp chng vo cc n v phn loi (taxon). Tt nhin
vic phn tch mc phn t l cha cn phi kt hp vi kt qu ca cc
nghin cu khc.
Analogous
Hiu mt cch n gin analogous l nhng c im ging nhau c quan
st thy hai hay nhiu loi m bn thn chng khng c s lin h v mt t tin.
Cc c im sinh hc ging nhau nh vy thng l kt qu ca qu trnh tin ha
hi t. Tin ha hi t l kiu tin ha m s thay i mt s c im trong qu
trnh tin ha ch mang tnh thch nghi vi iu kin nht nh. V d i cnh ca
chim v di c cu trc dng tng t nhau v ph hp cho vic bay ln nhng v
bn cht l khc nhau.
Homologous
Cc tnh trng tng ng (homology) c cng mt ngun gc tin ha chung.
Mt tnh trng tng ng c th l:
- Homoplasious: qu trnh tin ha xy ra ring r, nhng c cng t tin chung - Plesiomorphic: c cng t tin chung, nhng trong qu trnh tin ha dn n
s mt i mt s tnh trng cc th h con chu.
- (syn)apomorphic: c cng t tin chung v c mt tt c con chu ca chng
-
32
Ortholog
Cc trnh t tng ng c coi l orthologous khi chng c tch ring bi
mt s kin phn loi. Tuy nhin chng vn c cng mt t tin chung gn nht. Khi
mt loi phn li hay tch thnh 2 loi ring bit, cc bn copy phn ly t mt gene n
c gi l orthologous. Cc gene orthologous l cc gene ca cc loi khc nhau
nhng c s ging nhau bi v chng c ngun gc l hu du trc tip ca mt gene
n l. Chng hn protein iu ha Flu c mt c Arabidopsis (thc vt a bo bc
cao) v Chlamydomonas (to lc n bo). Chlamydomonas, protein ny phc tp
hn ch n xuyn mng 2 ln thay v mt ln Arabidopsis. Khi chuyn gene ny t
to lc sang genome thc vt bng k thut di truyn th hot ng ca gene ny cng
tng t nh t bo ban u ca chng. Kt qu ny chng t 2 gene ny l
orthologous v cng di truyn t 1 t tin chung.
xc nh 2 gene ging nhau c phi l orthologous hay khng th ch cn
phn tch ngun gc tin ha ca gene . Nu cc gene nm trong mt nhnh th
chng s l ortholog v l con chu ca mt t tin chung. Cc gene orthologs thng
c chc nng sinh hc ging nhau.
Paralogous
Cc trnh t tng ng (homologous) c gi l paralogous khi chng c
phn tch bi mt s kin lp gene. Nu mt gene ca mt sinh vt b lp v chim 2
v tr khc nhau trong cng mt genome, khi 2 bn copy c gi l paralogous
(para ngha l song song) v c th cng thc hin chc nng ging nhau. Paralog
thng c cng chc nng hoc chc nng tng t nhau, nhng khng phi lun lun
nh vy. Nguyn nhn ca hin tng ny l do thiu p lc la chn, tc l p lc la
chn ch t ln 1 bn copy ca gene b lp, bn copy kia c t do t bin, thay i
v hnh thnh chc nng mi.
Cc trnh t paralogous cung cp nhiu thng tin hu ch bn trong cc genome.
Cc gene m ha cho myoglobin v haemoglobin c xem nh l dng paralogs c
xa nht. n nay ngi ta bit 4 nhm haemoglobin (A, A2, B, F) l paralog ca
nhau. Trong khi mi protein u thc hin chc nng ging nhau l vn chuyn oxy
th mt dng bin i nh haemoglobin F dn n c i lc rt cao vi oxy so vi
cc haemoglobin ngi trng thnh. Chc nng hot ng ca cc gene paralog
cng khng nht thit phi gi vng. Cc gene paralogous thng thuc v cng mt
loi, nhng khng phi lc no cng nh vy. Chng hn gene haemoglobin ca ngi
v myoglobin ca kh u ch l paralog. y cng chnh l mt vn hay gp phi
trong tin sinh hc. Khi cc genome ca cc loi khc nhau c xc nh trnh t v so
snh vi nhau ngi ta rt d dng c th kt lun chng l tng ng (homologous)
tuy nhin chng vn c th l paralog v chc nng ca chng bin i.
Ohnology
Cc gene c gi ohnologous khi chng c ngun gc t mt qu trnh lp li
ton b genome. Thut ng ny c Ken Wolfe s dng vinh danh Susumu Ohno.
Ohnolog l mt trong nhng hin tng l th trong phn tch tin ha bi v chng
c bin i trong cng mt di thi gian bt u t ngun gc t tin chung ca
chng (do lp li ton b genome).
Xenology
-
33
Cc dng homolog hnh thnh do s trao i gene ngang (horizontal gene
transfer) gia 2 sinh vt c gi l xenologs. Phn ln cc xenolog ging nhau v
chc nng.
Gametology
Gametology m t mi quan h gia cc gene tng ng (homologous gene)
cc NST khng tng ng (chng hn NST X v NST Y ngi). Gametolog l kt
qu ca s quyt nh gii tnh v mt di truyn v l cc ro cn cho s ti t hp
gia cc NST gii tnh.
Tm tt chng 2
1. Tin sinh hc ra i da trn nn tng quan trng ca sinh hc, c bit l sinh hc phn t. Sinh hc phn t nghin cu cu trc, chc nng ca cc phn t
v cc hot ng sng ca t bo, m, c quan v c th mc phn t.
Trong tin sinh hc, nghin cu phn t tp trung vo vic xc nh trnh t cc
axit nucleic (DNA, RNA) v trnh t amino acid (protein), ng thi nghin
cu cu trc, chc nng v s tng tc gia cc phn t ny.
2. Thng tin di truyn c lu tr trong phn t DNA, RNA c biu hin thng qua cc qu trnh phin m, dch m v ci bin (sau phin m v dch
m). y cng l ni dung ca lun thuyt trung tm trong sinh hc phn t.
3. Vi s pht trin nhanh chng ca cc k thut, vic xc nh trnh t gene v genome tr thnh mt cng vic thng ngy cc phng th nghim. Sau
khi xc nh trnh t genome, vic m t v gn cc thng tin sinh hc vo cc
trnh t DNA l mt nhim v ca c cc nh nghin cu sinh hc v tin sinh
hc. Cc kt qu nghin cu sinh hc v thnh phn, cu trc gene ca sinh vt
prokaryote v eukaryote lm c s cho vic xy dng cc thut ton v m hnh
m phng my tnh.
4. Nhng nghin cu v mi lin h gia trnh t v cu trc phn t axit nucleic, protein v mi lin h gia cu trc v chc nng sinh hc s lm nn tng
m phng v d on v so snh cc cu trc, d on chc nng da vo vic
so snh trnh t.
5. t bin v nhng thay i trnh t, cu trc gene, genome trong qu trnh tin ha to c s nghin cu cc mi quan h loi, s pht sinh loi v
nghin cu chc nng ca gene, genome gia cc loi sinh vt. Trn c s phn
tch v so snh trnh t sinh hc c th xc nh c cc mi quan h di
truyn, ngun gc tin ha v xu hng tin ha cc mc tng gene, h
gene, h protein v mc loi.
Cu hi n tp chng 2
1. Trnh by thnh phn cu to v cu trc ca axit nucleic 2. Th no l m di truyn, c im ca m di truyn 3. Trnh by ni dung ca lun thuyt trung tm 4. Trnh by mi lin h gia cu trc v chc nng ca cc protein 5. Genome l g? ngha ca vic nghin cu genome? 6. Hy m t cu trc gene ca sinh vt prokaryote v eukaryote
-
34
7. iu ha hot ng gene l g? 8. Ti sao phi nghin cu mi quan h tin ha ca cc sinh vt
-
35
CHNG 3
TM KIM V QUN L TI LIU NGHIN CU
3.1. Phng php tm kim thng tin
S pht trin nhanh chng ca mng Internet v s lng trang Web to ra
mt lng thng tin khng l v tng ln tng ngy. tm c thng tin cn thit
trong kho d liu khng l ny cn phi s dng cc cng c tm kim kt hp vi
phng php ph hp. Chng 3 s gii thiu mt s cng c v phng php tm
thng tin chung trn Internet phc v hc tp v nghin cu.
Khi cn tm kim cc trang web cha nhng t c th hoc cm t cc cng c
tm kim chng hn nh Google s cho ra kt qu nhanh v rt hiu qu. Tuy nhin,
kt qu tm kim i khi a ra rt nhiu thng tin khng lin quan trc tip n ch
hoc phm vi tm kim dn n mt nhiu thi gian chn lc. Khi tm kim c nh
hng trong mt lnh vc c th hoc mt ch c th c th s dng cc nhm th
mc (subject directories) chng hn Word Wide Web Vitual Library (http://vlib.org/)
thu hp phm vi lnh vc ca ngi tm kim. Tuy nhin mt thc t l lng thng
tin m cc cng c tm kim cung cp ch khong 1/3 s lng thng tin thc t c.
Nguyn nhn l do cc cng c ny khng th truy cp c ngun thng tin . Vic
khng truy cp c ch yu lin quan n an ninh mng v cc hng ro chn. Cc
cng c tm kim khng c php vt qua cc ro chn ny.
C hai kiu tm kim thng tin, tm kim s dng cc cng c tm kim chung
(chng hn nh Google) v tm kim cc d liu c th theo mc ch nghin cu
hoc lnh vc nghin cu. Cho d s dng cng c tm kim no th vic tm kim
thng tin cng cn c cc qu trnh bao gm: (i) xc nh cng c tm tin hoc cc
trang web h tr tm tin, (ii) xc nh ni dung thng tin cn tm, (iii) xy dng t
kha i din cho ni dung tm kim (nn s dng t kha di dng cm t thay v
nhng t n, i vi ting Anh khng nn dng mo t, nn dng danh t), (iv) s
dng cc ton t logic kt hp chng hn nh cc hm boolean nh: and, or, not,
hoc +, -, du ngoc kp , du *, lc v thu hp kt qu nghin cu.
3.2. Cch tm ti liu phc v nghin cu
Hin nay Google c xem nh mt cng c tm kim nhanh v hu hiu nht
c a s mi ngi s dng. Xt v phng din tm kim thng tin chung hoc k
c tm kim theo th mc ch (directory) th Google vn l cng c chim u th.
Trong mt s trng hp Google c th thm nhp vo mt s trang web c bo mt
hin th thng tin tm kim, tuy nhin vic truy xut vo cc ngun thng tin ny s
b chn li v l do an ninh mng. Mc d vy, c th ni tm thng tin mt cch bao
qut Google c xem nh l cng c tm kim u tin c la chn.
Vic tm kim c bt u bng cch xc nh thng tin cn tm kim, tip sau
l xy dng t kha. i vi cc nh nghin cu sinh hc, c bit trong lnh vc
sinh hc phn t, thng tin ch yu c ly t cc ti liu nc ngoi v vy vic
thnh tho ting Anh l iu gn nh bt buc. Vic xy dng t kha da vo cch
kt hp cc t, ch yu l danh t hnh thnh cc cm t kha. Thng thng cc
kt qu tr v ca Google thng rt ln v vy ngi s dng phi lc kt bng cch
s dng cc phng php nh tng di t kha, nhm t kha thnh cc cm t v
kt hp vi cc ton t logic (hm boolean) hoc s dng cc chc nng tm kim
nng cao. Tuy nhin, vic s dng Google ch gii quyt c bi ton tm thng tin
http://vlib.org/
-
36
chung v khi qut v tm c thng tin c th cho mc ch nghin cu i hi
qu trnh tm kim li trong kt qu va tm c dn n mt rt nhiu thi gian v
cng sc.
Trong lnh vc sinh hc, mt phn ln ti liu phc v nghin cu v hc tp l
cc bi bo khoa hc c ng trn cc tp ch chuyn ngnh. Vic s dng thng tin
t cc bi bo m bo c tnh chnh xc v c th ca thng tin. Pubmed l mt
trong nhng c s d liu MEDLINE ca NCBI cho php ngi s dng c th tm
kim rt nhiu kt qu nghin cu lin quan n lnh vc sinh, y hc di dng cc bi
bo khoa hc ton vn (full text) hoc tm tt (abstract). Gn y, nhiu tp ch khc
nhau ng k vo trong danh mc ca Pubmed v vy phm vi tm kim cc kt qu
cng b di dng bi bo khoa hc ca Pubmed khng ch dng li phm vi y sinh
hc m cn lin quan n nhiu lnh vc khc nh ha hc, vt l, cng ngh vt liu,
cng ngh thng tin... Cc bi bo dng ton vn c th download min ph c th tm
trong CSDL PMC ca NCBI.
Cc d liu tm kim trong Pubmed c th hin di dng cc bi bo v
thng tin lin quan. Hnh xxx gii thiu mt kt qu tm kim in hnh ca Pubmed.
V mt nh dng, thng tin tm kim bng Pubmed s c cung cp bao gm tiu
bi bo, tc gi hoc nhm tc gi thc hin, tn tp ch c ng, s xut bn v s
trng ca bi bo. Ngoi ra, Pubmed cung cp ng kt ni (link) ti ngun ca bi
bo cho php ngi c c th truy cp min ph hoc c s cho php ca trang
cung cp cha bi bo .
Hnh 15. Tm kim ti liu nghin cu t CSDL Pubmed
3.3. Lm quen vi Pubmed
PubMed l mt ngun m c pht trin v duy tr bi NCBI, thuc NIH.
PubMed cha hn 20 triu trch dn cho cc vn lin quan n sinh y hc t
MEDLINE, cc tp ch khoa hc s sng v cc sch online. PubMed l mt CSDL
ln tp hp cc bi bo, tm tt, cc trch dn v cc ng link lin kt vi cc CSDL
khc. Ban u CSDL MEDLINE cha cc tp ch, tm tt lin quan n khoa hc s
sng v cc ch y sinh hc. United States National Library of Medicine (NLM)
-
37
NIH duy tr CSDL ny nh mt phn ca h thng qun l v lu tr thng tin.
PubMed c a ra bt u t thng ging nm 1996.
Tnh t nm 1966 n nay PubMed cha hn 22,7 triu bi bo v thm ch c nhng
bi t nm 1809. Hng nm c khong 0,5 triu bi bo mi c b sung. Trong s
cc d liu trong Pubmed c khong 13,1 triu c vit di dng tm tt v 14,2
triu di dng ng lin kt vi cc bi bo ton vn (full text) v trong s ny c
3,8 triu bi bo cho php ngi dng ti v min ph.
PubMed cng trang b cc ton t logic trong qu trnh thc hin tm kim, tuy
nhin qu trnh ny l t ng. T kha a vo s c dch ra thnh cc dng bin
th ca tng t v cc t thng c s dng lin quan vi cc t kha kt hp
vi cc ton t logic.
Hnh 16. Kt qu tm kim CSDL Pubmed
3.4. Cch qun l ti liu nghin cu
Vic tm c ti liu ph hp vi mc ch nghin cu l mt qu trnh i hi
mt nhiu thi gian v cng sc. Tuy nhin, ngay c khi tm c nhng bi bo
lin quan n ch nghin cu th vic qun l thng tin ny mt cch hiu qu cho
vic c, tra cu v trch dn cng i hi nh nghin cu sp xp v t chc ngun
thng tin ny mt cc hiu qu.
C nhiu cch qun l cc thng tin v d liu bi bo, trong Endnote l mt
cng c kh hiu qu cho php nh nghin cu truy cp v trch dn ngun ti liu
theo nhiu mc ch khc nhau. Mt trong nhng u im l Endnote nhn nh dng
kt qu tm kim ca mt s cng c, in hnh nht l nh dng MEDLINE ca
NCBI. Ngoi ra Pubmed cho php tm kim kh nng tm kim thng tin v trch dn
trong cc bi bo khoa hc, lun vn v lun n mt cch t ng da trn c s d
liu c to ra. Di y l mt hnh nh minh ha ca chng trnh Endnote. Cch
s dng Endnote c gii thiu c th trong cc bi thc hnh trn lp i km vi bi
ging ny.
-
38
Hnh 17: Qun l CSDL bi bo khoa hc bng chng trnh Endnote
Tm tt chng 3
1. Internet cha ng mt kh thng tin khng l, khai thc c ngun thng tin ny cn phi s dng cc cng c tm kim.
2. Vic tm kim thng tin bao gm vic xc nh ngun thng tin, xy dng t kha v biu thc tm tin v cui cng l la chn cng c tm kim.
3. Vic nh gi tin cy ca thng tin phi da vo mt s tiu ch nh mc ch ca ngi ng ti thng tin, thi gian ng ti, cc ng dn
4. C s d liu Pubmed l mt trong nhng CSLD quan trng ca NCBI. y cc nh nghin cu c th tm kim v ti v rt nhiu cng trnh, bi bo
nghin cu c ng trn nhiu tp ch c uy tn.
5. Vic qun l ti liu bng cc cng c tin hc gip cho nh nghin cu t chc, sp xp c cc ti liu tham kho mt cch khoa hc. Vic trch dn cc ti
liu cho cc bi bo, lun vn, lun n bng Endnote gip nh nghin cu tit
kim c thi gian v cng sc.
Cu hi n tp chng 3
1. Hy nu cc bc chnh trong qu trnh tm kim thng tin s dng cng c tm kim? Da trn nhng c s no nh gi tin cy ca thng tin tm kim
c. Hy nu mt v d c th cc bc tm kim mt ni dung nghin cu
(chng hn nghin cu chuyn gene khng thuc tr c vo thuc l) bng cng
c Google?
2. Tm mt s hnh nh vi khun E.coli, vi khun gy bnh bc l Xanthomonas oryzae pv oryzae, nguyn l k thut PCR.
-
39
3. S dng cc cng c tm kim, hy tm cc ti liu v k thut PCR v ng dng ca k thut ny. Yu cu: Xc nh t kha, s kt qu tm c. Trong
s cc kt qu tm c hy chn ra mt ti liu ng tin cy nht?
4. S dng kin thc hc hy tm kim a ch v truy cp vo cc trang ch ca Ngn hng gen th gii NCBI, EMBL, EBI, DDJB, PubMed v trang ch
ca Vin nghin cu la quc t (IRRI).
5. Truy cp vo trang PubMed, tm kim cc ti liu lin quan n virus HIV hoc bnh vim gan. Tm kim khong trn 10 bi bo (full text) trong CSLD
Pubmed sau dng chng trnh Endnote lu gi v qun l cc bi bo
ny dng mt th vin.
6. Trn c s th vin va xy dng hy tm kim cc bi bo theo cc trng (tn tc gi, tn bi bo, nm cng b, t kha). T kt qu xy dng th vin, hy
p dng chng trnh Endnote trch dn t ng cc bi bo, cng trnh
nghin cu cho lun vn tt nghip.
-
40
PHN 2
C S D LIU SINH HC
NG K TRNH T VO C S D LIU
CHNG 4. C S D LIU SINH HC
C s d liu
Nn tng quan trng nht trong tin sinh hc ng dng l CSDL. Phn ln d
liu trong cc CSDL sinh hc l nhng trnh t sinh hc i km vi nhng thng tin
m t chi tit. Chng hn d liu t cc d n xc nh trnh t genome c to ra
hng ngy trn quy m ton th gii. s dng c cc c s d liu ny cn phi
c mt h thng t chc v sp xp chng mt cch hp l c th lu tr, phn
nhm, cho php truy cp, tm kim v so snh. Ngoi ra, do c th ca CSDL sinh
hc, ngoi d liu trnh t thng thng cn c cc CSDL cu trc, chc nng.
Do tnh phc tp v mi lin h gia cc CSDL nn rt kh c th sp xp v
phn loi CSDL mt cch tch bit. Theo ngun gc ca d liu c th phn chia
thnh CSDL s cp v CSDL th cp. CSDL s cp cha cc trnh t nucleotide hoc
amino acid trnh cu trc c xc nh t thc nghim cng vi nhng thng tin m
t lin quan n chc nng, cc bi bo cng b lin quan, lin kt cho vi cc c s
d liu khc. CSDL th cp l CSDL cha cc d liu c cht lc, sp xp theo
nhng tiu ch nht nh t d liu ca CSDL s cp. Nu da vo c im d liu c
th phn chia thnh CSDL trnh t, CSDL cu trc v cc CSDL khc (hnh 18).
CSDL c vai tr v cng quan trng lm c s cho cc mc ch tm kim, phn tch
v so snh i chiu d liu. Kt hp vi cc cng c phn tch v cc lin kt cho
gia cc c s d liu, cc nh nghin cu c th xc nh, d on v phn tch
tm ra thng tin cha trong cc trnh t cng nh xc nh tnh cht v chc nng ca
cc trnh t sinh hc mi.
Hnh 18. Phn loi CSDL sinh hc
-
41
4.1. C s d liu s cp
4.1.1. CSDL trnh t nucleotide GenBank
CSDL GenBank c xem l CSDL c bit v s dng nhiu nht thuc NCBI
(Center for Biotechnology Information ca M. Genbank l CSDL cho php truy cp
min ph cha hn 189.000.000 trnh t vi tng s hn 299.000.000.000 base ca
hn 380.000 sinh vt (tnh n thng 12 nm 2010). GenBank cng kt hp vi 2 ngn
hng ln ca chu u (European Molecular Biology Laboratory (EMBL) t ti
European Bioinformatics Institute (EBI) v DNA Data Bank of Japan (DDBJ) ca
Nht hnh thnh trung tm hp tc trnh t nucleotide quc t (INSDC).
Cc trnh t c gi vo NCBI phi c chiu di t 50 base tr ln c m t
chi tit bao gm s truy cp (accession number/AN). S truy cp ny s c gi
khng i ngay c khi trnh t c update. Trong mt s trng hp cc phin bn
(nh s) t sau s truy cp v c ngn cch bi du chm. Trnh t c a vo
Genbank thng qua vic ng k trnh t c thc hin thng qua giao din web
(Bankit) hoc qua email (Sequin). Vic ng k trnh t s c m t chi tit
chng sau.
Mi trnh t lu tr trong Genbank c gi l mt mc (entry) c bt u
vi t kha LOCUS theo sau l tn locus (locus name). Tng t vi AN, tn locus l
duy nht tuy nhin, khc vi s truy cp, tn locus c th thay i sau khi c cn
nhc hoc sa i. Tn locus bao gm 8 k t bao gm ch u tin ch tn chi v
loi, sau l 6 con s ca s truy cp.
EMBL v DDBJ
Hai i tc chu u v Nht Bn ca GenBank l EMBL/EBI v DDBJ, y
cng l hai kho CSDL trnh t s cp. Ba CSDL GenBank/EMBL/DDBJ lin kt vi
nhau hnh thnh INSDC. CSDL ca mi i tc u c trao i vi nhau hng
ngy, v vy c th thc hin cc thao tc tm kim trnh t bt k ngn hng no.
Mc d nh dng cho mi entry ca NCBI v DDBJ so vi EMBL c s khc bit
nhng thng tin cha ng cho mi entry l nh nhau.
4.1.2. CSDL trnh t protein
SWISSPROT
Mt trong nhng CSDL ln nht cha cc trnh t protein c m t chi tit
nht l CSDL SWISSPROT c t ti Vin nghin cu tin sinh hc Thy S
(Institute of Bioinformatics/SIB). CSDL ny c h thng server gi l Expasy (Expert
Protein Analysis System). CSDL SWISSPROT c cha cc trnh t c chn lc
th cng, mi bn ghi (record) trong CSDL u c thm nh bi cc chuyn gia v
nu cn thit c th c i chiu vi cc cng trnh cng b. Chnh v iu ny m
CSDL ny c cht lng rt cao v c coi l tiu chun vng cho phn tch, tm hiu
thng tin v protein. Hn na SWISSPROT l mt phn trong CSDL UniProt hay cn
gi l UniProt.
Do s lng cc trnh t v thng tin mi c to ra lin tc nn cc chuyn
gia ca SIB khng th c thi gian bt kp v th mt CSDL mi c hnh
thnh bn cnh SWISSPROT l TrEMBL database. TrEMBL l ch vit tt ca
Translated EMBL v th n cha tt c cc trnh t protein c dch m t trnh t
-
42
DNA. Tt c cc thng tin m t u c thc hin t ng nh my tnh ch khng
phi cc chuyn gia v th tin cy ca TrEMBL km hn. C hai CSDL ny u c
th truy cp c thng qua giao din chnh SWISSPROT. Cc trnh t truy vn n
gin c th c nhp vo trong khung. Cc cng c tm kim v cng c phn tch
cc CSDL ny u c h tr SIB.
CSDL Protein NCBI
Mt CSDL trnh t rt quan trng khc cng c duy tr NCBI l CSDL
protein. CSDL ny khng ch n thun l cc d liu trnh t m l mt tp hp cc
entry t nhiu CSDL trnh t protein khc. Chng hn cc CSDL UniProt, PIR, v
PDB.
UniProt
Thng tin v cc protein trong UniProt vn tip tc tng ln nhanh chng. Bn
cnh thng tin v cc trnh t, cc m hnh biu hin, cc kt qu d on cu trc bc
2 v chc nng sinh hc cng c lu gi v m t. Tt c cc d liu ny c lu
gi trong cc CSDL, mt trong s chng l nhng CSDL c th (CSDL chuyn su
v mt lnh vc). tp hp c tt c cc thng tin lin quan n mt protein quan
tm c th mt rt nhiu thi gian. Chnh v vy EBI, SIB v Georgetown University
xy dng mt trung tm cho lu gi thng tin v cc protein gi l Universal
Protein Resource hay vit tt l UniProt. UniProt c thnh lp vo nm 2007 trn c
s kt hp ca cc CSDL protein nh: Swissprot, TrEMBL v PIR. UniProt bao gm
3 phn: (i) UniProt Knowledgebase (UniProtKB), (ii) c s d liu cc cm protein
c sp xp hay UniProt Reference Clusters Database (UniRef) v (iii) UniProt
Archive (UniPArc) l mt tp hp ca cc trnh t protein i km vi lch s ca n.
Trong s 3 CSDL ny ca UniProt, UniProtKB l CSDL tt nht c kt hp
ca Swissprot v TrEMBL. tm kim protein trong CSDL UniProtKB c th s
dng cc t kha di hoc t hp cc t kha. UniRef l mt CSDL trnh t duy nht
tc l mi trnh t ch c mt duy nht 1 ln. CSDL UniRef rt ph hp cho mc ch
tm kim trnh t tng ng. CSDL ny tn ti di 3 dng UniRef100, UniRef90 v
UniRef50. Mi CSDL ny cho php tm kim cc trnh t ging 100%, ln hn 90%
v ln hn 50%.
PIR
Protein information resource (PIR) cung cp cho cc nh khoa hc CSDL tin
cy v cc trnh t protein cng nh thng tin v chc nng ca chng mt cch chnh
xc v tin cy. PIR h tr c lc cho cc nghin cu v genome, proteom v sinh hc
h thng (system biology).
c thnh lp t nm 1984 bi hip hi nghin cu y sinh hc quc t
(NBRF) nhm h tr cc nh nghin cu xc nh v m t nh danh cc thng tin
trnh t protein. Bao gm so snh trnh t protein, xc nh cc trnh t c mi quan h
v tin ha da trn c s cn trnh t.
-
43
Hnh 19. C s d liu PIR
Tri qua hn 4 thp ch, bt u vi Atlas of Protein Sequence and Structure,
PIR cung cp cc CSDL protein v cng c phn tch cho php cc nh khoa hc
s dng v truy cp min ph bao gm c CSDL Protein Sequence Database (PSD).
4.1.3. C s d liu cu trc cc phn t PDB
Protein data bank (PDB) l CSDL cha cc d liu cu trc ba chiu ca cc
i phn t sinh hc, chng hn nh protein v axit nucleic. D liu thng l kt qu
nghin cu thc nghim s dng cc k thut kt tinh v phn tch tinh th bng tia X
hoc phn tch ph NMR. D liu c thu thp t kt qu nghin cu ca tt c cc
nh khoa hc, nhm nghin cu trn ton th gii. PDB c coi l ngun cung cp
CSDL cu trc sinh hc ln nht c lin kt vi cc CSDL ln khc nh GenBank,
EMBL, SwissProt
Bt u t nm 1976 vi ch c 3 cu trc phn t protein c xc nh, tnh
n gia thng 5/2013, CSDL PDB cha tng s 90611 d liu cu trc cc phn t.
-
44
Phng php thc
nghim
Proteins Nucleic acid Phc hp
protein/DNA
Cc
phn t
khc
Tng s
Tn x tia X 74593 1457 3864 2 79916
NMR 8700 1029 192 7 9928
Knh hin vi in t 374 45 126 0 545
Lai 46 3 2 1 52
Khc 147 4 6 13 170
Tng 83860 2538 4190 23 90611
Hnh 20. C s d liu cu trc protein PDB
hin th cc file ca PDB c th s dng cc chng trnh my tnh ngun
m. Mt s chng trnh c tch hp sn trn trang Web nh Pymol, UCSF
Chimera, Rasmol, Swiss-PDB Viewer. Cc phn mm ny thng i hi h tr
Javascript phin bn mi nht.
Ngoi vic lu gi cc d liu cu trc ca cc phn t, PDB cung cp cc cng
c cho php nh nghin cu so snh trnh t cc protein, m phng cu trc v so snh
cu trc ca cc protein.
SCOP
SCOP (Structure classification of Protein) phn loi cc protein bit cu trc
theo mt h thng th bc Cc protein thc hin chc nng sinh hc tng t nhau v
c mi quan h tin ha gn gi th chng s c cu trc tng t nhau, t nht l
nhng vng trung tm hot ng. Do c th d on c chc nng ca mt
protein cha bit bng cch so snh cu trc ca n vi cu trc cc protein bit.
CSDL SCOP cho php d on chc nng protein v c phn thnh ba dng l cc
-
45
h protein, siu h protein v cc cu trc gp np. Cc h protein bao gm cc
protein c mi quan h tin ha r rng v gn gi vi nhau c gii hn bi mt
mc ging nhau v trnh t t nht >30% trn ton b chiu di trnh t ca cc
protein. Nu khng p ng c nhng tiu ch ny cc protein s c xp vo
trong mt h nu nh chng vn c s tng ng v cu trc v chc nng. Tuy
nhin, cc protein c trnh t ging nhau mc thp nhng chng c mi quan h
vi nhau da vo cc c im cu trc v chc nng th s c xp thnh cc siu
h. Cc protein c cng kiu hoc dng cu trc bc hai trong cng mt kiu gp np
v cun li s c xp vo cng mt nhm.
CATH (Class Architecture Topology and Homologous Superfamily)
C s d liu CATH phn nhm cu trc cc protein theo kiu th bc thnh 4
cp. Class (C), Architecture (A), Topology (T), and Homologous Superfamily (H). S
xp v k phn loi cc protein thnh nhm cc lp (Class) ch yu c tin hnh t
ng, mt phn cc cu trc bc 2 c xem xt v tnh ton m khng cn quan tm
n s sp xp v kt ni ca cc cu trc bc 2. C 4 lp protein c phn bit: (i)
protein c cu to ch yu bi cc cu trc xon (ch yu l xon alpha), (ii) phin
beta, (iii) c xon v phin (alpha-beta) v (iv) cc protein c rt t cu trc bc 2.
Nhm Archiecture (A) m t s sp xp ca cc thnh phn cu trc bc 2 mt
cch ln lt v chnh xc theo cch th cng. Trong nhm Topology m t dng
protein v s tng tc kt ni ca cc thnh phn cu trc bc 2. S phn nhm
Topology da vo thut ton s dng da trn c s thc nghim xut pht t cc
thng s phn nhm cc domain. Nhm siu h protein tng ng (H) bao gm
cc domain tng ng, chng hn cc domain c cng ngun gc chung. Mc
ging nhau ca cc trnh t c xc nh bng cch so snh trnh t sau bi so
snh cu trc ty thuc vo vic phn loi theo nhm Topology. Ngoi 4 nhm trn,
mt nhm th 5 gi l h trnh t (Superfamilies). Trong nhm ny cc domain c
phn nhm da vo mc ging nhau cao ca trnh t (t nht 35% ging nhau trn
hn 60% chiu di ca domain ln) v vy cc protein ny thng c chc nng tng
t nhau.
4.2. C s d liu th cp
PROSITE
Lm mt CSDL th cp cha cc protein c phn nhm da vo vic s
dng motif bo th (nhng vng trnh t ngn c kch thc t 10 n 20 amino acid
c tnh cht bo th cao trong cc phn t protein c mi lin h gn gi). y l c s
rt quan trng nghin cu chc nng protein.
Vic tm kim cc protein c cc dng motif ging nhau cho php pht hin
c chc nng ca chng. iu ny rt hu ch trong vic nghin cu mt protein
cha bit. Vic pht hin cc motif c trong protein cha bit ny c th gi v chc
nng v mt s c im sinh hc ca n. Vic pht hin cc motif da vo nguyn l
cn trnh t (xem chng 8).
PRINTS
Cc trnh t trong CSDL PRINTS c phn bit da vo nguyn l
fingerpriting. Cc Fingerprints bao gm mt vi motif trnh t. CSDL PRINTS li
-
46
dng c im cc protein cha cc vng chc nng ging nhau s c mt vi vng
motif trnh t ging nhau. Bng cch so snh mt s vng trnh t Fingerprint s xc
nh c mi lin h ca mt protein vi mt h protein bit thm ch ngay c khi
mt s motif b mt hoc khng c.
CSDL PRINTS c lin kt cho vi cc mc (entries) ca cc CSDL lin
quan nh cho php ngi s dng c th truy cp ti nhiu ngun thng tin lin
quan n h protein. Cng tng t nh Prosite, CSDL Prints cha thng tin v mi
h protein v, nu c th, chc nng sinh hc ca mi motif trong cc fingerprint.
Pfam
CSDL Pfam phn loi cc protein da vo dng. Mi dng c xc nh bng
kh nng xut hin ca mt amino acid nht nh, mt v tr chn thm hoc mt i
mt amino acid mi v tr trong mt trnh t protein. Cc protein trong Pfam c
phn nhm da vo vic cn trnh t. Kt qu ca vic cn trnh t s cho php phn
bit kt hp gia chc nng, cu trc v mi quan h tin ha.
4.3. Cc c s d liu khc
4.3.1. C s d liu kiu gene v kiu hnh Mi quan h gia kiu gene v kiu hnh c nghin cu thng qua s thay
i kiu hnh ca cc gene b t bin. C mt s CSDL kiu gene/kiu hnh c
to ra lu gi cc mi quan h gia cc gene v cc c im sinh hc ca sinh vt.
Trong s c th k n CSDL OMIM (Online Mendelian Inheritance in Man) ca
NCBI. Mt dng CSDL na l dbGaP (Genotype and Phenotype database) ca NCBI.
D liu trong CSDL ny c s dng phn tch mc ngha thng k ca cc mi
quan h gia kiu gene v kiu hnh. Ngoi ra CSDL OMIA (Online Mendelian
Inheritance in Animals) NCBI cng cha cc mi quan h gia kiu gene v kiu
hnh nhiu loi ng vt, ngoi tr chut v ngi. Vi chut, CSDL tng ng l
MGD (Mouse genome database). Mi quan h gia genotype ca hai m hnh sinh vt
quan trng l rui dm (D. melanogaster) v giun trn (C. elegan), c lu gi
FlyBase v Wormbase. C hai CSDL cha thng tin cho mi quan h gia genotype
v phenotype.
4.3.2. CSDL kiu gene (PhenomicDB) CSDL kiu gene l mt CSDL lu gi thng tin v kiu gene v kiu hnh ca
nhiu loi t ngi cho n nhng sinh vt c nghin cu nhiu nh chut, c, rui
dm, giun trn, nm men v Arabidopsis thaliana. CSDL ny kt hp d liu t nhiu
CSDL khc.
Mt im c bit ca CSDL PhenomicDB l c s so snh cho gia cc sinh
vt vi nhau da trn mi quan h gia kiu gene v kiu hnh. Vic so snh c
thc hin bng cch kt hp cc d liu phn tch cc gene tng ng theo kiu
orthology (phn li t mt t tin chung).
4.3.3. PubChem L mt CSDL NCBI lu gi cc phn t nh v thng tin lin quan n cc
hot tnh sinh hc ca chng. PubChem bao gm 3 thnh phn: PubChem compound,
Pubchem substance v Pubchem Bio Assay. Trong PubChem compound cha hn
11 triu phn t (2007) cng vi cu trc 2 chiu tng ng.
-
47
PubChem substance cho php tm kim cc cht c to ra bi nhiu nh sn
xut, cc hp cht cha bit thnh phn v cc hp cht t nhin cha bit cu trc 2
chiu. PubChem BioAssay cung cp d liu v cc phn ng sinh hc. CSDL ny cho
php tm kim bng t kha truy vn (query). CSDL PubChem rt hu ch do c s
lin kt gia cc d liu bn trong CSDL v cc CSDL bn ngoi nh PubMed. Chng
hn khi bit mt cht c ch hot ng ca mt enzyme th c th tm c nhiu cht
c kh nng c ch tng t. Hn na cc phn t ha hc nh c th c xc nh
c cu trc khc nhau li c th c cng hot tnh sinh hc trong cc phn ng sinh
hc. y l c s p dng trong vic pht hin v pht trin cc cu trc thuc iu
tr mi.
Cc CSDL c th
Ngoi cc CSDL k trn, hin nay c ti hng nghn CSDL lu gi cc thng
tin v trnh t sinh hc, cu trc phn t, bn gene, mi lin h gia kiu gene v
kiu hnh. Vi s pht trin nhanh chng ca k thut xc nh trnh t genome th h
mi hng chc nghn genome ca cc sinh vt c xc nh trnh t. Cc CSDL
genome i km vi nhng thng tin m t c ngha rt ln trong vic khai thc thng
tin v genome, so snh genome cng nh nghin cu chc nng ca cc gene, cc
protein thng qua vic so snh khng ch mc phn t m c ton b genome.
i vi mt s i tng sinh vt c nghin cu k lng, thng tin chi tit
v tng gene hoc c ch iu ha hot ng ca cc gene u c m t. Mt v d
in hnh l CSDL Arabidopsis thaliana, CSDL v la v mt s i tng cy trng
quan trng.
S pht trin nhanh chng v s lng genome v kt qu ca vic so snh
genome hnh thnh nn cc CSDL v s a hnh cc nucleotide n (SNP). Cc c
s d liu SNP c ngha quan trng trong vic phn tch s a hnh ca cc sinh vt
v mi lin h gia SNP vi cc tnh trng v k c bnh tt. Nghin cu v SNP cng
gp phn nghin cu s phn ng khc nhau mc c th vi cc nh hng ca
mi trng hoc thuc iu tr. i vi vt nui, khai thc cc CSDL SNP cng cung
cp cc ch th phn t ng dng trong chn to ging.
Nghin cu v gene v hot ng chc nng ca gene
top related