mining associations for organism characteristics in .... pavlovic-l… · – thermal adaptation...

39
Mining associations for organism characteristics in prokaryotes – an integrative approach G. Pavlovic-Lazetic 1 , V. Pajic 2 , N. Mitic 1 , J. Kovacevic 1 , M. Beljanski 3 1 Faculty of Mathematics,University of Belgrade, Belgrade, Serbia 2 Faculty of Agriculture,University of Belgrade, Belgrade, Serbia 3 Institute of General and Physical Chemistry, Serbia

Upload: others

Post on 07-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Mining associations for organism characteristics in prokaryotes –

an integrative approach

G. Pavlovic-Lazetic1, V. Pajic2, N. Mitic1, J. Kovacevic1, M. Beljanski3

1Faculty of Mathematics,University of Belgrade, Belgrade, Serbia 2Faculty of Agriculture,University of Belgrade, Belgrade, Serbia

3Institute of General and Physical Chemistry, Serbia

Page 2: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

OVERVIEW

• On the problem – association of phenotype to genotype characteristics

• Related work • Databases • Database mining: Example • Text mining; Method • Example • Conclusion

Page 3: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

On the problem – association of phenotype to genotype characteristics

• Genotype • Phenotype • One of the main problems in the post-genomic era is to

associate phenotypic characteristics of an organism to proteins and metabolites encoded by its genome.

• Provide for deeper comprehension of evolutionary processes

• Some prediction possibilities, e.g., trends prediction of some pandemia

• Importance for bio-defense

Page 4: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Related work • Microbial phenotypes are typically due to the joined action of

multiple gene functions – Inferring gene function from cross-organismal distribution of

phenotypic traits; reliable when the phenotype does not arise from many alternate mechanisms – (Jim et al, 2004)

– Co-occurrence between sets of genes and the phenotype; association rule mining algorithms; NETCAR, PCAR (Tamura, D’haeseleer, 2008)

– Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010)

– Genotype-phenotype associations by combining information from a biomedical database with the molecular information from COGs database

– Association of genes to phenotypes by literature mining and comparative genome analysis (Korbel et al, 2005)

Page 5: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Databases • Plenty of genotype data and gene sequences for different

organisms • Usually well structured and placed into databases • Data on phenotypic characteristics of organisms often

scattered across different text documents, e.g., scientific papers or encyclopedias

• Public databases (both kinds of information), e.g., – NCBI Entrez Genome database (the most extensive) – Comprehensive Microbial Resource – Genome Atlas Database – DOE-Joint Genome Institute databases – Databases and tools for specific types of genotype-phenotype

research, etc.

Page 6: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

OVERVIEW

• On the problem – association of phenotype to genotype characteristics

• Related work • Databases • Database mining: Example • Text mining; Method • Example • Conclusion

Page 7: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Database mining: Example

• NCBI Entrez Genome database • An instance (2011)- collection (table) organism_info: • Characteristics:

– genome size, GC content – shape, oxygen, habitat, salinity, temperature, gram,

motility, pathogenicity – number of isolates / different species : 7467/2163 – some columns sparse – under half-populated

Page 8: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Database mining: Example

• Statistics on organism_info:

010002000300040005000600070008000

isol

ates

spec

ies

oxyg

-isol

oxyg

-spe

c

hab-

isol

hab-

spec

tem

p-is

ol

tem

p-sp

ec

gram

-isol

gram

-spe

c

o-h-

isol

o-h-

spec

o-h-

t-is

ol

o-h-

t-sp

ec

o+h+

t+g_

isol

o-h-

t-g-

spec

TABLE Organism_info

Page 9: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Database mining: Example • Modalities +--------------------------------+---------------+ | oxygen | count(*) | +--------------------------------+---------------+ | | 4182 | | Aerobic | 1064 | | Anaerobic | 792 | | Facultative | 1300 | | Microaerophilic | 129 | +--------------------------------+---------------+

+--------------------------------+---------------+ | habitat | count(*) | +--------------------------------+---------------+ | | 4273 | | Aquatic | 502 | | Host-associated | 1429 | | Multiple | 856 | | Specialized | 204 | | Terrestrial | 203 | +--------------------------------+---------------+

Page 10: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Database mining: Example • Modalities +--------------------------------+---------------+ | temp | count(*) | +--------------------------------+---------------+ | | 3835 | | Cryophilic | 1 | | Hyperthermophilic | 78 | | Mesophilic | 3377 | | Psychrophilic | 35 | | Thermophilic | 141 | +--------------------------------+---------------+ +---------+------------+ | gram | count(*) | +--------+-------------+ | | 4043 | | + | 1371 | | - | 2047 | | _ | 6 | +--------+-------------+

Page 11: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Database mining: Example

• Association rule mining – Given a set of transactions consisting of one or more elements (items),

find rules that predict occurrence of an item based on occurrence of other items in the transaction

– Support / Confidence / Support*Confidence / Lift

• IBM Intelligent Miner - two examples: – geno/pheno – pheno

Page 12: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Database mining: Example 1

Page 13: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Database mining: Example 2

Page 14: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

OVERVIEW

• On the problem – association of phenotype to genotype characteristics

• Related work • Databases • Database mining: Example • Text mining; method • Example • Conclusion

Page 15: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Text mining; method

• Much more information about microbes and other organisms in unstructured and semi-structured documents

• Text mining: Goal - mining genotype / phenotype organism characteristics from text

• Method: – two phase method based on finite state transducers (FST)

for information extraction from text • Tool: system UNITEX; regular expressions; graphs;

example

Page 16: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

FSTs for genome size

Page 17: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

OVERVIEW

• On the problem – association of phenotype to genotype characteristics

• Related work • Databases • Database mining: Example • Text mining; method • Example • Conclusion

Page 18: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Example

• Text mining in bioinformatics – Materials: encyclopedia of microorganisms – Bergey's Manual of Systematic Bacteriology, Volume 2 : The

Proteobacteria (2005) – Bergey's Manual of Systematic Bacteriology, Volume 3: The Firmicutes

(2009) – Bergey's Manual of Systematic Bacteriology, Volume 4: The

Bacteroidetes, Spirochaetes, Tenericutes (Mollicutes), Acidobacteria, Fibrobacteres, Fusobacteria, Dictyoglomi, Gemmatimonadetes, Lentisphaerae, Verrucomicrobia, Chlamydiae, and Planctomycetes, (2010)

Page 19: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype
Page 20: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Example: database

Page 21: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Example: graph “genome size” • “genome sizes of four G. oxydans strains were estimated to be

between 2240 and 3787 kb” • “genome size of R. prowazekii is 1,111,523 bp” • “genome size of R. africae is 1.248 kb” • “genome size of R. australis is 1256–1276 kbp” • “genome size is 2.62 X 109 Da” • “Genome size: 2.73 X 109 Da” • “genome size is 1.713 Mbp” • “genome size was estimated to be approximately 4061 kb” • “genome size of all the classical strains examined was about

3000 kb”

Page 22: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Example: graph “gram”

Page 23: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Example: phase 1

Page 24: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Example: phase 2

Page 25: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype
Page 26: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Example: postprocessing

• Manually / biocuration • E.g., habitat -> aquatic, terrestrial, host-associated,… • Tables Species, Genus

0

500

1000

1500

2000

2500

3000

total oxygen habitat temp gram ph

Table Species

TABLES Species+Genus

Page 27: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Example: Species vs.Organism_info Species vs. Organism_info: TABLE Species

0

500

1000

1500

2000

2500

3000

total IN Organism_info NOT IN Organism_info

Species vs.Organism_info

TABLE Species

Species vs. Organism_info: TABLE Species

050

100150

200250300

350400

450500

oxygen habitat temp gram

in Organism_info

in Orgnism_info: equal

in Organism_info: not equal

Page 28: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Example: Species vs.Organism_info

Species vs. Organism_info: TABLE Organism_info

0

200

400

600

800

1000

1200

1400

1600

oxygen habitat temp gram

in Speciesin Species: equalin Species: not equal

Page 29: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Example: integration of data sources

• Tables: organism_info, species • Populating the table organism_info by using data from the

table species • Table organism_info_int

0

1000

2000

3000

4000

5000

6000

oxygen habitat temp gram

TABLE Orgnism_info

TABLE Organism_info_int

Page 30: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Example: integration of data sources

• Tables: organism_info_int, species – union of the tables organism_info_int and species; – projection to common attribute

• Table species_int

0100020003000400050006000700080009000

10000

total

Oxyge

n

Habita

tTem

pGram

ox-ha

b

ox+h

ab+tem

p

ox+h

ab+tem

p+gra

m

Organism_info

Organism_info_int

vrste_int

Page 31: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Example: association rule mining

• 1. tables organism_info, organism_info_int – (some) genotype / phenotype chracteristics (genome size,

GC%, shape, habitat, oxygen, temperature, gram, ph)

• 2. tables organism_info, organism_info_int,

species_int – Smaller attribute set; phenotype (eco) characteristics

(habitat, oxygen, temperature, gram)

Page 32: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype
Page 33: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype
Page 34: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

OVERVIEW

• On the problem – association of phenotype to genotype characteristics

• Related work • Databases • Database mining: Example • Text mining; method • Example • Conclusion

Page 35: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Conclusion (and beyond…) • Significant enlargement of databases • Applicable to many other specific areas and tasks • Association rules – modest • Sparse structured data • Integration of several microbial databases • Rich set of characteristics

– Genotypic – Phenotypic – Structural – Protein disorder – COGs – RNA secondary structures – …

• Multivariate analysis

Page 36: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

References • A Cross-Genomic Approch for Systematic Mapping of Phenotypic Traits to Genes, Km Jim,

Kush Parmr, Mona Singh, Saeed Tavazoie, Genome Research, 14, 2004, 109-115 • Microbial genotype-phenotype mapping by class association rule mining, Makio Tamura,

patrik D’haeseleer, Bioinformatics, 24 (13), 2008, 1523-1529 • Ontology-guided data preparation for discovering genotype-phenotype relationships,

Adrien Coulet12*, Malika Smaïl-Tabbone2, Pascale Benlian3, Amedeo Napoli2 and Marie-Dominique Devignes2 , BMC Bioinformatics 2008, 9(Suppl 4):S3

• . J. Korbel, T. Doerks, L. J. Jensen, C.Perez-Iratxeta, S. Kaczanowski, S. D. Hooper, M. A. Andrade, P. Bork: Systematic association of genes to phenotypes by genome and literature mining, PLoS Biol, 3, 2005, pp. 134-134.

• 11. N. J. MacDonald, R. G. Beiko, Efficient learning of microbial genotype-phenotype association rules, Bioinformatics, 26, 2010, pp. 1834-1840.

• Reduction in Structurl Disorder and Functional Complexity in the Therml Adpttion of Prokryotes, Prasad V. Burr, Lajos Kalmr, Peter Tompa, Plos One, 2010. 5(8), e12069

• Goh, C.S., Gianoulis, T.A., Liu, Y., Li, J., Paccanaro, A., Lussier, Y.A. and Gerstein, M. (2006) • ‘Integration of curated databases to identify genotype-phenotype associations’, BMC

Genomics, 7, pp.257–257.

• .

Page 37: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype
Page 38: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Accepted in International Journal of Data Mining in Bioinformatics, Inderscience Enterprise Ltd. .

Page 39: Mining associations for organism characteristics in .... Pavlovic-L… · – Thermal adaptation vs. structural disorder and functional complexity (Burra et al, 2010) – Genotype-phenotype

Information Extraction from Semi-structured Resources: A Two-Phase Finite State Transducers Approach