text mining and data integration

151
Lars Juhl Jensen Text mining and data integration

Upload: lars-juhl-jensen

Post on 10-May-2015

143 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Text mining and data integration

Lars Juhl Jensen

Text mining and data integration

Page 2: Text mining and data integration

exponential growth

Page 3: Text mining and data integration
Page 4: Text mining and data integration
Page 5: Text mining and data integration

~45 seconds per paper

Page 6: Text mining and data integration

information retrieval

Page 7: Text mining and data integration

named entity recognition

Page 8: Text mining and data integration

information extraction

Page 9: Text mining and data integration

association networks

Page 10: Text mining and data integration

data integration

Page 11: Text mining and data integration

information retrieval

Page 12: Text mining and data integration

find the relevant papers

Page 13: Text mining and data integration

ad hoc retrieval

Page 14: Text mining and data integration

user-specified query

Page 15: Text mining and data integration

“yeast AND cell cycle”

Page 16: Text mining and data integration

PubMed

Page 17: Text mining and data integration
Page 18: Text mining and data integration

indexing

Page 19: Text mining and data integration

fast lookup

Page 20: Text mining and data integration

stemming

Page 21: Text mining and data integration

word endings

Page 22: Text mining and data integration

dynamic query expansion

Page 23: Text mining and data integration

MeSH terms

Page 24: Text mining and data integration

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 25: Text mining and data integration

no tool will find that

Page 26: Text mining and data integration

named entity recognition

Page 27: Text mining and data integration

computer

Page 28: Text mining and data integration

as smart as a dog

Page 29: Text mining and data integration

teach it specific tricks

Page 30: Text mining and data integration
Page 31: Text mining and data integration
Page 32: Text mining and data integration

identify the concepts

Page 33: Text mining and data integration

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 34: Text mining and data integration

comprehensive lexicon

Page 35: Text mining and data integration

CDC2

Page 36: Text mining and data integration

cyclin dependent kinase 1

Page 37: Text mining and data integration

orthographic variation

Page 38: Text mining and data integration

upper- and lower-case

Page 39: Text mining and data integration

CDC2

Page 40: Text mining and data integration

Cdc2

Page 41: Text mining and data integration

spaces and hyphens

Page 42: Text mining and data integration

cyclin dependent kinase 1

Page 43: Text mining and data integration

cyclin-dependent kinase 1

Page 44: Text mining and data integration

prefixes and postfixes

Page 45: Text mining and data integration

CDC2

Page 46: Text mining and data integration

hCDC2

Page 47: Text mining and data integration

“black list”

Page 48: Text mining and data integration

SDS

Page 49: Text mining and data integration

scalable implementation

Page 50: Text mining and data integration

>10 km<10 hours

Page 51: Text mining and data integration

augmented browsing

Page 52: Text mining and data integration

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 53: Text mining and data integration

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 54: Text mining and data integration

Reflect

Page 55: Text mining and data integration

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O’Donoghue et al., Journal of Web Semantics, 2010

Page 56: Text mining and data integration

information extraction

Page 57: Text mining and data integration

formalize the facts

Page 58: Text mining and data integration

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1

and this modification served as a priming step to promote subsequent

Cdc5-dependent Swe1 hyperphosphorylation and degradation

Page 59: Text mining and data integration

two approaches

Page 60: Text mining and data integration

co-mentioning

Page 61: Text mining and data integration

counting

Page 62: Text mining and data integration

within documents

Page 63: Text mining and data integration

within paragraphs

Page 64: Text mining and data integration

within sentences

Page 65: Text mining and data integration

co-mentioning score

Page 66: Text mining and data integration

NLPNatural Language Processing

Page 67: Text mining and data integration

grammatical analysis

Page 68: Text mining and data integration

part-of-speech tagging

Page 69: Text mining and data integration

multiword detection

Page 70: Text mining and data integration

semantic tagging

Page 71: Text mining and data integration

sentence parsing

Page 72: Text mining and data integration

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Page 73: Text mining and data integration

extract stated facts

Page 74: Text mining and data integration

high precision

Page 75: Text mining and data integration

poor recall

Page 76: Text mining and data integration

text corpus

Page 77: Text mining and data integration

most use abstracts

Page 78: Text mining and data integration

few use full-text articles

Page 79: Text mining and data integration

no access

Page 80: Text mining and data integration

PDF files

Page 81: Text mining and data integration
Page 82: Text mining and data integration

layout-aware extraction

Page 83: Text mining and data integration

my corpus

Page 84: Text mining and data integration

~22 million abstracts

Page 85: Text mining and data integration

~4 million articles

Page 86: Text mining and data integration

association networks

Page 87: Text mining and data integration

guilt by association

Page 88: Text mining and data integration
Page 89: Text mining and data integration

STRING

Page 90: Text mining and data integration

Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011

Page 91: Text mining and data integration

computational predictions

Page 92: Text mining and data integration

gene fusion

Page 93: Text mining and data integration

Korbel et al., Nature Biotechnology, 2004

Page 94: Text mining and data integration

gene neighborhood

Page 95: Text mining and data integration

Korbel et al., Nature Biotechnology, 2004

Page 96: Text mining and data integration

phylogenetic profiles

Page 97: Text mining and data integration

Korbel et al., Nature Biotechnology, 2004

Page 98: Text mining and data integration

a real example

Page 99: Text mining and data integration
Page 100: Text mining and data integration
Page 101: Text mining and data integration
Page 102: Text mining and data integration

Cell

Cellulosomes

Cellulose

Page 103: Text mining and data integration

experimental data

Page 104: Text mining and data integration

gene coexpression

Page 105: Text mining and data integration
Page 106: Text mining and data integration

physical interactions

Page 107: Text mining and data integration

Jensen & Bork, Science, 2008

Page 108: Text mining and data integration

curated knowledge

Page 109: Text mining and data integration

pathways

Page 110: Text mining and data integration

Letunic & Bork, Trends in Biochemical Sciences, 2008

Page 111: Text mining and data integration

many databases

Page 112: Text mining and data integration

different formats

Page 113: Text mining and data integration

different identifiers

Page 114: Text mining and data integration

variable quality

Page 115: Text mining and data integration

not comparable

Page 116: Text mining and data integration

hard work

Page 117: Text mining and data integration

quality scores

Page 118: Text mining and data integration

von Mering et al., Nucleic Acids Research, 2005

Page 119: Text mining and data integration

calibrate vs. gold standard

Page 120: Text mining and data integration

von Mering et al., Nucleic Acids Research, 2005

Page 121: Text mining and data integration

data integration

Page 122: Text mining and data integration

general approach

Page 123: Text mining and data integration

suite of web resources

Page 124: Text mining and data integration

STITCH

Page 125: Text mining and data integration

STRING + 300k chemicals

Page 126: Text mining and data integration

Kuhn et al., Nucleic Acids Research, 2012

Page 127: Text mining and data integration

COMPARTMENTS

Page 128: Text mining and data integration

subcellular localization

Page 129: Text mining and data integration

compartments.jensenlab.org

Page 130: Text mining and data integration

TISSUES

Page 131: Text mining and data integration

tissue expression

Page 132: Text mining and data integration

tissues.jensenlab.org

Page 133: Text mining and data integration

DISEASES

Page 134: Text mining and data integration

disease genes

Page 135: Text mining and data integration

unification

Page 136: Text mining and data integration

curated knowledge

Page 137: Text mining and data integration

text mining

Page 138: Text mining and data integration

experimental data

Page 139: Text mining and data integration

computational predictions

Page 140: Text mining and data integration

common identifiers

Page 141: Text mining and data integration

quality scores

Page 142: Text mining and data integration

visualization

Page 143: Text mining and data integration

dissemination

Page 144: Text mining and data integration

web interfaces

Page 145: Text mining and data integration
Page 146: Text mining and data integration

evidence viewers

Page 147: Text mining and data integration
Page 148: Text mining and data integration

web services

Page 149: Text mining and data integration

diseases.jensenlab.org

Page 150: Text mining and data integration

bulk download

Page 151: Text mining and data integration

thank you!