termite & termite expressions
Post on 25-May-2015
920 Views
Preview:
DESCRIPTION
TRANSCRIPT
© 2013 SciBite Limited. © 2013 SciBite Limited
Termite and
Termite-Express
SciBite
© 2013 SciBite Limited
Powerful Text-Indexing For Life Sciences
© 2013 SciBite Limited. © 2013 SciBite Limited
Whats The Issue?
So much public, private, professional textual content available…
Standard text-search tools don’t help much because they aren’t semantic…
Semantic means searching by “thing” not by synonym (keyword)…
Semantic means more accurate and complete results!
© 2013 SciBite Limited. © 2013 SciBite Limited
Users & ApplicationsResearchers
You Are: A life science professional who’s job involves hunting for key facts in literature, patents, grants and internal documents
We Offer: The ability to data-mine millions of documents to identify critical mentions and relationships
Enterprise Search
You Are: A company wishing to make its internal search portals more accurate
Content Providers
You Are: Anyone who produces or supplies textual content in the life-sciences
We Offer: The ability to enhance your existing search tool to find key biological entities more accurately, making your users happier and more productive!
We Offer: The opportunity to enrich your content for search, navigation and significantly increase the value to your consumers
© 2013 SciBite Limited. © 2013 SciBite Limited
Selecting A Semantic Recognition Engine
Is commercially supported Is highly configurable Is accurate Is scalable (millions of documents) Is fast (MB/sec processing) Is flexible (abstracts to full documents) Supports batch & on-demand (web service) processing Is tuned to life sciences data Comes supplied with highly curated thesauri Comfortable with ambiguity of life science texts Goes beyond recognition to identify critical phrases in a document
ChoicesYou want something that:
Termite meets all these criteria
© 2013 SciBite Limited. © 2013 SciBite Limited
Semantic Entity Recognition : Basics Two main approaches,
Thesaurus: match text to a list of known synonyms Algorithmic: try to identify an entity synonym “on the fly”
Termite uses both mechanisms to identify entities with high accuracy Thesauri are often an afterthought from tool providers, pointing to free
public sources While these are good starting points, they will deliver variable results Our view:
Commercial grade text-mining requires commercial grade thesauri
© 2013 SciBite Limited. © 2013 SciBite Limited
Our Thesauri Products Thesauri are at our heart, not an afterthought:
Combine crowd-sourced and professional curation with experienced biomedical/pharma researchers
Thesauri are built to tackle real world text Integration-ready: We use public identifiers by
default Include mappings to other resources and many are
organised via ontologies
© 2013 SciBite Limited. © 2013 SciBite Limited
Some Examples Human Gene
We have over 4.5 million synonyms, and when combined with our on-the-fly algorithms, we match over 30 million gene name mentions
Indication (Disease) We have extensive coverage of over 5000 of the most important human diseases, along with over 63,000
manually verified synonyms Protein Type
Recognises concepts such as “interluekin”, “cytokine”, “ion channel”, rather than specific genes. Arguably these terms are used more often in biomedical text than gene names yet such entities are very poorly identified by other tools
Drugs Recognise over 1 million synonyms covering >60,000 launched and research therapeutics. Updated on a
daily basis from our internet-wide scanning at SciBite.com
We also cover: Adverse Events, Cells, Tissues, Species (and species-specific gene thesauri), Companies, Micro RNAs,
Mutations, Hormones & Messengers, Investigative procedures (e.g. Biopsy), Laboratory Chemicals, Laboratory Procedures, Restriction Enzymes, Plasmids, General Laboratory Products & more!
© 2013 SciBite Limited. © 2013 SciBite Limited
Synonyms Aren’t Easy.. Biomedical terms are very ambiguous
GSK (GlaxoSmithkiline or Glucose Synthase Kinase?) Hedgehog (Animal or developmental regulator protein?) Android (The FDA approved drug or the Phone OS?) Transgene (The company or the technique?) MCD (macular dystrophy (corneal) or malformations of cortical
development) Pacific (Pacific Biotechnology or the ocean?) EGFR (The kinase receptor or e-glomerular filtration rate?)
© 2013 SciBite Limited. © 2013 SciBite Limited
Ambiguity: Termites Strength Termite’s engine and thesauri understand which synonyms are
Fairly Dependable (e.g. Pfizer), Often Ambiguous (e.g. MCD) or correct but very dangerous (e.g. Pacific)
As a document is analysed, Termite uses both: Synonym Range: Which synonyms are used, how ambiguous as a
whole, not just one-by-one? Synonym Metrics: Frequency and position of synonyms, relationship
of abbreviations and full terms Document context: Does the document mention key terms (but not
synonyms) that increase or decrease the chances the ambiguous synonym is correct
© 2013 SciBite Limited. © 2013 SciBite Limited
Bottom Line
Termite allows you to use ambiguous synonyms in your
Thesauri to increase recall without returning a lot of
rubbish!
© 2013 SciBite Limited. © 2013 SciBite Limited
β-actin
actin-β
b-actin
β actin
Beta actin
ß-actin
Actin, beta
b- actin
The German Sharp isnt beta but that doesn’t stop people using it
Including HTML Entity codes!
Termite handles Greek characters
with ease
© 2013 SciBite Limited. © 2013 SciBite Limited
Muscarinic M1 Receptor(s)
Muscarinic (M1) Receptor
M1 Muscarinic Receptor
Muscarinic Receptor M1
Muscarinic Receptors M1
Muscarinic Receptor type M1
The usual variations….
Termite handles variations with ease
© 2013 SciBite Limited. © 2013 SciBite Limited
M1/M2 muscarinic receptors
H1 and H2 Histamine Receptors
Kinases ERK1 and 2
ERK1/2
Termite handles “broken” phrases
with ease
© 2013 SciBite Limited. © 2013 SciBite Limited
MANY MORE EXAMPLES
HTTP://WWW.SLIDESHARE.NET/SCIBITELY/TERMITE-DEALING-WITH-REAL-WORLD
© 2013 SciBite Limited. © 2013 SciBite Limited
88%
7%5%
Accuracy Of Termite On Random Selection Of 400 Entries From Biocreative Gene-Mention
Task
CorrectDiasagreementIncorrect
http://biocreative.sourceforge.net/
© 2013 SciBite Limited. © 2013 SciBite Limited
WebConnect – “Termite Live”
http://scibite.com/site/p3/webconnect.html
© 2013 SciBite Limited. © 2013 SciBite Limited
TEXPRESS
Going Beyond Recognition
© 2013 SciBite Limited. © 2013 SciBite Limited
NER, Patterns, NLP Termite is a Named Entity Recognition (NER) engine – it finds mentions of
“things” in text Natural Language Processing (NLP) is an area of linguistics that seeks to
develop a computer-understandable representation of human text NLP is both powerful and complex. Human language can vary greatly,
and results in many facets to consider in NLP results Critically, many use-cases do not require full NLP, users wish to simply
“identify any relationships between entities in the text” Texpress uses “patterns” to achieve this
© 2013 SciBite Limited. © 2013 SciBite Limited
An Example Use case: Scan an input set of documents and identify disease-gene relationships within
the text and output these to a file for downstream processing We supply a simple pattern Indication{0,3}(Gene|Protein_Class), which means:
Find an indication Followed by 0-3 other words And then a gene or protein class. Its critical to use the “(Gene|Protein_Class)” when looking for
gene/protein info as often classes are used (see purple text below).
For example, on the text:“Simvastatin induces heme oxygenase-1 expression but fails to reduce inflammation in the capsule surrounding a silicone shell implant in rats”
[DRUG:CHEMBL1064]simvastatin [VERB:!INDUCES]induces [GENE:HMOX1]heme _oxygenase _1 [VERB:!EXPRESSION]expression but {NEG}fails to [VERB:!REDUCE]reduce [INDICATION:D007249]inflammation in the capsule surrounding a silicone shell implant in [ORG:RAT]rats
© 2013 SciBite Limited. © 2013 SciBite Limited
Identifying Causal Relationships E.g. we want to look for drugs that treat Lymphocytic Choriomeningitis Virus (LCV) We use the pattern: DRUG.{0,1}:treat.{0,1}:INDICATION(D001117) Which is translated as:
Find any drug in close proximity to the verb “treat” Followed closely by the specific indication (D001117 is the ID for LCV)
From the following text, we obtain the computer-readable result below: “To investigate its therapeutic potential, we used rapamycin to treat Lymphocytic Choriomeningitis Virus
(LCMV)-infected perforin-deficient (Prf1(-/-) ) mice according to a well-established model of HLH”
[DRUG:CHEMBL413]rapamycin to [VERB:!TREAT]treat [D001117]lymphocytic _choriomeningitis _virus
© 2013 SciBite Limited. © 2013 SciBite Limited
Other features Extension (pattern will match multiple entities in a list)
<Indication><gene> will find all genes in Cancer due to mutations in p53, SCA1 and BRCA1
Negativity TExpress will note where the extracted phrase contains negative keywords
or sentiment Verb Extraction
Identify causal/action relationships and return the verb used i.e. <gene> <any_verb> <gene> on “p53 binds mdm2” => binds
Auto-continuation We’ll match multiple entities of the same type in a list in the pattern (e.g.
matching both drugs in the phrase “cancer can be treated with paclitaxel and bortezomib”) using an “<INDICATION><DRUG>” pattern
© 2013 SciBite Limited. © 2013 SciBite Limited
Why TExpress? Built on Termite with all its advantages (quality thesauri,
ambiguity processing, coverage) Simple patterns, easy to create and understand High performance/scalability (around 10% slower than Termite
alone) Supports narrow focus (e.g. ‘<Gene1> inhibits <Gene2>’) and
wide focus (e.g. “<Gene1> <any_verb> <Gene2>”) relationships
Simple JSON, TSV or XML output
© 2013 SciBite Limited. © 2013 SciBite Limited
Want to know more?
Ask us for a demo today!
Email: info@scibite.com
Twitter: @scibitely
Call Us: +44 (0)20 8819 2776
top related