camtology: intelligent information access for scienceejb1/camtology-demo.pdfcamtology: intelligent...

Camtology: Intelligent Information Access for Science

Ted Briscoe1,2, Karl Harrison5, Andrew Naish-Guzman4, Andy Parker1,Advaith Siddharthan3, David Sinclair4, Mark Slater5 and Rebecca Watson2

1University of Cambridge 2iLexIR Ltd 4Camtology Ltd 5University of [email protected],[email protected],

[email protected] of Aberdeen

[email protected]

[email protected],[email protected]

[email protected],[email protected]

AbstractWe describe a novel semantic search enginefor scientific literature. The Camtology sys-tem allows for sentence-level searches of PDFfiles and combines text and image searches,thus facilitating the retrieval of informationpresent in tables and figures. It allows the userto generate complex queries for search termsthat are related through particular grammati-cal/semantic relations in an intuitive manner.The system uses Grid processing to parallelisethe analysis of large numbers of papers.

1 IntroductionScientific, technological, engineering and medi-cal (STEM) research is entering the so-called 4thParadigm of “data-intensive scientific discovery”, inwhich advanced data mining and pattern discoverytechniques need to be applied to vast datasets in or-der to drive further discoveries. A key componentof this process is efficient search and exploitation ofthe huge repository of information that only exists intextual or visual form within the “bibliome”, whichitself continues to grow exponentially.

Today’s computationally driven research methodshave outgrown traditional methods of searching forscientific data, creating a widespread and unfulfilledneed for advanced search and information extrac-tion. Camtology combines text and image process-ing to create a unique solution to this problem.

2 StatusCamtology has developed a search and informationextraction system which is currently undergoing us-ability testing with the curation team for FlyBase1,a $1m/year NIH-funded curated database coveringthe functional genomics of the fruit fly. To providea scalable solution capable of analysing the entireSTEM bibliome of over 20m electronic journal and

1http://flybase.org/

conference papers, we have developed a robust sys-tem that can be used with a grid of computers run-ning distributed job management software.

This system has been deployed and tested usinga subset of the resources provided by the UK Gridfor Particle Physics (Britton et al., 2009), part of theworldwide Grid of around 200000 CPUs assembledto allow analysis of the petabyte-scale data volumesto be recorded each year by experiments at the LargeHadron Collider in Geneva. Processing of the Fly-Base archive of around 15000 papers required about8000 hours of CPU, and has been successfully com-pleted in about 3 days, with up to a few hundred jobsrun in parallel. A distributed spider for collectingopen-source PDF documents has also been devel-oped. This has been run concurrently on over 2000CPU cores, and has used to retrieve over 350000subject-specific papers, but these are not consideredin the present demo.

3 FunctionalityCamtology’s search and extraction engine is the firstto integrate a full structural analysis of a scien-tific paper in PDF format identifying headings, sec-tions, captions and associated figures, citations andreferences with a sentence-by-sentence grammati-cal analysis of the text and direct visual search overfigures. Combining these capabilities allows us totransform paper search from keyword based paperretrieval, where the end result is a set of putativelyrelevant PDF files which need to be read, to informa-tion extraction based on the ability to interactivelyspecify a rich variety of linguistic patterns whichreturn sentences in specific document locales, andwhich combine text with image-based constraints;for instance:

“all sentences in figure captions which containany gene name as the theme of the action ‘ex-press’ where the figure is a picture of an eye”

Camtology allows the user to build up such com-plex queries quickly though an intuitive process ofquery refinement.

Figures often convey information crucial to theunderstanding of the content of a paper and are typ-ically not available to search. Camtology’s searchengine integrates text search to the figure and cap-tion level with the ability to re-rank search returns onthe basis of visual similarity to a chosen archetype(ambiguities in textual relevance are often resolvedby visual appearance). Figure 1 provides a compactoverview of the search functionality supported byour current demonstrator. Interactively, constructingand running such complex queries takes a few sec-onds in our intuitive user interface, and allows theuser to quickly browse and then aggregate informa-tion across the entire collection of papers indexed bythe system. For instance, saving the search resultfrom the example above would yield a computer-readable list of gene names involved in eye develop-ment (in fruit flies in our demonstrator) in a secondor so. With existing web portals and keyword basedselection of PDF files (for example, Google Scholar,ScienceDirect, DeepDyve or PubGet), a query likethis would typically take many hours to open andread each one using cut and paste to extract genenames (and excludes the possibility of ordering re-sults on a visual basis). The only other alterna-tive would require expensive bespoke adaptation ofa text mining system by IT professionals using li-censed software (such as Ariadne Genomics, Temisor Linguamatics). This option is only available to atiny minority of researchers working for large well-funded corporations.

4 Summary of Technology4.1 PDF to SciXMLThe PDF format represents a document in amanner designed to facilitate printing. In short,it provides information on font and position fortextual and graphical units. To enable informa-tion retrieval and extraction, we need to convertthis typographic representation into a logical onethat reflects the structure of scientific documents.We use an XML schema called SciXML (firstintroduced in Teufel et al. (1999)) that we extendto include images. We linearise the textual ele-ments in the PDF, representing these as <div>elements in XML and classify these divisions as{Title|Author|Affiliation|Abstract|Footnote|Caption|

Heading|Citation| References|Text} in a constraintsatisfaction framework.

In addition, we identify all graphics in the PDF,including lines and images. We then identify ta-bles by looking for specific patterns of text andlines. A bounding box is identified for a tableand an image is generated that overlays the texton the lines. Similarly we overlay text onto im-ages that have been identified and identify bound-ing boxes for figures. This representation allowsus to retrieve figures and tables that consist of textand graphics. Once bounding boxes for tables orfigures have been identified, we identify a one-oneassociation between captions and boxes that min-imises the total distance between captions and theirassociated figures or tables. The image is then ref-erenced from the caption using a “SRC” attribute;for example, in (abbreviated for space constraints):

<CAPTION SRC=”FBrf0174566 fig 6 o.png”><b>Fig. 6. </b> Phenotypicanalysis of denticle belt fusionsduring embryogenesis. (A)The denticle belt fusion phe-notype resulted in folds aroundthe surrounding fused... ...(G)...the only cuticle phenotypeof the DN-EGFR-expressingembryos was strong denticlebelt fusions in alternatingparasegments (<i>paired</i>domains).</CAPTION>

Note how informative the caption is, and the valueof being able to search this caption in conjunctionwith the corresponding image (also shown above).

4.2 Natural Language ProcessingEvery sentence, including those in abstracts, titlesand captions, is run through our named-entity recog-niser and syntactic parser. The output of these sys-tems is then indexed, enabling semantic search.

Named Entity RecognitionNER in the biomedical domain was implementedas described in Vlachos (2007). Gene Mentiontagging was performed using Conditional RandomFields and syntactic parsing, using features derivedfrom grammatical relations to augment the tagging.We also use a probabilistic model for resolution ofnon-pronominal anaphora in biomedical texts. Themodel focuses on biomedical entities and seeks tofind the antecedents of anaphora, both coreferentand associative ones, and also to identify discourse-new expressions (Gasperin and Briscoe, 2008).

ParsingThe RASP toolkit (Briscoe et al., 2006) is usedfor sentence boundary detection, tokenisation, PoStagging and finding grammatical relations (GR) be-tween words in the text. GRs are triplets consistingof a relation-type and two arguments and also en-code morphology, word position and part-of-speech;for example, parsing “John likes Mary.” gives us asubject relation and a direct object relation:

(|ncsubj| |like+s:2 VVZ| |John:1 NP1|)(|dobj| |like+s:2 VVZ| |Mary:3 NP1|)

Representing a parse as a set of flat triplets allowsus to index on grammatical relations, thus enablingcomplex relational queries.

4.3 Image ProcessingWe build a low-dimensional feature vector to sum-marise the content of each extracted image. Colourand intensity histograms are encoded in a short bitstring which describes the image globally; this isconcatenated with a description of the image derivedfrom a wavelet decomposition (Jacobs et al., 1995)that captures finer-scale edge information. Efficientsimilar image search is achieved by projecting thesefeature vectors onto a small number of randomly-generated hyperplanes and using the signs of theprojections as a key for locality-sensitive hashing(Gionis et al., 1999).

4.4 Indexing and SearchWe use Lucene (Goetz, 2002) for indexing and re-trieving sentences and images. Lucene is an opensource indexing and information retrieval librarythat has been shown to scale up efficiently and han-dle large numbers of queries. We index using fieldsderived from word-lemmas, grammatical relationsand named entities. At the same time, these complexrepresentations are hidden from the user, who, as afirst step, performs a simple keyword search; for ex-ample “express Vnd”. This returns all sentences thatcontain the words “express” and “Vnd” (search is onlemmatised words, so morphological variants of ex-press will be retrieved). Different colours representdifferent types of biological entities and processes(green represents a gene), and blue words show theentered search terms in the result sentences an exam-ple sentence retrieved for the above query follows:

It is possible that like ac , sc and l’sc ,vnd is expressed initially in cell clusters and

then restricted to single cells .

Next, the user can select specific words in thereturned sentences to indirectly specify a relation.Clicking on a word will select it, indicated by un-derlining of the word. In the example above, thewords “vnd” and “expressed” have been selected bythe user. This creates a new query that returns sen-tences where “vnd” is the subject of “express” andthe clause is in passive voice. This retrieval is basedon a sophisticated grammatical analysis of the text,and can retrieve sentences where the words in therelation are far apart; an example of a sentence re-trieved for the refined query is shown below:

First , vnd might be spatially regulated in amanner similar to ac and sc and selectivelyexpressed in these clusters .

Camtology offers two other functionalities. Theuser can browse the MeSH (Medical Subject Head-ings) ontology and retrieve papers relevant to aMeSH term. Also, for both search and MeSH brows-ing, retrieved papers are plotted on a world map; thisis done using converting the affiliations of the au-thors into geospatial coordinates. The user can thendirectly access papers from a particular site.

5 Script OutlineI Quick overview of existing means of searching sci-

ence (PubMed, FlyBase, Google Scholar).II Walk through the functionality of Camtology (these

are numbered in Figure 1:• (1) Initial query through textual search box; (2)

Retrieval of relevant sentences; (3) Query re-finement by clicking on words; (4) Using im-plicit grammatical relations for new search;

• Alternative to search: (5) Browse MeSH On-tology to retrieve papers with MeSH terms.

• (6) Specifically searching for tables/figures• (7) Viewing the affiliation of the authors of re-

trieved papers on a world map.• (8) Image search using similarity of image.

6 AcknowledgementsThis work was supported in part by a STFC miniP-IPSS grant to the University of Cambridge andiLexIR Ltd.

ReferencesT. Briscoe, J. Carroll, and R. Watson. 2006. The second

release of the RASP system. In Proc. ACL 2006.D. Britton, AJ Cass, PEL Clarke, et al. 2009. GridPP:

the UK grid for particle physics. Philosophical Trans-actions A, 367(1897):2447.

C. Gasperin and T. Briscoe. 2008. Statistical anaphoraresolution in biomedical texts. In Proc. COLING’08.

A. Gionis, P. Indyk, and R. Motwani. 1999. Similaritysearch in high dimensions via hashing. In Proc. 25thACM Internat. Conf. on Very Large Data Bases.

B. Goetz. 2002. The Lucene search engine: Powerful,flexible, and free. Javaworld http://www. javaworld.com/javaworld/jw-09-2000/jw-0915-lucene. html.

C.E. Jacobs, A. Finkelstein, and D.H. Salesin. 1995. Fast

multiresolution image querying. In Proc. 22nd ACMannual conference on Computer graphics and interac-tive techniques.

S. Teufel, J. Carletta, and M. Moens. 1999. An annota-tion scheme for discourse-level argumentation in re-search articles. In Proc. EACL’99.

A. Vlachos. 2007. Tackling the BioCreative2 gene men-tion task with CRFs and syntactic parsing. In Proc.2nd BioCreative Challenge Evaluation Workshop.

Figu

re1:

Scre

ensh

ots

show

ing

func

tiona

lity

ofth

eC

amto

logy

sear

chen

gine

.

camtology: intelligent information access for scienceejb1/camtology-demo.pdfcamtology: intelligent...

Documents