john tait johntait.net ltd. [email protected]. overview the problem finding relevant physical...

14
Using Physical Quantities to Find Similar Documents John Tait johntait.net Ltd. [email protected]

Upload: mildred-hopkins

Post on 28-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Using Physical Quantities to Find Similar Documents

John Taitjohntait.net Ltd.

[email protected]

OverviewThe Problem

Finding relevant physical quantities in documents

Some SolutionsConcluding Remarks

The ProblemMany searches in Business Intelligence areas

like Technology Lanscaping, or Techno-Legal Areas like Freedom-to-Operate searches involve searching documents which mention physical quantities like metres, kilograms and degrees centigrade

Document Similarity Searching (including Boolean) based on bag-of-words models works badly for these sort of queries

Example Searches 1Are there any enforceable patents related to

Manufacturing Process using trifluormethanesulfonate as a reagent at approximately 22°C• Relevant Documents include those with

temperatures expressed in K and °F• Note also the implied range

Example Search 2Have we any internal test documents which

report a torque in excess of 1000 ft lbf for an electric motor suitable for installation in a car?Relevant documents include documents

reporting N m, possibly also bhp, KW etc.

Four solutionsBased on a survey in the LinkedIn Group

“Information Access and Search Professional”Thanks to: Mathew Kesler, Helmut Berger, Seth Grimes, Kevin Watters, Gerard DuPont, Christopher Frenz, Robert Peterson,

Marat Shaidulatov

Solution 1: Synonym Query ExpansionUse a comprehensive list of units (e.g.  

http://www.unc.edu/~rowlett/units/index.html ) to identify synonyms for the search specification units and e.g. Boolean searching and manual result set refinement to obtain a suitabel result setProbably effective but heavy on searcher effort

Solution 2: System with facetingUse physical quantities as a facetEnsure documents contain suitable metadata

to facilitate the searchEndeca looks good here:

http://www.endeca.com/en/home.html although it remains to see what will happen under Oracle ownership

Requires good metadata – can be hard to arrange for large existing collections

Solution 3: Normalise input documentsUse a text annotation system like Gate Mimir

(http://gate.ac.uk/family/mimir.html ) combined with the Tagger Measurements (http://gate.ac.uk/gate/doc/plugins.html#Tagger_Measurements ) and possibly machine learning to annotate the documents with normalised measurements

Use a standard search system (e.g. Lucene/SOLR) to do the searching.

Requires a project for your application

Solution 4: Specialised Search systemUse a system with in-built knowledge of

ranges, units and physical quantities on both query and indexing sides

E.g. Max.recall’s Quantalyze (https://www.quantalyze.com/en/ )

ConclusionsSearching for physical quantities is a real and

pressing problem for many professional searchers

Effective solutions now exist for both one off requirements and long term needs

AcknowledgementsFrancisco De Sousa Webber, CEO of the IRF,

who originally introduced me to the problemMike Baycroft, CEO of Fairview Research and

IFI Claims for many stimulating discussions

®

taming quantities in text

ForFor more information Email [email protected]