OverviewThe Problem
Finding relevant physical quantities in documents
Some SolutionsConcluding Remarks
The ProblemMany searches in Business Intelligence areas
like Technology Lanscaping, or Techno-Legal Areas like Freedom-to-Operate searches involve searching documents which mention physical quantities like metres, kilograms and degrees centigrade
Document Similarity Searching (including Boolean) based on bag-of-words models works badly for these sort of queries
Example Searches 1Are there any enforceable patents related to
Manufacturing Process using trifluormethanesulfonate as a reagent at approximately 22°C• Relevant Documents include those with
temperatures expressed in K and °F• Note also the implied range
Example Search 2Have we any internal test documents which
report a torque in excess of 1000 ft lbf for an electric motor suitable for installation in a car?Relevant documents include documents
reporting N m, possibly also bhp, KW etc.
Four solutionsBased on a survey in the LinkedIn Group
“Information Access and Search Professional”Thanks to: Mathew Kesler, Helmut Berger, Seth Grimes, Kevin Watters, Gerard DuPont, Christopher Frenz, Robert Peterson,
Marat Shaidulatov
Solution 1: Synonym Query ExpansionUse a comprehensive list of units (e.g.
http://www.unc.edu/~rowlett/units/index.html ) to identify synonyms for the search specification units and e.g. Boolean searching and manual result set refinement to obtain a suitabel result setProbably effective but heavy on searcher effort
Solution 2: System with facetingUse physical quantities as a facetEnsure documents contain suitable metadata
to facilitate the searchEndeca looks good here:
http://www.endeca.com/en/home.html although it remains to see what will happen under Oracle ownership
Requires good metadata – can be hard to arrange for large existing collections
Solution 3: Normalise input documentsUse a text annotation system like Gate Mimir
(http://gate.ac.uk/family/mimir.html ) combined with the Tagger Measurements (http://gate.ac.uk/gate/doc/plugins.html#Tagger_Measurements ) and possibly machine learning to annotate the documents with normalised measurements
Use a standard search system (e.g. Lucene/SOLR) to do the searching.
Requires a project for your application
Solution 4: Specialised Search systemUse a system with in-built knowledge of
ranges, units and physical quantities on both query and indexing sides
E.g. Max.recall’s Quantalyze (https://www.quantalyze.com/en/ )
ConclusionsSearching for physical quantities is a real and
pressing problem for many professional searchers
Effective solutions now exist for both one off requirements and long term needs
AcknowledgementsFrancisco De Sousa Webber, CEO of the IRF,
who originally introduced me to the problemMike Baycroft, CEO of Fairview Research and
IFI Claims for many stimulating discussions