text mining: opportunities and barriers john mcnaught deputy director national centre for text...

Post on 12-Jan-2016

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Text Mining: Opportunities and Barriers

John McNaught

Deputy Director

National Centre for Text MiningJohn.McNaught@manchester.ac.uk

Topics

• What is text mining? (briefly)• What can it offer? (selectively)• What are the obstacles? (mostly)

NaCTeM

• First publicly-funded (JISC) national text mining centre in the world

• Remit: provide services to research community

• Initial focus on biology, then social sciences, medicine, chemistry, …

• Processing on a large scale, e.g. for UKPMC (Wellcome T.+17 other funders)

• www.nactem.ac.uk

What is text mining?

• Goal: Discover new knowledge from old• How:

– Process very large amounts of text• Millions of documents, the more the better

– Identify and extract information– (Link extracted information to already curated

knowledge)– Mine to discover implicit significant associations– Flag (unknown) associations for researcher to

investigate further– Spin-off on the way: render information explicit

From text to new knowledge

What does it offer?• Finds unsuspected knowledge

– E.g. Disease-gene associations

• Enables discoveries human effort could not achieve (information overload/overlook)

• Enables better search/navigation of literature– Semantic search via extracted semantic metadata

• Reduces time spent searching– 15-48% of researcher time spent on classic

search, 20-50% of classic searches unsatisfied

• E.g. Systematic reviews: months to weeks

What does it offer?

• Text mining boosts research– Makes research possible that would otherwise

be impossible or unfeasible

• Research drives growth and innovation• Research produces more information• More information is available for text

mining• Text mining boosts research …

Barriers

• Access to the literature• Format issues (tied to next point…)

– “PDF is evil” (Lynch)

• Main blocks: copyright and licensing issues– <8% of scientific claims found in full article

appear in its abstract (Blake)– Abstracts deficient on argumentation,

discussion, methods, background, …– Full texts needed to realise full benefits of TM

Barriers• Need to copy documents to analyse them• Licences typically not favourable to TM• Licences established on per institution basis

– Prevents community-oriented services• Results only for internal use by institutional users

– Hinders mining over collections of content from different providers

• Inconsistency: human can search and manually analyse, but cannot use machine to do same job on same data already subscribed to

Barriers

• Problem even with liberal OA licences– Author attribution required

• Author attribution in a data mining environment is impossible/unfeasible– Association finding: cannot track positive, negative,

neutral individual author contributions

• Derived works in a TM environment– Every author of every text processed to produce

new derived knowledge may have a claim…– Rights clearance thus an effective barrier

Barriers

• Laudable effort 1: NESLi2 model licence (JISC Collections) allows TM– Publisher <> single institution– But how many publishers retain TM provisions?– But cannot display annotations produced by TM on

document itself

• Laudable effort 2: NPG licence for self-archived content allows TM– But “content must be destroyed when experiment

complete” is vague. So services for community?

Conclusion

• Copyright and licensing restrictions block full realisation of TM benefits– Economic savings and potential for growth are

stifled

• Japan has introduced an information analysis exception to copyright law– National Diet Library (= British Library) has

recently changed its motto to:

“Through knowledge we prosper”– Can we say the same in the UK?

Extras

Info=degree of surprise

Finding unknown associations: reproducing a discovery reported 5 days ago in Nature Medicine

UKPMC EvidenceFinder by NaCTeM: Questions generated by deep analysis, with known answers

Click on a question to see relevant extracted evidence(from OA subset of the archive)

top related