1 nfais annual conference 2004 text mining and the new breed of licensee: the information...

30
1 NFAIS Annual Conference 2004 NFAIS Annual Conference 2004 Text Mining and the New Breed of Text Mining and the New Breed of Licensee: The Information Licensee: The Information Provider’s Perspective Provider’s Perspective February 23, 2004 February 23, 2004 Jane L. Rosov Jane L. Rosov National Library of National Library of Medicine Medicine 301-496-7706 301-496-7706 [email protected] [email protected]

Post on 19-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

1

NFAIS Annual Conference 2004NFAIS Annual Conference 2004

Text Mining and the New Breed of Text Mining and the New Breed of Licensee: The Information Provider’s Licensee: The Information Provider’s

PerspectivePerspective

February 23, 2004February 23, 2004

Jane L. RosovJane L. RosovNational Library of MedicineNational Library of Medicine

[email protected]@nlm.nih.gov

Page 2: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

Where is NLM?Where is NLM?

Department of Health and Human Services Department of Health and Human Services Public Health ServicePublic Health Service National Institutes of HealthNational Institutes of Health National Library of Medicine National Library of Medicine Library OperationsLibrary Operations Bibliographic Services DivisionBibliographic Services Division MEDLARS Management SectionMEDLARS Management Section

Page 3: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

3

AgendaAgenda

Brief overview of NLM’s data distribution programBrief overview of NLM’s data distribution program Who leases our dataWho leases our data Recent changes increased interest in leasing Recent changes increased interest in leasing

MEDLINEMEDLINE®® and streamlined distribution processesand streamlined distribution processes Examples of mining projectsExamples of mining projects How the Library has adapted to the increased How the Library has adapted to the increased

demand for leasing its datademand for leasing its data

Page 4: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

4

NLM MissionNLM Mission

To collect, organize, and To collect, organize, and disseminatedisseminate the the world’s health-related and biomedical world’s health-related and biomedical informationinformation

Page 5: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

5

Dissemination of DataDissemination of Data

Web-based products and servicesWeb-based products and services●● MEDLINEplusMEDLINEplus®®

●● LOCATORplusLOCATORplus

● ● TOXNETTOXNET®®

●● ClinicalTrials.govClinicalTrials.gov

●● Unified Medical Language SystemUnified Medical Language System®®

●● NLM GatewayNLM Gateway

●● Entrez retrieval system including PubMedEntrez retrieval system including PubMed®®/MEDLINE/MEDLINE Data distribution (leasing) programData distribution (leasing) program

Page 6: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

6

What is MEDLINE?What is MEDLINE? 12 million + biomed and life science journal citations12 million + biomed and life science journal citations Worldwide coverage; currently 4,700 journalsWorldwide coverage; currently 4,700 journals Advisory committee recommends titles Advisory committee recommends titles Includes abstracts if present in the published journalsIncludes abstracts if present in the published journals Controlled vocabulary: Medical Subject Headings Controlled vocabulary: Medical Subject Headings ~600,000 new records in 2004 ~600,000 new records in 2004 New and revised records daily; annual MeSHNew and revised records daily; annual MeSH®®

changes changes Does not include full text of article Does not include full text of article Primary component of PubMedPrimary component of PubMed

Page 7: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

7

Additional Records in PubMedAdditional Records in PubMed

Slightly broader journal coverage in life sciencesSlightly broader journal coverage in life sciences Citations prior to date journal selected for Citations prior to date journal selected for

MEDLINEMEDLINE Citations to out-of-scope-for-MEDLINE articlesCitations to out-of-scope-for-MEDLINE articles In process records In process records OLDMEDLINE OLDMEDLINE ~ 98 % of PubMed records are exported~ 98 % of PubMed records are exported

Page 8: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

8

Data Distribution Web PagesData Distribution Web Pages

Prospective licensees: Prospective licensees: http://www.nlm.nih.gov/databases/leased.htmlhttp://www.nlm.nih.gov/databases/leased.html

Existing licensees: Existing licensees: http://www.nlm.nih.gov/bsd/licensee.htmlhttp://www.nlm.nih.gov/bsd/licensee.html

●● PPaperwork aperwork

●● DTDs that define the XML formatDTDs that define the XML format

●● Files containing sample records Files containing sample records

●● Information about distribution media Information about distribution media

●● Announcements Announcements

●● Documentation Documentation

Page 9: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

9

Key Elements of NLM’s Licensing ProgramKey Elements of NLM’s Licensing Program

Standard licenses - basic and non-US research-only; Standard licenses - basic and non-US research-only; no customizingno customizing

No charges - funded by NLM AppropriationsNo charges - funded by NLM Appropriations No search software - data onlyNo search software - data only US licensees may redistribute - clauses in license to US licensees may redistribute - clauses in license to

ensure accuracy and currency, etc. ensure accuracy and currency, etc. advised to consult with legal counsel on re-use of advised to consult with legal counsel on re-use of

abstractsabstracts

●● NLM does not claim copyright but others mightNLM does not claim copyright but others might

●● Citations and MeSH Headings are in the public domain Citations and MeSH Headings are in the public domain

Page 10: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

10

Intended Use WorksheetIntended Use Worksheet

Indicate databases to leaseIndicate databases to lease Categorize and briefly describe intended use of Categorize and briefly describe intended use of

the data the data Indicate organization typeIndicate organization type Data used in summary form for reports to Data used in summary form for reports to

CongressCongress

Page 11: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

11

Alternatives To LeasingAlternatives To Leasing

Web links to PubMedWeb links to PubMed Downloading from PubMed using utilitiesDownloading from PubMed using utilities

Page 12: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

12

Page 13: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

13

Then and NowThen and Now

International MEDLARS Centers International MEDLARS Centers Redistributors Redistributors Researchers Researchers

FIRST: International MEDLARS CentersFIRST: International MEDLARS Centers

●● Public institutions - bilateral agreements with NLM to perform Public institutions - bilateral agreements with NLM to perform as biomedical information resource centers in their countries as biomedical information resource centers in their countries

●● Encouraged to provide access to NLM’s data - particularly Encouraged to provide access to NLM’s data - particularly important when telecommunications for worldwide online important when telecommunications for worldwide online access to data in US was less advancedaccess to data in US was less advanced

THEN: Redistributors THEN: Redistributors Including ARIES, BRS (now OVID), Cambridge Scientific Including ARIES, BRS (now OVID), Cambridge Scientific

Abstracts (now CSA), Dialog, SilverPlatter, EBSCO, and many Abstracts (now CSA), Dialog, SilverPlatter, EBSCO, and many othersothers

Page 14: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

14

Then and Now (cont.)Then and Now (cont.)

International MEDLARS Centers International MEDLARS Centers Redistributors Redistributors Researchers Researchers NOW: Researchers NOW: Researchers

●● Academic institutions or biotechnology, Academic institutions or biotechnology, pharmaceutical and software development pharmaceutical and software development companies companies

●● Mine MEDLINE to discover new clinical, public Mine MEDLINE to discover new clinical, public health and health services information or develop health and health services information or develop better software to assist in the scientific research better software to assist in the scientific research

Page 15: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

15

Non-US Research-Only LicenseNon-US Research-Only License

2001 2001 Internal use Internal use No commercial redistribution of recordsNo commercial redistribution of records

Page 16: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

16

Page 17: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

17

Self-Described Intended Use - 2003Self-Described Intended Use - 2003

Total MEDLINE Licensees 219

Research purposes 152 = 69%

Data / Text mining 157 = 72%

Both 123 = 56%

Page 18: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

18

Page 19: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

19

Why Growth and Switch From Redistributors Why Growth and Switch From Redistributors To Researchers?To Researchers?

Events outside NLMEvents outside NLM●● Recent developments in computer technology and Recent developments in computer technology and

informatics informatics ●● Boom of the biotechnology industry Boom of the biotechnology industry ●● Mapping of the human genome Mapping of the human genome ●● Increasing volume of information Increasing volume of information ●● More researchers are seeking cures to disease or are More researchers are seeking cures to disease or are

developing new tools to support this researchdeveloping new tools to support this research Reinvention at NLMReinvention at NLM

Page 20: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

20

NLM’s ReinventionNLM’s Reinvention Purpose: Purpose: Move Move fromfrom outmoded and expensive legacy outmoded and expensive legacy

mainframe-based systems mainframe-based systems toto a more flexible, a more flexible, powerful, and maintainable system to support powerful, and maintainable system to support streamlined internal processing and innovative streamlined internal processing and innovative new servicesnew services

Result: Result: New software environment for building and New software environment for building and

maintaining MEDLINEmaintaining MEDLINE

Page 21: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

21

New Data Creation and Maintenance New Data Creation and Maintenance SystemSystem

Distribution media - transition from old tape technology to state-of-the-art tapes and FTP

Distribution format – transition from legacy data format to widely accepted XML format

Use of new media and format are more time and cost efficient for NLM and licensees

Enabled changes:

Page 22: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

22

Benefits of Reinvention: Distribution MediaBenefits of Reinvention: Distribution Media

Tapes from mainframeTapes from mainframe

150 + tapes for 10 150 + tapes for 10 million recordsmillion records

10 hours for one set10 hours for one set Weekly or monthly Weekly or monthly updatesupdates Cost recoveryCost recovery

State-of-the-art DLT State-of-the-art DLT tapes and FTPtapes and FTP Single tape for Single tape for

12+million records12+million records 4.5 hours per tape4.5 hours per tape Updates 5 days per Updates 5 days per week via FTPweek via FTP No costNo cost

Through 2000: 2001 After Reinvention:

Page 23: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

23

Benefits of Reinvention: Distribution MediaBenefits of Reinvention: Distribution Media

In 2002 FTP alternative to DLT tapeIn 2002 FTP alternative to DLT tape FTP timesFTP times

●● 3 quickest times for download: ½ hour, 1 hour, 2 3 quickest times for download: ½ hour, 1 hour, 2

hours (all in the US) hours (all in the US)

●● 3 longest times for download: 25 hours, 17.53 longest times for download: 25 hours, 17.5

hours, 15.5 hours (all outside the US)hours, 15.5 hours (all outside the US) Strong preference for FTP – ~70% in 2004 Strong preference for FTP – ~70% in 2004

instead of DLT tapeinstead of DLT tape

Page 24: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

24

Getting Files From NLM’s Public FTP ServerGetting Files From NLM’s Public FTP Server

Hidden directories Hidden directories Restrict access to data files by ip address Restrict access to data files by ip address Access available 23 hours every dayAccess available 23 hours every day

Page 25: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

25

Benefits of Reinvention: Distribution FormatBenefits of Reinvention: Distribution Format

Transition from legacy homegrown ELHILLTransition from legacy homegrown ELHILL®® Unit Record Format to widely accepted and Unit Record Format to widely accepted and documented XML format documented XML format

Page 26: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

26

Examples Of Mining ProjectsExamples Of Mining Projects

Bio-acronyms and abbreviations, bio-relations, Bio-acronyms and abbreviations, bio-relations, and proteins and proteins

Gene, drug, and disease relationships Gene, drug, and disease relationships Interfaces for performing efficient and Interfaces for performing efficient and

effective searches effective searches Identification, prevention, and treatment of Identification, prevention, and treatment of

emerging infectious disease or biothreatemerging infectious disease or biothreat

Academic Researchers

Page 27: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

27

Examples Of Mining ProjectsExamples Of Mining Projects (cont.)(cont.)

Biotechnology CompaniesBiotechnology Companies Gene-to-gene interactions and connection to Gene-to-gene interactions and connection to

diseases and/or existing drugsdiseases and/or existing drugs Vaccine research and antibacterial drug Vaccine research and antibacterial drug

discoverydiscovery

Page 28: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

28

Examples Of Mining Projects (cont.)Examples Of Mining Projects (cont.)

Pharmaceutical CompaniesPharmaceutical Companies Support drug discovery and development Support drug discovery and development

efforts efforts Internal access to add unique value to their Internal access to add unique value to their

previously derived data; privacy concerns previously derived data; privacy concerns with external web access with external web access

Page 29: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

29

Examples Of Mining Projects (cont.)Examples Of Mining Projects (cont.)

Software DevelopersSoftware Developers

Data mining methods to help uncover Data mining methods to help uncover gene/disease relationships leading to discovery gene/disease relationships leading to discovery of new drugs of new drugs

Page 30: 1 NFAIS Annual Conference 2004 Text Mining and the New Breed of Licensee: The Information Provider’s Perspective February 23, 2004 Jane L. Rosov National

30

Winning CombinationWinning Combination

High quality contentHigh quality content Identifiable new areas of outside interestIdentifiable new areas of outside interest Well-accepted data distribution formatWell-accepted data distribution format Inexpensive, easy-to-use distribution media Inexpensive, easy-to-use distribution media