1 nfais annual conference 2004 text mining and the new breed of licensee: the information...
Post on 19-Dec-2015
215 views
TRANSCRIPT
1
NFAIS Annual Conference 2004NFAIS Annual Conference 2004
Text Mining and the New Breed of Text Mining and the New Breed of Licensee: The Information Provider’s Licensee: The Information Provider’s
PerspectivePerspective
February 23, 2004February 23, 2004
Jane L. RosovJane L. RosovNational Library of MedicineNational Library of Medicine
[email protected]@nlm.nih.gov
Where is NLM?Where is NLM?
Department of Health and Human Services Department of Health and Human Services Public Health ServicePublic Health Service National Institutes of HealthNational Institutes of Health National Library of Medicine National Library of Medicine Library OperationsLibrary Operations Bibliographic Services DivisionBibliographic Services Division MEDLARS Management SectionMEDLARS Management Section
3
AgendaAgenda
Brief overview of NLM’s data distribution programBrief overview of NLM’s data distribution program Who leases our dataWho leases our data Recent changes increased interest in leasing Recent changes increased interest in leasing
MEDLINEMEDLINE®® and streamlined distribution processesand streamlined distribution processes Examples of mining projectsExamples of mining projects How the Library has adapted to the increased How the Library has adapted to the increased
demand for leasing its datademand for leasing its data
4
NLM MissionNLM Mission
To collect, organize, and To collect, organize, and disseminatedisseminate the the world’s health-related and biomedical world’s health-related and biomedical informationinformation
5
Dissemination of DataDissemination of Data
Web-based products and servicesWeb-based products and services●● MEDLINEplusMEDLINEplus®®
●● LOCATORplusLOCATORplus
● ● TOXNETTOXNET®®
●● ClinicalTrials.govClinicalTrials.gov
●● Unified Medical Language SystemUnified Medical Language System®®
●● NLM GatewayNLM Gateway
●● Entrez retrieval system including PubMedEntrez retrieval system including PubMed®®/MEDLINE/MEDLINE Data distribution (leasing) programData distribution (leasing) program
6
What is MEDLINE?What is MEDLINE? 12 million + biomed and life science journal citations12 million + biomed and life science journal citations Worldwide coverage; currently 4,700 journalsWorldwide coverage; currently 4,700 journals Advisory committee recommends titles Advisory committee recommends titles Includes abstracts if present in the published journalsIncludes abstracts if present in the published journals Controlled vocabulary: Medical Subject Headings Controlled vocabulary: Medical Subject Headings ~600,000 new records in 2004 ~600,000 new records in 2004 New and revised records daily; annual MeSHNew and revised records daily; annual MeSH®®
changes changes Does not include full text of article Does not include full text of article Primary component of PubMedPrimary component of PubMed
7
Additional Records in PubMedAdditional Records in PubMed
Slightly broader journal coverage in life sciencesSlightly broader journal coverage in life sciences Citations prior to date journal selected for Citations prior to date journal selected for
MEDLINEMEDLINE Citations to out-of-scope-for-MEDLINE articlesCitations to out-of-scope-for-MEDLINE articles In process records In process records OLDMEDLINE OLDMEDLINE ~ 98 % of PubMed records are exported~ 98 % of PubMed records are exported
8
Data Distribution Web PagesData Distribution Web Pages
Prospective licensees: Prospective licensees: http://www.nlm.nih.gov/databases/leased.htmlhttp://www.nlm.nih.gov/databases/leased.html
Existing licensees: Existing licensees: http://www.nlm.nih.gov/bsd/licensee.htmlhttp://www.nlm.nih.gov/bsd/licensee.html
●● PPaperwork aperwork
●● DTDs that define the XML formatDTDs that define the XML format
●● Files containing sample records Files containing sample records
●● Information about distribution media Information about distribution media
●● Announcements Announcements
●● Documentation Documentation
9
Key Elements of NLM’s Licensing ProgramKey Elements of NLM’s Licensing Program
Standard licenses - basic and non-US research-only; Standard licenses - basic and non-US research-only; no customizingno customizing
No charges - funded by NLM AppropriationsNo charges - funded by NLM Appropriations No search software - data onlyNo search software - data only US licensees may redistribute - clauses in license to US licensees may redistribute - clauses in license to
ensure accuracy and currency, etc. ensure accuracy and currency, etc. advised to consult with legal counsel on re-use of advised to consult with legal counsel on re-use of
abstractsabstracts
●● NLM does not claim copyright but others mightNLM does not claim copyright but others might
●● Citations and MeSH Headings are in the public domain Citations and MeSH Headings are in the public domain
10
Intended Use WorksheetIntended Use Worksheet
Indicate databases to leaseIndicate databases to lease Categorize and briefly describe intended use of Categorize and briefly describe intended use of
the data the data Indicate organization typeIndicate organization type Data used in summary form for reports to Data used in summary form for reports to
CongressCongress
11
Alternatives To LeasingAlternatives To Leasing
Web links to PubMedWeb links to PubMed Downloading from PubMed using utilitiesDownloading from PubMed using utilities
12
13
Then and NowThen and Now
International MEDLARS Centers International MEDLARS Centers Redistributors Redistributors Researchers Researchers
FIRST: International MEDLARS CentersFIRST: International MEDLARS Centers
●● Public institutions - bilateral agreements with NLM to perform Public institutions - bilateral agreements with NLM to perform as biomedical information resource centers in their countries as biomedical information resource centers in their countries
●● Encouraged to provide access to NLM’s data - particularly Encouraged to provide access to NLM’s data - particularly important when telecommunications for worldwide online important when telecommunications for worldwide online access to data in US was less advancedaccess to data in US was less advanced
THEN: Redistributors THEN: Redistributors Including ARIES, BRS (now OVID), Cambridge Scientific Including ARIES, BRS (now OVID), Cambridge Scientific
Abstracts (now CSA), Dialog, SilverPlatter, EBSCO, and many Abstracts (now CSA), Dialog, SilverPlatter, EBSCO, and many othersothers
14
Then and Now (cont.)Then and Now (cont.)
International MEDLARS Centers International MEDLARS Centers Redistributors Redistributors Researchers Researchers NOW: Researchers NOW: Researchers
●● Academic institutions or biotechnology, Academic institutions or biotechnology, pharmaceutical and software development pharmaceutical and software development companies companies
●● Mine MEDLINE to discover new clinical, public Mine MEDLINE to discover new clinical, public health and health services information or develop health and health services information or develop better software to assist in the scientific research better software to assist in the scientific research
15
Non-US Research-Only LicenseNon-US Research-Only License
2001 2001 Internal use Internal use No commercial redistribution of recordsNo commercial redistribution of records
16
17
Self-Described Intended Use - 2003Self-Described Intended Use - 2003
Total MEDLINE Licensees 219
Research purposes 152 = 69%
Data / Text mining 157 = 72%
Both 123 = 56%
18
19
Why Growth and Switch From Redistributors Why Growth and Switch From Redistributors To Researchers?To Researchers?
Events outside NLMEvents outside NLM●● Recent developments in computer technology and Recent developments in computer technology and
informatics informatics ●● Boom of the biotechnology industry Boom of the biotechnology industry ●● Mapping of the human genome Mapping of the human genome ●● Increasing volume of information Increasing volume of information ●● More researchers are seeking cures to disease or are More researchers are seeking cures to disease or are
developing new tools to support this researchdeveloping new tools to support this research Reinvention at NLMReinvention at NLM
20
NLM’s ReinventionNLM’s Reinvention Purpose: Purpose: Move Move fromfrom outmoded and expensive legacy outmoded and expensive legacy
mainframe-based systems mainframe-based systems toto a more flexible, a more flexible, powerful, and maintainable system to support powerful, and maintainable system to support streamlined internal processing and innovative streamlined internal processing and innovative new servicesnew services
Result: Result: New software environment for building and New software environment for building and
maintaining MEDLINEmaintaining MEDLINE
21
New Data Creation and Maintenance New Data Creation and Maintenance SystemSystem
Distribution media - transition from old tape technology to state-of-the-art tapes and FTP
Distribution format – transition from legacy data format to widely accepted XML format
Use of new media and format are more time and cost efficient for NLM and licensees
Enabled changes:
22
Benefits of Reinvention: Distribution MediaBenefits of Reinvention: Distribution Media
Tapes from mainframeTapes from mainframe
150 + tapes for 10 150 + tapes for 10 million recordsmillion records
10 hours for one set10 hours for one set Weekly or monthly Weekly or monthly updatesupdates Cost recoveryCost recovery
State-of-the-art DLT State-of-the-art DLT tapes and FTPtapes and FTP Single tape for Single tape for
12+million records12+million records 4.5 hours per tape4.5 hours per tape Updates 5 days per Updates 5 days per week via FTPweek via FTP No costNo cost
Through 2000: 2001 After Reinvention:
23
Benefits of Reinvention: Distribution MediaBenefits of Reinvention: Distribution Media
In 2002 FTP alternative to DLT tapeIn 2002 FTP alternative to DLT tape FTP timesFTP times
●● 3 quickest times for download: ½ hour, 1 hour, 2 3 quickest times for download: ½ hour, 1 hour, 2
hours (all in the US) hours (all in the US)
●● 3 longest times for download: 25 hours, 17.53 longest times for download: 25 hours, 17.5
hours, 15.5 hours (all outside the US)hours, 15.5 hours (all outside the US) Strong preference for FTP – ~70% in 2004 Strong preference for FTP – ~70% in 2004
instead of DLT tapeinstead of DLT tape
24
Getting Files From NLM’s Public FTP ServerGetting Files From NLM’s Public FTP Server
Hidden directories Hidden directories Restrict access to data files by ip address Restrict access to data files by ip address Access available 23 hours every dayAccess available 23 hours every day
25
Benefits of Reinvention: Distribution FormatBenefits of Reinvention: Distribution Format
Transition from legacy homegrown ELHILLTransition from legacy homegrown ELHILL®® Unit Record Format to widely accepted and Unit Record Format to widely accepted and documented XML format documented XML format
26
Examples Of Mining ProjectsExamples Of Mining Projects
Bio-acronyms and abbreviations, bio-relations, Bio-acronyms and abbreviations, bio-relations, and proteins and proteins
Gene, drug, and disease relationships Gene, drug, and disease relationships Interfaces for performing efficient and Interfaces for performing efficient and
effective searches effective searches Identification, prevention, and treatment of Identification, prevention, and treatment of
emerging infectious disease or biothreatemerging infectious disease or biothreat
Academic Researchers
27
Examples Of Mining ProjectsExamples Of Mining Projects (cont.)(cont.)
Biotechnology CompaniesBiotechnology Companies Gene-to-gene interactions and connection to Gene-to-gene interactions and connection to
diseases and/or existing drugsdiseases and/or existing drugs Vaccine research and antibacterial drug Vaccine research and antibacterial drug
discoverydiscovery
28
Examples Of Mining Projects (cont.)Examples Of Mining Projects (cont.)
Pharmaceutical CompaniesPharmaceutical Companies Support drug discovery and development Support drug discovery and development
efforts efforts Internal access to add unique value to their Internal access to add unique value to their
previously derived data; privacy concerns previously derived data; privacy concerns with external web access with external web access
29
Examples Of Mining Projects (cont.)Examples Of Mining Projects (cont.)
Software DevelopersSoftware Developers
Data mining methods to help uncover Data mining methods to help uncover gene/disease relationships leading to discovery gene/disease relationships leading to discovery of new drugs of new drugs
30
Winning CombinationWinning Combination
High quality contentHigh quality content Identifiable new areas of outside interestIdentifiable new areas of outside interest Well-accepted data distribution formatWell-accepted data distribution format Inexpensive, easy-to-use distribution media Inexpensive, easy-to-use distribution media