can text mining and machine learning help reduce ... · abstract text was missing for some...
TRANSCRIPT
DR MELITA GIUMMARRA
Can text mining and machine learning help reduce systematic review workload for injury researchers?
Senior Research Fellow, ARC DECRA
Pre-hospital, Emergency and Trauma Research Group
MONASH
PUBLIC HEALTH &
PREVENTIVE MEDICINE
@MelitaGiummarra [email protected]
ACKNOWLEDGEMENTS
STUDY TEAM
Dr Melita Giummarra
Ms Georgina Lau
Dr Genevieve Grant
Professor Belinda Gabbe
FUNDING DISCLOSURES
Dr Giummarra is supported by an ARC DECRA fellowship
Professor Gabbe is supported by an ARC Future Fellowship.
Paper reporting review
findings (in press):
Methods paper
SYSTEMATIC REVIEWS FOR EVIDENCE SYNTHESIS
• Systematic reviews employ rigorous methods to evaluate the state of the science
▪ Comprehensive search strategy (MeSH/EMTREE terms & keywords)
▪ Pool results from multiple databases (typically 4-7)
▪ Each stage is conducted by two researchers and cross-referenced
▪ Generate narrative and quantitative (e.g., meta-analysis) synthesis of the evidence
▪ Evaluate level of evidence relative to risk of bias/quality of the science
▪ As a result, systematic reviews:
▪ Are often highly cited
▪ Help reduce “research waste”
▪ Identify benefits/harms of “exposures” or interventions
▪ Identify gaps in knowledge
over-identify potential literature
(typically reject >95%)
EMERGING CHALLENGES
The rapidly increasing publication rates mean that reviews are practically out of date by the time they’re complete!
CAN MACHINE LEARNING & TEXT MINING HELP?
• ML/TM are statistical tools to detect patterns and extract knowledge from unstructured natural language text
• Explore and categorize unstructured data (e.g., term co-occurrences or frequency)
• Minimise human effort requirements – especially for large bodies of unstructured data.
• For systematic reviews, ML/TM tools have been developed for various stages of the review:
• Literature search
• Screening
• Data extraction
• Risk of bias evaluation
• Review updates
• To date, recommendations are that we do NOT rely on text mining solely, but it may be used as a “second reviewer”.
ABSTRACKR
Abstrackr a free web-based platform developed by Byron Wallace (Brown University, USA).
• It uses an active learning algorithm (using uni-grams and bi-grams) to generate predictions of relevance from the words in citation titles, abstracts and keywords based on judgements by a reviewer.
• Once predictions are generated, citations are sorted according to probability of relevance and researchers can then more quickly identify articles likely to be relevant and eliminate those with low probability
• Abstrackr has been shown to reduce the burden of conducting and updating systematic reviews in specific health topics (e.g., genetics) without compromising sensitivity and specificity to identify eligible citations for full text review.
http://abstrackr.cebm.brown.edu
AIMS OF MY STUDY
REVIEW AIM:
Determine relationships between fault attribution and socio-economic/health outcomes after transport injury.
METHODS-SPECIFIC AIMS
• Examine whether machine learning is appropriate for systematic review citation screening in injury research
• Examine whether text analysis of full text articles provides workload savings for injury recovery systematic reviews
SEARCH STRATEGY
STEP 1: SCREENING CITATIONS
(TITLE & ABSTRACT)
SCREENING STRATEGY
Endnote library
(n = 16,324)
Duplicates removed
(n = 5,764)
Citations screened
(n = 10,559)
Medline
(n = 4,291)
Embase
(n = 7,482)
PsycINFO
(n = 1,315)
CINAHL
(n = 2,667)
Cochrane
(n = 569)
Reviewer 1: Traditional manual screening against eligibility criteria:
Population: Transport injury, adults aged >15
Design: Cohort, observational and prospective studies
Outcomes: Work, Pain, Psychological or Health outcomes reported
Reviewer 2: Screening against eligibility criteria in Abstrackr, stopping when no more citations were predicted relevant.
SCREENING CITATIONS (Abstrackr)
SCREENING CITATIONS (Abstrackr)
RESULTS: SCREENING STAGE 1 (citations)
Reviewer 1:
• Screened 10,559 citations
• 61 hours of screening
• Identified 401 articles for full text screening
Reviewer 2:
• Screened 1,809 articles
• 16 hours of screening
• Identified 634 citations for full text screening
RESULTS: SCREENING STAGE 1 (citations)
16.30 hrs(n=1,809)
2.50 hrs(n=343)
5.47 hrs(n=649)
7.37 hrs(n=818)
11.52 hrs(n=1,244)
13.90 hrs(n=1,374)
KEY OBSERVATIONS: CITATION SCREENING
✓ Excellent workload savings (17.1% of citations; total screening saving = 63.4% vs traditional review)
✓ Excellent (low) false negative rate – esp. when considering full text inclusion
✓ Workload savings, specificity and false negative rate optimised by a more generous stopping rule
Variable and moderate precision and low sensitivity, probably due to:
1. Abstract text was missing for some citations
2. Abstracts often failed to report design and population features
3. Complex inclusion criteria with multiple outcomes
4. Highly imbalanced dataset (relevant : irrelevant)
• Only 689 (15.3%) articles were relevant for full text review (134 that were excluded as ineligible for full text review, e.g., conference abstracts, books etc.).
The findings are consistent with previous Abstrackr methods evaluations with unbalanced and broad review topics.(e.g., Rathbone, 2015, Systematic Reviews; Gates, 2018, Systematic Reviews)
STEP 2: SCREENING FULL TEXTS
SCREENING STRATEGY: FULL TEXT
Full text screened
(n = 555)
Reviewer 1: Traditional manual screening against all eligibility criteria
Reviewer 2: Text mining with a fault dictionary** to restrict full texts to those with a fault-related concept in methods/results
using Wordstat & QDA Miner. Manual screening those with a fault term against al eligibility criteria.
Endnote library
(n = 16,324)
Duplicates removed
(n = 5,764)
Citations screened
(n = 10,559)
Medline
(n = 4,291)
Embase
(n = 7,482)
PsycINFO
(n = 1,315)
CINAHL
(n = 2,667)
Cochrane
(n = 569)
* Full text ineligible for screening: Publication format (conference abstract, non-empirical, dissertation), missed duplicate, not available in English
** Dictionary was developed from a survey of 20 injury and trauma experts
Citations judged irrelevant
(n = 9,870)
Full text ineligible*
(n = 134)
SCREENING FULL TEXT ARTICLES (REVIEWER 2)
1. Add semantic anchors before methods & after results
2. Prepare and import PDFs for text mining
1. Optimize PDFs for text (e.g., via “edit PDF” in Adobe)
2. Import PDFs into Wordstat (v.14) via Stata with Document Converter Wizard (n=555)
3. Review failed PDFs and attempt to load via QDA Miner (n=89)
4. Reserve those that still failed for manual screening (n=25)
3. Load fault terms (categorization dictionary, 46 terms) in Wordstat
4. Analyse fault term frequency between semantic anchors
a) Test sensitivity of terms and drop irrelevant terms (32 terms dropped)
b) Save keyword frequency as stata file to identify full text s for manual screening (papers with >=1 fault term)
Fault Dictionary
Attribution of responsibility
Blame
common law
Compensable
Compensation
Fault
Insurance
Lawyer
Legal
Liability
Litigation
Passenger
Pedestrian
Tort
SCREENING FULL TEXT ARTICLES (Wordstat)
Reviewer 1
• Reviewed 555 full texts
• 39 hours screening time
Reviewer 2
• 25 PDFs did not import and were manually screened
• 342 (64.5% of 530 full texts that loaded) contained >1 fault term and were manually screened
• The most frequent fault terms identified were insurance (n=171), compensation (n=136), passenger (n=104) and pedestrian (n=100)
• 8.75 hours screening time (including PDF formatting)
✓ None of the full-text articles without a fault-related term were judged to eligible by Reviewer 1.
✓ Text mining reduced screening workload by 29.7% (screening time)
KEY OBSERVATIONS
TAKE HOME MESSAGE
• Abstrackr and text mining offer excellent work efficiencies, good accuracy and very low false negative rates
• Learnings to improve citation screening in Abstrackr:
• Identify citations missing abstractS (or be wary when removing duplicates to keep sourceS with the abstract)
• Eliminate ineligible citation types in Endnote before importing into Abstrackr (e.g., book chapters and conference abstracts)
• Considerations for using text mining of full text articles
• Significant time is required for document preparation (e.g., upload only methods/results), and dictionary testing, but still saves time relative to manual screening.
• Conclusion?
• These tools ARE beneficial to support systematic reviews in public health/injury research
• Recommend machine learning especially if the outcomes or exposures are clearly defined
• Text mining cannot and should not completely replace human screeners when examining complex literature, populations or health outcomes.
QUESTIONS?
Paper reporting review
findings (in press):
@MelitaGiummarra
Methods paper