confidence in neural networks: methodological issues arising from a review of safety-related...

Confidence in neural networks:methodological issues arising from a review of safety-related applications

P.J.G. Lisboa

[email protected]

Computing and Mathematical Sciences

Liverpool John Moores University

Outline

Developments in commercial safety-related systems comprising artificial neural networks

Increasing demand for decision support e.g. healthcare

Where is the evidence of healthcare benefit from ANNs?

Framework for assuring confidence in neural networks

Design assurance Risk analysis Evidence of effectiveness

Methodological issues arising from the review

Fire alarm for office blocks

SiemensFP11

FirePrintTechnology

Very high specificity

Commercial safety-related systems

Automotive industry

Tow-by-wireNACT (http://www.mech.gla.ac.uk/~nact/nact.html)

Fuel injectionFAMIMO (http://iridia.ulb.ac.be/~famimo/)

Electronic ABSH2C (http://www.control.lth.se/H2C/)

Lisboa, P.J.G. ‘Industrial use of safety-related artificial neural networks’ HSE CR, 2001 http://www.hse.gov.uk/crr_pdf/crr01327.pdf

Papnet

Cytology screening

FDA approvedfor secondary screening

Proven sensitivity

Specificityleft to user

Cost-effective

In the US 44,000 - 98,000 preventable deaths attributed to medical errors (Weingart et al, BMJ 2000)

Exceeds combined toll from motor crashes suicides falls poisonings drowning

Epidemiology of medical error

http://www.bmj.com/content/vol320/issue7237/images/large/weis18fh.f1.jpeg

Epidemiology of medical error

Managing error

The just-world hypothesis Systemic approaches

(Reason, BMJ 2000)

http://www.bmj.com/content/vol320/issue7237/images/large/reaj26ja.f1.jpeg

Oncology

Critical care

Cardiology

Other

Diagnosisand staging

Prostatic cancer (2).Cervical cancer.Breast cancer.Acute leukemia.

Intracranial haemorrhage in neonates.

Acute Myocardial Infarction (2).

Appendicitis.

Outcome prediction

Response to therapyin head & neck cancer.Recurrence of breast cancer in axillary node-negative patients.

Length-of-stay in preterm neonates.

Tracolimus blood levels. Effect of treatment in schizophrenia and depression. Rib fracture injury.

Radiology MRI of osteosarcoma.

Perfusion scintigraphy for detection of coronary stenosis. Doppler microembolic signal counts in patients with prosthetic heart valves.

Physiological monitoring

Fetal surveillance during labour from fetal ECG.

Randomised Controlled Trials

Lisboa, P.J.G. ‘A review of evidence of health benefit from artificial neural networks in medical intervention’, Neural Networks, 15, 1, 9-37,2002.

Clinical function

Oncology

Critical

care

Cardiology

Neurology

Other

Diagnosisand staging

Cervical cancer (3). Pre-cancerous breast.

Transient ischaemia. Acute ischaemia.

Embolus detection in stroke. Spontaneous EEG. Sleep EEG (2). Quantitative EEG. Ventilation mode recognition.

Referral methods for patients with third molars. Bladder outlet obstruction. Tear protein patterns. Haemodialysis. Ovulation time. Pure tone thresholds.

Outcome prediction

Multiple myeloma.

Stone growth after lithotripsy.

Radiology Myocardial perfusion images. Detection of stenoses from Doppler u/s waveforms.

MRS of epilepsy. PET of 5-HT reuptake sites PET in Alzheimer’s.

MRS of muscle.

Physiological monitoring

EEG in Pediatrics.

Single trial PVEP (2). Correlation of EEG and MEG. Lorazepan and sleep EEG. Evoked potentials in multiple-sclerosis.

Oxygenation in infants. EGG of gastric empting (2). Subcutaneous adipose tissue. Nonstress tests in obstetrics. Bone dimeneralization.

Clinical Trials

Reference No. of subjects

Clinical function Performance assessment

Results

Prostatic cancer

Gamito et al, 2000

4,133 Prediction of risk of lymph node spread (LNS) from age, race, PSA, PSA velocity, Gleason sum and TNM

External validation(n=660)

98% accuracy in detection of low risk of LNS with a MLP

Cervical cancer

Prismatic team, 1999

NNA Assessment as a primary screening tool for categorization of cervical smears as negative, mild, moderate or severe dyskaryosis, invasion, glandular neoplasia and borderline nuclear changes

External validation (n=21,700)

89.9% agreement across all classes was found between PAPNET and conventional primary screening.Similar sensitivity (82 cf. 83%), with PAPNET having improved specificity (77 cf. 42%) and faster processing (3.9 min. cf. 10.4 min)

Doornewaard et al 1999

NNA Assessment as a primary screening tool for the early detection of cervical dysplasia

External validation(n=6,063)

PAPNET testing has similar diagnostic value to conventional screening of Pap smears, with AUROC 95% CIs of 78-82% for control and 77-81% for PAPNET

Mango et al, 1998

NNA Comparison of yield in re-screening of node-negative PAP smears between NNA and conventional unassisted cytology

External validation (n=10,000)

PAPNET returned a yield of 6.2% versus 0.6% for manual re-screening

Reference No. of subjects

Clinical function Performance assessment

Results

Neonates

Zernikow et al, 1999

2,144 Predicting length-of-stay in preterm neonates from 40 first-day-of-life items

Train/test/validation

First-day-of-life data is predictive of length-of-stay of pre-term neonates with correlation CIs of 0.85-0.90 for MLR and 0.87-0.92 for MLP

Ischaemia

Polak et al, 1997 1,367 Prediction of transient ischaemia during ambulatory Holter monitoring, from a resting 12-lead ECG. Univariate t-tests were used to inform model selection

Train/test LDA and adaptive logic networks were superior to the MLP to predict the likelihood for the occurrence of ischaemic episodes

Selker et al, 1995 3,453 Clinical indicators available within 10 minutes of emergency department care were used to predict AMI and unstable angina pectoris, in a real-time clinical setting

External validation(n=2,320)

Limiting the inputs to 8 readily available variables, AUROCs for LogR, CART and MLP were 0.887, 0.858 and 0.902, respectively. Each is a clinically useful predictor of clinical outcome

Is there evidence of clinical benefit ?

Clinician performance patient outcome

Primary to secondary care referrals of patients with third molars:

Sens. Spec. Acc.1) Control group 0.97 0.22 0.83

2) Paper-based clinical algorithm 0.56 0.93 0.73

3) MLP-based recommendation 0.56 0.79 0.67

Which is the best performing system ?

Is there evidence of clinical benefit ?

Clinician performance patient outcome

Primary to secondary care referrals of patients with third molars:

Sens. Spec. Acc.1) Control group 0.97 0.22 0.83

2) Paper-based clinical algorithm 0.56 0.93 0.73

3) MLP-based recommendation 0.56 0.79 0.67

1) 1.2 2) 8.0 3) 2.7

ySpecificit

ySensitivit

1

Performance estimation

ROC framework

Boost factor:

PPV = True positives/Predicted positives

)Pr(~

)Pr(*

11 class

class

ySpecificit

ySensitivit

PPV

PPV

Predictive modelsDeficiencies in standard modelling methods:

(Altman & Royston, Stat. Med. 2000)

1. Overoptimistic assessment of predictive performance

2. Multiple regression using stepwise variable selection

3. EPV < 10 (samples/parameters)

4. Case-mix (cohort variations)

5. External evaluation (protocol changes)

• Retrospective vs. Temporal

• Prospective vs. External

Continuum of inference models

numeric to numeric symbolic to symbolicnumeric to symbolic

unsupervisedsupervised

statistical methodssignal processing data driven knowledge drivenneural networks

FFT

waveletsindependentcomponents analysis

logistic regression

kernel methods inc. SVM

radialbasisfunctions

multi-layerperceptron

ART

SOM/GTM CART

rule induction

axiomatic

productionrules

fuzzy logicreinforcementlearning

k-meansclustering

rule extraction

control

Software life-cycle

Requirements analysis & specification

Functional specification & data requirements

Model design

Implementation & training

Test of model predictions

Integration test

System evaluation

Unit level

Integration level

Systems level

Risk analysis

Extract from the FDA guidelines

Post-market surveillancePost-market surveillance

RCTRCT

Animal modelsAnimal models

Healthy humansHealthy humans

Clinical trialsClinical trials

Phase IV: Follow-upPhase IV: Follow-up

Phase II: Exploratory trialPhase II: Exploratory trial

Phase III: Definitive trialPhase III: Definitive trial

Pre-clinical: RationalePre-clinical: Rationale

Phase I: ModellingPhase I: Modelling

The continuum of evidence(Drug development)

The continuum of evidence(Campbell et al, 2000)

Phase IV: Follow-upPhase IV: Follow-up

Phase II: Exploratory trialPhase II: Exploratory trial

Phase III: Definitive trialPhase III: Definitive trial

Pre-clinical: RationalePre-clinical: Rationale

Phase I: ModellingPhase I: Modelling

Post-market surveillancePost-market surveillance

RCTRCT

MethodologyMethodology

Retrospective studiesRetrospective studies

Prospective studiesProspective studies

Ph I: Theory

Regularisation framework

Ph II: Performance optimisation

Complexity control

Ph III: Generalisation

HAZOP/FMEA

RCT: case-control study

Clinical effectiveness Post-market surveillancePost-market surveillance

RCTRCT

MethodologyMethodology

Retrospective studiesRetrospective studies

Prospective studiesProspective studies

The continuum of evidence

Ph I: Theory

Regularisation framework

Ph II: Performance optimisation

Complexity control

Ph III: Generalisation

HAZOP/FMEA

RCT: case-control study

Clinical effectiveness

Medical Devices Directives

Essential requirements

Performance validation

Doctrine of Substantially Equiv. Products

Model evaluationRequirement for Learned Intermediaries.

Risk assessment

Procedure for post-marketing surveillance

H & S requirements

The continuum of evidence

Methodological issues arising

Confidence Data Regularisation Calibration

Transparency Rule-extraction

Linear-in-the-parameters statistical inference Fuzzy or rule-based supervisory models

Effectiveness Performance estimation

Reliability Novelty-detection

Generalisation

Performance estimation

Power calculations

Bootstrap Ci:

test

C

testtrain

C

train

C

train

CAUROC

##*###~ 432

21

Rule extraction

Axis parallel boxes & network pruning

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1y4L

Rule extraction

11.5

22.5

3

1

1.5

2

2.5

30

0.5

1

1.5

2

S urface Map

11.5

22.5

3

1

1.5

2

2.5

30.8

0.9

1

1.1

1.2

1.3

Variance Map

Lisboa, P.J.G., Etchells, T.A and Pountney, D.C. ‘Minimal MLPs do not model the XOR logic’ Neurocomputing, Rapid communication, 48, 1-4, 1033-1037, 2002.

Axis parallel boxes & network pruning

Good practice

Embodying a safety-culture

Specification• Statistically significant vs. clinically useful

Transparency • Verify against clinical prior knowledge

HAZOP & FMEA• Reliability - novelty detection ?• Maintainability - incremental learning ?

Summary

Assuring confidence in complex is not a specific issue for neural networks but applies to all inference systems

A framework can be constructed based on a life-cycle model of safety-related software:

Good practice in the design data-based models Need to switch between evidence-based and knowledge-

based cultures for v & v.

confidence in neural networks: methodological issues arising from a review of safety-related...

Documents

review slide

high specificity slide

cervical cancer

sleep eeg

user costeffective slide

review of safety

prostatic cancer

correlation of eeg