the national centre for text mining
DESCRIPTION
The National Centre for Text Mining. …and its ramifications for e-Science and the other way round. Anne E Trefethen Deputy Director, e-Science Core Programme. A Definition of e-Science. - PowerPoint PPT PresentationTRANSCRIPT
21/03/05 National Centre for Text Mining
The National Centre for Text Mining
Anne E Trefethen
Deputy Director,
e-Science Core Programme
…and its ramifications for e-Science and the other way round
21/03/05 National Centre for Text Mining
A Definition of e-Science
‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’
John Taylor
Director General of Research Councils
Office of Science and Technology
21/03/05 National Centre for Text Mining
Licklider’s Vision for the Internet
“Lick had this concept – all of the stuff linked together throughout the world, that you can use a remote computer, get data from a remote computer, or use lots of computers in your job.”
Larry Roberts – Principal Architect of the ARPANET
21/03/05 National Centre for Text Mining
UK e-Science Programme
Collaborative projects
Director’s Management Role
Director’sAwareness and Co-ordination Role
Generic Challenges EPSRC (£15m) £16.2m, DTI (£15m)
Industrial Collaboration
Pilot ApplicationProgramme
PPARC (£26m) £31.6mBBSRC (£8m) £10.0mMRC (£8m) £13.1mNERC (£7m) £8.0mESRC (£3m) £10.6mEPSRC (£17m) £18.0mCLRC (£5m) £5.0m
Research Councils (£74m), £96.3mDTI (£5m)
TechnicalAdvisoryGroup
£250m government investment over 5yrs
21/03/05 National Centre for Text Mining
Powering the Virtual Universe
http://www.astrogrid.ac.uk(Edinburgh, Belfast, Cambridge, Leicester,
London, Manchester, RAL)
Multi-wavelength showing the jet in M87: from top to bottom – Chandra X-ray, HST optical, Gemini mid-IR, VLA radio. AstroGrid will provide advanced, Grid based, federation and data mining tools to facilitate better and faster scientific output.
Picture credits: “NASA / Chandra X-ray Observatory / Herman Marshall (MIT)”, “NASA/HST/Eric Perlman (UMBC), “Gemini Observatory/OSCIR”, “VLA/NSF/Eric Perlman (UMBC)/Fang Zhou, Biretta (STScI)/F Owen (NRA)”
AstroGrid Slides courtesyof Nick Walton, Cambridge
21/03/05 National Centre for Text Mining
Image from
E
SO
Image + IRIS data
Gamma Ray BurstsGamma Ray Bursts
D. Ducros, ESAReprocessing of ionospheric STP datachange coords from earth to celestial
Collate data frommultiple telescopesover months - meta data issues
Localise GRB alertin minutes – as faderapidly.
SWIFT satelliteobserves gammaray burst
Compare against SNlight curves – bump shows eveidence for a SN in the GRB(Price et al, 2002)
Interaction with observatory pipe-lines
Cross reference multi-λdata – ID pre-cursorand or environment
Large computationalphotometric redshift calcs on multi-λ > gives distance
21/03/05 National Centre for Text Mining
Dark Matter + Large Scale StructureDark Matter + Large Scale Structure
X-ray cluster: Chandra X-ray (Mullis) overlaid on a deep BRI image (Clowe & Luppino).
Image from
E
SO
Multiple large image sources:registration &association
Source ID from multiplexed spectral data
Multi-TB λCDMmodels, e.g.Millennium Sim
Generate Shear Mapsc.f. CDM models> DM distributionwith redshift Remove stars
correlate galswith z
Colour-Colourrelationshipsclassification in multi-phase space
Automatic clusterfinding techniques
21/03/05 National Centre for Text Mining
Some facts on Astronomy data• Virtual observatories
– Many national virtual observatories containing data at different wavelengths. Estimated
• US NVO project alone will store 500 Terabytes/year • Laser Interferometer Gravitational Observatory (LIGO)
generates 250 Terabytes/year • VISTA, Visible and infrared survey telescope estimated to
generate 250 Gigabytes of raw data/night – 10 terabytes of stored data/year.
• Together with data analysis need to combine with previously published knowledge on that astronomical time/space events
21/03/05 National Centre for Text Mining
The eDiaMoND Project
Hardware, Software and People Skills
People Skills
University RelationsLife SciencesWorldwide Grid
BreastScreeningProgrammes
Engineering and Physical Sciences Research CouncilMedical Research Council
eDiaMoNDeDiaMoND
eDiamond Slides courtesyof David Gavagahn, Oxford
21/03/05 National Centre for Text Mining
UK Breast Screening – Today
Began in 1988
Women 50-64ScreenedEvery 3 Years1 View/Breast
~100 BreastScreeningProgrammes- Scotland- Wales- Northern Ireland- England
1,300,000 - Screened in 2001-0265,000 - Recalled for Assessment8,545 – Cancers detected300 - Lives per year Saved
230 - Radiologists (Double Reading)
Film
Paper
Statistics from NHS Cancer Screening web site
21/03/05 National Centre for Text Mining
UK Breast Screening – Challenges
230 - Radiologists (Double Reading)50% - Workload Increase
2,000,000 - Screened every Year120,000 - Recalled for Assessment10,000 - Cancers1,250 - Lives Saved
Began in 1988
Women 50-70ScreenedEvery 3 Years2 Views/Breast+ DemographicIncrease
~100 BreastScreeningProgrammes- Scotland- Wales- Northern Ireland- England
Digital
Digital
21/03/05 National Centre for Text Mining
UK Breast Screening – Workflow
Call
1000
Missed1
Interval Cancers
ScreeningScreening AssessmentAssessment
TrainingTraining
EpidemiologyEpidemiology
~100 BreastScreeningProgrammes
Recall
40
All Clear960
All Clear34
Cancer
6
Previous
Current
21/03/05 National Centre for Text Mining
eDiaMoND – Scope
EpidemiologyEpidemiology
TeachingTeaching
DiagnosisDiagnosis
ScreeningScreening
EpidemiologyEpidemiology
TeachingTeaching
DiagnosisDiagnosis
ScreeningScreening
EpidemiologyEpidemiology
TeachingTeaching
DiagnosisDiagnosis
ScreeningScreeningDataData32 MB / Image
256 TB / Year
EpidemiologyEpidemiology
TrainingTraining
ScreeningScreening
Workstation Grid
Previous
Current
ComputeCompute
StandardMammoFormat
StandardMammoFormat
DataMining
DataMining
CADeCADi
CADeCADi
~4 BreastScreeningProgrammes
21/03/05 National Centre for Text Mining
eDiaMoND – Compute
Mammograms have different appearances, depending on image settings and acquisition systems
StandardMammoFormat
StandardMammoFormat
Temporal mammography
ComputerAidedDetection
3D View
21/03/05 National Centre for Text Mining
eDiaMoND – Data
Data Images
Logical View is One Resource
PatientPatient AgeAge …… ImageImage
107258107258 5555 …… 1.dcm1.dcm
236008236008 6262 …… 2.dcm2.dcm
700266700266 5959 …… 3.dcm3.dcm
895301895301 5858 …… 4.dcm4.dcm
……………… …… …… ……..……..
……………… …… …… ……..……..
……………… …… …… ……..……..
……………… …… …… ……..……..
……………… …… …… ……..……..
……………… …… …… ……..……..
……………… …… …… ……..……..
……………… …… …… ……..……..
DataDataDICOMDICOM
DICOMDICOM
DICOMDICOM
DICOMDICOM
Grid
ComputeCompute
StandardMammoFormat
StandardMammoFormat
DataMining
DataMining
CADeCADi
CADeCADi
21/03/05 National Centre for Text Mining
myGrid: Directly Supporting the e-Scientist
PartnersManchester, EBI, Southampton,Nottingham, Newcastle, SheffieldAstraZenecaGlaxoSmithKlineMerck KGaAEpistemics LtdGeneticXchangeNetwork Inference
http://mygrid.man.ac.uk
IBMSUN Microsystems
myGrid slidescourtesy of Carole Goble
21/03/05 National Centre for Text Mining
myGrid Project • Imminent
‘deluge’ of genomics data
• Highly heterogeneous
• Highly complex and inter-related
• Convergence of data and literature archives
(courtesy of Carole Goble, Manchester)
21/03/05 National Centre for Text Mining
Information Weaving
• Large amounts of different kinds of data & many applications.
• Highly heterogeneous.– Different types, algorithms,
forms, implementations, communities, service providers
• High autonomy.• Highly complex and inter-
related, & volatile.
(courtesy of Carole Goble, Manchester)
21/03/05 National Centre for Text Mining
An in silico experiment = a web of interconnected information and components
Provenance record of workflow runs
Provenance of the workflow template. Related workflows.
People
Ontologies describing workflows
Services used
Notes
Data in and out
LiteratureLiterature
(courtesy of Carole Goble, Manchester)
21/03/05 National Centre for Text Mining
• Building links between e-research data, from the CombeChem project, with scholarly communication and other on-line sources
• Investigating the role of aggregator services in linking data-sets from Grid enabled projects to open data archives contained in digital repositories through to peer-reviewed articles as resources in portals
• JISC-funded project led by UKOLN in partnership with the Universities of Southampton and Manchester
The eBank Project
21/03/05 National Centre for Text Mining
Grid
E-Scientists
Entire E-Science CycleEncompassing experimentation, analysis, publication, research, learning
5
Institutional Archive
LocalWebPublisher
Holdings
Digital Library
E-Scientists Graduate Students
Undergraduate Students
Virtual Learning Environment
E-Experimentation
E-Scientists
Technical Reports
Reprints
Peer-Reviewed Journal &
Conference Papers
Preprints & Metadata
Certified Experimental
Results & Analyses
Data, Metadata & Ontologies
21/03/05 National Centre for Text Mining
Generic Issues• In next 5 years e-Science projects will produce
more scientific data than has been collected in the whole of human history
NSF “Atkins” report on Cyberinfrastructure• the primary access to the latest findings in a growing
number of fields is through the Web, then through classic preprints and conferences, and lastly through refereed archival papers’.
• ‘archives containing hundreds or thousands of terabytes
of data will be affordable and necessary for archiving scientific and engineering information’.
21/03/05 National Centre for Text Mining
Generic Issues cont
• Data Deluge from e-Science projects requires grid technologies to facilitate discovery, analysis, curation of data
• Sheer volume of text published and new results appearing, is impossible for researchers to read and correlate
• Effective automated processing required research, locate, gather and make use of knowledge encoded electronically in available literature
21/03/05 National Centre for Text Mining
Bioscience and biomedicine
• Bioscience and biomedicine resulted in huge volume of domain literature
• Open Acess publishers such as BioMed Central have a growing number of full-text articles
• Integration of literature and data analysis of increasing importance - linking factual biodatabases to literature, using publishers to check, complete or complement contents of such databases
21/03/05 National Centre for Text Mining
• NaCTeM establishing high-quality service provision in text mining for academic community – focus initially on biological and biomedical science
• Enabling e-Science applications!
21/03/05 National Centre for Text Mining
Grid Technologies enabling Text mining
• Text mining process involves many steps• Potentially many tools• Large amounts of text and data to be analysed • Requring temporary storage of intermediate
results• Access large resources, ontologies, document
collections etc• Compute-intensive algorithms• Portal access to data and compute resources
21/03/05 National Centre for Text Mining
Conclusions
• The vastness of the amount of electronic literature and digital text demands automatic capabilities for effective analysis
• Combining this capability with data analysis is of growing importance for some research areas
• The future services provided by NaCTeM will form a significant piece of the toolset for e-Science applications
21/03/05 National Centre for Text Mining
Acknowledgements
Thanks to
• Carole Goble and the myGrid team
• Liz Lyon and the eBank team
• Dave Gavaghan and the eDiamond Project
• Nick Walton and AstroGrid