BLAH BLAH BLAH
Name
Vidya Jyothi Prof V. K. Samaranayake Memorial Oration 2016
Big Myths and Big Opportunities in Big Data Analytics: A Data Engineering
PerspectiveProfessor Saman Halgamuge
Optimization and Pattern Recognition GroupThe University of Melbourne
2
Outline• A personal tribute to the great visionary Prof V. K.
Samaranayake- Stories of inspiration
• Why is Data Science/Data Engineering the continuation of his legacy?
• What is Big Data? How does it affect us?
• Data Engineering: an engineering enabled approach to data analytics
• Some examples of our applied research in Big Data Analytics in collaboration with industry and other organisations
• Two examples of Big Data Analytic Problems relevant to Sri Lanka
3
Two Pioneers of Statistics, Data Science and Computing in the Region
Prof Prasanta Chandra Mahalanobis, Founder of Indian Statistical Institute in Calcutta
Prof V. K Samaranayake, Founder of University of Colombo School of Computing
Vision, Courage, Efficiency and Strategy
Inspiring, Caring and Nurturing
The inventor of iHelmet – Forbes Asia 30 under 30: Consumer Tech (2016)
Minor Planet (26441) named after him (2010)
At the age of 15: Young Computer Scientist (YCS) - Gold award, Sri Lanka Association for the Software Industry (SLASI)2006
Real complex problems may not have many clues or reliable data!
7
Big Data: A Data Engineering Perspective
• The 3 Vs of Big Data:– Velocity – Variety (Variability, Veracity)– Volume
• Need to capture, curate, store, visualize, update, process/analyse and pay attention to information privacy
• Data engineers develop and use electronic/mechanical/chemical/biologicalhardware/``wetware” for– Capturing data– Storing data– Process/analyse big data quickly (Big Data Analytics)– Information privacy/security– ……..
8
Big Data Analytics: Good old wine in new bottles?
• We create data profiles about our behaviours that can be captured, stored and sold!
• Yet some problems are still unsolved: e.g. MH 370
• How does Big Data Analytics differ from “Data Analysis”, “Pattern Recognition”, “Data Mining”…?
• Need to develop new and faster methods to analyse large volumes of data of multiple varieties/uncertainties/inaccuracies!
• Sometimes we know only a little or nothing about the information hidden in Data!
9
Big Data Analytics: Machine Learning Approaches
• Supervised Learning (know everything)
• Unsupervised Learning (know nothing)– Self Organizing Maps
– Growing Self Organizing Maps (Our past work)
– DEEP UNSUPERVISED LEARNING (Our new work)
• Semi-supervised learning (know something)– Near Supervised learning: Mostly available methods
– Near Unsupervised Learning (NUL): Our recent work
• What are the major differences? – Speed, Accuracy, Applicability…
APPROACH: SupervisedLearning
100%
50%
50% 100%
Percentage of Known data
labels
Percentage of known classes
DATA ANALYTICS: Classification/Clustering Approaches- Supervised
EXAMPLE:SEPARATE Bananas from Apples:
We teach a child using some “samples”:
APPLES are ROUND BANANAS are NOT ROUNDTaste differentLook different…..
The child can then learn to differentiate Apples from Bananas
TRAIN A CHILDNeed a supervisor
SupervisedLearning
APPROACH: UnsupervisedLearning
100%
50%
50% 100%
Percentage of Known data
labels
Percentage of known classes
DATA ANALYTICS: Classification/Clustering Approaches- Unsupervised
EXAMPLE:No clue about the different types of fruits in the basket
If Supervised Learning is about training a child, Unsupervised Learning is about asking a learned/experienced person/professor for the opinion/best guess
No training is possible
Near-UnsupervisedLearning
Semi-SupervisedLearning
SupervisedLearning
UnsupervisedLearning
100%
50%
50% 100%
Percentage of known classes
Near Unsupervised Learning (our ARC funded work)
Percentage of Known data
labels
Near-UnsupervisedLearning
Semi-SupervisedLearning
SupervisedLearning
UnsupervisedLearning
100%
50%
50% 100%
Percentage of known classes
Positive Unlabelled Learning (Current research with applications)
50% 100%
Positive Unlabelled Learning
Percentage of Known data
labels
Semi-SupervisedLearning
SupervisedLearning
UnsupervisedLearning
100%
50%
50% 100%
Percentage of known classes
My Interpretation of Data Analytics Space (marked in Yellow)
50% 100%
Percentage of Known data
labels
Positive Unlabelled Learning
Good news for enthusiastic Computer Scientists, Mathematicians, Engineers…:Mostly unexplored Territory when moving into the era of “BIG DATA”
THE MICROBIAL WORLD
15
Image: http://commons.wikimedia.org/BIG DATA APPLICATIONS:LEARNING FROM THE SMALEST AND THE OLDEST LIFE FORMS
17
Application in Metabolomics• Metabolomics (2013) =>
Application in Neural Engineering
Computational Neuroscience
19
Micro Electrode Array: New Technology demanding Big data Analytics
Culture Preparation & Data Acquisition
Dissection of newborn mice and extracting the cortex
Culture Preparation
Plating MEAs
Maintaining cultures
Electrophysiological recording
Images courtesy of Multichannel Systems, Potter Lab (GeorgiaTech), Nature protocols
Electrodes Neurons
MEA voltage signals
Data aquisition
High Pass Filtering
Spike Detection
Burst detection
NB Detection
NB analysis
Rate
Intervals
Duration
Spikes in NBs
Channels in NBs
Jitter
Burst analysis
Rate
Duration
Spikes in Bursts
Spike analysis
Firing rate
Amplitude (dep/hyp)
Spike width
Spike Burst
Network Burst
(NB)
Ch
ann
els
High frequency features
Time (s)
Tonic
channel
Low frequency features
Drug A Drug B
Data acquisition
Low Pass Filtering
LFP Detection
LFP analysis
Rate
Amplitudes
Intervals
Width
• Local Field Potentials (LFPs): Low frequency events that represents ionic gradients near electrodes resulting from the activity of multiple neurons
• Can be used to differentiate drugs
Select concentrations and time points
Signature network activity patterns for:• genotypes • drugs
Microelectrode Array (MEA) data
NormalizationOutlier
removal
Machine Learning
Feature extraction
Exploratory Data Analysis
Feature selection
Data Analysis Pipeline
Model
Unlabelled drugs or genotypes
Predict classes
Data Analytic System
Epileptic Phenotypes in Cultured Neuronal Networks
24
Extracellular voltages
Neuronal Networks Microelectrode Array
Elec
tro
des
/Ch
ann
els
Time
Epileptic Phenotypes in Cultured Neuronal Networks25
With genetic mutations“transgenic”
Wild type
Change in signal dynamics
Anti-epileptic Drugs ?
Electrical Stimulations
?
Time
Ele
ctro
des
Epileptic Phenotypes in Cultured Neuronal Networks
26
With genetic mutationsWild type
Change in signal dynamics
Spontaneous network activity
Features:• Amplitudes• Frequencies• Action potential (spike) timing• Bursting• Synchronization• Action propagation patterns
Network activity after drug application
Spontaneous network activity
Network activity after drug application
Exploratory data analysis
+Pattern Recognition
BIG DATA PROBLEM CREATED DUE TO NEW MULTI ELECTRODE ARRAY TECHNOLOGY
Clustering Weekly recordings of Neuronal cultures
Transgenic (with mutation)
Wild Type
27
Biological variability: • Variability between cultures containing neurons from different animals• Variability within different cultures containing neurons from the same animal• Variability among different “culture ages”
GOAL:Investigating separability between transgenic and wild type cultures amidst biological variability
Unsupervised Learning Method GrowingSelf Organizing Map with nodesrepresenting 3 weeks of recordings each ofcultures from 6 different mice
PhD project of Dulini Mendis supervised by Steve Petrou and Saman Halgamuge
Can we eradicate Cancer?: CRISPR
Noise
Clean data
Incompletedata
Culprit“Target Gene”
Noise
Noise
CRISPRTechnology
Typical scenario: data of interest hidden among noise and largely incomplete data
CRISPR: Clustered regularly interspaced short palindromic repeats
Data Engineering: CRISPR
Noise
NoiseNoise
Generate more data where needed (informed by “smart” learning algorithms)
CRISPR
CRISPR: Clustered regularly interspaced short palindromic repeats
30
Big Data in Marketing and Advertising
• Information collected about you and me:
– Anytime, Anywhere, Anything
• We create data profiles about our behaviours that can be captured, stored and sold!
• Targeted advertising using our browsing history..
• US internet advertising revenue $40bn in 2012
• Internet- only 13-19% of total media advertisement expenditure spent
Part 1: Synergy in Online Advertising Sequences
20 % 50 % 30 %
Part 2: Credit Allocation for Online Advertising Channels (Attribution)
How does On-line Advertising “convert” us to purchase?
“Not Enough Data available in Sri Lanka”?
Crop Variation in Sri Lanka
Decision Makers
Example: Forward planning in Agriculture
Locally collected inaccurate data
NOISE
How far are we away from “Precision Agriculture”?
Photo credit: https://mosesorganic.org/wp-content/uploads/Publications/Broadcaster/March2014/cover-crop-field.jpg
Can Data Engineers help?
“Not Enough Energy Storage in Sri Lanka”?
Geothermal heating andcooling
Source – http://www.smartgrids.eu/News_2014_and_beforehttps://www.bspq.com.au/tesla-powerwall-reviewhttp://thephilanews.com/eu-u-s-coordinating-electric-vehicle-smart-grid-development-41215.htm
Least Cost Storage & Transmission Assets
Two Current relevant PhD Projects: Khalid Abdulla and Hansani Weeratunga
WHY SOME PEOPLE ARE AFRAID OF CONSIDERING OFF-THE-GRID Solutions?
We consider them as options for other places, for example in Indonesia
“Unreliable, inefficient, expensive…”?
• Utilizing renewable energy resources as solar, wind, geothermal,hydro power, tidal, bio fuel in a controlled manner with cleverstorage management would overcome the draw backs of asingle system.
• It would lead to a hybrid renewable energy system which cansupport off the grid electrification system which enables areliable power supply and Increases the overall percentage ofrenewable energy generation capacity.
Source-http://energy.gov/eere/femp/renewable-energy-technologies-federal-projects
Can we predict the peak demand and reduce its dominance?
35
Acknowledgment
This work is partially supported by 15 University of Melbourne PhD Scholarships and Australian Research Council Grants: “Near Unsupervised computational methods for exploring omic data (DP150103512), “Discovering Patterns using Near Unsupervised Learning to Support the Quick Detection of New Animal Disease Outbreaks Caused by Viruses” (LP140100670) and YourGene Australia.
Current PhD students: C. Wijetunga, D. Mendis, Y. Deerasooriya, H. Weeratunga, P. Hamead, C. Jayawardena, D. Herath, D. Senanayake, A. Khalaj, W. Wei, Y. Sun and K. Abdulla and previous students K. Amarasinghe, Z. Li, U. Premaratne, S. Jayasekara, D. Jayasundara, D. Alahakoon and K. Chan and
Research collaborators: S. L. Tang, S. Petrou, U. Kayande, K. Steer, A. Wirth, B. Chang, M. Premaratne, A. Hsu, I. Saeed, U. Roessner, M. Kirley, K. Verspoor, J. Browne, G. Narsilio, D. Ackland and A. Bacic are acknowledged.
Thank you