Download - Vidya Jyothi Prof V. K. Samaranayake Memorial Oration 2016 ... · PowerPoint Presentation Author: ucsc Created Date: 7/7/2016 12:00:48 PM

BLAH BLAH BLAH

Name

Vidya Jyothi Prof V. K. Samaranayake Memorial Oration 2016

Big Myths and Big Opportunities in Big Data Analytics: A Data Engineering

PerspectiveProfessor Saman Halgamuge

Optimization and Pattern Recognition GroupThe University of Melbourne

2

Outline• A personal tribute to the great visionary Prof V. K.

Samaranayake- Stories of inspiration

• Why is Data Science/Data Engineering the continuation of his legacy?

• What is Big Data? How does it affect us?

• Data Engineering: an engineering enabled approach to data analytics

• Some examples of our applied research in Big Data Analytics in collaboration with industry and other organisations

• Two examples of Big Data Analytic Problems relevant to Sri Lanka

3

Two Pioneers of Statistics, Data Science and Computing in the Region

Prof Prasanta Chandra Mahalanobis, Founder of Indian Statistical Institute in Calcutta

Prof V. K Samaranayake, Founder of University of Colombo School of Computing

Vision, Courage, Efficiency and Strategy

Inspiring, Caring and Nurturing

The inventor of iHelmet – Forbes Asia 30 under 30: Consumer Tech (2016)

Minor Planet (26441) named after him (2010)

At the age of 15: Young Computer Scientist (YCS) - Gold award, Sri Lanka Association for the Software Industry (SLASI)2006

Real complex problems may not have many clues or reliable data!

7

Big Data: A Data Engineering Perspective

• The 3 Vs of Big Data:– Velocity – Variety (Variability, Veracity)– Volume

• Need to capture, curate, store, visualize, update, process/analyse and pay attention to information privacy

• Data engineers develop and use electronic/mechanical/chemical/biologicalhardware/``wetware” for– Capturing data– Storing data– Process/analyse big data quickly (Big Data Analytics)– Information privacy/security– ……..

8

Big Data Analytics: Good old wine in new bottles?

• We create data profiles about our behaviours that can be captured, stored and sold!

• Yet some problems are still unsolved: e.g. MH 370

• How does Big Data Analytics differ from “Data Analysis”, “Pattern Recognition”, “Data Mining”…?

• Need to develop new and faster methods to analyse large volumes of data of multiple varieties/uncertainties/inaccuracies!

• Sometimes we know only a little or nothing about the information hidden in Data!

9

Big Data Analytics: Machine Learning Approaches

• Supervised Learning (know everything)

• Unsupervised Learning (know nothing)– Self Organizing Maps

– Growing Self Organizing Maps (Our past work)

– DEEP UNSUPERVISED LEARNING (Our new work)

• Semi-supervised learning (know something)– Near Supervised learning: Mostly available methods

– Near Unsupervised Learning (NUL): Our recent work

• What are the major differences? – Speed, Accuracy, Applicability…

APPROACH: SupervisedLearning

100%

50%

50% 100%

Percentage of Known data

labels

Percentage of known classes

DATA ANALYTICS: Classification/Clustering Approaches- Supervised

EXAMPLE:SEPARATE Bananas from Apples:

We teach a child using some “samples”:

APPLES are ROUND BANANAS are NOT ROUNDTaste differentLook different…..

The child can then learn to differentiate Apples from Bananas

TRAIN A CHILDNeed a supervisor

SupervisedLearning

APPROACH: UnsupervisedLearning

100%

50%

50% 100%


labels


DATA ANALYTICS: Classification/Clustering Approaches- Unsupervised

EXAMPLE:No clue about the different types of fruits in the basket

If Supervised Learning is about training a child, Unsupervised Learning is about asking a learned/experienced person/professor for the opinion/best guess

No training is possible

Near-UnsupervisedLearning

Semi-SupervisedLearning

SupervisedLearning

UnsupervisedLearning

100%

50%

50% 100%


Near Unsupervised Learning (our ARC funded work)


labels

Near-UnsupervisedLearning


SupervisedLearning


100%

50%

50% 100%


Positive Unlabelled Learning (Current research with applications)

50% 100%

Positive Unlabelled Learning


labels


SupervisedLearning


100%

50%

50% 100%


My Interpretation of Data Analytics Space (marked in Yellow)

50% 100%


labels

Positive Unlabelled Learning

Good news for enthusiastic Computer Scientists, Mathematicians, Engineers…:Mostly unexplored Territory when moving into the era of “BIG DATA”

THE MICROBIAL WORLD

15

Image: http://commons.wikimedia.org/BIG DATA APPLICATIONS:LEARNING FROM THE SMALEST AND THE OLDEST LIFE FORMS

17

Application in Metabolomics• Metabolomics (2013) =>

Application in Neural Engineering

Computational Neuroscience

19

Micro Electrode Array: New Technology demanding Big data Analytics

Culture Preparation & Data Acquisition

Dissection of newborn mice and extracting the cortex

Culture Preparation

Plating MEAs

Maintaining cultures

Electrophysiological recording

Images courtesy of Multichannel Systems, Potter Lab (GeorgiaTech), Nature protocols

Electrodes Neurons

MEA voltage signals

Data aquisition

High Pass Filtering

Spike Detection

Burst detection

NB Detection

NB analysis

Rate

Intervals

Duration

Spikes in NBs

Channels in NBs

Jitter

Burst analysis

Rate

Duration

Spikes in Bursts

Spike analysis

Firing rate

Amplitude (dep/hyp)

Spike width

Spike Burst

Network Burst

(NB)

Ch

ann

els

High frequency features

Time (s)

Tonic

channel

Low frequency features

Drug A Drug B

Data acquisition

Low Pass Filtering

LFP Detection

LFP analysis

Rate

Amplitudes

Intervals

Width

• Local Field Potentials (LFPs): Low frequency events that represents ionic gradients near electrodes resulting from the activity of multiple neurons

• Can be used to differentiate drugs

Select concentrations and time points

Signature network activity patterns for:• genotypes • drugs

Microelectrode Array (MEA) data

NormalizationOutlier

removal

Machine Learning

Feature extraction

Exploratory Data Analysis

Feature selection

Data Analysis Pipeline

Model

Unlabelled drugs or genotypes

Predict classes

Data Analytic System

Epileptic Phenotypes in Cultured Neuronal Networks

24

Extracellular voltages

Neuronal Networks Microelectrode Array

Elec

tro

des

/Ch

ann

els

Time

Epileptic Phenotypes in Cultured Neuronal Networks25

With genetic mutations“transgenic”

Wild type

Change in signal dynamics

Anti-epileptic Drugs ?

Electrical Stimulations

?

Time

Ele

ctro

des

Epileptic Phenotypes in Cultured Neuronal Networks

26

With genetic mutationsWild type

Change in signal dynamics

Spontaneous network activity

Features:• Amplitudes• Frequencies• Action potential (spike) timing• Bursting• Synchronization• Action propagation patterns

Network activity after drug application

Spontaneous network activity

Network activity after drug application

Exploratory data analysis

+Pattern Recognition

BIG DATA PROBLEM CREATED DUE TO NEW MULTI ELECTRODE ARRAY TECHNOLOGY

Clustering Weekly recordings of Neuronal cultures

Transgenic (with mutation)

Wild Type

27

Biological variability: • Variability between cultures containing neurons from different animals• Variability within different cultures containing neurons from the same animal• Variability among different “culture ages”

GOAL:Investigating separability between transgenic and wild type cultures amidst biological variability

Unsupervised Learning Method GrowingSelf Organizing Map with nodesrepresenting 3 weeks of recordings each ofcultures from 6 different mice

PhD project of Dulini Mendis supervised by Steve Petrou and Saman Halgamuge

Can we eradicate Cancer?: CRISPR

Noise

Clean data

Incompletedata

Culprit“Target Gene”

Noise

Noise

CRISPRTechnology

Typical scenario: data of interest hidden among noise and largely incomplete data

CRISPR: Clustered regularly interspaced short palindromic repeats

Data Engineering: CRISPR

Noise

NoiseNoise

Generate more data where needed (informed by “smart” learning algorithms)

CRISPR

CRISPR: Clustered regularly interspaced short palindromic repeats

30

Big Data in Marketing and Advertising

• Information collected about you and me:

– Anytime, Anywhere, Anything

• We create data profiles about our behaviours that can be captured, stored and sold!

• Targeted advertising using our browsing history..

• US internet advertising revenue $40bn in 2012

• Internet- only 13-19% of total media advertisement expenditure spent

Part 1: Synergy in Online Advertising Sequences

20 % 50 % 30 %

Part 2: Credit Allocation for Online Advertising Channels (Attribution)

How does On-line Advertising “convert” us to purchase?

“Not Enough Data available in Sri Lanka”?

Crop Variation in Sri Lanka

Decision Makers

Example: Forward planning in Agriculture

Locally collected inaccurate data

NOISE

How far are we away from “Precision Agriculture”?

Photo credit: https://mosesorganic.org/wp-content/uploads/Publications/Broadcaster/March2014/cover-crop-field.jpg

Can Data Engineers help?

“Not Enough Energy Storage in Sri Lanka”?

Geothermal heating andcooling

Source – http://www.smartgrids.eu/News_2014_and_beforehttps://www.bspq.com.au/tesla-powerwall-reviewhttp://thephilanews.com/eu-u-s-coordinating-electric-vehicle-smart-grid-development-41215.htm

Least Cost Storage & Transmission Assets

Two Current relevant PhD Projects: Khalid Abdulla and Hansani Weeratunga

WHY SOME PEOPLE ARE AFRAID OF CONSIDERING OFF-THE-GRID Solutions?

We consider them as options for other places, for example in Indonesia

“Unreliable, inefficient, expensive…”?

• Utilizing renewable energy resources as solar, wind, geothermal,hydro power, tidal, bio fuel in a controlled manner with cleverstorage management would overcome the draw backs of asingle system.

• It would lead to a hybrid renewable energy system which cansupport off the grid electrification system which enables areliable power supply and Increases the overall percentage ofrenewable energy generation capacity.

Source-http://energy.gov/eere/femp/renewable-energy-technologies-federal-projects

Can we predict the peak demand and reduce its dominance?

35

Acknowledgment

This work is partially supported by 15 University of Melbourne PhD Scholarships and Australian Research Council Grants: “Near Unsupervised computational methods for exploring omic data (DP150103512), “Discovering Patterns using Near Unsupervised Learning to Support the Quick Detection of New Animal Disease Outbreaks Caused by Viruses” (LP140100670) and YourGene Australia.

Current PhD students: C. Wijetunga, D. Mendis, Y. Deerasooriya, H. Weeratunga, P. Hamead, C. Jayawardena, D. Herath, D. Senanayake, A. Khalaj, W. Wei, Y. Sun and K. Abdulla and previous students K. Amarasinghe, Z. Li, U. Premaratne, S. Jayasekara, D. Jayasundara, D. Alahakoon and K. Chan and

Research collaborators: S. L. Tang, S. Petrou, U. Kayande, K. Steer, A. Wirth, B. Chang, M. Premaratne, A. Hsu, I. Saeed, U. Roessner, M. Kirley, K. Verspoor, J. Browne, G. Narsilio, D. Ackland and A. Bacic are acknowledged.

Thank you

[email protected]

Download - Vidya Jyothi Prof V. K. Samaranayake Memorial Oration 2016 ... · PowerPoint Presentation Author: ucsc Created Date: 7/7/2016 12:00:48 PM

Top Related