big-data analytics, smart clouds what is data science...

Download Big-Data Analytics, Smart Clouds What is Data Science ...gridsec.usc.edu/hwang/TalksandPresentations/WHU-March17-2016.pdf · Prof. Kai Hwang, USC, March 24, 2014 1 Big-Data Analytics,

If you can't read please download the document

Upload: vuongtu

Post on 07-Feb-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 1 Prof. Kai Hwang, USC, March 24, 2014

    Big-Data Analytics, Smart Clouds and Intelligent IoT Applications

    Kai Hwang

    University of Southern California currently visiting Wuhan University

    Presentation in Wuhan, March 17, 2016

    [email protected]

    1 2

    What is Data Science ? s) Data Science is the extraction

    of actionable knowledge directly from data through a process of discovery, hypothesis, and analytical hypothesis analysis.

    A Data Scientist is a practitioner who has sufficient knowledge of the overlapping regimes of expertise in business needs, domain knowledge, analytical skills and programming expertise to manage the end-to-end scientific method process through each stage in the big data lifecycle.

    Big Data refers to digital data volume, velocity and/or variety whose management requires scalability across coupled horizontal resources

    3 4 Prof. Kai Hwang, USC, March 24, 2014

    The Five Vs of Big Data

  • 5

    Domain Expertise

    MathStatistics

    Programming Skills

    Analytics

    Algorithms

    Models

    Data Science

    Distributed Computing

    Hadoop

    Statistics

    Machine Learning

    Deep Learning (Neural Networks)Natural Language Processing

    Data Mining

    Data Visualization

    Social Network & Graph Analysis

    Spark

    Medical Engineering &

    Science

    Linear Algebra & Programming

    How Big Is Data Industry Today ?

    All rights reserved, Kai Hwang, July 2015 6

    The Evolution of Scalable Parallel Computing on Clouds: From MapReduce to Hadoop and Spark in the last 10 years

    ! Google MapReduce Paradigm Written in C: from Search Engine to Google AppEngine

    ! Hadoop Library for MapReduce Programming in Java environment

    ! Extending Hadoop from MapReduce in batch processing over bi-parti workflow using distributed disks

    ! Using Spark for in-memory processing in streaming mode over any DAG computing paradigm

    7

    Distributed and Cloud Computing Kai Hwang, Geoffrey Fox, and Jack Dongarra, published by Morgan Kaufmann, 2012 (648 pages). covering clouds, grids, social networks, P2P systems, Internet of Things (IoT) and Big Data and Security studies. Second edition to

    appear in late 2016 1 - 8

    Data Deluge Enabling New Challenges

  • 9

    Architecture of The Internet of Things

    Merchandise Tracking

    Environment Protection

    Intelligent Search

    Tele- medicine

    Intelligent Traffic

    Cloud Computing Platform

    Smart Home

    Mobile Telecom Network

    The Internet Information Network

    RFID

    RFID Label

    Sensor Network

    Sensor Nodes

    GPS

    Road Mapper Sensing

    Layer

    Network Layer

    Application Layer

    10

    Wireless Sensor Network (WSN): Spatially distributed sensors to monitor physical or environmental conditions. WSNs

    emphasizing the information perception through all kinds of sensor nodes -- A basic scenario of the IoT.

    Machine to Machine (M2M) Communication: Typically, M2M

    refers to data communications without or with limited human intervention among various terminal devices such as

    computers, embedded processors, smart sensors/actuators and mobile devices, etc.

    Body-Area Network (BAN): Use of advances on lightweight,

    small-size, ultra-low-power, and intelligent monitoring wearable sensors, which continuously monitor humans

    physiological conditions for health status and motion control

    Cyber Physical System (CPS): It is a system of collaborating computational elements controlling physical entities.

    "Wireless Sensor Network (WSN): Spatially distributed sensors to monitor physical or environmental conditions. WSNs emphasizing the information perception through all kinds of sensor nodes -- A basic scenario of the IoT.

    "Machine to Machine (M2M) Communication: Typically, M2M refers to data communications without or with limited human intervention among various terminal devices such as computers, embedded processors, smart sensors/actuators and mobile devices, etc.

    " Body-Area Network (BAN): Use of advances on lightweight, small-size, ultra-low-power, and intelligent monitoring wearable sensors, which continuously monitor humans physiological conditions for health status and motion control

    " Cyber Physical System (CPS): It is a system of collaborating computational elements controlling physical entities.

    Four Major IoT Components :

    11

    Body-Area Networks (BAN) for Health-Care and Other

    Personal Applications

    12 Prof. Kai Hwang, USC, Nov. 25, 2013

    1. Government Operation: National Archives and Records Administration, Census Bureau

    2. Commercial: Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping

    3. Defense: Sensors, Image surveillance, Situation Assessment

    4. Healthcare and Life Sciences: Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity

    5. Deep Learning and Social Media: Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets

    51 Use Cases of Big Data: from TBs to PBs (NIST 2013)

    12

  • 13 Prof. Kai Hwang, USC, Nov. 25, 2013

    6. The Ecosystem for Research: Metadata, Collaboration, Language Translation, Light source experiments

    7. Astronomy and Physics: Sky Surveys, Large Hadron Collider at CERN and Belle Accelerator in Japan

    8. Earth, Environmental and Polar Science: Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors

    9. Energy: Smartgrid

    13

    51 Big-Data Use Cases: from TBs to PBs (Continued)

    14

    Example: 2012 US Election

    15

    Progression from Simple Analysis of Small Data To Cloud Analytics for Big Data

    16 Prof. Kai Hwang, USC, March 24, 2014

    Cloud Analytics on Big Data

    Bayesian Classifiers,

    ANN, SVM

    Professor Kai Hwang University of Southern California

    16

  • 17

    Analytics Process Model

    18 Prof. Kai Hwang, USC, March 24, 2014

    Execution Layers of Cloud for Big-Data Mining and Analytics Apps.

    19

    Cloud Mining of Big Data Structured data; low processing

    efficiency of unstructured data Expensive Hardware; Poor

    compatibility High scalability with the cost of

    Vendor lock-in

    Hybrid analysis capabilities of structured/unstructured data

    X86 servers; Good compatibility High scalability, over 10,000 node-level

    deployment

    Traditional Database/

    Data Warehouse

    Distributed architectureTB PB EB # ZB

    Sever supercluster (Data Warehouse) + Hadoop

    20

  • 21

    Evolution of Hadoop Programming On Internet Clouds In 10 Years

    22

    Machine Learning Approaches (1) 1. Decision Tree Learning : Using multi-way tree

    to make categorical decisions 2. Association Rule Learning : Discovering interesting

    relations between variables in large databases 3. Artificial Neural Networks (ANN) : Learning algorithm

    that is inspired by the structure and functional aspects of bilological neural networks

    4. Support Vector Machines (SVM) : Using supervised learning methods for classification and regression.

    23

    Machine Learning Approaches (2)

    5. Clustering Analysis : grouping sample data into clusters with similar properties or some predefined criteria.

    6. Bayesian Networks : A belief network that represent a set of random variables and their independencies. 7. Representative Learning: Preserving the input

    information as a preprocessing process for other classification algorithms.

    8. Genetic Algorithms (GA) : A research heuristic that mimics the process of natural selection and uses methods such as mutation and crossover to generate genotype towards making better decision

    24

    Bayesian Classifiers (1) Consider each attribute and class label as

    random variables

    Given a record with attributes (A1, A2,,An) Goal is to predict class C Specifically, we want to find the value of C

    that maximizes P(C| A1, A2,,An )

    Can we estimate P(C| A1, A2,,An ) directly from data?

  • 25

    Bayesian Classifiers (2) Compute the posterior probability P(C |

    A1, A2, , An) for all values of C using the Bayes theorem

    Choose value of C that maximizes P(C | A1, A2, , An)

    Equivalent to choosing value of C that maximizes P(A1, A2, , An|C) P(C)

    How to estimate P(A1, A2, , An | C )?

    )()()|()|(

    21

    21

    21

    n

    n

    n AAAPCPCAAAPAAACP

    =

    26

    Nave Bayes Classifier (3) Assume independence among attributes

    Ai when class is given: P(A1, A2, , An |C)

    = P(A1| Cj) P(A2| Cj) P(An| Cj) Can estimate P(Ai| Cj) for all Ai and Cj. New point is classified to Cj ,

    if P(Cj) P(Ai| Cj) is maximal.

    27

    Example of Nave Bayes Classifier (4) Name Give Birth Can Fly Live in Water Have Legs Class

    human yes no no yes mammalspython no no no no non-mammalssalmon no no yes no non-mammalswhale yes no yes no mammalsfrog no no sometimes yes non-mammalskomodo no no no yes non-mammalsbat yes yes no yes mammalspigeon no yes no yes non-mammalscat yes no no yes mammalsleopard shark yes no yes no non-mammalsturtle no no sometimes yes non-mammalspenguin no no sometimes yes non-mammalsporcupine yes no no yes mammalseel no no yes no non-mammalssalamander no no sometimes yes non-mammalsgila monster no no no yes non-mammalsplatypus no no no yes mammalsowl no yes no yes non-mammalsdolphin yes no yes no mammalseagle no yes no yes non-mammals

    Give Birth Can Fly Live in Water Have Legs Classyes no yes no ?

    0027.02013004.0)()|(

    021.020706.0)()|(

    0042.0134

    133

    1310

    131)|(

    06.072

    72

    76

    76)|(

    ==

    ==

    ==

    ==

    NPNAP

    MPMAP

    NAP

    MAP

    A: attributes

    M: mammals

    N: non-mammals

    P(A|M)P(M) > P(A|N)P(N)

    => Mammals

    28

    Decision Tree Training Process

  • 29

    What is Apache Spark ? Very Powerful in Big Data Processing

    ! It is a cluster computing platform designed to be fast in general-purpose applications, compared with the use of MapReduce or Hadoop.

    ! On the speed side, Spark extends the MapReduce model to support interactive queries and streaming processing.

    ! Spark offers the ability to run computations in memory, which is more efficient than MapReduce running on disks for complex applications.

    ! Spark is highly accessible, offering simple APIs in Python, Java, Scala, SQL, etc. Spark can run in Hadoop Clusters and access any Hadoop data source like Cassandra.

    30

    31

    Key Spark Libraries by 2015

    !Spark SQL deal with structured data. !Spark Streaming handles live streams of data. !Mllib library contains common machine

    learning functionality. !GraphX for manuipulating social network

    graphs. !Sparks Cluster Manager can run with !Hadoop YARN !Apache Mesos !Sparks own Standalone Scheduler.

    32

    The SMACT Technologies

  • 33

    Services-Oriented Architecture (SOA)

    34

    Gartners 2015 Hype Cycle of Emerging New Technologies:

    35

    SMACT

    Technology

    Theoretical Foundations

    Hardware Advances

    Sofware

    Tools and Libraries

    Networking

    Enablers

    Representative

    Service Providers

    SMACT Technologies

    36

    Centralized Cloud Control and Processing

  • 37

    SMACT Techno-

    logy

    Theoretical

    Foundations

    Hardware Advances

    Sofware

    Tools and Libraries

    Networking

    Enablers

    Representativ

    e Service Providers

    Cloud

    Compu-ting

    Virtualiza-tion, Parallel and Distributed Computing

    Server clusters, Clouds, Virtual Machines, Intercon-nection networks.

    OpenStack, GFS, HDFS, MapReduce, Hadoop, Spark, Storm, Cassandra

    Virtual Networks, OpenFlow Networks, Software-Defined Networks

    AWS, GAE, IBM, Salesforce, GoGrid Apache, Azure Rachspace, DropBox

    Internet

    of Things

    (IoT)

    Sensing Theory, Cyber Physics, Navitgation, Pervasive Computing

    Sensors, RFID, GPS, Robotics, Satellites, Zigbee, Geroscope,

    TyneOS, WAP, WTCP,IPv6, Mobile IP, Android, iOS, WPKI, UPnP, JVM

    Wireless LAN, PAN, MANET, WLAN Mesh, VANet, Bluetooth

    IoT Council, IBM, Health-Care, Smar tGrid, Social Media, Smart Earth, Google, Samsung

    SMACT Technologies by Basic Theories, Typical Hardware, Software Tooling, Networking and Service Providers

    38

    Concluding Remarks : $ We must leverage computing clouds and data analytics for the

    storing, processing and mining of big data, which changes in time and space all the time.

    $ Apache Hadoop made it possible for big-data processing on large server clusters or on elastic clouds in the past decade.

    $ Since 2009, Berkeley Spark frees up many constraints in MapReduce and Hadoop programming for general-purpose static or streaming big-data applications.

    $ Bigdata and cloud growth demand a major overhaul of our educational programs in computer science and technology.

    $ The clouds, mobile, IoT and social networks are changing our world, reshaping human relations, promoting the global economy and triggering societal reforms on a global scale.