big data 101

28
NEW ENGLAND SQL SERVER Big Data 101 SPONSORED BY Paresh Motiwala

Upload: paresh-motiwala-pmp

Post on 13-Apr-2017

74 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Big data 101

NEW ENGLAND SQL SERVER

Big Data 101

SPONSORED BY

Paresh Motiwala

Page 2: Big data 101

BIG DATA 101SERIOUSLY, THIS IS JUST 101BY PARESH MOTIWALA PREPARED FOR NESQL

Page 3: Big data 101

PARESH MOTIWALA, PMP ®

[email protected]• http://www.linkedin.com/in/pareshmotiwala

• @pareshmotiwala• www.circlesofgrowth.com

Page 4: Big data 101

BIG DATA 101• Who should attend

• DBAs• CIO• Marketing peeps• Developers• Big Data Enthusiasts

• Who should not attend

Page 5: Big data 101

BIG DATA 101• Agenda for the day:

• Sources• Privacy concerns• Storing- Hadoop• Processing – MapReduce• Presentation• Summary

Page 6: Big data 101

BIG DATA 101

Page 7: Big data 101

SO WHY SHOULD I CARE ABOUT THIS?

Data is the new Electricity (Satya Nadella, Spring 2016) https://www.microsoft.com/en-us/sql-server/data-drivenCompanies Generate data, Distribute, Meter, and Use it

Where is data stored?Current: SQL Server, Oracle, Teradata, DB2, Netezza, Open Source Databases; Casandra, MySQL, MongoDBUnstructured: Hadoop, Spark, Data Lakes

What type of data is stored?Traditional: Rows and Columns Big Data Explosion: Images, streaming data, internet-connected devices (IoT), Machine data

Page 8: Big data 101

BIG DATA IS DRIVING TRANSFORMATIVE CHANGESTraditional Big Data

Relational datawith highly modeled schema

All datawith schema agility

Specialized HW Commodity HW

Datacharacteristics

Costs

Culture Operational reportingFocus on rear-view analysis

Experimentation leading to intelligent actionWith machine learning, graph, a/b testing

Page 9: Big data 101

BIG DATA 101• Sources

• Cell Phones• Social Media• Credit Cards• GPSs• Bread Crumbs

Page 10: Big data 101

BIG DATA 101• 5 Vs of Big Data

• Volume• Variety• Velocity• Veracity• Value

Page 11: Big data 101

BIG DATA 101• Desired Properties:

• Robustness- Fault Tolerance• Low Latency• Scalability• Generalization• Extensibility• Ad hoc Queries• Minimal Maintenance• Debuggability

Page 12: Big data 101

BIG DATA 101• Flow

Collection Pre-processing

Intervention Visualization

Hygiene

Analysis

OVER 90% OF TODAY’S DATA WAS CREATED IN PAST 2 YEARS

Page 13: Big data 101

BIG DATA 101• 5 Rs of Data Quality

• Relevancy• Recency• Range• Robustness• Reliability

• Ephemeral Vs. Durability• Refresh of Data

Page 14: Big data 101

BIG DATA 101• Privacy of Data

• If I collect the data, is it mine?• Ownership Vs Rights • Share Answers not Data• OpAl (http://www.trust.mit.edu/projects/)• Enigma• Let them know

• Why you are collecting• What you are collecting

• FIPP- Fair Information Privacy Principles• Individual Control• Transparency• Respect for Context• Security• Access and Accuracy• Focused Collection

• FERPA- Family Education Rights and Privacy Act

Page 15: Big data 101

BIG DATA 101WHAT IS A DATA LAKE? ---COURTESY : JAMES SERRA

A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed.

• A place to store unlimited amounts of data in any format inexpensively, especially for archive purposes

• Allows collection of data that you may or may not use later: “just in case”• A way to describe any large data pool in which the schema and data requirements

are not defined until the data is queried: “just in time” or “schema on read”• Complements EDW and can be seen as a data source for the EDW – capturing all

data but only passing relevant data to the EDW• Frees up expensive EDW resources (storage and processing), especially for

data refinement• Allows for data exploration to be performed without waiting for the EDW team to

model and load the data (quick user access)• Some processing in better done with Hadoop tools than ETL tools like SSIS• Easily scalable

Page 16: Big data 101

BIG DATA 101THE “DATA LAKE” USES A BOTTOMS-UP APPROACH

Ingest all data regardless of requirements

Store all data in native format without schema definition

Do analysisUsing analytic engines like Hadoop

Interactive queriesBatch queries

Machine LearningData warehouse

Real-time analytics

Devices

Courtesy : James Serra

Page 17: Big data 101

BIG DATA 101

Page 18: Big data 101

BIG DATA 101

Page 19: Big data 101

BIG DATA 101

Page 20: Big data 101

BIG DATA 101• MapReduce

• Map –Sends Queries• Reduce – Collects Results• Job Tracker• Task Tracker• YARN

Page 21: Big data 101

Azure Services

Near Realtime Data Analytics Pipeline using Azure Steam Analytics

Big Data Analytics Pipeline using Azure Data Lake

Interactive Analytics and Predictive Pipeline using Azure Data Factory

Base Architecture : Big Data Advanced Analytics Pipeline

Data Sources Ingest Prepare(normalize, clean, etc.)

Analyze(stat analysis, ML, etc.)

Publish(for programmatic

consumption, BI/visualization)

Consume (Alerts, Operational Stats,

Insights)

Machine Learning

Tele

met

ry

Azure SQL (Predictions)

HDI Custom ETL Aggregate /Partition

Azure Storage Blob

dashboard of predictions / alerts

Live / real-time data stats, Anomalies and

aggregates

Hist

oric

Las

er D

ata

(1 ti

me

drop

)

Local DB Sensor

Readings

Local DB Logs Customer

MIS

Event Hub PowerBI

dashboard

Stream Analytics (real-time analytics)

Sensor Readings Device

Health

Data Stream

Faul

t an

d M

aint

enan

ce D

ata

(1 ti

me

drop

)

Azure Data Lake Analytics (Big Data Processing)

Azure Data Lake Storage Azure SQL

Data in Motion

Data at Rest

dashboard of operational stats

Real

time

Read

ings

and

O

pera

tiona

l Da

ta

Lega

cy

(Rep

lace

d by

Az

ure

SQL)

Operational Logs

21

Sche

dule

d ho

urly

tr

ansf

er u

sing

Azur

e Da

ta F

acto

ry

Machine Learning (Anomaly Detection)

OnPrem Data

Page 22: Big data 101

VISION FOR BIG DATA AND DATA WAREHOUSING

Azure Data Factory+

Federated Query

On-

Prem

ises

Data Warehouse

“Big Data”

Clou

d

ComprehensiveConnectedChoice

VMsSQL DW

APSSQL Server

VMsHDInsight

Data Lake

Microsoft Azure

Microsoft Azure

Microsoft SQL Server

HDP APSYour dataYour workloadYour businessYour way

Page 23: Big data 101

BIG DATA 101PRESENTATION• R• Python• Power BI• Power BI Desktop

Page 24: Big data 101

BIG DATA 101•Someday Big Data will just become data

Page 25: Big data 101

BIG DATA 101• Summary:

• Sources• Privacy concerns• Storing- Hadoop• Processing – MapReduce• Presentation

Page 26: Big data 101

BIG DATA 101 - CONCLUSIONSQL Server is the best Relational DatabaseThe world is much bigger than any one relational databaseWhat is your company’s data strategy?What is your company’s cloud strategy?Learn adjacent technologies that will make you valuable.

Power BI?Hadoop?NoSQL?

Page 27: Big data 101

BIG DATA 101• BIBLIOGRAPHY –

• http://www.datasciencecentral.com/• https://

www.youtube.com/playlist?list=PLt-0mOCwxJ6B_OxTlpevxJNAa7GfCLd3l

• https://www.dezyre.com/article/hadoop-components-and-architecture-big-data-and-hadoop-training/114

• MIT Big Data Analytics Course• Data Lake presentation by James Serra• Future of Data…..(or something like that) by George Walters

Page 28: Big data 101

BIBLIOGRAPHY- BIG DATA 101Ignite (IT Pros) - https://myignite.microsoft.com/videosChannel9 (Developers) - https://channel9.msdn.com/Microsoft Virtual Academy (Both) – http://mva.microsoft.comTechnet Virtual Labs (Hands-on!) - https://technet.microsoft.com/en-us/virtuallabs/defaultFree Azure for 1 month - https://azure.microsoft.com/en-us/free/Free HDInsight (Hadoop as a service) for a week - https://azure.microsoft.com/en-us/services/hdinsight/information-request/MSDN? Link that to Azure for monthly Azure money.Github - https://github.com/