aws webcast - managing big data in the aws cloud_20140924
DESCRIPTION
This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.TRANSCRIPT
Managing Big Data in the AWS Cloud
Siva Raghupathy
Principal Solutions Architect
Amazon Web Services
Agenda
• Big data challenges • AWS big data portfolio• Architectural considerations• Customer success stories• Resources to help you get started• Q&A
Data Volume, Velocity, & Variety
• 4.4 zettabytes (ZB) of data exists in the digital universe today– 1 ZB = 1 billion terabytes
• 450 billion transaction per day by 2020
• More unstructured data than structured data GB
TB
PB
ZB
EB
1990 2000 2010 2020
Big Data• Hourly server logs: how your systems were
misbehaving an hour ago
• Weekly / Monthly Bill: What you spent this past billing cycle?
• Daily customer-preferences report from your web-site’s click stream: tells you what deal or ad to try next time
• Daily fraud reports: tells you if there was fraud yesterday
Real-time Big Data• Real-time metrics: what just went wrong
now
• Real-time spending alerts/caps: guaranteeing you can’t overspend
• Real-time analysis: tells you what to offer the current customer now
• Real-time detection: blocks fraudulent use now
Big Data : Best Served Fresh
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated dataData volume - Gap
1990 2000 2010 2020
Data Analysis Gap
Big Data
Potentially massive datasets
Iterative, experimental style of data manipulation and analysis
Frequently not a steady-state workload;
peaks and valleys
Time to results is key
Hard to configure/manage
AWS Cloud
Massive, virtually unlimited capacity
Iterative, experimental style of infrastructure deployment/usage
At its most efficient with highly variable workloads
Parallel compute clusters from singe data source
Managed services
AWS Big Data Portfolio
Collect / Ingest
Kinesis
Process / Analyze
EMR EC2
Redshift Data Pipeline
Visualize / ReportStore
Glacier
S3
DynamoDB
RDS
Import Export
Direct Connect
Amazon SQS
Ingest: The act of collecting and storing data
Why Data Ingest Tools?
• Data ingest tools convert random streams of data into fewer set of sequential streams
– Sequential streams are easier to process
– Easier to scale
– Easier to persist
Data Ingest Tools
• Facebook Scribe Data collectors
• Apache Kafka Data collectors
• Apache Flume Data Movement and Transformation
• Amazon Kinesis Data collectors
Real-time processing of streaming data
High throughput
Elastic
Easy to use
Connectors for EMR, S3, Redshift, DynamoDB
Amazon Kinesis
Amazon Kinesis ArchitectureAmazon Kinesis Architecture
Kinesis Stream: Managed ability to capture and store data
• Streams are made of Shards
• Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by adding or
removing Shards
• Replay data inside of 24Hr. Window
Simple Put interface to store data in Kinesis
• Producers use a PUT call to store data in a Stream
• PutRecord {Data, PartitionKey,
StreamName}
• A Partition Key is supplied by producer and used to
distribute the PUTs across Shards
• Kinesis MD5 hashes supplied partition key over the hash
key range of a Shard
• A unique Sequence # is returned to the Producer upon a
successful PUT call
Building Kinesis Processing Apps: Kinesis Client LibraryClient library for fault-tolerant, at least-once, Continuous Processing
o Java client library, source available on Github
o Build & Deploy app with KCL on your EC2 instance(s)
o KCL is intermediary b/w your application & stream
Automatically starts a Kinesis Worker for each shard
Simplifies reading by abstracting individual shards
Increase / Decrease Workers as # of shards changes
Checkpoints to keep track of a Worker’s location in
the stream, Restarts Workers if they fail
o Integrates with AutoScaling groups to redistribute workers
to new instances
Sending & Reading Data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library +Connector Library
Apache Storm
Amazon Elastic MapReduce
Sending Reading
Write Read
AWS Partners for Data Load and Transformation
Hparser, Big Data Edition
Flume, Sqoop
Storage
Storage
Structured – Simple QueryNoSQL
Amazon DynamoDBCache
Amazon ElastiCache (Memcached, Redis)
Structured – Complex QuerySQL
Amazon RDS Data Warehouse
Amazon RedshiftSearch
Amazon CloudSearch
Unstructured – No QueryCloud Storage
Amazon S3Amazon Glacier
Unstructured – Custom QueryHadoop/HDFS
Amazon Elastic Map Reduce
Dat
a St
ruct
ure
Com
plex
ity
Query Structure Complexity
Store anything
Object storage
Scalable
Designed for 99.999999999% durability
Amazon S3
Why is Amazon S3 good for Big Data?
• No limit on the number of Objects• Object size up to 5TB• Central data storage for all systems• High bandwidth• 99.999999999% durability• Versioning, Lifecycle Policies• Glacier Integration
Amazon S3 Best Practices
• Use random hash prefix for keys
• Ensure a random access pattern
• Use Amazon CloudFront for high throughput GETs and PUTs
• Leverage the high durability, high throughput design of Amazon S3 for backup
and as a common storage sink
• Durable sink between data services
• Supports de-coupling and asynchronous delivery
• Consider RRS for lower cost, lower durability storage of derivatives or copies
• Consider parallel threads and multipart upload for faster writes
• Consider parallel threads and range get for faster reads
Aggregate All Data in S3 Surrounded by a collection of the right tools
EMR Kinesis
Redshift DynamoDB RDS
Data Pipeline
Spark StreamingCassandra Storm
Amazon S3
Amazon S3
Fully-managed NoSQL database service
Built on solid-state drives (SSDs)
Consistent low latency performance
Any throughput rate
No storage limits
Amazon DynamoDB
DynamoDB Concepts
attributes
items
table
schema-lessschema is defined per attribute
DynamoDB: Access and Query Model
• Two primary key options• Hash key: Key lookups: “Give me the status for user abc”• Composite key (Hash with Range): “Give me all the status updates for user ‘abc’
that occurred within the past 24 hours”
• Support for multiple data types– String, number, binary… or sets of strings, numbers, or binaries
• Supports both strong and eventual consistency– Choose your consistency level when you make the API call– Different parts of your app can make different choices
• Global Secondary Indexes
DynamoDB: High Availability and Durability
• Regional service• Synchronous replication to
three Availability Zones• Writes acknowledged only
when they are on disk in at least two Availability Zones
What does DynamoDB handle for me?
• Scaling without down-time• Automatic sharding• Security inspections, patches, upgrades• Automatic hardware failover• Multi-AZ replication• Hardware configuration designed specifically for DynamoDB• Performance tuning
…and a lot more
Amazon DynamoDB Best Practices
• Keep item size small• Store metadata in Amazon DynamoDB and blobs in Amazon S3 • Use a table with a hash key for extremely high scale • Use hash-range key to model
– 1:N relationships– Multi-tenancy
• Avoid hot keys and hot partitions• Use table per day, week, month etc. for storing time series data• Use conditional updates
Relational Databases
Fully managed; zero admin
MySQL, PostgreSQL, Oracle & SQL Server
Amazon RDS
Process and Analyze
Processing Frameworks
• Batch Processing– Take large amount (>100TB) of cold data and ask questions– Takes hours to get answers back
• Stream Processing (real-time)– Take small amount of hot data and ask questions – Takes short amount of time to get your answer back
Processing Frameworks
• Batch Processing– Amazon EMR (Hadoop)– Amazon Redshift
• Stream Processing– Spark Streaming– Storm
Columnar data warehouse
ANSI SQL compatible
Massively parallel
Petabyte scale
Fully-managed
Very cost-effective
Amazon
Redshift
Amazon Redshift architecture
• Leader Node– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3
– Parallel load from Amazon DynamoDB
• Hardware optimized for data processing
• Two hardware platforms– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3 / DynamoDB
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
LeaderNode
Amazon Redshift Best Practices
• Use COPY command to load large data sets from Amazon S3, Amazon
DynamoDB, Amazon EMR/EC2/Unix/Linux hosts
– Split your data into multiple files
– Use GZIP or LZOP compression
– Use manifest file
• Choose proper sort key
– Range or equality on WHERE clause
• Choose proper distribution key
– Join column, foreign key or largest dimension, group by column
Hadoop/HDFS clusters
Hive, Pig, Impala, HBase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon Elastic
MapReduce
EMR Cluster
S3
1. Put the data into S3
2. Choose: Hadoop distribution, # of nodes, types of nodes, Hadoop
apps like Hive/Pig/HBase
4. Get the output from S3
3. Launch the cluster using the EMR console, CLI, SDK, or
APIs
How Does EMR Work?
EMR
EMR Cluster
S3
You can easily resize the cluster
And launch parallel clusters using the same
data
How Does EMR Work?
EMR
EMR Cluster
S3
Use Spot nodes to save time and money
How Does EMR Work?
The Hadoop Ecosystem works inside of EMR
Amazon EMR Best Practices
• Balance transient vs persistent clusters to get the best TCO
• Leverage Amazon S3 integration– Consistent View for EMRFS
• Use Compression (LZO is a good pick)• Avoid small files (< 100MB; s3distcp can help!)• Size cluster to suit each job• Use EC2 Spot Instances
Amazon EMR Nodes and Size
• Tuning cluster size can be more efficient than tuning Hadoop code• Use m1 and c1 family for functional testing• Use m3 and c3 xlarge and larger nodes for production workloads• Use cc2/c3 for memory and CPU intensive jobs• hs1, hi1, i2 instances for HDFS workloads • Prefer a smaller cluster of larger nodes
Partners – Analytics (Scientific, algorithmic, predictive, etc)
Visualize
Partners - BI & Data Visualization
Putting All The AWS Data Tools Together & Architectural Considerations
One tool to rule them all
Data Characteristics: Hot, Warm, Cold
Hot Warm Cold
Volume MB–GB GB–TB PBItem size B–KB KB–MB KB–TBLatency ms ms, sec min, hrsDurability Low–High High Very High
Request rate Very High High LowCost/GB $$-$ $-¢¢ ¢
Average latency
Datavolume
Item size
Request rate
Cost ($/GB/month)
Durability
Elasti-Cache
ms
GB
B-KB
Very High
Low -Moderate
$$
AmazonDynamoDB
ms
GB-TBs(no limit)
B-KB(64 KB max)
Very High
Very High
¢¢
AmazonRDS
ms.sec
High
High
¢¢
GB-TB(3 TB max)
KB(~rowsize)
CloudSearch
ms.sec
High
High
$
GB-TB
KB(1 MB max)
AmazonRedshift
sec.min
Low
High
¢
KB(64 K max)
TB-PB(1.6 PB max)
AmazonEMR (Hive)
sec.min,hrs
Low
High
¢
KB-MB
GB-PB(~nodes)
AmazonS3
Very High
¢
KB-GB(5 TB max)
GB-PB(no limit)
ms,sec,min (~size)
Low-Very High (no limit)
AmazonGlacier
Very High
¢
GB(40 TB max)
GB-PB(no limit)
hrs
Very Low(no limit)
Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…”
Request rate (Writes/sec)
Object size(Bytes)
Total size(GB/month)
Objects per month
300 2048 1483 777,600,000
Request rate (Writes/sec)
Object size(Bytes)
Total size(GB/month)
Objects per month
300 2,048 1,483 777,600,000 DynamoDB or S3?
Request rate (Writes/sec)
Object size(Bytes)
Total size(GB/month)
Objects per month
Scenario 1 300 2,048 1,483 777,600,000
Scenario 2 300 32,768 23,730 777,600,000
Amazon S3
Amazon DynamoDB
use
use
Lambda Architecture
Putting it all togetherDe-coupled architecture
• Multi-tier data processing architecture• Ingest & Store de-coupled from Processing• Ingest tools write to multiple data stores • Processing frameworks (Hadoop, Spark, etc.) read from data stores• Consumers can decide which data store to read from depending on
their data processing requirement
SparkStreaming /
Storm
Redshift
Impala Spark
EMR/Hadoop
Redshift
EMR/Hadoop
Spark
Kinesis/Kafka
NoSQL / DynamoDB / Hadoop HDFS S3Data
Hot ColdData Temperature
Latency
Low
HighAnswers
Customer Use Cases
Autocomplete Search RecommendationsAutomatic spelling corrections
A look at how it works
Months of user history Common misspellings
Data Analyzed Using EMR:
WestenWistin
WestanWhestin
Automatic spelling corrections
Months of user search data
Search terms
Misspellings
Final click throughs
Yelp web site log data goes into Amazon S3
Amazon S3
Amazon Elastic MapReduce spins up a 200 node Hadoop cluster
Hadoop Cluster
Amazon EMRAmazon S3
Hadoop Cluster
Amazon EMRAmazon S3
All 200 nodes of the cluster simultaneously look for common misspellings
Westen
Wistin
Westan
Hadoop Cluster
Amazon EMRAmazon S3
A map of common misspellings and suggested corrections are loaded back into Amazon S3.
Westen
Wistin
Westan
Then the cluster is shut down Yelp only pays for the time they used it
Hadoop Cluster
Amazon EMRAmazon S3
Each of Yelp’s 80 Engineers Can Do This Whenever They Have a Big Data Problem
spins up over 250 Hadoop clusters per
week in EMR.
Amazon EMRAmazon S3
Data Innovation Meets Action at Scale at NASDAQ OMX
• NASDAQ’s technology powers more than 70 marketplaces in 50 countries
• NASDAQ’s global platform can handle more than 1 million messages/second at a
median speed of sub-55 microseconds
• NASDAQ own & operate 26 markets including 3 clearinghouse & 5 central securities repositories
• More than 5,500 structured products are tied to NASDAQ’s global indexes with the notional value of at least $1 trillion
• NASDAQ powers 1 in 10 of the world’s securities transactions
NASDAQ’s Big Data Challenge
• Archiving Market Data– A classic “Big Data” problem
• Power Surveillance and Business Intelligence/Analytics
• Minimize Cost– Not only infrastructure, but development/IT labor costs too
• Empower the business for self-service
1 2 3 4 5 6 7 8 90
100,000,000200,000,000300,000,000400,000,000500,000,000600,000,000
NASDAQ Exchange Daily Peak Messages
MarketDataIs BigDataCharts courtesy of the Financial Information Forum
NASDAQ’s Legacy Solution
• On-premises MPP DB– Relatively expensive, finite storage– Required periodic additional expenses to add more storage– Ongoing IT (administrative) human costs
• Legacy BI tool– Requires developer involvement for new data sources, reports,
dashboards, etc.
New Solution: Amazon Redshift
• Cost Effective– Redshift is 43% of the cost of legacy
• Assuming equal storage capacities
– Doesn’t include IT ongoing costs!
• Performance– Outperforms NASDAQ’s legacy BI/DB solution– Insert 550K rows/second on a 2 node 8XL cluster
• Elastic– NASDAQ can add additional capacity on demand, easy to grow their cluster
• Amazon Redshift partner– http://aws.amazon.com/redshift/partn
ers/pentaho/
• Self Service– Tools empower BI users to integrate
new data sources, create their own analytics, dashboards, and reports without requiring development involvement
• Cost effective
New Solution: Pentaho BI/ETL
Net Result
• New solution is cheaper, faster, and offers capabilities that NASDAQ didn’t have before– Empowers NASDAQ’s business users to explore data like they never
could before– Reduces IT and development as bottlenecks– Margin improvement (expense reduction and supports business
decisions to grow revenue)
NEXT STEPS
AWS is here to help
Solution Architects
Professional Services
Premium Support
AWS Partner Network (APN)
aws.amazon.com/partners/competencies/big-data
Partner with an AWS Big Data expert
http://aws.amazon.com/marketplace
Big Data Case Studies
Learn from other AWS customers
aws.amazon.com/solutions/case-studies/big-data
AWS Marketplace
AWS Online Software Store
aws.amazon.com/marketplace
Shop the big data category
http://aws.amazon.com/marketplace
AWS Public Data Sets
Free access to big data sets
aws.amazon.com/publicdatasets
AWS Grants Program
AWS in Education
aws.amazon.com/grants
AWS Big Data Test Drives
APN Partner-provided labs
aws.amazon.com/testdrive/bigdata
https://aws.amazon.com/training
AWS Training & Events
Webinars, Bootcamps, and Self-Paced Labs
aws.amazon.com/events
Big Data on AWS
Course on Big Data
aws.amazon.com/training/course-descriptions/bigdata
reinvent.awsevents.com
aws.amazon.com/big-data
Thank You!