aws summit barcelona - data analysis on aws
TRANSCRIPT
AWS Summit 2013 Barcelona Oct 24 – Barcelona, Spain
Carlos Conde
Sr. Mgr. Solutions Architecture
DATA ANALYSIS ON AWS
GENERATE STORE ANALYZE SHARE
THE COST OF DATA
GENERATION IS FALLING
THE MORE DATA YOU COLLECT
THE MORE VALUE YOU CAN
DERIVE FROM IT
GENERATE STORE ANALYZE SHARE
Lower cost,
higher throughput
GENERATE STORE ANALYZE SHARE
Lower cost,
higher throughput
Highly
constrained
Generated data
Available for analysis
DATA VOLUME
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
GENERATE STORE ANALYZE SHARE
GENERATE STORE ANALYZE SHARE
ACCELERATE
+ ELASTIC AND HIGHLY SCALABLE
+ NO UPFRONT CAPITAL EXPENSE
+ ONLY PAY FOR WHAT YOU USE
+ AVAILABLE ON-DEMAND
= REMOVE CONSTRAINTS
GENERATE STORE ANALYZE SHARE
GENERATE STORE ANALYZE SHARE
AWS Import / Export
AWS Direct Connect
Generated and stored in AWS
Inbound data transfer is free
Multipart upload to S3
Physical media
AWS Direct Connect
Regional replication of AMIs and snapshots
GENERATE STORE ANALYZE SHARE
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2
AMAZON S3 SIMPLE STORAGE SERVICE
AMAZON
DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED
NoSQL DATABASE SERVICE
DURABLE &
AVAILABLE CONSISTENT, DISK-ONLY
WRITES (SSD)
LOW LATENCY AVERAGE READS < 5MS,
WRITES < 10MS
NO ADMINISTRATION
500,000 WRITES PER SECOND
DURING SUPER BOWL
AMAZON
REDSHIFT FULLY MANAGED, PETA-BYTE SCALE
DATAWAREHOUSE ON AWS
DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was…
AMAZON REDSHIFT
A Whole Lot Simpler
A Lot Cheaper
A Lot Faster
AMAZON REDSHIFT
RUNS ON OPTIMIZED HARDWARE
HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate
HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage
30 MINUTES
DOWN TO
12 SECONDS
Extra Large Node
(HS1.XL)
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
AMAZON REDSHIFT LETS YOU
START SMALL AND GROW BIG
Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)
CREATE A DATAWAREHOUSE IN
MINUTES
JDBC/ODBC
Price Per Hour for
HS1.XL Single
Node
Effective Hourly
Price Per TB
Effective Annual
Price per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year
Reservation $ 0.500 $ 0.250 $ 2,190
3 Year
Reservation $ 0.228 $ 0.114 $ 999
DATA WAREHOUSING DONE THE AWS WAY
No upfront costs, pay as you go
Really fast performance at a really low price
Open and flexible with support for popular tools
Easy to provision and scale up massively
USAGE SCENARIOS
Redshift Reporting
and BI EMR
S3
DynamoDB Redshift
OLTP
Web Apps Reporting
and BI
RDBMS Redshift
OLTP
ERP Reporting
& BI
+
RDBMS Redshift
OLTP
ERP Reporting
& BI
Social Point Analytics in AWS Marc Canaleta (CTO)
@mcanaleta AWS Summit Barcelona 2013
Social Games developer para Mobile y Facebook
Fundada en 2008, oficinas en Barcelona (22@), 170 personas.
Top #20 mobile grossing games worldwide
Top #3 facebook developer
Juegos Sociales: interacción entre amigos, viralidad
Modelo freemium: Jugar es gratis, algunos items de pago
Sector Midcore
Leader in Breeding & Collecting strategy games
Top 20 Grossing en iOS App Store worldwide
Lanzado
recientemente en Android, featured en Google Play
6M DAU en Facebook
No mantener ni planificar hardware: aumenta la velocidad del negocio
Flexible: Pago por uso
Facilita la escalabilidad:
Auto Scaling
Facilita la alta disponibilidad: múltiples availability zones
Managed components: Load Balancers, Bases de datos, …
Analytics Driven. Necesarias para casi todos nuestros equipos:
Ingenieros: analíticas realtime, monitorización, detección de problemas
Producto: tomar decisiones, A/B testing, game balancing, …
Marketing: optimización de campañas
Finanzas: seguimiento del negocio
ANALYTICS QUEUES
BACKEND SERVERS BACKEND SERVERS
FLASH CLIENT IOS CLIENT ANDROID
CLIENT
ANALYTICS QUEUES ANALYTICS QUEUES
LOGFILES STORAGE
ANALYTICS DATABASE
BACKEND SERVERS Symfony 2
Redis
AWS S3
AWS Redshift
REDIS
Backend escribe eventos en listas de redis
Porque Redis? Coste y rendimiento: 10K eventos/segundo/servidor
Problema: es una base de datos en memoria, hay que vaciar las colas
constantemente Escalado y HA: N servidores distribuidos aleatoriamente
BACKEND
REDIS REDIS
Procesos python consumen las colas constantemente y
Calculan métricas Real Time
Almacenan logfiles de
eventos para subirlos a S3
Encolan en SQS la URL del objeto S3
Consumer
Redis Queue
LPOP event
Event Log File
Amazon S3
write event
put object
CARGA DE DATOS
GENERACIÓN DE EVENTOS
Redis Real Time
INCR counter
Amazon SQS
enqueue S3 object URL
Python es muy adecuado para desarrollar workers y tratar datos
Redis: estructuras como contadores,
sets, sorted sets, para métricas Real Time
S3: espacio virtualmente infinito, escalable, alta disponibilidad
SQS fiabilidad y disponibilidad a mayor precio que Redis
Consumer
LPOP event
Redis Real Time
INCR counter
Event Log File
Amazon S3
write event
put object
Amazon SQS
enqueue S3 object URL
CARGA DE DATOS
Redis Queue
GENERACIÓN DE EVENTOS
Amazon S3 Amazon SQS
Importer
TSV
RedShift
Los importers leen URLs de SQS
Se descargan logfiles de S3
Convierten a TSV
Importan masivamente a Redshift (N logfiles a la vez)
PROCESADO DE EVENTOS
Nos permite ser flexibles -> cambios de esquema sin downtime
Muy escalable (con downtime de escrituras)
Poco riesgo de implantación Sistema offline Backups
Mantenimiento mínimo: vacuums, espacio
Buen soporte de SQL, a diferencia de otras columnar databases
Transformaciones y cálculos diarios implementados en SQL
Ejemplo: UPDATE USER SET total_revenues = (SELECT SUM(amount) FROM transaction t
WHERE t.user_id = user.user_id);
Por qué no hadoop?
Mucho más complejo y lento; de momento las operaciones SQL cumplen todos nuestros requisitos
¿Te gustaría trabajar en el sector de los videojuegos?
Buscamos talento. El talento atrae al talento.
www.socialpoint.es/jobs
¡GRACIAS!
GENERATE STORE ANALYZE SHARE
Amazon EC2
Amazon Elastic
MapReduce
AMAZON ELASTIC
MAPREDUCE HADOOP AS A SERVICE
• A FRAMEWORK
• SPLITS DATA INTO PIECES
• LETS PROCESSING OCCUR
• GATHERS THE RESULTS
Corporate Data
Center
Elastic Data
Center
Corporate Data
Center
Elastic Data
Center
Application data
and logs for
analysis pushed
to S3
Corporate Data
Center
Elastic Data
Center
Amazon Elastic
Map Reduce
name node to
control analysis
N
Corporate Data
Center
Elastic Data
Center
Hadoop cluster
started by Elastic
Map Reduce
N
Corporate Data
Center
Elastic Data
Center
N
Adding many
hundreds or
thousands of
nodes
Corporate Data
Center
Elastic Data
Center
N
Disposed of when
job completes
Corporate Data
Center
Elastic Data
Center
Results of
analysis pulled
back into your
systems
GENERATE STORE ANALYZE SHARE
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
PUBLIC DATA SETS http://aws.amazon.com/publicdatasets
GENERATE STORE ANALYZE SHARE
GENERATE STORE ANALYZE SHARE
FROM DATA TO
ACTIONABLE
INFORMATION