ad-tech on aws 세미나 | aws와 데이터 분석

Data Analytics on AWSAWS 와 데이터 분석

세션의 진행

Piljoong Kim (@PiljoongKim)Solutions Architect

Amazon Web Services Korea

Data AnalyticsBig Data 와 데이터 분석관련 AWS 서비스AWS 와 데이터 분석

들어가기에 앞서

씨~익!웃고 가시죠!

데이터의 폭발적 증가

Volume

Velocity

Variety

빅데이터의 진화

실시간

알림

예측

전망

배치

보고서

Amazon Glacier

S3 DynamoDB

RDS

EMR

Amazon Redshift

Data PipelineAmazon Kinesis Cassandra

CloudSearchKinesis-enabled

app

Lambda ML

SQS

ElastiCache

DynamoDBStreams

Amazon Elasticsearch

너무 많은 툴들

Amazon S3

어떤 것이든 저장

오브젝트 저장소

확장 가능

99.999999999% 내구성

오브젝트 저장소

Amazon Redshift

관계형 데이터 웨어하우스

대용량 병렬 처리 – 페타 바이트 수준

완전 관리형 서비스

SSD 및 HDD 플랫폼 제공

1TB 기준 연간 $1,000, 시간당 $0.25 부터 시작

예약 노드(Reserved Node) 옵션 제공

정형 데이터 처리

Amazon EMR

Hadoop 을 서비스로 제공

Hive, Impala, Spark, Presto, 기타

쉬운 사용과 완전 관리형 서비스

스팟 인스턴스 사용 가능

HDFS 및 S3 파일 시스템

반정형/비정형 데이터 처리

Amazon Kinesis

실시간 스트림 처리

높은 처리량과 탄력성

손 쉬운 사용

S3, Lambda, Redshift, DynamoDB 와의 통합

스트리밍 처리

Amazon ML

손 쉬운 사용, 개발자를 위해 만들어진 관리형 서비스

Amazon 의 내부 시스템을 기반으로한 강력한 기술

AWS 에 저장되어 있는 데이터를 사용하여 모델 생성

예측 분석

Amazon Lambda

이벤트에 응답하는 코드를 작동시키는 Server-less

컴퓨팅 서비스

사용자 정의 커스텀 로직으로 AWS 서비스를 확장

처리된 요청과 동작한 컴퓨팅 시간만큼만 비용 청구

이벤트 처리

다시 데이터 분석으로 돌아와서…

많은 분들이 다음을 궁금해 합니다.

참고할 만한 아키텍처가 있나요?

너무 많아요, 뭘 써야 하죠?

어떻게 써야 하죠?

왜 많은 것 중 그걸 써야 하는거죠?

아키텍처 원리

• “데이터 버스”의 비결합성• Data → Store → Process → Answers

• 작업에 적합한 도구를 사용• Data structure, latency, throughput, access patterns

• 람다 아키텍처 활용• Immutable (append-only) log, batch/speed/serving layer

• AWS 관리형 서비스의 활용• No/low admin

• Big data != Big cost

Simplify Big Data Processing

ingest /collect

store process /analyze

consume / visualize

data answers

Time to Answer (Latency)Throughput

Cost

Collect Store Analyze Consume

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

Amazon Redshift

Impala

Pig

Amazon ML

Streaming

AmazonKinesis

AWSLambda

Amaz

on E

last

ic M

apR

educ

e

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Cac

he

Stre

am P

roce

ssin

gBa

tch

Inte

ract

ive

Logg

ing

Stre

am S

tora

ge

IoT

Appl

icat

ions

File

Sto

rage An

alys

is &

Vis

ualiz

atio

n

Hot

Cold

Warm

Hot

Slow

Hot

ML

Fast

Fast

Amazon QuickSight

Transactional Data

File Data

Stream Data

Not

eboo

ks

Predictions

Apps & APIs

Mobile Apps

IDE

Search Data

ETL

Reference Architecture

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Cac

he

Logg

ing

Stre

am S

tora

ge

IoT

Appl

icat

ions

File

Sto

rage

Transactional Data

File Data

Stream Data

Mobile Apps

Search Data

Database

FileStorage

Search

스트림저장소

Collect Store

스트림 저장소 옵션들

AWS 관리형 서비스• Amazon Kinesis: Stream• Amazon DynamoDB Streams: Table + Streams• Amazon SQS: Queue• Amazon SNS: Pub/Sub

비관리형 서비스• Apache Kafka: Stream

어떤 스트림 저장소를 사용해야 할까?AmazonKinesis

Amazon DynamoDBStreams

Amazon SQSAmazon SNS

Kafka

Managed Yes Yes Yes No

Ordering Yes Yes No Yes

Delivery at-least-once exactly-once at-least-once at-least-once

Lifetime 7 days 24 hours 14 days Configurable

Replication 3 AZ 3 AZ 3 AZ Configurable

Throughput No Limit No Limit No Limit ~ Nodes

Parallel Clients Yes Yes No (SQS) Yes

MapReduce Yes Yes No Yes

Record size 1MB 400KB 256KB Configurable

Cost Low Higher(table cost) Low-Medium Low (+admin)

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Cac

he

Logg

ing

Stre

am S

tora

ge

IoT

Appl

icat

ions

File

Sto

rage

Transactional Data

File Data

Stream Data

Mobile Apps

Search Data

Database

Search

파일저장소

Collect Store

왜 Amazon S3 가 빅데이터에 좋은가?

• 기본적으로 빅데이터 프레임워크 지원(Spark, Hive, Presto, etc.) • 스토리지를 위한 컴퓨팅 클러스터가 불필요 (HDFS와 다름)• Amazon EC2 스팟 인스턴스를 활용하여 하둡 클러스터 운영 가능• 동일한 데이터로 여러 종류(Spark, Hive, Presto) 클러스터를 동시에 사용• 오브젝트 갯수 무제한• 99.999999999%의 내구성을 위한 설계• 고 가용성 – AZ 장애 극복• 수명주기를 활용한 계층-스토리지 (Standard, IA, Amazon Glacier)• 보안 – SSL, client/server-side encryption at rest• 저비용• 매우 높은 대역폭 – 총 처리량 제한 없음

• 매우 자주 접근하는(hot) 데이터는 HDFS 사용

• 자주 접근하는 데이터는 Amazon S3 Standard 사용

• 드물게 접근하는 데이터는 Amazon S3 Standard – IA 사용

• 거의 접근하지 않는(cold) 데이터는 Amazon Glacier 사용하여 아카이브

S3와 HDFS, Amazon Glacier를 함께…

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Cac

he

Logg

ing

Stre

am S

tora

ge

IoT

Appl

icat

ions

File

Sto

rage

Transactional Data

File Data

Stream Data

Mobile Apps

Search Data

Collect Store

데이터베이스+ 검색

계층

Data TierSearch

Amazon ElasticsearchService

Amazon CloudSearch

Cache

RedisMemcached

SQL

Amazon AuroraMySQLMariaDBPostgreSQLOracleSQL Server

NoSQL

CassandraAmazon

DynamoDBHBaseMongoDB

Database + Search Tier

모범 사례 – 성격에 맞는 적합한 도구 사용

Applications

구체적인 예

데이터 구조와 접근 패턴

접근 패턴 What to use?

Put/Get (Key, Value) Cache, NoSQLSimple relationships → 1:N, M:N NoSQL

Cross table joins, transaction, SQL SQLFaceting, Search Search

데이터 구조 What to use?

Fixed schema SQL, NoSQLSchema-free (JSON) NoSQL, Search

(Key, Value) Cache, NoSQL

Cache SQL

Request RateHigh Low

Cost/GBHigh Low

LatencyLow High

Data VolumeLow High

GlacierSt

ruct

ure

NoSQL

Hot Data Warm Data Cold Data

Low

High

Search

Collect Store Analyze Consume

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

Amazon Redshift

Impala

Pig

Amazon ML

Streaming

AmazonKinesis

AWSLambda

Amaz

on E

last

ic M

apR

educ

e

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Cac

he

Stre

am P

roce

ssin

gBa

tch

Inte

ract

ive

Logg

ing

Stre

am S

tora

ge

IoT

Appl

icat

ions

File

Sto

rage An

alys

is &

Vis

ualiz

atio

n

Hot

Cold

Warm

Hot

Slow

Hot

ML

Fast

Fast

Amazon QuickSight

Transactional Data

File Data

Stream Data

Not

eboo

ks

Predictions

Apps & APIs

Mobile Apps

IDE

Search Data

ETL

Reference Architecture

처리와 분석

데이터 분석은 유용한 정보를 발견, 결론을 제시, 의사결정의 목적으로 데이터를 검사, 정제, 변환, 모델링하는과정을 의미

예시)대화형 대시보드 à 대화형 분석(Interactive Analytics)일일/주간/월간 보고서 à 배치 분석(Batch Analytics)결제/부정행위 경고, 1분 측정 à 실시간 분석(Real-Time Analytics)심리 분석, 예측 모델 à 기계 학습(Machine Learning)

대화형 분석

대량의 (warm/cold) 데이터를 대상답변을 얻기까지 수초가 걸림

예: 셀프 서비스 대시보드

배치 분석

대량의 (warm/cold) 데이터를 대상답변을 얻기까지 수분에서 수시간이 걸림

예: 일일, 주간, 월간 보고서 생성

실시간 분석

소량의 hot 데이터를 대상답변을 얻기까지 적은 시간(수밀리초 ~ 수초)이 걸림

실시간 (이벤트)- 데이터 스트림의 이벤트에 실시간으로 응답- 예: 결제/부정행위 알림

근 실시간 (마이크로 배치)- 데이터 스트림의 마이크로 배치를 통한 근 실시간

운영- 예: 1분 측정

기계 학습을 통한 예측

기계 학습(ML)은 명시적으로 프로그래밍 하지 않고도 컴퓨터가 학습할 수 있는 능력을 제공

기계 학습 알고리즘:감독 학습 ß “teach” 프로그램

- Classification ß 이 거래가 부정행위 인가? (Yes/No)- Regression ß 고객의 LTV 는?

자율 학습 ß let it learn by itself- Clustering ß 시장 세분화

분석 툴과 프레임워크

기계 학습- Mahout, Spark ML, Amazon ML대화형 분석- Amazon Redshift, Presto, Impala, Spark배치 처리- MapReduce, Hive, Pig, Spark스트림 처리- Micro-batch: Spark Streaming, KCL, Hive, Pig- Real-time: Storm, AWS Lambda, KCL

Amazon Redshift

Impala

Pig

Amazon Machine Learning

Streaming

AmazonKinesis

AWSLambda

Amaz

on E

last

ic M

apR

educ

e

Stre

am P

roce

ssin

gBa

tch

Inte

ract

ive

ML

Analyze

고객 사례: Hearst

Hearst is one of the world’s largest media and information companies, with more than 360

businesses.

I don’t know how we could have made our clickstream data pipeline work without

Amazon Kinesis.

Peter JaffeData Scientist,

Hearst Corporation

”

“ • 실시간 클릭스트림 이벤트와 트렌드 콘텐츠를분석할 플랫폼 개발이 필요 했었음

• Amazon Kinesis Streams 와 Amazon Kinesis Firehose 를 사용해서 매일 발생하는30 TB 의 클릭스트림 데이터를 전송하고있음

• 복잡한 데이터 사이언스 일과 분석 쿼리에Amazon Redshift 를 사용함

• 300 여개 이상의 웹사이트에서 생성되는데이터가 처리됨

• 수분 이내에 에디터로 클릭스트림 데이터를전달 함

• 트렌드 콘텐츠의 재순환이 25 퍼센트 이상증가함

https://aws.amazon.com/solutions/case-studies/hearst/


Buzzing API

APIReadyData

Amazon Kinesis

S3 Storage

Node.JSApp- ProxyUsers to

Hearst Properties

Clickstream

Data ScienceApplication

Amazon Redshift

ETL on EMR

100 seconds1G/day

30 seconds5GB/day

5 seconds1G/day

Milliseconds100GB/day

LATENCYTHROUGHPUT Models

Agg Data


Buzzing API

APIReadyData

Amazon Kinesis

S3 Storage

Node.JSApp- ProxyUsers to

Hearst Properties

Clickstream

Data ScienceApplication

Amazon Redshift

ETL on EMR

Models

Agg Data

Data Science Toolbox

DataModels

Amazon Redshift

• IPython Notebook• On Spark and Amazon Redshift

• Code sharing (and insights)• User-friendly development

environment for data scientists• Auto-convert .pynb à .py

잠시 Redshift 에 대해 더 알아볼까요?

Redshift 의 Ad Tech 활용 사례

• 어트리뷰선 분석 (Attribution Analysis)• 캠페인 성능 (Campaign Performance)• 데이터 관리 (Data Management)• 실시간 경매 (Real-Time Bidding)• 리타겟팅 (Retargeting)

왜 Redshift 일까요?

• 엄청난 데이터

– 160GB – 2TB

– S3 로의 접근

– 싱글 클러스터 vs 멀티 클러스터

• 가능하면 저렴하게!

– $1000/TB/매년

– 비용 때문에 데이터를 잃어 버릴 순 없죠

– 데이터는 온라인 일 수도, 오프라인 일 수도 있습니다!

• 시간은 돈!

– MPP 컬럼너: 수십억개의 이벤트에 쿼리를 수행 후 결과를 얻을 수 있습니다!

– SSD

– approximate 기능

Approximate COUNT DISTINCT

692.8s

34.9s

< 0.76%

COPY from JSON

• Ingest JSON directly into Amazon Redshift

• If you have a 1:1 mapping between JSON elements and column names, use ‘auto’

• Map elements to columns using a JSONPaths file

데이터 관리

• 일반적으로 최종 고객에게 분석 결과를 제공

• 중앙 클러스터가 모든 데이터에 대해 작업하고, 고객별

클러스터를 가동

• 주변 영향 없이 고객마다 독립적으로 클러스터를 확장

• 1개의 노드로 구성된 10개의 클러스터와, 10개의 노드로

구성된 1개의 클러스터의 가격 차이 없음

Neustar 의 AWS Redshift 경험

re:Invent 2014 (ADV403)슬라이드: http://bit.ly/NeustarAWS동영상: http://bit.ly/AWSNeustarVideo

4가지 문제점

Frequency

Attribution

Overlap

Ad-hoc

4가지 솔루션

Frequency + Attribution + Overlap + Ad-hoc =2.5 + 2 + 2.5 + 1.5 =

8.5 시간이 필요

Workload Node Count Node Type Restore Maint. Exec.Frequency

& Attribution& Overlap& Ad Hoc

16 dw2.8xlarge 2h 1h 6h

= $691.20

Workload Node Count Node Type Restore Maint. Exec.

Frequency 8 dw2.8xlarge 1.5h 0.5h 2.5h

Attribution 8 dw2.8xlarge 1.5h 0.5h 2h

Overlap 8 dw2.8xlarge 1h 0.5h 2.5h

Ad-hoc 8 dw2.8xlarge 0h 0.5h 1.5h

= $556.80 (-19%)

Lesson Learned

Amazon Redshift 클러스터의오케스트레이션이 참 쉬웠어요!

Don’t scale up, scale out.

AWS 에서 구현된 애드테크

• (Front-end)Beanstalk:Clickstreamingestion• Kinesis:Real-timedatastream• (Back-end)Beanstalk:KCLapps(Kinesis->S3)• Lambda:Eventdrivenprocessing(S3->Redshift)• RedShift:Businessintelligencereportingwithin-houseBItool• EMR:DataprocessingonSpark

MobileDevice(sdk연동)

ElasticBeanstalk

Kinesis

ElasticBeanstalk

Clickstreamdatacollection

Datafeeds

Logstorage,dataprocessing&analysis

S3

EMR Lambda

Redshift

adbrixUser BIuser

Visualize&report

Database

ElastiCache

DynamoDB

“EMR-Spark를이용한차세대빅데이터시스템을구현하여 60 퍼센트이상의비용절감을

달성하게되었습니다.”…

“S3-Lambda-RedShift를사용하여마이크로배치분석시스템을혼자서전부구현하는데약 10 업무일이소요되었습니다.”

- 백정상개발팀장, DevelopmentteamLeadatIGAWorks-

AdbrixUser

MobileDevice

Route53

EC2AdbrixAnalytics

DatabaseAdbrixAnalytics

EMR-SparkDailyBatchAnalysis

DynamoDB

ElasticBeanstalkActivityTracker

AmazonKinesis ElasticBeanstalkActivityProcess

AmazonS3ActivityStorages

AmazonLambdaMicro-batchloading

AmazonRedshiftBIAnalysis

AmazonRDS

AWSTokyoregion(ap-northeast-1) AWSN.Virginiaregion(us-east-1)

CrossRegion

Replication

ElastiCache

Amazon Elastic Beanstalk 활용

http://<elastic beanstalk app>/pixel.jpg?cID=10049&cdid=5961&campID=8&&ic_ch=&refVar=http%3A%2F%2Fwww.cosmopolitan.com%2F&icxid=1415035174637-8824780787007880&ic_uq=1415035296585-3799348233235675&ic_mid=&ic_js_ver=20140917&icctm_ht_athr=Tess%2520Koman&icctm_ht_aid=cosmo.article.32782&icctm_ht_attl=Terminally%2520lll%252029-Year-Old%2520Brittany%2520Maynard%2520Ends%2520Her%25200wn%2520Life%2520as%2520Planned&icctm_ht_chnl=Lifestyle&icctm_ht_dspb=NaN&icctm_ht_gack=1047615795&icct_m_ht_scck=&icctm_ht_q=&icctm_ht_kw=brittany%2520Ends%2520Her%2520wn%2520Life%2520as%2520Planned&icctm_ht_pgtyp=news&icctm_ht_dtpub=2014-11-03%252002%3A00%3A00&icctm_ht_sthr=Lifestyle&icctm_ht_stnm=cosmopolitan.com&icctm_ht_sfid=21422*FA0711DBFB-180E7D89E340EDB8&icctm_ht_cnocl=http%3A%2F%2Fwww.cosmopolitan.com%2Flifestyle%2Fnews%2Fa32782%2Fbrittany-maynard-dies%2F

ClientBrowser

AWS Elastic Beanstalkrunning node.js

Amazon Kinesis

Amazon Kinesis–

enabled app

Post to KinesisImage Request

모바일 리타겟팅

수집 데이터 데이터 정제탐구적

데이터 분석데이터 보강 성향 모델링

알고리즘수행

빈 값 처리중복 제거부정확한 값 교정

일변량 분석이변량 분석

사용자 캠페인 기록사용자/디바이스 프로필사용자 브라우징 기록(웹사이트 방문 기록,확인한 제품들,수행한 행동) 새로운 변수 생성

가변적으로 변화

모델 비교최선의 모델 선택

마케팅 캠페인 수정피드백 모니터링알고리즘 조정

목표: 고객 성향 예측을 위한 머신 러닝 기반의 실시간 분석 플랫폼

모바일 리타겟팅

AmazonKinesis

Amazon ML

AmazonEMR

AmazonRedshift

AmazonDynamoDB

AWS Elastic Beanstalk

Customer

데이터 수집데이터 처리

계산

사용자 방문 기록디바이스 프로필고객 데모그래픽

분산 데이터 클러스터- 실시간 처리 +

배치 처리- 관계형 + NoSQL

광고 제공 알고리즘- 회귀 모델- 인공신경망

Bid 가격 최적화비지니스 규칙 조절

CDN

Real-timeBidding

RetargetingPlatform

Reporting

Qubole

Real TimeAppsKCL Apps

Archiver

Amazon Kinesis

Event Replay Amazon S3

빅데이터 스트리밍

Producers Aggregator Continuous Processing

Store Analytics

ü DSP Running big data processing platform on AWS ü Evaluating 30T (30조) ad opportunities monthly ü Processing 86B (860억) messages daily on Kinesisü 72 % monthly cost saving on operational costs

Real-time Analytics

Producer ApacheKafka

KCL

AWS Lambda

SparkStreaming

Apache Storm

Amazon SNS

AmazonML

Notifications

AmazonElastiCache

(Redis)

AmazonDynamoDB

AmazonRDS

AmazonES

Alert

App state

Real-time Prediction

KPI

processstore

DynamoDBStreams

Amazon Kinesis

Interactive & BatchAnalytics

Producer Amazon S3

Amazon EMR

Hive

Pig

Spark

AmazonML

processstore

Consume

Amazon Redshift

Amazon EMRPresto

Impala

Spark

Batch

Interactive

Batch Prediction

Real-time Prediction

Batch Layer

AmazonKinesis

data

processstore

Lambda Architecture

Amazon Kinesis S3 Connector

Amazon S3

Applications

Amazon Redshift

Amazon EMR

Presto

Hive

Pig

Spark answer

Speed Layer

answer

Serving Layer

AmazonElastiCache

AmazonDynamoDBAmazon

RDSAmazon

ES

answer

AmazonML

KCL

AWS Lambda

Spark Streaming

Storm

이번 세션에서 얻어갈 점

• 비결합된 “데이터 버스”를 구축하세요!– Data → Store → Process → Answers

• 때에 맞는 적절한 툴을 활용 하세요!– Data Structure, latency, throughput, access patterns

• Lambda 아키텍처를 적극 고려해 보세요!– Immutable (append-only) log, batch/speed/serving layer

• AWS 관리형 서비스를 활용 하세요!– No/low admin

• 항상 비용을 고려하세요!– Big Data != Big Cost

이번 세션에서 얻어갈 점

• 하나의 거대한 클러스터 보다 다수의 작은 클러스터가 좋을 때가 많아요!– 클라우드의 장점을 적극 활용하세요, 언제든 켜고 끌 수 있어요

• S3 를 Data lake 로 사용해보세요!– 다른 서비스들과의 통합이 매우 자유로워요

Sacrificial ArchitectureFor many people throwing away a code base is a sign of failure, perhaps understandable given the inherent exploratory nature of software development, but still failure. But often the best code you can write now is code you'll discard in a couple of years time.

http://martinfowler.com/bliki/SacrificialArchitecture.html

피드백은 언제든 환영합니다!

AWS 공식 블로그: http://aws.amazon.com/ko/blogs/korea

AWS 공식 소셜 미디어

@AWSKorea AWSKorea

AmazonWebServices AWSKorea

감사합니다

ad-tech on aws 세미나 | aws와 데이터 분석

Technology