jenkinsとhadoopを利用した継続的データ解析環境の構築

69
継続的データ解析環境の構築 株式会社adingo エンジニア 健太

Upload: voyage-group

Post on 27-Jun-2015

1.473 views

Category:

Business


3 download

TRANSCRIPT

  • 1. adingo

2. Meta Information 2006.4 2012.3 Keio University Artificial Intelligence, Semantic Web, Ontology Engineering 2011.2 2012.3 CTO at Trippiece, Inc. Software Engineering 2012.4 Engineer at adingo, Inc. Data Analysis, Operation Engineering twitter: @suzu_vhttp://blog.kentasuzuki.net 3. 4. WebCRMLTVhttp://cosmi.io 5. 6. 7. http://www.flickr.com/photos/darwinbell/5827849044/sizes/o/in/photostream/ 8. 9. 10. AB 30 11. Audience Data +Data of Servicehttp://www.flickr.com/photos/chef_ele/3791289276/sizes/l/in/photostream/ 12. http://www.flickr.com/photos/catikaoe/232832224/sizes/o/in/photostream/ 13. LOGimpression clickconversion 14. monthsweeks days hoursminuteshttp://www.flickr.com/photos/gadl/284995199/sizes/o/in/photostream/seconds 15. Log: X0GB / day 16. http://www.flickr.com/photos/eole/1394588888/sizes/o/in/photostream/ 17. 18. 19. Infrastructure of cosmi 20. 21. AdnetworksmediaWeb SitesSSP DSP Amazon ElasticLoad Balancing(Web)Web APIEC2 EC2EC2 EC2 EC2EC2Auto scaling Group Auto scaling GroupAmazon SimpleAmazon Elastic DBStorage Service MapReduce(MongoDB) (S3)http://aws.amazon.com/jp/solutions/case-studies/adingo/ 22. AWSS3 -> Elastic MapReduce 23. 24. Apache Hive 25. 26. -- EX.) schema of access logCREATE EXTERNAL TABLE access_log (stamp bigint,ipaddress string,request_status string,latency string,user_agent string,referer string,response_out_byte int,input_byte int,connection string,status string)PARTITIONED BY (dt string)STORED AS SEQUENCEFILELOCATION s3n://sample-bucket/access_log/; 27. -- defining table for output daily referer count bypages.CREATE EXTERNAL TABLE daily_referer_count (referer string,referer_count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY tSTORED AS TEXTFILELOCATION s3n://sample-bucket/reports/${EVENTDATE}/daily_referer_count_ranking; 28. S3-- aggregation daily count of each referer.INSERT OVERWRITE TABLE daily_referer_countSELECT referer, count(*) AS referer_count FROMaccess_log WHERE dt="${EVENTDATE}"GROUP BY refererDISTRIBUTE BY referer SORT BY referer_count DESC,referer ASCLIMIT 100; 29. 30. 31. 32. 33. Deployment & Running Batch with Jenkins 34. Jenkins Jenkins Jenkins Welcome to Jenkins CI! | Jenkins CI http://jenkins-ci.org/ 35. JenkinsAPI MapReduceJava s3Hives3 36. $ vim apps/hoge/output_log_format$ git commit m#123 $ git push origin masterpush Team 37. pushTeam mirroringgithubpushmirrorfetch$ git clone --mirror [email protected]:hoge/repos.git$ git remote update 38. push Team mirroring:release:masterJenkins 39. pushTeammirroring:release development:masterEC2 master $ fab f app_fabfile.py development deploy testing migrate 40. pushproductionTeammirroringEC2 EC2 EC2:releaseAuto scaling Group previewEC2 AWS CloudFormation development:master EC2 Servers belonged to certain data domain.preCloudFormationrelease $ fab f app_fabfile.py preview deploy migrate $ fab f app_fabfile.py production deploy migrate 41. push productionTeam mirroring EC2 EC2 EC2 :release Auto scaling Group preview EC2 AWS CloudFormation development:master EC2 Servers belonged to certain data domain. .jarhive S3MapReduceHiveS3$ s3cmd put --recursive /path/to/workspace/* s3://example/hive/ 42. push productionTeam mirroring EC2 EC2 EC2 :release Auto scaling Group preview EC2 AWS CloudFormation development:master EC2 Servers belonged to certain data domain. .jarhive S3 $ TODAY=`date +%Y/%m/%d` $ elastic-mapreduce --create name LogAggregation--num-instances 10 --instance-type m1.small--hive-script --arg s3://example/hive/query/1st_step.qAmazon ElasticMapReduce--args -d,EVENTDATE=$TODAY --step-name FirstStep --hive-script --arg s3://example/hive/query/2nd_step.q--args -d,EVENTDATE=$TODAY --step-name SecondStep 43. 44. http://www.flickr.com/photos/proimos/4199675334/sizes/o/in/photostream/ 45. 46. Jenkins 47. ./elastic-mapreduce Amazon ElasticMapReduceMapReduceJenkins 48. ./elastic-mapreduce Amazon ElasticMapReduce Trigger Parameterized Buildparam:job_idjob_id 49. ./elastic-mapreduce Amazon ElasticMapReduce Trigger Parameterized Buildparam:job_id monitoring state of the jobMapReduce 50. ./elastic-mapreduceFAILED. Amazon ElasticMapReduce Trigger Parameterized Buildparam:job_id monitoring state of the job 51. ./elastic-mapreduce FAILED.Amazon Elastic MapReduce Trigger Parameterized Buildparam:job_idGetting Log Elastic MapReduceapi 52. ./elastic-mapreduce Amazon ElasticMapReduceTriggerParameterizedBuildparam:job_id Getting Logmail, IRC, etc.including the log. 53. 54. 55. http://www.flickr.com/photos/bogenfreund/556656621/sizes/o/in/photostream/ 56. 57. http://www.flickr.com/photos/proimos/4199675334/sizes/o/in/photostream/ 58. 59. http://www.flickr.com/photos/darwinbell/5827849044/sizes/o/in/photostream/ 60. http://www.flickr.com/photos/srtagomez/5416367341/sizes/l/in/photostream/ 61. 62. http://www.flickr.com/photos/columna/236353428/sizes/l/in/photostream/ 63. 64. Jenkins Welcome to Jenkins CI! | Jenkins CI http://jenkins-ci.org/ Parameterized Build - Jenkins - Jenkins Wiki https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Build Parameterized Trigger Plugin - Jenkins - Jenkins Wiki https://wiki.jenkins-ci.org/display/JENKINS/Parameterized+Trigger+Plugin MapReducejob_id Build Pipeline Plugin - Jenkins - Jenkins Wiki https://wiki.jenkins-ci.org/display/JENKINS/Build+Pipeline+Plugin MapReducegit commit IRC Plugin - Jenkins - Jenkins Wiki https://wiki.jenkins-ci.org/display/JENKINS/IRC+Plugin IRC Email-ext plugin - Jenkins - Jenkins Wiki https://wiki.jenkins-ci.org/display/JENKINS/Email-ext+plugin Jenkins OReilly Japan - Jenkins http://www.oreilly.co.jp/books/9784873115344/ JenkinsParameterized Trigger 65. Elastic MapReduce Amazon Elastic MapReduce Ruby Client : Developer Tools : Amazon Web Services http://aws.amazon.com/developertools/2264 Jenkins Running Hive on Amazon Elastic MapReduce : Articles & Tutorials : Amazon Web Services http://aws.amazon.com/articles/2857 Elastic MapReduceHive Contextual Advertising using Apache Hive and Amazon EMR : Articles & Tutorials : Amazon Web Services http://aws.amazon.com/articles/2855 Hive 66. Apache Hive Programming Hive-OReilly Media http://shop.oreilly.com/product/ 0636920023555.do Hive WebJIRA 67. Fabric Github: fabric/fabric https://github.com/fabric/fabric fabric python Fabric documentation http://docs.fabfile.org/ 68. cosmi Relationship Suite adingoCRM cosmi Relationship Suite http://pressrelease.adingo.jp/news/2012/07/ adingocrmcosmi--7f06.html cosmi Relationship Suite https://www.facebook.com/note.php? note_id=222292327824741 DMP 69. http://www.infoq.com/jp/articles/Continous-Delivery-Patterns Jenkins