scalding @ coursera

Click here to load reader

Post on 14-Jun-2015

14.567 views

Category:

Software

2 download

Embed Size (px)

DESCRIPTION

A lightning talk I gave about how Coursera decided to use Scalding.

TRANSCRIPT

  • 1. @ CourseraDaniel Chia@DanielJHChiaSoftware Engineer, Infrastructure

2. Overview Context Growing Needs Hive / Pig / Scalding 3. Technical (Online Stack) 100% hosted on AWS Service-oriented architecture Mix of MySQL and Cassandra for persistence Scala 4. Existing WarehouseStreaming 5. Future Warehouse FlowS3Event Data 6. Need 1: Expressive Joins Aggregations Secondary sort Multiple map-reduce 7. Need 2: Semi-structured Data Increased usage of Cassandra Events data 8. {timestamp:1411359695744,membershipState":"LearnerEnrolled"} 9. {"typeName": "multipart","definition": {"assignmentParts": {"id1": {"typeName": "plainText","order": 0,"definition": {"prompt": "Write a sentence describing what you think about cereal."}},"id2": {"typeName": "richText","order": 1,"definition": {"prompt": "Write a long essay with lots of fancy formatting describing what youthink about cereal."}},"id3": {"typeName": "url","order": 2,"definition": {"prompt": "Post a link to your favorite cereal."}},"id4": {"typeName": "plainText","order": 3,"definition": { 10. Choices Hive Pig Scalding 11. Hive SQL-like language Great for simple rollups and aggregations Procedural transforms difficult to express 12. Pig Mature Procedural Pig Latin + Lots of UDFs 13. Scalding Pros Succinct Expressive All code in one language Re-use online data models 14. Scaling Pros Easy to test 15. Scalding Cons Have to learn Scala More heavy weight for simple experimental things. Many layers abstracted from MapReduce 16. Scalding Example User event data Want to join with course and topic data 17. Scalding Exampleval events = TypedTsv /* load data */.toTypedPipeval courses = TypedTsv .toTypedPipeval topics = TypedTsv .toTypedPipe 18. Scalding Exampleevents.groupBy(_.courseId).leftJoin(courses.groupBy(_.courseId)).groupBy(_._2.topicId).leftJoin(topics.groupBy(_.topicId))/* more analysis */ 19. Scalding Exampleevents.groupBy(_.courseId).leftJoin(courses.groupBy(_.courseId)).groupBy(_._2.topicId).leftJoin(topics.groupBy(_.topicId))/* more analysis */ 20. Scalding Exampleevents.groupBy(_.courseId).leftJoin(courses.groupBy(_.courseId)).groupBy(_._2.topicId).sketch(reducer = 100).leftJoin(topics.groupBy(_.topicId)) 21. Scalding Wish-list More documentation Scala 2.11 soon, please? 22. Questions?Were hiring!coursera.org/jobs