column-oriented storage techniques for mapreduce

Column-Oriented Storage Techniques for MapReduce

Most slides & Paper by:Avrilia Floratou (University of Wisconsin Madison)Jignesh M. Patel (University of Wisconsin Madison)Eugene J. Shekita (While at IBM Almaden Research Center)Sandeep Tata (IBM Almaden Research Center)Column-Oriented Storage Techniques for MapReduce1Ohio State CSE 788.11 WI 2012 Presentation by:David Fuhry and Karthik TungaTalk Outline2MotivationMerits/Limits of Row- & Column-store with MapReduce (recap)Lazy Tuple ConstructionCompressionExperimental EvaluationConclusion

DaveKarthikMotivation3MapReduce increasingly used for Big Data analysisScalability, ease of use, fault tolerance, priceBut MapReduce implementations lack some advantages often seen in Parallel DBMSEfficiency & performance, SQL, indexing, updating, transactions

This work (CIF) and RCFile both address Efficiency & performance with column-oriented storage

MotivationDatabasesMapReduceColumn OrientedStoragePerformanceProgrammabilityFault tolerance44 Row-Store: Merits/Limits with MapReduce ABCD1012013014011022023024021032033034031042043044041052053054055TableHDFS BlocksStore Block 1Store Block 2Store Block 3Data loading is fast (no additional processing);All columns of a data row are located in the same HDFS block Not all columns are used (unnecessary storage bandwidth) Compression of different types may add additional overheadHDFS BlocksStore Block 1Store Block 2Store Block 3Column-Store: Merits/Limits with MapReduce A1011021031041056CD301401302402303403304404305405B201202203204205Column group 1Column group 2Column group 3Unnecessary I/O costs can be avoided: Only needed columns are loaded, and easy compressionAdditional network transfers for column groupingChallenges7How to incorporate columnarstorage into an existing MR system (Hadoop) without changing its core parts?

How can columnar-storage operate efficiently on top of a DFS (HDFS)?

Is it easy to apply well-studied techniques from the database field to the Map-Reduce framework given that:It processes one tuple at a time.It does not use a restricted set of operators.It is used to process complex data types.

Column-Oriented Storage in HadoopNameAgeInfoJoe23hobbies: {tennis}friends: {Ann, Nick}David32friends: {George}John45hobbies: {tennis, golf}Smith 65hobbies: {swimming}friends: {Helen}1st node2nd nodeEliminate unnecessary I/ONameAgeInfoJoe23hobbies: {tennis}friends: {Ann, Nick}David32friends: {George}NameAgeInfoJohn45hobbies:{tennis, golf}Smith 65hobbies: {swimming}friends: {Helen}NameJoeDavidAge2332Infohobbies: {tennis}friends:{Ann, Nick}friends: {George}NameJohnSmith Age4565Infohobbies: {tennis, golf}hobbies: {swimming}friends: {Helen}Introduce a new InputFormat :ColumnInputFormat (CIF)8Replication and Co-locationHDFSReplicationPolicyNode A Node BNode CNode D

NameAgeInfoJoe23hobbies: {tennis}friends: {Ann, Nick}David32friends: {George}John45hobbies: {tennis, golf}Smith 65hobbies: {swimming}friends: {Helen}NameJoeDavidAge2332Infohobbies: {tennis}friends:{Ann, Nick}friends: {George}NameJoeDavidNameJoeDavidAge2332Age2332Infohobbies: {tennis}friends: {Ann,Nick}friends: {George}Infohobbies: {tennis}friends:{Ann, Nick}friends: {George}CPPIntroduce a new column placement policy (CPP)9

ExampleAgeNameRecordif (age < 35)return name2332453050JoeDavidJohnMaryAnnMap Method23Joe32DavidWhat if age > 35?Can we avoid reading and deserializing the name field?10OutlineColumn-Oriented StorageLazy Tuple ConstructionCompressionExperimentsConclusions

11Lazy Tuple Construction

Deserialization of each record field is deferred to the point where it is actually accessed, i.e. when the get() methods are called.Mapper ( NullWritable key, Record value){String name;int age = value.get(age);if (age < 35) name = value.get(name);}Mapper ( NullWritable key, LazyRecord value){String name;int age = value.get(age);if (age < 35) name = value.get(name);}12Skip List (Logical Behavior)R1R2R10R20R99R100......

R90...R1R1R20R90R100R100...R10Skip 100 RecordsSkip 1013R1R2R10R20R90R99R1R10R20R90R1R100ExampleAgeJoeJaneDavidNameSkip10 = 1002Skip100 = 9017 Skip 10 = 868Mary10 rows10 rows100 rowsSkip BytesAnn23394530if (age < 35)return name14John012102ExampleAgehobbies: tennisfriends : Ann, NickNullfriends : GeorgeInfoSkip10 = 2013Skip100 = 19400 Skip 10 = 1246hobbies: tennis, golf10 rows10 rows100 rows23394530if (age < 35)return hobbies15OutlineColumn-Oriented StorageLazy Record ConstructionCompressionExperimentsConclusions

16Compression# Records in B1# Records in B2LZO/ZLIB compressed blockRID : 0 - 9LZO/ZLIB compressed blockRID : 10 - 35B1B2NullSkip10 = 210Skip100 = 1709Skip 10 = 3040: {tennis, golf}10 rows10 rows100 rowsDictionaryhobbies : 0friends : 1Compressed BlocksDictionary Compressed Skip ListsSkip BytesDecompress0 : {tennis}1 : {Ann, Nick}1: {George}17OutlineColumn-Oriented StorageLazy Record ConstructionCompressionExperimentsConclusions

18RCFileMetadataMetadataJoe, DavidJohn, Smith23, 32{hobbies: {tennis} friends: {Ann, Nick}}, {friends:{George}}{hobbies: {tennis, golf}},{hobbies: {swimming}friends: {Helen}}Row Group 1Row Group 2NameAgeInfoJoe23hobbies: {tennis}friends: {Ann, Nick}David32friends: {George}John45hobbies: {tennis, golf}Smith 65hobbies: {swimming}friends: {Helen}45, 6519Compressed MetadataCompressed Column ACompressed Column BCompressed Column CCompressed Column DRCFile: Inside each Row Group101102103104105301302303304305201202203204205401402403404405201202203204205101102103104105301302303304305401402403404405RowGrpNo duplication20CIF: Separate file for each columnCompressed MetadataCompressed Column A101102103104105Compressed Column B201202203204205Compressed Column C301302303304305Compressed Column D401402403404405128MBNo duplication21Experimental Setup42 node clusterEach node:2 quad-core 2.4GHz sockets32 GB main memoryfour 500GB HDDNetwork : 1Gbit ethernet switch22Overhead of Columnar StorageSynthetic Dataset 57GB 13 columns6 Integers, 6 Strings, 1 MapQuery Select *23Single node experimentBenefits of Column-Oriented StorageQuery Projection of different columns24Single node experimentWorkloadURLInfo { String url String srcUrl time fetchTime String inlink[] Map metadata Map annotations byte[] content}If( url contains ibm.com/jp )find all the distinctencodings reported by the pageSchemaQueryDataset : 6.4 TBQuery Selectivity : 6%2526 SEQ: 754 secComparison of Column-Layouts (Map phase)273040Comparison of Column-Layouts (Map phase)Comparison of Column Layouts (Total job)28 SEQ: 806 secConclusionsDescribe a new column-oriented binary storage format in MapReduce.Introduce skip list layout.Describe the implementation of lazy record construction.Show that lightweight dictionary compression for complex columns can be beneficial.

29CIF / RCFile comparison30CIFRCFileModify block placement policyYNModify block contentYYMetadata storedIn blockSeparate filerow group sizeLargeFlexibleLazy tuple deserializationYYSkip listsYNComparison of Sequence Files31RCFile32Comparison of Column-LayoutsLayout Data Read (GB)Map Time(sec)Map Time RatioTotal Time (sec)Total Time RatioSeq - uncomp.64001416-1482-Seq - record3008820-889-Seq - block2848806-886-Seq - custom30407541.0x8061.0xRCFile11137021.1x7611.1xRCFile - comp1022023.7x2912.8xCIF - ZLIB3612.859.1x7710.4xCIF9612.460.8x7810.3xCIF - LZO5412.461.0x7910.2xCIF - SL759.281.9x7011.5xCIF -DCSL617.0107.8x6312.8x3333Comparison of Column-Layouts34 SEQ: 754 secCIF DCSL results in the highest map time speedup and improves the total job time by more than an order of magnitude (12.8X).RCFile35 SEQ: 754 sec36Comparison of Sequence Files SEQ: 754 sec

column-oriented storage techniques for mapreduce

Documents