hive object model

Download Hive Object Model

Post on 15-Jan-2015




1 download

Embed Size (px)


This set of slides describes the efficient Java object model in Hive.


  • 1. Efficient Object Model in Java Slides by Zheng Shao, Facebook Part of Apache Hadoop Hive Project

2. Object Inspector 3. On-disk Data Format Single on-disk form systemat s Simplicity Multiple on-disk form systemat s Ease-of-use Ease-of-integration Flexibility: better trade off between space, performance, etc Hive allow M s ultiple on-disk format 4. Exam M ple ultiple on-disk Formats File Format: Row-based Column-based Block-based Rowformat: Text-based Binary-based Customized Index format 5. In-m ory Data Form emat Single in-m ory form systemem at s Simplicity: Simpler code Multiple in-m ory form systemem at s Ease-of-integration: other system m use their ow forms ay nat Performance: Multiple on-disk format/external form + efficient loadingat M ultiple in-m ory formem at Hive allow M s ultiple in-m ory form em at 6. Exam M ple ultiple in-m ory Form em ats Integer: Integer IntWritable LazyInteger String: String Text 7. Multiple In-m ory Form Design Patternsem at Object-oriented: A single interface/base class for Integer Multiple derived classes Delegation: data stored in object format/operations stored in objectInspector a pair of object and objectInspector represents a data unit It possible to w either one up to conform to the other pattern. srap s 8. Multiple In-m ory Form Design Patternsem at In OO, w need an interface HiveInteger to represent Integers e Make Integer, IntWritable classes all implem it. ent How ever, Integer class is final (not extendable) and does not implem HiveInteger ent W need to do a conversion, every tim w exchange data w UDF,e e eith SerDe (Thrift), or other libraries (unless they knowHiveInteger this is a bad assumption to make in open system). Delegation w be a better idea because ill For Integer, w have an JavaIntegerObjectInspector e For IntWritable , w have an We ritableIntegerObjectInspector W convert param and return values only if necessarye s 9. Delegation Method List General methods: List Objects: isNull(object o) getListSize(object o) hashCode(object o) getListElement(object o) compare(object o) getList(object o) clone(object o) M Objects: ap Primitive Objects: getMapSize(object o) primitive getValue(object o) getValueForKey(object o) String Objects: getMap(object o) String getString(object o) Struct Objects: Text getText(object o) getStructField(object o) getStructAsAList(object o) 10. SerDe 11. Where is SerDe? Hive OperatorHive OperatorRe duc e r Mappe r ObjectInspectorHierarchicalHierarchicalHierarchicalHierarchicalHierarchical ObjectObject ObjectStandard ObjectObject ObjectLazyObjectJava Object Use ArrayList for struct andLazily-deserializedObject of a Javaarray SerDeClass Use HashM for m apapText( p 1.0 3 54// UTF8 im)Writable W ritableW ritable encoded Writable W ritableWritableBytesW ritable(x3Fx64x72x0 W ritableWritable0) FileForm / Hadoop Serialization at File onMapthrift_record< > Stream Stream im 1.0 3 54 p File on HDFSOutputthrift_record< > Im 0.2 1 33 p HDFSFilethrift_record< > clk 2.2 8 212thrift_record< > Im 0.7 2 22 pUser Script 12. SerDe, ObjectInspector and TypeInfo av intint String ObjectObje c tIns pe c to r3stringstring struct getTypeg e tMapValueHierarchical getMapValueOIHashMap a,Obje c tIns pe c to r2ObjectHashMap( getType ), a av bv , b mapint listclass HO { stringHashM ap a, g e tS truc tFie ldInteger b, List ( List c, HashM ap( , ), a av b bvString d;Hierarchical getFieldOI Obje c tIns pe23, r1 c to } Object getType Class ClassC {Struct List(List(1,null),List(2,4),List(5,null)), Integer a, abcdInteger b;Type Info de s e rialize s e rializeS e rDe )getOI }WritableWritable Text( a=av:b=bv 23 1:2=4:5BytesWritable(x3Fx64x72x0abcd) 0) 13. LazySimpleSerDe componentsbyte[](a=av:b=bv 23 1:2=4:5byte[] data abcd) LazyStruct LazyStructOI( ) LazyMapLazyInteger LazyArrayLazyStringLazyMapOI( , ) : = LazyArrayOI( ) :LazyStructLazyStringOI LazyString LazyString LazyInteger LazyStringOI LazyString LazyString LazyIntegerLazyStructOI( ) = LazyStructHierarchical Object / LazyObjectLazyInteger LazyIntegerOIStandardIntegerOI One Per SerDe instance LazyIntegerLazyObjectInspectorSingleton 14. LazyPrimitive LazyString/LazyInteger setAll(byte[] data, int start, int length) LazyString: parse the data and create a String object LazyInteger: parse the data and create an Integer object getObject() returns the corresponding String/Integer object Future Replace String/Integer w Text/IntW ith ritable The Text/IntWritable object is owned by the LazyString/LazyInteger object. 15. LazyNonPrimitive LazyStruct/LazyArray/LazyMap setAll(byte[] data, int start, int length) Rem ber data, start and length, and set parsed to false.em getStructField/getArrayElement/getMapValue If not parsed yet, parse the byte and rem ber starting positions ofem each field/element/key/value For Struct/Array, do setAll on the corresponding LazyObject and return it For M search for the serialized key and return the correspondingap, value (after doing a setAll on the value). 16. W another SerDe?hy Functionality: MetadataTypedColumnSetSerDe can only deal w String columithns Dynam icSerDe can deal w all prim ith itive colum and primns itive lists/ maps, but it does not fully support nested types yet. Efficiency: Both MetadataTypedColum nSetSerDe and DynamicSerDe uses String.split() and are not efficient for long rows 17. Features of LazySimpleSerDe Functionality: Fully compatible w M ith etaDataSerDe and Dynamic/TCTLSeparated Fully support all nested types (M Key m be primapust itive) Efficiency: Fully support lazy deserialization - only deserialize the field (and create Objects) w hen asked. Reuse multiple-levels of LazyObjects. Read numbers without UTF-8 decoding (TODO) Fully reuse objects - IntWritable for Integer, Text for String (TODO) W num rite bers without UTF-8 encoding 18. Profiling result of a mapper 17%: TrackedRecordReader (should include InputFileFormat and decompression) 22%: Operator.close |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding) |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat) 50%: Operator.forward |-18%: Text.decode (from LazySerDe) | |- 7%: CharacterSet.decode() (UTF-8 decoding) | |- 5%: toString() (where we create the string object) |- 3%: LazyStruct.parse (the code that search for separators in the row) |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData) |- 8%: GroupByOperator.processHashAggr |- 3%: HashMap.get() in GroupByOperator * Performance Data from Rodrigo Schmidt 19. TypeInfo String specification W not Thrift?hy Hard to parse Sim Syntaxple Type: PrimitiveType | MapType | ArrayType | StructType PrimitiveType: int | bigint | tinyint | smallint | double | string MapType: map ArrayType: array StructType: struct< [Nam : Type]+ > e Example: array,c:doube>>> 20. Future Works 21. Future Works of ObjectInspector Delegate all methods described earlier isNull(), hashCode(), compare() etc are not delegated yet Support UNION data type: HIVE-537 22. Future Works of SerDe LazyBinarySerDe: HIVE-553 A binary-form sortable SerDe: serialized sorting order is the sam at e as deserialized sorting order A binary-form comatpact SerDe: saving space