hadoop (pig). processing of large data (by eugene smertenko) - big data tech hangout - 2013.10.26
DESCRIPTION
On Saturday, 26 of October, the second external meeting of Tech Hangout Community took place in Creative Space 12, the cultural and educational center based in Kiev! The event was held under the motto «Discover the value of Big Data!» * Tech Hangout -- an event, organized by the developers for the developers for knowledge and experience sharing. The concept of the event proposes a 30-minute report on the topic previously defined, and the discussion of the same duration in a roundtable session format. This initiative has proved to be so popular and high-demand that Tech Hangout own logo, blog and group on Facebook with the opportunity to discuss information heard have been created in a short period of time. Join to discuss - https://www.facebook.com/groups/techhangout/ Read us - http://hangout.innovecs.com/TRANSCRIPT
Hadoop-PigProcessing of large data
Yevgen SmertenkoEngineering Team Lead. BI Developer.
How it worksBI engineerclear result
data
PigPig
Hadoop - Software Framework
Provide Massive Parallel Processing (MPP) of data
MapReduce program• Input read• Map• Partition / Combine• Copy / Compare / Merge• Reduce• Output write
MapReduce Data Flow
MapReduce Data Flow
MapReduce functionality
The Hadoop Ecosystem
PIG
• Data types• Relational operators• UDF – user defined functions
Pig Latin - language of the data streams description
Pig. Data Types
Simple Types• int• long• float• double• chararray• bytearray• boolean• datetime
Complex Types• tuple (.., ..)• map [key#value]• bag {(), .., ()}
Pig. Relational operators
• SPLIT• UNION• FILTER• DISTINCT• SAMPLE• FOREACH• STREAM
• JOIN• GROUP / COGROUP• CROSS• ORDER
• LOAD• STORE
PIG. UDF
Eval Functions (EvalFunc) • Filter Functions • Aggregate Functions• Algebraic Interface• Accumulator Interface
Load/Store Functions (StoreFunc)
piggybank
How it worksBI engineerclear result
data
Pig
THANKS FOR YOUR ATTENTION!