apache pig
DESCRIPTION
Apache PIG : User Group PresentationTRANSCRIPT
Welcome to !BigData Cloud Architects Meetup Group
Weekly, Sunday 6:00p - 8:00p
Exciting Topic Every Week, Please Register @ !http://www.meetup.com/BigData-Cloud-Meetup/
6:00 - 6:15 : Socializing!6:15 - 6:30 : Meetup Introduction!
6:30 - 7:55 : Topic of the week!7:55 - 8:00 : Wrap UP
Started in Jun 2013!33 weeks - 3 weeks = 30 weeks
Suresh Mandava!!
CyberSecurity Cloud!Principal, CSC!
!
!
!24+ Years of IT Experience in Infrastructure, Platform and Data
Management in Mission Critical Enterprise Environments
Founder/CTO for Network Security and a Cloud IaaS Product Companies!
Apache Accumulo
Market NeedBigData Architect
Must-have Technical skills and Experience: !•! 5-7 years of strong system design/development experience in building massively large scale distributed systems and products. !•! 1 to 2 years of good experience in working with Hadoop and Big Data technologies including core Hadoop (HDFS and MapReduce), Flume and Sqoop, HBase and Zookeeper, Oozie, Hive and Pig.
•! Troubleshooting: The candidate must be able to engage in solving complex problems. Programming problems are a good example.
•! Databases: Advanced SQL knowledge a must. Good to have advanced data warehousing and MPP knowledge but it's not a must have.
•! Linux/Unix and system administration: Advanced Linux knowledge is a must. Understanding of shell, debugging things etc. The candidate should be able to get their way around Linux and get things to work.
•! Networking and Hardware: Advanced networking and hardware knowledge required. !•! Programming: Strong in one programming language. Java would be preferable. If not Java, the candidate must have a good handle on one language and display the ability to pick up a new one if required. Experience on programming infrastructure management or automation tools is required. !•! Big picture / High level architecture: The candidate must be able to think at a high level about the overall systems and goals of the projects.
Jan 9, 2014
Hadoop 2.0 GA Release Oct 16, 2013
Apache PIGSuresh Mandava
Jan 19, 2014 : Session 30
Handle large data volume Run queries spanning days/months GB/TB/PBs
Structured, Semi and Unstructured data Computationally intensive
Deep analytics Machine learning algorithms
Hadoop Cluster
MapReduce (Computation)
+
HDFS (Storage)
But, writing MapReduce!jobs in Java is painful. Let’s see why …!
Java
and
PIG———-Start ——— A = load ’/app_logs/2012/01/*/' using PigStorage(); !uLogs = FILTER A BY $0 == ’U'; !uLogFields = FOREACH uLogs GENERATE $1 as orgId, $2 as userId, !orgUserGroup = GROUP uLogFields BY (orgId, userId); !uCount = FOREACH orgUserGroup GENERATE group, COUNT(uLogFields); !STORE uCount INTO ‘output’; ———— END ———-
Performance of Pig vs MapReduce
PIG Philosophy ?
Pigs Eat Everything
• Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc.
Pigs Live Everywhere
• Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on Hadoop, but we do not intend that to be only on Hadoop.
Pigs are Domestic Animals
• Pig is designed to be easily controlled and modified by its users.
• Pig allows integration of user code where ever possible, so it currently supports user defined field transformation functions, user defined aggregates, and user defined conditionals. These functions can be written in Java or scripting languages that can compile down to Java (e.g. Jython).
• Pig has an optimizer that rearranges some operations in Pig Latin scripts to give better performance, combines Map Reduce jobs together, etc. However, users can easily turn this optimizer off to prevent it from making changes that do not make sense in their situation.
Pigs Fly
• Pig processes data quickly. We want to consistently improve performance, and not implement features in ways that weigh pig down so it can't fly.
About PIG
Large Code Contributions from : Yahoo, Twitter, HortonWorks
PIG Execution Stages
PIG VS SQL
PIG PROCEDURAL !VS!
SQL DECLARATIVE
FEATURES OF PIG
PIG : Configure
PIG : Data Types
PIG : Run
PIG : Latin Constructs
Sample Data
Employee > 100K salary
More : Pig Latin
More : Pig Latin
More : Pig Latin
Is that ALL ?
PIG is Extremely Flexible
PigLatin : UDF
UDF : Types
UDF Library
UDF : How to write (EvalFunc)
UDF : How to use custom UDF in PIG
UDF : How to execute pig script with custom UDF
Our Goal
Master the Eco-System one bite at a time, every week
Appreciate you Feedback!@ Group Reviews
Thank You.
www.bigdatapro.io
Thank You and Have a Nice Week!!
Meet you all Next Week