apache pig

Welcome to !BigData Cloud Architects Meetup Group

Weekly, Sunday 6:00p - 8:00p

Exciting Topic Every Week, Please Register @ !http://www.meetup.com/BigData-Cloud-Meetup/

6:00 - 6:15 : Socializing!6:15 - 6:30 : Meetup Introduction!

6:30 - 7:55 : Topic of the week!7:55 - 8:00 : Wrap UP

Started in Jun 2013!33 weeks - 3 weeks = 30 weeks

http://www.meetup.com/BigData-Cloud-Meetup/

Suresh Mandava!!

CyberSecurity Cloud!Principal, CSC!

!

!

!24+ Years of IT Experience in Infrastructure, Platform and Data

Management in Mission Critical Enterprise Environments

Founder/CTO for Network Security and a Cloud IaaS Product Companies!

Apache Accumulo

Market NeedBigData Architect

Must-have Technical skills and Experience: !•! 5-7 years of strong system design/development experience in building massively large scale distributed systems and products. !•! 1 to 2 years of good experience in working with Hadoop and Big Data technologies including core Hadoop (HDFS and MapReduce), Flume and Sqoop, HBase and Zookeeper, Oozie, Hive and Pig.

•! Troubleshooting: The candidate must be able to engage in solving complex problems. Programming problems are a good example.

•! Databases: Advanced SQL knowledge a must. Good to have advanced data warehousing and MPP knowledge but it's not a must have.

•! Linux/Unix and system administration: Advanced Linux knowledge is a must. Understanding of shell, debugging things etc. The candidate should be able to get their way around Linux and get things to work.

•! Networking and Hardware: Advanced networking and hardware knowledge required. !•! Programming: Strong in one programming language. Java would be preferable. If not Java, the candidate must have a good handle on one language and display the ability to pick up a new one if required. Experience on programming infrastructure management or automation tools is required. !•! Big picture / High level architecture: The candidate must be able to think at a high level about the overall systems and goals of the projects.

Jan 9, 2014

http://www.indeed.com/q-flume-$130,000-l-Chesterfield,-MO-jobs.html

http://www.apple.com

Hadoop 2.0 GA Release Oct 16, 2013

Apache PIGSuresh Mandava

Jan 19, 2014 : Session 30

Handle large data volume Run queries spanning days/months GB/TB/PBs

Structured, Semi and Unstructured data Computationally intensive

Deep analytics Machine learning algorithms

Hadoop Cluster

MapReduce (Computation)

+

HDFS (Storage)

But, writing MapReduce!jobs in Java is painful. Let’s see why …!

PIG———-Start ——— A = load ’/app_logs/2012/01/*/' using PigStorage(); !uLogs = FILTER A BY $0 == ’U'; !uLogFields = FOREACH uLogs GENERATE $1 as orgId, $2 as userId, !orgUserGroup = GROUP uLogFields BY (orgId, userId); !uCount = FOREACH orgUserGroup GENERATE group, COUNT(uLogFields); !STORE uCount INTO ‘output’; ———— END ———-

Performance of Pig vs MapReduce

PIG Philosophy ?

Pigs Eat Everything

• Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc.

Pigs Live Everywhere

• Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on Hadoop, but we do not intend that to be only on Hadoop.

Pigs are Domestic Animals

• Pig is designed to be easily controlled and modified by its users.

• Pig allows integration of user code where ever possible, so it currently supports user defined field transformation functions, user defined aggregates, and user defined conditionals. These functions can be written in Java or scripting languages that can compile down to Java (e.g. Jython).

• Pig has an optimizer that rearranges some operations in Pig Latin scripts to give better performance, combines Map Reduce jobs together, etc. However, users can easily turn this optimizer off to prevent it from making changes that do not make sense in their situation.

Pigs Fly

• Pig processes data quickly. We want to consistently improve performance, and not implement features in ways that weigh pig down so it can't fly.

About PIG

Large Code Contributions from : Yahoo, Twitter, HortonWorks

PIG Execution Stages

PIG VS SQL

PIG PROCEDURAL !VS!

SQL DECLARATIVE

FEATURES OF PIG

PIG : Configure

PIG : Data Types

PIG : Run

PIG : Latin Constructs

Sample Data

Employee > 100K salary

More : Pig Latin

Is that ALL ?

PIG is Extremely Flexible

PigLatin : UDF

UDF : Types

UDF Library

UDF : How to write (EvalFunc)

UDF : How to use custom UDF in PIG

UDF : How to execute pig script with custom UDF

Q & A !

[email protected]

mailto:[email protected]

Our Goal

Master the Eco-System one bite at a time, every week

Appreciate you Feedback!@ Group Reviews

Thank You.

www.bigdatapro.io

http://www.bigdatapro.io

Thank You and Have a Nice Week!!

Meet you all Next Week

apache pig

Data & Analytics