total duration: 120 hours (15 days) - databyte …...•debugging •code proﬁling python:...

Course Description

Big Data Science course will cover the Advanced Analytical and Machine learning techniques using most popular tools in the analytics industry like Hadoop, Python and Spark. This course focuses on case study approach for learning and is a great blend of analytics and technology, making it apt for aspirants who want to develop advanced data science skills

Course Objective

This is our advanced Big Data training, where students will gain practical skill set not only on Hadoop in detail, but also learn advanced analytics concepts through Python, Hadoop and Spark. For extensive hands-on practice, students will get several assignments and projects. At end of the program candidates are awarded Certified Big Data Science Certification on successful completion of projects that are provided as part of the training.

Who should do this course?

Students coming from IT, Software, Datawarehouse background and wanting to get into the Big Data Analytics domain

Prerequisites

Knowledge of excel is mandatory and a quantitative background is preferred. Knowledge of any programming & data analytics exposure would be an advantage.

Who are the trainers?

Our trainers are highly qualified industry experts and certified instructors with more than 10 years of global analytical experience.

Projects – Case Studies

Key Drivers for Customer SpendingObjective: The objective of the case study is to provide end to end steps to build and validate regression model to identify the key drivers of customer spend using Python-Spark. Problem Statement: One of the leading banks would like to identify key drivers for customer spending so that they can define strategy to optimize the product features.

Predicting bad customers(Default customer) using Credit customer application dataObjective: The objective of the case study is to provide end to end steps to build and validate classification model using python-spark Problem Statement: One of the leading banks would like to predict bad customers (Defaulters) based on the customer data provided by them in their application.

Telecom Customer SegmentationObjective: The objective of the case study to apply advanced algorithms like factor and cluster analysis for data reduction and customer segmentation based on the customer behavioural data Problem Statement: Build an enriched customer segmentation and profile them using different KPIs for one of the leading telecom company to define marketing strategy.

Air Passengers ForecastingObjective: The objective of the case study to given hands-on experience on how to apply/use different time series forecasting techniques (Averages/Smoothening, decomposition, ARIMA etc) Problem Statement: One of the leading travel companies would like predict number of air passengers travelling to Europe so that they can define their marketing strategy accordingly.

Who provides the certification?

The certification is provided by Databyte AcademyUpon successful completion of the program, students will be conferred with dual certification:- Certificate of Completion - CERTIFIED BIG DATA SCIENTIST**In order to be “Certified” as part of the course, students need to complete the assignments and examination. Once all your assignments are submitted and evaluated, the certificate shall be awarded.

Course Outcome

Ability to understand big data and use Big Data Ecosystem tools store and process the big data. Also get hands on exposure on how to use big data technology to improve performance across functions by storing, managing and processing big data in efficient manner.

CERTIFIED BIG DATA SCIENTIST(Hadoop - Spark - Phyton)

Total Duration: 120 Hours (15 Days)

Databyte Academy Sdn Bhd (1176678-V)No. 18-4, Jalan 13/48A

Sentul Boulevard Shop OfficeJalan Sentul, 51000 Kuala Lumpur, Malaysia

[email protected]

www.databyte.com.my

+603-4045 5000

+603-4045 6000

Course Content

•What is Data Science?•Data Science Vs. Analytics vs. Data warehousing, OLAP, MIS Reporting•Relevance in Industry and need of the hour•Type of problems and objectives in various industries•How leading companies are harnessing the power of Data Science?•Different phases of a typical Analytics / Data Science projects

Introduction to Data Science

•Introduction and Relevance•Uses of Big Data analytics in various industries like Telecom, E- commerce, Finance and Insurance etc.•Problems with Traditional Large-Scale Systems

Introduction to Big Data•HDFS Overview & Data storage in HDFS•Get the data into Hadoop from local machine(Data Loading Techniques) - vice versa•Map Reduce Overview (Traditional way Vs. MapReduce way)•Concept of Mapper & Reducer•Understanding MapReduce program Framework•Develop MapReduce Program using Java (Basic)•Develop MapReduce program with streaming API) (Basic)

Hadoop and Mapreduce (Yarn)

•Motivation for Hadoop•Different types of projects by Apache•Role of projects in the Hadoop Ecosystem•Key technology foundations required for Big Data•Limitations and Solutions of existing Data Analytics Architecture•Comparison of traditional data management systems with Big Data management systems•Evaluate key framework requirements for Big Data analytics•Hadoop Ecosystem &Hadoop 2.x core components•Explain the relevance of real-time data•Explain how to use Big Data and real-time data as a Business planning tool

Hadoop (Big Data) Ecosystem

•Hadoop Master-Slave Architecture•The Hadoop Distributed File System - Concept of data storage•Explain different types of cluster setups (Fully distributed/Pseudo etc)•Hadoop cluster set up - Installation•Hadoop 2.x Cluster Architecture•A Typical enterprise cluster – Hadoop Cluster Modes•Understanding cluster management tools like Cloudera manager/Apache ambari

Hadoop Cluster-Architecture-ConfigurationFile

•Introduction exploratory data analysis•Descriptive statistics, Frequency Tables and summarization•Univariate Analysis (Distribution of data & Graphical Analysis)•Bivariate Analysis (Cross Tabs, Distributions & Relationships, Graphical Analysis)•Creating Graphs- Bar/pie/line chart/histogram/box plot/scatter/density etc)•Important Packages for Exploratory Analysis (NumPy Arrays, Matplotlib, Pandas and scipy.statsetc)

PYTHON: Data Analysis And Visualization

•Basic Statistics - Measures of Central Tendencies and Variance•Building blocks - Probability Distributions - Normal distribution - Central Limit Theorem•Inferential Statistics -Sampling - Concept of Hypothesis Testing•Statistical Methods - Z/t-tests (One sample, independent, paired), Anova, Correlations and Chi-square

PYTHON- BASIC STATISTIC

•Importing Data from various sources(CSV, Txt, Excel, Access etc…)•Database Input (Connecting to database)•Viewing Data objects - subsetting, •Exporting Data to various formats

PYTHON: Accessing / Importing andExporting Data

•Cleansing Data with Python•Data Manipulation steps (Sorting, filtering, duplicates, merging, appending, subsetting, derived variables, sampling, Data type conversations, renaming, formatting etc)•Data manipulation tools (Operators, Functions, Packages, control structures, Loops, arrays etc)•Python Built-in Functions (Text, numeric, date, utility functions)•Python User Defined Functions•Stripping out extraneous information•Normalizing data•Formatting data•Important Python Packages for data manipulation (Pandas, Numpyetc)

PYTHON- Data Manipulation-Cleansing

•Introduction to Machine Learning & Predictive Modeling•Types of Business problems - Mapping of Techniques•Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning•Different Phases of Predictive Modeling (Data Pre-processing, Sampling, Model Building, Validation)•Overfitting (Bias-Variance Trade off) & Performance Metrics•Types of validation(Bootstrapping, K-Fold validation etc)

PYHTON: Machine Learning-PredictiveModeling Basic

•Linear Regression•Logistic Regression•Segmentation - Cluster Analysis (K-Means)•Decision Trees (CHAID/CART/CD 5.0)•Artificial Neural Networks (ANN)•Support Vector Machines (SVM)•Ensemble Learning (Random Forest, Bagging & boosting)•Other Techniques (KNN, Naïve Bayes)•Important Packages for Machine Learning (Sci Kit Learn, scipy.statsetc)

PYTHON: Machine Learning in Practice

•Overview of Python- Starting Python•Introduction to Python Editors & IDE's (Canopy, pycharm, Jupyter, Rodeo, Ipython etc…)•Custom Environment Settings•Concept of Packages/Libraries - Important packages (NumPy, SciPy, scikit-learn, Pandas, Matplotlib, etc)•Installing & loading Packages & Name Spaces•Data Types & Data objects/structures (Tuples, Lists, Dictionaries)•List and Dictionary Comprehensions•Variable & Value Labels – Date & Time Values•Basic Operations - Mathematical - string - date•Reading and writing data•Simple plotting•Control flow•Debugging•Code profiling

PYTHON: Introduction and Essentials

CERTIFIED BIG DATA SCIENTIST(Hadoop - Spark - Phyton)

Total Duration: 120 Hours (15 Days)

•Integrating Hadoop into an Existing Enterprise•Loading Data from an RDBMS into HDFS by Using Sqoop•Managing Real-Time Data Using Flume•Accessing HDFS from Legacy Systems

Data Integration Using SQOOP & FLUME

•Introduction to Apache Spark•Streaming Data Vs. In Memory Data•Map Reduce Vs. Spark•Modes of Spark•Spark Installation Demo•Overview of Spark on a cluster•Spark Standalone Cluster

SPARK Introduction•Analyze Hive and Spark SQL Architecture•Analyze Spark SQL•Context in Spark SQL•Implement a sample example for Spark SQL•Integrating hive and Spark SQL•Support for JSON and Parquet File Formats Implement Data Visualization in Spark•Loading of Data•Hive Queries through Spark•Performance Tuning Tips in Spark•Shared Variables: Broadcast Variables & Accumulators

SPARK: SPARK Meets HIVE•Overview of GraphX module in spark•Creating graphs with GraphX

SPARK GraphX

•Understand Machine learning framework•Implement some of the ML algorithms using Spark MLLib

Introduction to Machine Learning Using Spark

•Consolidate all the learnings•Working on Big Data Project by integrating various key components

Project

•Extract and analyze the data from twitter using Spark streaming•Comparison of Spark and Storm – Overview

SPARK Streaming

•Invoking Spark Shell•Creating the Spark Context•Loading a File in Shell•Performing Some Basic Operations on Files in Spark Shell•Caching Overview•Distributed Persistence•Spark Streaming Overview(Example: Streaming Word Count)

SPARK: SPARK in Practice

•Apache Hive - Hive Vs. PIG - Hive Use Cases•Discuss the Hive data storage principle•Explain the File formats and Records formats supported by the Hive environment•Perform operations with data in Hive•Hive QL: Joining Tables, Dynamic Partitioning, Custom Map/Reduce Scripts•Hive Script, Hive UDF•Hive Persistence formats•Loading data in Hive - Methods•Serialization & Deserialization•Handling Text data using Hive•Integrating external BI tools with Hadoop Hive

Data Analysis Using HIVE•Impala & Architecture•How Impala executes Queries and its importance•Hive vs. PIG vs. Impala•Extending Impala with User Defined functions

Data Analysis Using IMPALA

•NoSQL database - Hbase•Introduction Oozie

Introduction to other Ecosystem Tools•Introduction to Data Analysis Tools•Apache PIG - MapReduceVs Pig, Pig Use Cases•PIG’s Data Model•PIG Streaming•Pig Latin Program & Execution•Pig Latin : Relational Operators, File Loaders, Group Operator, COGROUP Operator, Joins and COGROUP, Union, Diagnostic Operators, Pig UDF•Writing JAVA UDF’s•Embedded PIG in JAVA•PIG Macros•Parameter Substitution•Use Pig to automate the design and implementation of MapReduce applications•Use Pig to apply structure to unstructured Big Data

Data Analysis Using PIG

total duration: 120 hours (15 days) - databyte …...•debugging •code proﬁling python:...

Documents