iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
DESCRIPTION
TRANSCRIPT
1
Testing Big Data
Prepared by: Anca Andreea Sfecla, Quality Assurance Manager Embarcadero Technologies Romania
@ CODECAMP 2013,
20th April 2013
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
What is Big Data?
• “Big Data is the frontier of a firm’s ability to store, process, and access all the data it needs to operate effectively, make decisions, reduce risks, and serve customers.” - Forrester Research
• “Big data creates a new layer in the economy which is all about information, turning information, or data, into revenue. In 2013, big data is forecast to drive $34 billion of IT spending” – Gartner Research
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Characteristics
Big Data
Volume
Variety
Velocity
Value
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Characteristics
Big Data
Volume
Variety
Velocity
Value
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Characteristics
Big Data
Volume
Variety
Velocity
Value
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Characteristics
Big Data
Volume
Variety
Velocity
Value
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Characteristics
Big Data
Volume
Variety
Velocity
Value
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Success Stories
• Detecting infections in premature infants up to 24 hours before they exhibit symptoms
• Reducing the cost of sequencing a genome from $10,000 to less than $100
• Predict flu outbreaks by analyzing massive number of Google searches related to flu symptoms
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
EDW versus Big Data
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
EDW versus Big DataClean Data Unclean Data
Gigabytes to Terabytes(1000 GB)
Petabytes(1000 TB) to Exabytes(1000 PB)
Simplified, Structured Complex, Semi or Unstructured
Data from relational database
Data from non-relational flat file storage
Centralized data Distributed data
Structured Database Schema
Customized-instant schema, generated
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Solutions
Microsoft Big Data Solution
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Solutions
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Solutions
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Processing using Hadoop Framework
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Analytics
Web Logs StreamingData Social Data Transactional
Data (RDBMS)
Enterprise Data Warehouse
HAD
OO
P
HivePig
MapReduce(Job Execution)HBase(NoSQL DB)
HDFS (Hadoop Distributed File System)
Processed Data
Data Load using Sqoop
ETL Process
Big Data Architecture
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Architecture
Big Data Analytics
Web Logs StreamingData Social Data Transactional
Data (RDBMS)
Enterprise Data Warehouse
HAD
OO
P
HivePig
MapReduce(Job Execution)HBase(NoSQL DB)
HDFS (Hadoop Distributed File System)
Processed Data
Data Load using Sqoop
ETL Process
1 Pre-HadoopProcessing
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Possible problems• incorrect data captured from source systems
• incorrect storage of data
• incomplete or incorrect replications
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Analytics
Web Logs StreamingData Social Data Transactional
Data (RDBMS)
Enterprise Data Warehouse
HAD
OO
P
HivePig
MapReduce(Job Execution)HBase(NoSQL DB)
HDFS (Hadoop Distributed File System)
Processed Data
Data Load using Sqoop
ETL Process
Big Data Architecture
1 Pre-HadoopProcessing
2 Map-Reduce process validation
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Possible problems• coding issues in map-reduce jobs
• jobs working correctly when run in standalone node, but working incorrectly when run on multiple nodes
• incorrect aggregations, node configurations and incorrect output format
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Analytics
Web Logs StreamingData Social Data Transactional
Data (RDBMS)
Enterprise Data Warehouse
HAD
OO
P
HivePig
MapReduce(Job Execution)HBase(NoSQL DB)
HDFS (Hadoop Distributed File System)
Processed Data
Data Load using Sqoop
ETL Process
Big Data Architecture
1 Pre-HadoopProcessing
2 Map-Reduce process
validation
3 Data Extract and Load Process
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Possible problems• incorrectly applied transformation
rules
• incomplete data extract from HDFS
• incorrect load of HDFS files into analysis tools
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Analytics
Web Logs StreamingData Social Data Transactional
Data (RDBMS)
Enterprise Data WarehouseH
ADO
OP HivePig
MapReduce(Job Execution)HBase(NoSQL DB)
HDFS (Hadoop Distributed File System)
Processed Data
Data Load using Sqoop
ETL Process
Big Data Architecture
1 Pre-HadoopProcessing
2 Map-Reduce process
validation
3 Data Extract and Load Process
Reports testing
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Possible problems
• report definitions not set as per requirement
• report data issues
• layout and format issues
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Big Data Analytics
Web Logs StreamingData Social Data Transactional
Data (RDBMS)
Enterprise Data WarehouseH
ADO
OP HivePig
MapReduce(Job Execution)HBase(NoSQL DB)
HDFS (Hadoop Distributed File System)
Processed Data
Data Load using Sqoop
ETL Process
Big Data Architecture
1 Pre-HadoopProcessing
2 Map-Reduce process
validation
3 Data Extract and Load Process
Non
Fun
ction
al T
estin
g Reports testing
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Possible problems• imbalance in input splits
• redundant sorts
• moving most of the aggregation computations to the Reduce process
• node failures
• data corruption
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
New to the tester
• Semi-structured and unstructured data
• Immense volumes of dynamic, complex data
• Test environment
• Big Data ecosystem
• Pure programming tools
• Non-SQL interrogations
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Testing Big Data
• Big
• Fast
• Complex
• Rewarding
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Q&A
Prepared by Anca Sfecla, QAM - Embarcadero Technologies
Thank you!
& Please fill in your evaluation form [email protected]