www.decideo.fr/bruley
MapReduceMapReduce
April 2012April 2012
Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
www.decideo.fr/bruley
What is MapReduce?What is MapReduce?
Restricted parallel programming model meant for large clusters
– User implements Map() and Reduce() functions
Parallel computing framework
– Libraries take care of EVERYTHING else
• Parallelization
• Fault Tolerance
• Data Distribution
• Load Balancing
Useful model for many practical tasks
www.decideo.fr/bruley
Map and Reduce Map and Reduce
The idea of Map, and Reduce is 40+ year old
– Present in all Functional Programming Languages.
– See, e.g., APL, Lisp and ML
Alternate names for Map: Apply-All
Higher Order Functions
– take function definitions as arguments, or
– return a function as output
Map and Reduce are higher-order functions.
www.decideo.fr/bruley
Map and Reduce FunctionsMap and Reduce Functions
Functions borrowed from functional programming languages (eg. Lisp)
Map()– Process a key/value pair to generate intermediate
key/value pairs
Reduce()– Merge all intermediate values associated with the same
key
www.decideo.fr/bruley
Example: Counting WordsExample: Counting Words
Map()– Input <filename, file text>– Parses file and emits <word, count> pairs
• eg. <”hello”, 1>
Reduce()– Sums all values for the same key and emits <word,
TotalCount>• eg. <”hello”, (3 5 2 7)> => <”hello”, 17>
www.decideo.fr/bruley
Execution on ClustersExecution on Clusters
1. Input files split (M splits)
2. Assign Master & Workers
3. Map tasks
4. Writing intermediate data to disk (R regions)
5. Intermediate data read & sort
6. Reduce tasks
7. Return
www.decideo.fr/bruley
Map/Reduce Cluster Map/Reduce Cluster ImplementationImplementation
split 0split 1split 2split 3split 4
Output 0
Output 1
Input files
Output files
M map tasks
R reduce tasks
Intermediate files
Several map or reduce tasks can run on a single
computer
Each intermediate file is divided into R
partitions, by partitioning function
Each reduce task corresponds to one partition
www.decideo.fr/bruley
Map Reduce vs. Parallel Map Reduce vs. Parallel DatabasesDatabases
Map Reduce widely used for parallel processing
– Google, Yahoo, and 100’s of other companies
– Example uses: compute PageRank, build keyword indices, do data analysis of web click logs, ….
Database people say:
– but parallel databases have been doing this for decades
Map Reduce people say:
– we operate at scales of 1000’s of machines
– We handle failures seamlessly
– We allow procedural code in map and reduce and allow data of any type
www.decideo.fr/bruley
Map Reduce Map Reduce ImplementationsImplementations
Google– Not available outside Google
Hadoop– An open-source implementation in Java– Uses HDFS for stable storage– Download: http://lucene.apache.org/hadoop/
Teradata Aster– Cluster-optimized SQL Database that also implements
MapReduce• IITB alumnus among founders
And several others, such as Cassandra at Facebook, etc.
www.decideo.fr/bruley
MapReduce v. HadoopMapReduce v. Hadoop
MapReduce Hadoop
Org Google Yahoo/Apache
Impl C++ Java
Distributed File Sys
GFS HDFS
Data Base Bigtable HBase
Distributed lock mgr
Chubby ZooKeeper
www.decideo.fr/bruley
Solutions Solutions StackStack for Teradata Aster for Teradata Aster
Aster Data nCluster
Business Intelligence
Tools
Analytics Specialists
Data Integration
/ ETL
Systems Management
Security
Query Tools
Servers
Operating System
Cloud Infrastructure
Aster Data Ecosystem
Aster Data Platform
InfrastructureStorage
www.decideo.fr/bruley
Teradata Aster Platform Teradata Aster Platform InfrastructureInfrastructure
For physical infrastructure (non-cloud) deployments
Server Hardware
Operating System
Aster Data Analytic Platform
Certified commodity (x86) server hardware with internal storage
Certified Linux operating system
Aster Data nCluster packaged softwarenClusternCluster
www.decideo.fr/bruley
Teradata Aster InfrastructureTeradata Aster Infrastructure
For cloud deploymentsFor cloud deployments
Compute Instance
Compute instance from cloud provider (e.g. Amazon Web Services EC2)
CCCCxLargexLarge
StorageStorage connected to cloud computing
capacityEBSEBS
EphemeralEphemeral
Operating System
Aster Data Analytic Platform
Linux operating system
Aster Data nCluster packaged softwarenClusternCluster
www.decideo.fr/bruley
Teradata Aster Architecture for Teradata Aster Architecture for AnalyticsAnalytics
Your Analytics & Advanced Reporting Applications
Aster Data nCluster
Massively Parallel Data Stores • Hybrid row/column DBMS • Linear, incremental scalability
• Commodity hardware
• Standard SQL interface • MapReduce processing integrated with SQL via
SQL-MapReduce interface
• Rich libraries of MapReduce analytics from Aster Data and partners
• Visual development environment--develop in hours
Unified Interface
SQL SQL-MapReduce
Analytic Functions and Frameworks
• Optimized SQL engine• Fully-integrated in-database MapReduce
Analytics Processing Engines
AppAppApp App
SQL MapReduce …
• Support for in-database processing of custom applications written in broad variety of languages
• Integration with third-party packaged software via ODBC/JDBC or in-database integration
www.decideo.fr/bruley
Teradata Aster EcosystemTeradata Aster Ecosystem
Partner ProductProduct release
Platform for Certification
MicroStrategy Intelligence Server 9.2.1 32-bit Windows 7, Enterprise Edition SP1, 32-bit, 64-bit
SAP Business Objects XI 3.1 Windows 2008, 32-bit
Informatica Powercenter 9.0.1Client: Windows 2003/2008 Server 32 bit.Server: Windows 2003/2008 Server 32 bit and 64 bit
IBM Cognos 10.1FP1 n/a
Tableau Tableau Server 6 Windows (SS: TBU)
MicrosoftSSLS, SSAS, SSFS, SSIS
SQL Server 2008
.NET Framework 2.0Windows Server, 2008 64-bitWindows 2003, 32-bit
*Oracle BIEE certification currently in process