customer applications of hadoop on red hat storage server
DESCRIPTION
"If analyzing data were easy, wouldn’t everybody do it? The fact is, several steps are necessary to effectively use Hadoop out of the box to process data for the information you need. From preprocessing relevant data to preparing it for consumption to copying large data sets to another location for analysis—the time it takes to get value from your data might be so long that the data is no longer relevant. So why bring your data to the analytics engine when you can bring the analytics engine to your data? In this session, you’ll learn how Red Hat customers use Red Hat Storage Server to more quickly start mining their data by allowing the application that generates the data they need to write directly to where the analytics will run. And in the process of doing so, they can skip making an extra copy of the data."TRANSCRIPT
CUSTOMER APPLICATIONS OF HADOOP ON RED HAT STORAGEJacob ShucartSr. Solution Architect, Red HatErin BoydPrinciple Software Engineer, Red HatDiane FeddemaPrinciple Software Engineer, Red Hat
AGENDA
What is Red Hat Storage Server?
•Overview of technology
How does it integrate with Hadoop?
•Summary of integration
Customer use case
•DreamWorks Animation
Demonstration
Q&A
OPEN SOFTWARE-DEFINED STORAGE
TRADITIONAL STORAGE THE RED HAT WAY
HARDWARE CENTRIC
LOCKED DOWN
PROPRIETARY
SOFTWARE CENTRIC
FLEXIBLE
OPEN
INNOVATIVEVENDOR CONTROLLED
ARCHITECTURAL HIGHLIGHTS
DESIGNED FOR THE NEW DATA LANDSCAPE – PETABYTE SCALE
LINEAR SCALABILITY – PERFORMANCE AND CAPACITY
HIGHLY AVAILABLE – WITHOUT BREAKING THE BANK
ACCESSIBLE – FROM ANY APPLICATION, THROUGH ANY DEVICE
OPEN AND INTEROPERABLE – STANDARDS AND OPEN TECHNOLOGIES
SELF-HEALING, SELF-MANAGING – REDUCING OPERATIONAL OVERHEAD
HYBRID DATACENTER FOUNDATION - PRIVATE, PUBLIC, AND HYBRID CLOUDS
EXTENSIBLE – INNOVATE TO MEET YOUR UNIQUE BUSINESS REQUIREMENTS
DESIGNED FOR TODAY’S IT ECONOMICS – DO MORE WITH WHAT YOU HAVE
SINGLE PLATFORM FOR ALL YOUR DATA
CONVERGED COMPUTE AND STORAGEFILE SERVICES OPEN OBJECT APIs
OPEN, SOFTWARE-DEFINED STORAGE PLATFORM
PERSISTANT DATA STORES
APPLICATION BIG DATA
PHYSICAL
STANDARD X86
FILE
VIRTUAL
IDLE OR LEGACY
CLOUD MOBILE
CLOUD
EBSEBSSCALE-OUT STORAGE
ARCHITECTURE
ELASTIC SCALE-OUT ARCHITECTURE
APPS
GROW AS YOU GO
EXPAND TRANSPARENTLY
NO META DATA COMPLEXITY
CLIENTS
SCALE OUT NEW DATA CAPACITYSTORAGE
SINGLE, SHARED NAMESPACE
/mnt/point
BRING APPLICATIONS CLOSER TO THE DATA
DATA LOCALITY
REDUCE FOOTPRINT
ISOLATE WORKLOAD
REMAIN SUPPORTED
STORAGE RESIDENT APPLICATIONS
FOUNDATION FOR BIG DATA
DIRECT DATA ACCESS
DON'T COPY YOUR DATA
FULL API COMPATIBILITY
SCALES AS NEEDED
BRING HADOOP TO DATA
ENABLING HADOOP ON RHS VIA AMBARI
Ambari provides ease of installation and configuration of Hadoop
•High Touch Beta (HTB) enabled the ability to install and configure Hadoop on GlusterFS
• Initial Goals• Remove the tightly coupled bond to HDFS creating crippling dependencies between services
• Enable GlusterFS presence on the UI
• Enable minimal monitoring of GlusterFS
HDP STACKS
•Hadoop services are loaded into Ambari via the concept of a stack
•A stack consists of service definitions, rpms and configurations for Hadoop services
•Development enabled GlusterFS as a client service is a new HDP/RHS stack• Client services have no control enablement (stop/start function) from the management console
• Client services are not able to be monitored via Ganglia
• GlusterFS client was deployed via stanard Ambari agents and validated to be running properly for Hadoop installation
•
SERVICES IN THE STACK
Stack 2.0.6.GlusterFS•Contains all the core services of 2.0.6 with the addition of GlusterFS:
•GANGLIA, NAGIOS, GLUSTERFS •HBASE •HDFS •MAPREDUCE2 •OOZIE •SQOOP •YARN•GLUSTERFS •HCATALOG •HIVE •PIG •WEBHCAT •ZOOKEEPER
EXPOSE GLUSTERFS OPTIONS FOR HADOOP
•Enabling Raw local filesystem
•Security enablement for multiusers
•Allow users in Hadoop group to run on Hadoop without compromising other’s data
DREAMWORKS ANIMATION WITH HADOOP ON RED HAT STORAGE
Diane FeddemaPrinciple Software Engineer
Red Hat
OVERVIEW
Introduction● What does DreamWorks Animation want to know?
Using RHS to do big data analysis in place● Price effective and uses existing infrastructure
What we found● Some surprise findings
What we learned● How to find the right questions and produce the appropriate log data for the
questions
DREAMWORKS ANIMATION WEB SERVICE INFRASTRUCTURE
Render farm 2with traffic servers
Render farm 3with traffic servers
Render farm 1with traffic servers
Data volume:• 1 ms granularity • ~3.5 TB per week• From 8 traffic servers • Currently much if the data is discarded
Data Analysis to date• Hourly analysis using Splunk –expensive, DreamWorks Animation thinks provides
generic, limited answers to their questions• There remain a number of outstanding questions about work flow and efficiencies
THE NATURE OF DREAMWORKS ANIMATION WEB SERVICE INFRASTRUCTURE DATA
The present log files provide:• One record for each Service Request ( & sub-service request )
Each record provides:– Unix era time stamp and duration (ms) of a process– Type of data being transferred (e.g. JSON, jpeg)
Source, traffic and destination IP information etc.– Code for movie name – Payload size and payload headers etc.– Cache results – HIT or MISS– Correlation ID – unique job/sub-job identifiers
THE NATURE OF DREAMWORKS ANIMATION WEB SERVICE INFRASTRUCTURE DATA
THE DREAMWORKS ANIMATION QUESTIONS
DreamWorks Animation wanted to answer six questions about their web service infrastructure:
1.Determine the service infrastructure HIT and MISS rate over time?2.What does the dominant service call look like?
– What chain of sub calls is most common?3.How often are Correlation IDs reused?4.Compare production vs. non-production jobs for these questions?
– there are jobs which run to test the health of the system that are NOT production runs. Production and non-production runs are mixed together on these traffic servers at all times.
5.What is the average service call duration per traffic server?6.Determine number of non-zero duration events for traffic server cache hits and misses?
QUESTION 5What is the average service call duration per traffic
server?
QUESTION 6Determine number of non-zero duration events for
traffic server cache hits and misses?
FINDINGS
• Demonstrated the usefulness Hadoop on RHS for analyzing log data in place.
• Found conditions that they did not know could happen– Found “impossible” missing ID cases– Single and double correlation IDs events dominate when they expected larger
numbers of service call chains to dominate
• Need to make sure you have the right data for the questions asked i.e. – correlations ID data does not allow to track how jobs are split
CONCLUSIONS AND FUTURE COLLABORATION
• Look at more Time Series questions, e.g. show farmid jobs and how they overlap in time.
• Develop hadoop based Dashboard to monitor the “health” of the system and create automated alerts.
• Rethink the questions that can be extracted from existing data• Need to be able to examine individual cases to determine whether some application of the system are not using the API properly
• Consider adding/substituting information in the log file to provide more useful information
• Develop Dashboard to monitor the “health” of the system and create automated alerts.
THANK YOU AND Q&A
Questions and Answers
Contact Information
•Jacob Shucart – [email protected]
•Erin Boyd – [email protected]
•Diane Feddema – [email protected]
Thank you!
Check Out Other Red Hat Storage Activities at The Summit
•Enter the raffle to win tickets for a $500 gift card or trip to LegoLand!• Entry cards available in all storage sessions - the more you attend, the more chances you
have to win!
•Talk to Storage Experts:
• Red Hat Booth (# 211)
• Infrastructure • Infrastructure-as-a-Service
•Storage Partner Solutions Booth (# 605)
•Upstream Gluster projects• Developer Lounge
Follow us on Twitter, Facebook: @RedHatStorage