expert big data tips

TIPS from the Experts

Upload: qubole

Post on 11-Nov-2014



Data & Analytics

2 download


Whether you are interested in healthcare data analytics or looking to get started with big data and marketing, these fundamental principles from data experts will contribute to your success.


Page 1: Expert Big Data Tips

TIPSfrom the Experts

Page 2: Expert Big Data Tips

Table of Contents

Setup is KeyThink wideTool integration

Evaluate and AdaptSharingEncryption

A data science mindsetInnovationReal-time action

Page 3: Expert Big Data Tips

To see all of the tips in list form, click the button on the

bottom of the slide.

See in List Form

Page 4: Expert Big Data Tips

Create a data lake and give your business and data analysts access to all your data – structured and unstructured – with SQL engines like Hive. They will surprise you with the insight and value they can extract, and your development team will have less work answering ad-hoc queries.

Grant Unlimited Access


—Christian Prokopp, Principal Consultant at Big Data Partnership

See in List Form

Page 5: Expert Big Data Tips

Very often the query is when to use MapReduce/Pig/Hive vs. HBase/Cassandra/Impala frameworks. NFR (Non Functional Requirements) have to be considered while deciding the framework. MapReduce/Pig/Hive are used for high throughput/high latency requirements as in the case of Batch processing/ETL. HBase/Cassandra/Impala are used for low throughput/low latency requirements as in the case of a customer filling out an online application.

Select the Right Tools


—Praveen Sripati, Hadoop trainer and author of Hadoop Tips

See in List Form

Page 6: Expert Big Data Tips

Improve query performance by considering Presto with RCFile or ORC File format.

Use Presto


—Minesh Patel, Qubole

See in List Form

Page 7: Expert Big Data Tips

Use Robust Machine Learning Algorithms to extract the data – Data collection and massive storing is only the enabling infrastructure. You should leverage existing and also propriety machine learning algorithms, that will discover hidden patterns, and will learn from the data what is important for the analyst to view and examine, and what is not.

Incorporate Machine Learning


—Idan Tendler, CEO of Fortscale

See in List Form

Page 8: Expert Big Data Tips

There is a big need for automation in Big Data. Security is an important industry that has proven the value of Big Data. But, that has just as quickly proved that Big Data is also valueless without automation wrapped around it to make it practical. Only once you make Big Data practical can you begin to perform analytics, etc., which is where the value of Big Data in the security industry really gets unlocked.

Automation is Key


—Sean Brady, VP of Product Management at Vorstack

See in List Form

Page 9: Expert Big Data Tips

Segment the data based on demographic and/or firmographic information. This is an easy and inexpensive way to highlight trends in the primary customers and industries served. This information is very helpful when determining what new products and/or services should be offered. In addition, look for trends in behavioral transaction information and further optimize the customer’s experience with relevant marketing and messaging.

Identify Easy Wins


—David Handmaker, CEO of Next Day Flyers

See in List Form

Page 10: Expert Big Data Tips

Identify all of the data you have access to and/or will produce, and explore possible audiences and use cases for it. Often times, big data plays are geared toward a fairly narrow audience and set of use cases based on the original inspiration for the solution. Or, there is not an active and explicit exploration of the full potential of what you have to offer. I can all but assure you that there are major opportunities for your offering that you haven’t even considered yet. The earlier you have a crisp view of the potential of your big data and offering, the better able you will be to build the right thing, in the right way, to exploit the potential of that idea.

Think Broad


—Dirk Knemeyer, founder of Involution Studios

See in List Form

Page 11: Expert Big Data Tips

Setup is Key

Page 12: Expert Big Data Tips

Big Data tools ( Mapreduce/Hive etc. ) are known for their latency problems, but on the other hand they are excellent for processing petabytes of data in a distributed computing environment. When it comes to integration with any BI/reporting tools, big data technologies should be used in an appropriate manner so that you can avoid the negatives and leverage the strength of these technologies.

For example – if you are building an integrated pipeline with BI tools, try to aggregate as much as you can and utilize the caching or cube technologies with the BI tools to make it a faster experience for the end user. Real time connectivity with big data sources like Hive/HDFS is not a great end user experience in the BI space, so it should be avoided.

Careful and Smart Integration with BI tools


—Ashish Dubey, Solutions Architect at Qubole

See in List Form

Page 13: Expert Big Data Tips

Rule of thumb, invest 80% of your time in your data lake and data pipeline (mining, extracting, cleaning, transforming, loading), and 20% in the high level data science and machine learning effort. Data in the wild is complex, wrong, contradicting, hard to access and find. Consequently more, faster, and accurate data usually has a higher impact than more complex models and makes for a robust system.

Invest in Your Pipeline


—Christian Prokopp, Principal Consultant at Big Data Partnership

See in List Form

Page 14: Expert Big Data Tips

Everyone with a Big Data project wants to rush straight into analysis. That is where things usually fall apart, however, because there is simply too much data flowing across the network and it is mostly in a format that current analytics software cannot handle.

Don’t Rush Into Analysis


—Rick Aguirre, president of Cirries Technologies

See in List Form

Page 15: Expert Big Data Tips

Big Data success requires three steps of heavy lifting first, before you ever analyze it.

Most of the Big Data torrent is a big nothing and not relevant. Decide what data you want to analyze and set up algorithms to locate and corral it.

Start with Heavy Lifting


—Rick Aguirre, president of Cirries Technologies

Step 1 is data capture.

You want to capture the data you need as it comeacross the network. It may not be relevant in just a few minutes, or you may need to store it for a number of years if, as one example, it is data that might be needed later for law enforcement purposes.

Step 1 is data control.

This is where you convert whatever format the data is in to a format that your analytics software can use. Only now, at this step, do you have the right data in the right format that you can then use for whatever kind of analytics you have in mind.

Step 1 is data humanization.

See in List Form

Page 16: Expert Big Data Tips

Once data is collected then you have easy access for advanced analytics – don’t stop at only analyzing one log source or one dimension of data – analyze across log sources and multiple entities. For example, in order to discover advanced cyber attacks that leveraged users’ credentials, we profile across behavioral activity of users – including their permissions configuration, their access to files and systems and their web activity. We analyze their historical activity as well as comparing them against their peers.

Think wide


—Idan Tendler, CEO of Fortscale

See in List Form

Page 17: Expert Big Data Tips

Perform BI Analytics and Visualization with the ODBC Driver.

Use the ODBC Driver


—Minesh Patel, Qubole

See in List Form

Page 18: Expert Big Data Tips

I always start by looking at a subsample of the data. You often get a very good impression of what the main focus of the data munging or cleaning will be just by looking at some numbers (or characters).

Use a Subsample


—Benedikt Koehler, Data Scientist and Blogger at Beautiful Data

See in List Form

Page 19: Expert Big Data Tips

Evaluate and Adapt

Page 20: Expert Big Data Tips

Measure and record everything, and keep an eye on your key metrics. Things change, and tests become obsolete, and sometimes in surprising ways especially when you depend on external data. For example, data sources you mine may introduce rolling changes, which are hard to catch as an error but easy to identify in metrics.

Measure Everything


—Christian Prokopp, Principal Consultant at Big Data Partnership

See in List Form

Page 21: Expert Big Data Tips

Measure and record everything, and keep an eye on your key metrics. Things change, and tests become obsolete, and sometimes in surprising ways especially when you depend on external data. For example, data sources you mine may introduce rolling changes, which are hard to catch as an error but easy to identify in metrics.

Sharing is Caring


—Idan Tendler, CEO of Fortscale

See in List Form

Page 22: Expert Big Data Tips

Encrypting data at rest is a good best practice.



—Minesh Patel, Qubole

—Idan Tendler, CEO of Fortscale

See in List Form

Page 23: Expert Big Data Tips

A common question is whether to go for a distribution from Apache or a vendor. When there is enough expertise in the organization to know the internals of the different frameworks for integrating and resolving any issues quickly, then go with Apache Hive. If that expertise is not available, use a distribution through a vendor and get commercial support to resolve any issues that may arise.

Pick the Right Distribution


—Praveen Sripati, Hadoop trainer and author of Dattamsha

See in List Form

Page 24: Expert Big Data Tips

Developing a Big Data strategy is all about starting small and making gradual steps in becoming more data-driven. Start with breaking down the data silos within your organization to gain the most insights from your data when you start analyzing it through a variety of tools.

Start Small


—Mark van Rijmenam – CEO / Founder BigData-Startups

See in List Form

Page 25: Expert Big Data Tips

There is often a perception that there is gold in an organization’s data, and that if you just look hard enough, you will find it. In reality, this perception can lead to fruitless efforts with no real direction and no payoff. Instead, start with a business intent in mind. What are the actions you would take—and the value to your business—if data can provide the answer to a certain question?

Have a Business Intent


—Sean Stauth, Director, Client Services, Silicon Valley Data Science

See in List Form

Page 26: Expert Big Data Tips

Your data strategy should be a living document that helps you get the most value from your data. As your goals, your technical environment, or the market change, keep it updated to help you follow those changes and stay on course.

Update Your Strategy


—Scott Kurth, VP, Advisory Services, Silicon Valley Data Science

See in List Form

Page 27: Expert Big Data Tips

A Data Science Mindset

Page 28: Expert Big Data Tips

Have an always-on data science mindset — Successful big data initiatives start with a holistic 360 view of the problem space. This includes understanding the inputs (data types, sources, features), the desired outputs (decisions, goals, predictions), and the constraints (model parameters, boundary conditions, optimization constraints). To achieve this perspective, one must be thinking like a scientist from start to finish: collect data, infer a testable hypothesis, design an experiment, test and evaluate the results, refine your hypothesis, and repeat (if necessary).

Data Science Mindset


—Kirk Borne, Data Scientist, Astrophysicist and Big Data Science Consultant

See in List Form

Page 29: Expert Big Data Tips

The most important ROI in Big Data Analytics projects is Return On Innovation. What are you doing that’s different and consequential? What sets you apart from the rest of the multitudes in this space?

Return on Innovation


—Kirk Borne, Data Scientist, Astrophysicist and Big Data Science Consultant

See in List Form

Page 30: Expert Big Data Tips

Developing a big data platform requires focusing on the users. Serve a few users well, and let their processing scale up with your capabilities. “Premature platformization” or trying to satisfy too many use cases too early in the project leads to failures. Make the initial users successful, and the ecosystem will thrive and grow.

Focus on the Users


—Owen O’Malley – Sr. Architect and Co-founder of Hortonworks

See in List Form

Page 31: Expert Big Data Tips

Using the API: samples for Java SDK, Python SDK, and REST.

Use the API


—Minesh Patel, Qubole

See in List Form

Page 32: Expert Big Data Tips

If you cannot take real-time action, you have no need of real-time processing. There will always be batch processing workloads supporting the enterprise, and increasingly dynamic decision areas can be effectively supported by analytical systems because of advances in data architectures.

Take Real-Time Action


—Sanjay Mathur, CEO, Silicon Valley Data Science

See in List Form

Page 33: Expert Big Data Tips

State—the full context of an event, like a customer visit or the completion of a step in a manufacturing process—can be expensive to reassemble after the fact. This is particularly true with highly relational systems: witness the complex ETL (extract, transform, load) workloads that enterprise data warehouse systems struggle to scale. Storing denormalized state, e.g. rich logs, for analysis has proven highly successful for the web businesses of silicon valley, and those techniques can be applied to industries across the economy.

Store Denormalized State


—John Akred, CTO, Silicon Valley Data Science

See in List Form

Page 34: Expert Big Data Tips

Whether you are thinking about migrating towards Big Data or whether you are just starting out with data all together, it helps to focus upon building and maintaining a common platform. Similar to software development platforms, data platforms should also include source control, change management, and testing scenarios. This will help reduce future migration costs and will lead to long-term sustainable, competitive data capabilities.

Build a Common Platform


—Ryan Kirk, SR. Data Scientist at Hipcricket

See in List Form

Page 35: Expert Big Data Tips


Looking for additional big data tips and advice?Subscribe to Qubole's email newsletter.