big data

31
2. INTRODUCTION Big Data is not about the size of the data, it’s about the value within the data. We are awash in a flood of data today. In a broad range of application areas, data is being collected at unprecedented scale. Decisions that previously were based on guesswork, or on painstakingly constructed models of reality, can now be made based on the data itself. Such Big Data analysis now drives nearly every aspect of our modern society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences. In 2010 the term ‘Big Data’ was virtually unknown, but by mid-2011 it was being widely touted as the latest trend, with all the usual hype. Like ‘cloud computing’ before it, the term has today been adopted by everyone, from product vendors to large-scale outsourcing and cloud service providers keen to promote their offerings. In short, Big Data is about quickly deriving business value from a range of new and emerging data sources, including social media data, location data generated by smart phones and other roaming devices, public information available online and data from sensors embedded in cars, buildings and other objects — and much more besides Over the past decade, much has been written about "Big Data" in the last couple of years, but just what is it? As now commonly used, the term Big Data refers not just to the explosive growth in data that almost all organizations are 1

Upload: bmayurnath

Post on 05-Sep-2015

3 views

Category:

Documents


0 download

DESCRIPTION

Seminar Report On Big Data

TRANSCRIPT

2. INTRODUCTIONBig Data is not about the size of the data, its about the value within the data.We are awash in a flood of data today. In a broad range of application areas, data is being collected at unprecedented scale. Decisions that previously were based on guesswork, or on painstakingly constructed models of reality, can now be made based on the data itself. Such Big Data analysis now drives nearly every aspect of our modern society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences.In 2010 the term Big Data was virtually unknown, but by mid-2011 it was being widely touted as the latest trend, with all the usual hype. Like cloud computing before it, the term has today been adopted by everyone, from product vendors to large-scale outsourcing and cloud service providers keen to promote their offerings.

In short, Big Data is about quickly deriving business value from a range of new and emerging data sources, including social media data, location data generated by smart phones and other roaming devices, public information available online and data from sensors embedded in cars, buildings and other objects and much more besides

Over the past decade, much has been written about "Big Data" in the last couple of years, but just what is it? As now commonly used, the term Big Data refers not just to the explosive growth in data that almost all organizations are experiencing, but also the emergence of data technologies that allow that data to be leveraged. Big Data is a holistic term used to describe the ability of any company, in any industry, to find advantage in the ever increasingly large amount of data that now flows continuously into those enterprises, as well as the semi structured and unstructured data that was previously either ignored or too costly to deal with. The problem is that as the world becomes more connected via technology, the amount of data flowing into companies is growing exponentially and identifying value in that data becomes more difficult - as the data haystack grows larger, the needle becomes more difficult to find. So Big Data is really about finding the needles gathering, sorting and analyzing the flood of data to find the valuable information on which sound business decisions are made. Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions."The term big data refers to data sets so large and complex that traditional tools, like relational databases, are unable to process them in an acceptable time frame or within a reasonable cost range. Problems occur in sourcing, moving, searching, storing, and analyzing the data.Definition of Big Data

Big Data has been defined in various ways by different organizations over the years. Few of them include:

IBM defines: Every day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

In simple words, A set of technology advances that have made capturing and analyzing data at high scale and speed vastly more efficient.

Big data is a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.

Some examples of Big Data:

An airline jet collects 10 terabytes of sensor data for every 30 minutes of flying time.

Twitter has over 500 million registered users.

1. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough to finish well ahead of Brazil, Japan, the UK and Indonesia.

2. 79% of US Twitter users are more like to recommend brands they follow.

3. 67% of US Twitter users are more likely to buy from brands they follow.

4. 57% of all companies that use social media for business use Twitter

How fast data is increasing:

The picture which explains what happens in every 60 seconds on the internet. By this we can understand how much data being generated in a second, a minute, a day or a year and how exponentially it is generating. As per the analysis by Tech News Daily we might generate more than 8 Zetta bytes of data by 2015.

Defining Big Data: the 3V model & Characteristics of Big DataMany analysts use the 3V model to define Big Data. The three Vs stand for volume, velocity and variety.

Volume refers to the fact that Big Data involves analysing comparatively huge amounts of information, typically starting at tens of terabytes. Velocity reflects the sheer speed at which this data is generated and changes. For example, the data associated with a particular hashtag on Twitter often has a high velocity. Tweets fly by in a blur. In some instances they move so fast that the information they contain cant easily be stored, yet it still needs to be analysed.Variety describes the fact that Big Data can come from many different sources, in various formats and structures. For example, social media sites and networks of sensors generate a stream of ever-changing data. As well as text, this might include, for example, geographical information, images, videos and audio.

Big Data Problem:

Traditional systems build within the company for handling the relational databases may not be able to support/scale as data generating with high volume, velocity and variety of data.

Volume: As an example, Terabytes of posts generated on Facebook or 400 billion annual twitter tweets could mean Big Data! This enormous amount of data will be stored somewhere to analyze and come up with data science reports for different solutions and problem solving approaches.

Velocity: Big data requires fast processing. Time factor plays a very crucial role in several organizations. For instance, millions of records are generated in the stock market which needs to be stored and processed with the same speed as its coming into the system.

Variety: There is no specific format of Big Data. It could be in any form such as structured, unstructured, text, images, audio, video, log files, emails, simulations, 3D models, etc. Until now we have been working with only structured data. It might be difficult to handle the quality and quantity of unstructured or semi-structured data that we are generating on a daily basis.

How Big Data handles the above problems: Distributed File System (DFS): In DFS, we can divide a large set of data files into smaller blocks and load these blocks into multiple number of machines which will then be ready for parallel processing. For example, if we have 1 Terabyte of data to read with 1 machine and 4 Input/Output channels with each channels reading speed id 100MB/sec, the whole 1 TB data will be read in 45 minutes. On the other hand, if we have 10 different machines, we can divide 1 TB of data into 10 machines and then the data can be read in parallel which reduces the total time to only 4.5 minutes.

Parallel Processing: When data resides on N number of servers and holds the power of N servers, then the data can be processed in parallel for analysis, which helps the user to reduce the wait time to generate the final report or analyzed data.

Fault Tolerance: The Fault tolerance feature of Big Data frameworks (like Hadoop) is the one of the main reason for using this framework to run jobs. Even when running jobs on a large cluster where individual nodes or network components may experience high rates of failure, BigData frameworks can guide jobs toward a successful completion as the data is replicated into multiple nodes/slaves.

Use of Commodity Hardware: Most of the Big Data tools and frameworks need commodity hardware for its working which reduces the cost of the total infrastructure and very easy to add more clusters as data size increase.The Potentials and Difficulties of Big Data

Big data needs to be considered in terms of how the data will be manipulated. The size of the data set will impact data capture, movement, storage, processing, presentation, analytics, reporting, and latency. Traditional tools quickly can become overwhelmed by the large volume of big data. Latency time it takes to access the datais as an important a consideration as volume. Suppose you might need to run an ad hoc query against the large data set or a predefined report. A large data storage system is not a data warehouse, however, and it may not respond to queries in a few seconds. It is, rather, the organization-wide repository that stores all of its data and is the system that feeds into the data warehouses for management reporting. One solution to the problems presented by very large data sets might be to discard parts of the data so as to reduce data volume, but this isnt always practical. Regulations might require that data be stored for a number of years, or competitive pressure could force you to save everything. Also, who knows what future benefits might be gleaned from historic business data? If parts of the data are discarded, then the detail is lost and so too is any potential future competitive advantage. Instead, a parallel processing approach can do the trick think divide and conquer. In this ideal solution, the data is divided into smaller sets and is processed in a parallel fashion. What would you need to implement such an environment? For a start, you need a robust storage platform thats able to scale to a very large degree as the data grows and one that will allow for system failure. Processing all this data may take thousands of servers, so the price of these systems must be affordable to keep the cost per unit of storage reasonable. In licensing terms, the software must also be affordable because it will need to be installed on thousands of servers. Further, the system must offer redundancy in terms of both data storage and hardware used. It must also operate on commodity hardware, such as generic, low-cost servers, which helps to keep costs down. It must additionally be able to scale to a very high degree because the data set will start large and will continue to grow. Finally, a system like this should take the processing to the data, rather than expect the data to come to the processing. If the latter were to be the case, networks would quickly run out of bandwidth.

Requirements for a Big Data System

This idea of a big data system requires a tool set that is rich in functionality. For example, it needs a unique kind of distributed storage platform that is able to move very large data volumes into the system without losing data. The tools must include some kind of configuration system to keep all of the system servers coordinated, as well as ways of finding data and streaming it into the system in some type of ETL-based stream. (ETL, or extract, transform, load, is a data warehouse processing sequence.) Software also needs to monitor the system and to provide downstream destination systems with data feeds so that management can view trends and issue reports based on the data. While this big data system may take hours to move an individual record, process it, and store it on a server, it also needs to monitor trends in real time.In summary, to manipulate big data, a system requires the following:

A method of collecting and categorizing data

A method of moving data into the system safely and without data loss A storage system that Is distributed across many servers Is scalable to thousands of servers Will offer data redundancy and backup Will offer redundancy in case of hardware failure Will be cost-effective

A rich tool set and community support A method of distributed system configuration Parallel data processing

System-monitoring tools

Reporting tools

ETL-like tools (preferably with a graphic interface) that can be used to build tasks that process the data and monitor their progress

Scheduling tools to determine when tasks will run and show task status

The ability to monitor data trends in real time

Local processing where the data is stored to reduce network bandwidth usage3. New technologies in Big DataDealing with Big Data which sets up to multiple peta bytes in size (a single peta byte is a quadrillion bits of data) requires new technologies and new approaches to efficiently process large quantities of data within tolerable elapsed times. Traditional relational database technologies, like SQL, have been proven inadequate in terms of response times when applied to very large datasets such as those found in Data implementations. To address this shortcoming, these Big Data implementations are leveraging new technologies that provide a framework for processing the massive data stores that define Big Data. The Big Data landscape is dominated by two classes of technology1. Operational: Systems that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored. Focus is on servicing highly concurrent requests while exhibiting low latency for responses operating on highly selective access criteria.2. Analytical: Systems that provide analytical capabilities for retrospective, complex analysis that may touch most or all of the data.Focus is on high throughput; queries can be very complex and touch most if not all of the data in the system at any time. Both systems tend to operate over many servers operating in a cluster, managing tens or hundreds of terabytes of data across billions records. Currently trending technologies: 1. Column oriented databases

2. Schema-less / No-SQL databases

3. Map Reduce

4. Hadoop

5. Hive

6. PIG

7. WibiData

8. PLATFORA

9. Sky TreeColumn-oriented databases

Traditional, row-oriented databases are excellent for online transaction processing with high update speeds, but they fall short on query performance as the data volumes grow and as data becomes more unstructured. Column-oriented databases store data with a focus on columns, instead of rows, allowing for huge data compression and very fast query times. The downside to these databases is that they will generally only allow batch updates, having a much slower update time than traditional models.

Schema-less databases, or NoSQL databases

There are several database types that fit into this category, such as key-value stores and document stores, which focus on the storage and retrieval of large volumes of unstructured, semi-structured, or even structured data. They achieve performance gains by doing away with some (or all) of the restrictions traditionally associated with conventional databases, such as read-write consistency, in exchange for scalability and distributed processing.

MapReduce

This is a programming paradigm that allows for massive job execution scalability against thousands of servers or clusters of servers. Any MapReduce implementation consists of two tasks:

The "Map" task, where an input dataset is converted into a different set of key/value pairs, or tuples;

The "Reduce" task, where several of the outputs of the "Map" task are combined to form a reduced set of tuples (hence the name).

Hadoop

Hadoop is by far the most popular implementation of MapReduce, being an entirely open source platform for handling Big Data. It is flexible enough to be able to work with multiple data sources, either aggregating multiple sources of data in order to do large scale processing, or even reading data from a database in order to run processor-intensive machine learning jobs. It has several different applications, but one of the top use cases is for large volumes of constantly changing data, such as location-based data from weather or traffic sensors, web-based or social media data, or machine-to-machine transactional data.

Hive

Hive is a "SQL-like" bridge that allows conventional BI applications to run queries against a Hadoop cluster. It was developed originally by Facebook, but has been made open source for some time now, and it's a higher-level abstraction of the Hadoop framework that allows anyone to make queries against data stored in a Hadoop cluster just as if they were manipulating a conventional data store. It amplifies the reach of Hadoop, making it more familiar for BI users.

PIG

PIG is another bridge that tries to bring Hadoop closer to the realities of developers and business users, similar to Hive. Unlike Hive, however, PIG consists of a "Perl-like" language that allows for query execution over data stored on a Hadoop cluster, instead of a "SQL-like" language. PIG was developed by Yahoo!, and, just like Hive, has also been made fully open source.

WibiData

WibiData is a combination of web analytics with Hadoop, being built on top of HBase, which is itself a database layer on top of Hadoop. It allows web sites to better explore and work with their user data, enabling real-time responses to user behavior, such as serving personalized content, recommendations and decisions.

PLATFORA

Perhaps the greatest limitation of Hadoop is that it is a very low-level implementation of MapReduce, requiring extensive developer knowledge to operate. Between preparing, testing and running jobs, a full cycle can take hours, eliminating the interactivity that users enjoyed with conventional databases. PLATFORA is a platform that turns user's queries into Hadoop jobs automatically, thus creating an abstraction layer that anyone can exploit to simplify and organize datasets stored in Hadoop.Storage Technologies

As the data volumes grow, so does the need for efficient and effective storage techniques. The main evolutions in this space are related to data compression and storage virtualization.

SkyTree

SkyTree is a high-performance machine learning and data analytics platform focused specifically on handling Big Data. Machine learning, in turn, is an essential part of Big Data, since the massive data volumes make manual exploration, or even conventional automated exploration methods unfeasible or too expensive.

Big Data in the cloud

Big Data and cloud computing go hand-in-hand. Cloud computing enables companies of all sizes to get more value from their data than ever before, by enabling blazing-fast analytics at a fraction of previous costs. This, in turn drives companies to acquire and store even more data, creating more need for processing power and driving a virtuous circle.4. Big Data Architecture

Data can be classified into three discrete categories namely- 1. Structured Data 2. Unstructured Data

3. Semi Structured Data1. Structured Data- Data that resides in a fixed field within a record or file is called structured data. This includes data contained in relational databases and spreadsheets. Structured data first depends on creating a data model viz. defining what fields of data will be stored and how that data will be stored: data type and any restrictions on the data input. Structured data has the advantage of being easily entered, stored, queried and analyzed. Structured data is organized in a highly mechanized and manageable way. Structured data is often managed using Structured Query Language(SQL).2. Unstructured Data- Unstructured data usually refers to information that doesn't reside in a traditional row column database. Unstructured data files often include text and multimedia content. Examples include email messages, word processing documents, videos, photos, audio files, presentations, web pages and many other kinds of business documents. These sorts of files may have an internal structure, they are still considered "unstructured" because the data they contain doesn't fit neatly in a database. Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly. Unstructured data is raw and unorganized. Digging through such data can be cumbersome and costly. Big Data has generally to do with this large collection of unstructured data that is growing in size daily and swiftly.3. Semi structured data- is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important. Semi-structured data is increasingly occurring since the advent of the Internet where full-text documents and databases are not the only forms of data many more and different applications need a medium for exchanging information. In object-oriented databases, one often finds semi-structured data.Big Data architecture is premised on a skill set for developing reliable, scalable, completely automated data pipelines. That skill set requires profound knowledge of every layer in the stack, beginning with cluster design and spanning everything from Hadoop tuning to setting up the top chain responsible for processing the data. The following diagram shows the complexity of the stack, as well as how data pipeline engineering touches every part of it.

The main detail here is that data pipelines take raw data and convert it into insight (or value). Along the way, the Big Data engineer has to make decisions about what happens to the data, how it is stored in the cluster, how access is granted internally, what tools to use to process the data, and eventually the manner of providing access to the outside world. The latter could be BI or other analytic tools, the former (for the processing) are likely tools such as Impala or Apache Spark. 5. System DesignCompanies today already use, and appreciate the value of, business intelligence. Business data is analyzed for many purposes: a company may perform system log analytics and social media analytics for risk assessment, customer retention, brand management, and so on. Typically, such varied tasks have been handled by separate systems, even if each system includes common steps of information extraction, data cleaning, relational-like processing (joins, group-by, aggregation), statistical and predictive modelling, and appropriate exploration and visualization tools. With Big Data, the use of separate systems in this fashion becomes prohibitive expensive given the large size of the data sets. The expense is due not only to the cost of the systems themselves, but also the time to load the data into multiple systems. Big Data has made it necessary to run heterogeneous workloads on single infrastructure that is sufficiently flexible to handle all these workloads. The challenge here is not to build a system that is ideally suited for all processing tasks.Instead, the need is for the underlying system architecture to be flexible enough that the components built on top of it for expressing the various kinds of processing tasks can tune it to efficiently run these different workloads. we focus on the programmability requirements. If users are to compose and build complex analytical pipelines over Big Data, it is essential that they have appropriate high-level primitives to specify their needs in such flexible systems. The MapReduce framework has been tremendously valuable, but is only a first step. Even declarative languages that exploit it, such as Pig Latin, are at a rather low level when it comes to complex analysis tasks. Similar declarative specifications are required at higher levels to meet the programmability and composition needs of these analysis pipelines. Besides the basic technical need, there is a strong business imperative as well. Businesses typically will outsource Big Data processing, or many aspects of it. Big data classification

6. Algorithms used in Big DataBig data is data so large that it does not fit in the main memory of a single machine, and the need to process big data by efficient algorithms arises in Internet search, network traffic monitoring, machine learning, scientific computing, signal processing, and several other areas.

For decades researchers across different disciplines of computer science have envisioned the need oftechniques to handle data-intensive computing. With the boom of internet and the explosion of datain every socio-economical aspect, once what was a futuristic research, has now transformed itselfinto a dire requirement. Big Data comes with immense opportunity, but turning this seriously highvolume, high velocity, structured or unstructured, heterogeneous, often noisy and high-dimensionaldata into something one can use is a huge challenge

1. Streaming: Sampling and Sketching

2. Dimensionality Reduction

3. External Memory and Semi-streaming Algorithms

4. Map-Reduce Framework and Extensions

5. Near Linear Time Algorithm Design

6. Property Testing

7. Metric Embedding

8. Sparse Transformation

9. Crowd sourcing

Mainly used big data algorithms is A/B testing and data analytics in general. In all machine learning algorithms, you do much better by understanding the domain of the data well and also understanding the underlying models, their strengths and weaknesses etc. It is much the same with A/B testing. 7. Bottom of Form

Features of Big DataThe traditional data warehousing approach of bringing data into a central repository is costly and time consuming. Moreover, there is a significant shift in the way information is managed and consumed in a big data environment.1. High-capacity, inexpensive storage

2. High-performance, inexpensive processing power3. High velocity data stream processing 4. Data integration and quality capabilities

5. Relational database acceleration/scale

6. Unstructured text management and search

8. Draw backs of Big Data There is no doubt that big data is a valuable tool that has already had a critical impact in certain areas.

The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. Second, big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement. Third, many tools that are based on big data can be easily gamed. Even Googles celebrated search engine, rightly seen as a big data success story, is not immune to Google bombing and spamdexing, wily techniques for artificially elevating website search placement.

Fourth, even when the results of a big data analysis arent intentionally gamed, they often turn out to be less robust than they initially seem. Consider Google Flu Trends, once the poster child for big data. A fifth concern might be called the echo-chamber effect, which also stems from the fact that much of big data comes from the web. Whenever the source of information for a big data analysis is itself a product of big data, opportunities for vicious cycles abound.

Sixth, big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions. In the past few months, for instance, there have been two separate attempts to rank people in terms of their historical importance or cultural contributions, based on data drawn from Wikipedia.FINALLY, big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs. 9. Future advances in Big DataBig data for all

Currently Big Data is seen predominantly as a business tool. Increasingly, though, consumers will also have access to powerful Big Data applications. In a sense, they already do (e.g. Google, social media search tools, etc). But as the number of public data sources grows and processing power becomes ever faster and cheaper increasingly easy-to-use tools will emerge that put the power of Big Data analysis into everyones handsData evolution

It is also certain that the amount of data stored will continue to grow at an astounding rate. This inevitably means Big Data applications and their underlying infrastructure will need to keep pace. More governments will initiate open data projects, further boosting the variety and value of available data sources. Linked Data databases will become more popular and could potentially push traditional relational databases to one side due to their increased speed and flexibility. This means businesses will be able to develop and evolve applications at a much faster rate.Data security will always be a concern, and in future data will be protected at a much more granular level than it is today. As data increasingly becomes viewed as a valuable commodity, it will be freely traded, manipulated, added to and re-sold.Dawn of the databots

As volumes of stored data continue to grow exponentially and data becomes more openly accessible, databots will increasingly crawl organisations linked data, unearthing new patterns and relationships in that data over time. These databots will initially be small applications or programs that follow simple rules, but as time moves on they will become more sophisticated, self-learning entities. The artificial intelligence programs they employ will continue to grow more effective due to the fact that they can operate over time and learn from ever larger data sets. The Final Word on Big DataThe vision is about linking people, products and services with information. In this context, the digital world offers mechanisms that allow individuals and organisations to share and collaborate on a planetary scale.

In this fast-moving, connected world, intuition, experience and training will not be enough to give businesses the insight they need. They need to apply scientific and data analysis to their questions and find the answers in the time frame demanded so they can make the right decisions. And that all requires scalable, global solutions.

10.Conclusion Big Data is huge amount of data. Through better analysis of the largevolumes of data that are becoming available, there is the potential for makingfaster advances in many scientific disciplines and improving the profitability and success of many enterprises. However, many technical challenges describe in this paper must be addressed before this potential can be realized fully. Thechallenges include not just the obvious issues of scale, but also heterogeneity,lack of structure, error-handling, privacy, timeliness, provenance, andvisualization, at all stages of the analysis pipeline from data acquisition to result interpretation. These technical challenges are common across a large variety of application domains, and therefore not cost-effective to address in the context of one domain alone. Furthermore, these challenges will require transformative solutions, and will not be addressed naturally by the next generation of industrial products. We must support and encourage fundamental research towards addressing these technical challenges if we are to achieve the promised benefits of Big Data.1