research on big data analytics for different domains
Post on 22-Jan-2018
323 Views
Preview:
TRANSCRIPT
RESEARCH ON BIG DATA
ANALYTICS
Name
Anmol Seth (anmolseth1298@gmail.com)
Divya Mehta (divyasmehta@gmail.com)
Kalpi Savalia (savaliakalpi@gmail.com)
Rajvi Thakkar (rrajvithakkar@gmail.com)
Ahmedabad University Big Data Analytics
1 Summer Internship
Index Abstract: .............................................................................................................................................. 2
Keywords: ............................................................................................................................................ 2
Introduction: ....................................................................................................................................... 2
Literature survey: ................................................................................................................................ 2
Introduction to Big Data: ................................................................................................................. 2
Common sources of big data: ......................................................................................................... 4
Big data analytics: ........................................................................................................................... 4
Challenges in data processing: ........................................................................................................ 4
Hadoop: ........................................................................................................................................... 4
Hadoop vs. Conventional relational database: ............................................................................... 5
Alternatives of Hadoop: .................................................................................................................. 5
Why use Hadoop? ........................................................................................................................... 5
Origin of Hadoop: ............................................................................................................................ 6
Hadoop Architecture: ...................................................................................................................... 6
Hadoop Daemons: ........................................................................................................................... 7
Hadoop has two main Components:............................................................................................... 7
HDFS-Hadoop Distributed File System: ........................................................................................... 7
2 Functions of map reduce:......................................................................................................... 9
Advantages of Hadoop according to - Gennaro (Rino) Persico (Database Software Leader at .... 10
Case Studies in different domains: ................................................................................................... 10
1) Social Media: ............................................................................................................................. 10
1) Retail: ........................................................................................................................................ 11
3) Natural Calamity: ...................................................................................................................... 11
4) Social Media: ............................................................................................................................. 11
Sentiment Analysis on tweets from twitter for a particular Hash tag: ............................................. 12
Results for #Maggi: ....................................................................................................................... 12
Results for #SalmanVerdict: .......................................................................................................... 13
Future Scope: .................................................................................................................................... 14
References:........................................................................................................................................ 14
Ahmedabad University Big Data Analytics
2 Summer Internship
Abstract: The size of the databases used today has been growing at exponential rates day by day. The explosion of
information systems, Internet and web based application, social networks and new technologies have given
rise to humongous amount of information, also known as “Big Data”. There are billions of active users on
social media with terabytes and petabytes of information loaded each day. For storing and extracting useful
information, which is quite useful for decision-making, from this gigantic information are difficult using
traditional methods of processing data. Hadoop is one of the tools that can manage data using HDFS and
MapReduce and other components. This research gives you an insight of Big Data and Hadoop and how
sentimental analysis using text mining is done in RapidMiner.
Keywords: Big Data, Social media analytics, word sentimental analysis, Hadoop, RapidMiner.
Introduction: In this techno age, with increasing devices and new upcoming technology, the databases are overwhelmed
by data entered each day. Data is generated through many sources viz. transactions, business processes,
social networking sites, web based applications, web servers, etc. Mostly business processes use huge data
for introducing new schemes or products. Today’s business and politics face fierce competition. In business
industry, they use big data technologies for decision-making and generating business intelligence. Politicians
use them for analysis citizen’s behavior. Processing this enormous amount of data or extracting meaningful
information is a tedious task.
Literature survey:
Introduction to Big Data:
Generally people have this misconception that Big data is a new technology, but actually Big data is not a
technology rather it is a term for a collection of huge massive collection of large and complex datasets. It
is not limited to the data lying in servers, but the term big data actually refers to each and every piece of
data that has stored in an organization till now. Data is mainly of three types: Structured, Unstructured
and Semi-Structured [Dai Clegg,”Semi-Structured data analytics: Relational or Hadoop platform? Part 1”,
IBM blog, Available URL:”http://www.ibmbigdatahub.com/blog/semi-structureddata-analytics-
relational-or-hadoop-platform-part-1”, accessed on].
Ahmedabad University Big Data Analytics
3 Summer Internship
Figure: 1 [[Wayne Eckerson,”Big Data Analytics: Profiling use of analytical platforms in user organizations”, Available URL: “http://www.slideshare.net/calidadgmv/big-data-analyticsresearch-
report-29783323”accessed on:]]
The term-structured data generally refers to data that has a defined length and format, is usually stored in
a database.
Unstructured data is not properly defined and structured. Over 80% of data is unstructured. [IBM
Business Analytics, “IBM Smarter Analytics”, YouTube video, Available
URL:”https://www.youtube.com/watch?v=3rFtrCzfWVQ”, accessed on]
Semi-structured data is also often called instrumentation data, when it comes from sensors or other
instrumented sources.
In any sector, such data is considered extremely valuable. The amount of data in the world is exploding.
Data is growing at a 40 percent compound annual rate, reaching nearly 45ZB by 2020 [Jeff Heaton,”Big Data
Presentation at Maryville University, March 25, 2015 Big Data”, YouTube video, Available
URL:”https://www.youtube.com/watch?v=l2lYXX2hwaI”, accessed on].
1ZB=1 Billion TB. Today data is so voluminous that it is difficult to store it in disks or in centralized systems,
thus data processing is shifting to distributed architecture.
Three V’s of big data according to IBM: Volume, Velocity, and Variety.
• Volume: Huge massive amount of data.
• Velocity: The rate at which data is been generated on the internet.
• Variety: It is a collection of different kinds of data.
Ahmedabad University Big Data Analytics
4 Summer Internship
Common sources of big data:
R. A. Fadenavis, Samrudhi Tabhane, (date), “Big Data Processing Using Hadoop”: Data generated from “the
internet of things”, sensors, mobile devices, search engines, social networking websites, scientific data &
enterprises; all are contributing to this huge explosion in data.
Gali Halevi, Dr.Henk Moed, (30-Sept-2012),”The Evolution of Big Data as a Research and Scientific
Topic”: Big Data can be seen in the finance and business where enormous amount of stock exchange,
banking, online and onsite purchasing data flows through computerized systems every day and are then
captured and stored for inventory monitoring, customer behavior and market behavior. It can also be seen
in the life sciences where big sets of data such as genome sequencing, clinical data and patient data are
analyzed and used to advance breakthroughs in science in research.
Big data analytics:
Big data analytics [http://www.edupritstine.com/courses/big-data-hadoop-program/big-data-
hadoopcourse] refers to the process of collecting, organizing and analyzing large sets of data ("big data") to
discover patterns and other useful information.
Big data requires tools and methods that can be applied to analyze the available large data and extract
patterns from it. Hadoop is one such big data tools to analyze big data.
Challenges in data processing:
(19-May-2015),”Challenges and Opportunities with Big Data”:
Capturing data from various sources, Filtering of that data, Storing is a major challenge, Searching, Scaling,
Sharing, Transferring, Analyzing, Presentation, Timeliness, Privacy, Human Collaboration and Processing of
data.
The major disadvantage with data warehouse was that the algorithm ran on a small sample of data collected
from different sources. The process of collecting the data, cleaning and then running having analytics on
them, that takes huge turnaround time which makes the results go stale. It was a lack of capability to process
the lots of data and a ability to do it quickly, which started to cause a problem.
Other challenges were like increase in volume and scalability of data in RDBMS systems. To overcome the
challenges of RDBMS, distributed systems were introduced. But problem arises in distributed systems
when a single machines crashes, which will need manual check. Thus to overcome above challenges,
Hadoop was introduced.
Hadoop:
Hadoop – Is an amalgamation of open source programs and procedures, which anyone can use as a core
ingredient for their big data operations [Bernard Marr,”What’s Hadoop? Here’s Simple Explanation for
Everyone”, Available URL” http://smartdatacollective.com/bernardmarr/205456/what-s-hadoop-here-
ssimple-explanation-everyone”, accessed on].
Ahmedabad University Big Data Analytics
5 Summer Internship
Hadoop vs. Conventional relational database:
-In Hadoop data is distributed across many nodes and processing of that data is distributed. Whereas in
conventional relational database, conceptually all the data sits on one server and one database.
-In Hadoop, once you've written data you are not allowed to modify it, you can delete but u cannot modify
it but in relational database data can be written, any times like balance on your account.
-Hadoop is optimized for, once you've written the data, you don't want to modify it. If it's archival data about
telephone calls or transactions you don’t want to change it once you written it but in relational database we
always use SQL.
-Hadoop does not support SQL at all; it supports lightweight versions of SQL called NOSQL but not
conventional SQL.
Alternatives of Hadoop:
Following are the different alternatives of Hadoop:
• BashReduce
• Disco Project
• Spark
• GraphLab
Why use Hadoop?
In spite of these above alternatives, Hadoop is much more popular because of the following reasons:
• Data is distributed to multiple systems and thus parallel processing can be carried out.
• Hadoop is highly available: Data is always available and process never stops.
• Fault tolerance: Even if there is any fault, still process never stops.
• High throughput: Due to parallelism, Hadoop is used to achieve high throughput.
Popular companies working on Hadoop project development:
• Apache
• Cloudera
• Hortonworks
• Intel
• IBM. Etc.
Ahmedabad University Big Data Analytics
6 Summer Internship
Origin of Hadoop:
Hadoop is an open source project developed by Dough Cutting and his team, after being inspired by Google’s
White Paper.
Finding something relevant 10 years ago was very difficult. Today search engines provides best results in few
milliseconds. So back then, it was at that time itself that the industries have realized that they need clusters
of computers to process big amount of data, as the data to be processed was too much for the capacity of a
single powerful server. Moreover, the data was constantly increasing, and changing, at that point yahoo
started to face the big data problems, as they were looking at the data from the complete World Wide Web.
They even after developed their own versions of distributed computing.
Google came and introduced papers on Google file system and MapReduce. At the time, these were just the
high level designs.
Later Doug Cutting and Mike Cafarella who were working in yahoo, made improvement and in the end of
2005, this project was named as HADOOP.
In august 2010, Doug joined Cloudera. In 2013, Apache released a stable version of Hadoop2, which is also
known as YARN. All versions of Hadoop was open source and managed by Apache.
Hadoop was derived from Google’s MapReduce and Google file system. Yahoo was originator and has been
a major contributor and uses Hadoop across its business.
Hadoop Architecture:
Figure: 2 [[Antonio Cangiano,” Why Big SQL”, AvailableURL:“http://sqlforhadoop.com/2013/05/why-
bigsql/”accessed on:]]
Hadoop comprises of multiple concepts and modules like HDFS, Map-Reduce, HBASE, PIG, HIVE, SQOOP and
ZOOKEEPER to perform the easy and fast processing of huge data.
[http://www.edupritstine.com/courses/big-data-hadoop-program/big-data-hadoop-course]
Ahmedabad University Big Data Analytics
7 Summer Internship
ZooKeeper: A high-performance coordination service for distributed applications.
Pig: A high-level data-flow language and execution framework for parallel computation.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
HBase: A scalable, distributed database that supports structured data storage for large tables.
[Dominique A. Heger,” Hadoop Design, Architecture & MapReduce Performance”, Available URL”
http://www.datanubes.com/mediac/HadoopArchPerfDHT.pdf”, accessed on].
Hadoop is a software which provides some classes and interfaces. It is conceptually different from Relational
databases and can process the high volume, high velocity and high variety of data to generate value.
Hadoop Daemons:
There are five Daemons running in background:
• NameNode
• DataNode
• JobTracker
• TaskTracker
• Secondary NameNode
Hadoop has two main Components:
HDFS-Hadoop Distributed File System:
It is like container of the Hadoop system, it is designed to store large amount of data. It creates multiple data
blocks and stores each of the blocks redundantly across the pool of servers to enable reliable, extreme rapid
computation.
• HDFS is highly fault tolerant and it is designed to be deployed on low cost hardware.
• HDFS is suitable for applications that have large dataset.
• HDFS maintain the metadata in a dedicated server called NameNode and the application data are
kept in separate nodes called Data Node.
• Used for processing data Map Reduce
• Used for storing data. HDFS
Ahmedabad University Big Data Analytics
8 Summer Internship
Figure: 3 [[Antonio Cangiano,” Why Big SQL”, AvailableURL:“http://sqlforhadoop.com/2013/05/why-bigsql/”accessed on:]] NameNode: HDFS namespace is a hierarchy of files and directories. NameNode knows everything about a
given file. If client asks to read a file, request is sent first to the NameNode. It maintains metadata.
This metadata contains two files:
• FSImage
• Editlogs.
First time whenever data is stored, an initial image of a file is stored in FSImage, then updates of every blocks
is maintained in Editlogs.
Files and directories are represented on the NameNode using inodes which record attributes like filename, block details, replication factor, permissions modification and access time, namespace and
disk space quotas.
The file content is split into blocks (typically 128MB) each block of file is dependently replicated at multiple
DataNodes. NameNode maintains the mapping of file blocks to DataNodes.
DataNodes: Two files represent each data block in the host native file system, one file contains the data itself
and the other contains block’s metadata.
Client machine: neither a NameNode nor a Data Node, Client machines have Hadoop installed on them.
They’re responsible for loading data into the cluster, submitting MapReduce jobs and viewing the results of
the job once complete.
MapReduce:
MapReduce is a tool/program written in java for managing and processing vast amounts of unstructured
data in parallel based on division of a big data item in smaller independent task units.
Ahmedabad University Big Data Analytics
9 Summer Internship
Hadoop is not really a database it only stores data and you can fetch data out, but there are no queries
involved like SQL or any other. Multiple mappers run on different parallel machines.
The main aim of MapReduce is: it groups tasks of similar nature so that type of data is placed on same nodes.
2 Functions of map reduce:
Mapper’s tasks to process the records
Reducer tasks are written to use the output of mappers (coming from multiple servers) and merge the results
together.
Once the tasks have been created, they’re spread across multiple nodes and run simultaneously (the “map”
step). The “reduce” phase combines the results together.
The JobTracker and TaskTracker handle delegation of tasks.
• JobTracker: The JobTracker oversees how MapReduce jobs are split up into tasks and divided
among nodes within the cluster.
• TaskTracker: The TaskTracker accepts tasks from the JobTracker, performs the work and alerts
the JobTracker once it’s done.
Figure: 4 [[Praveen Kumar,” Efficient Capabilities of Processing of Big Data using Hadoop Map Reduce”,
AvailableURL:“http://www.ijarcce.com/upload/2014/june/IJARCCE7F%20s%20praveen%20vashisht%20
Efficient%20Capabilities.pdf”accessed on:]]
From HDFS data is sent to different mappers resided on different machines. They are shuffled and then send
to reduction phase.
Ahmedabad University Big Data Analytics
10 Summer Internship
Advantages of Hadoop according to - Gennaro (Rino) Persico (Database Software
Leader at IBM)
• Hadoop does not require any specialized hardware to process data.
• Hadoop processes move closer to the data.
• Data and Server failures are managed: Hadoop HDFS replicates data blocks among the nodes
and a single failure does not impact the processing.
• Query and manage with “Hive”: Apache Hive is built on the top of Hadoop and it is used to query and manage data in distributed storage.
• Why not? If your data are not documents or contents or your data is composed by records or
your workload is transactional, you require database services.
[Bernard Marr,”What’s Hadoop? Here’s Simple Explanation for Everyone”, Available URL”
http://smartdatacollective.com/bernardmarr/205456/what-s-hadoop-here-s-simple-explanationeveryone”,
accessed on]
Disadvantages of Hadoop: 1. Cluster management is hard: - [ http://www.j2eebrain.com/java-J2ee-hadoop-advantages-
anddisadvantages.html] In the cluster, operations like debugging, distributing software; collection
logs etc. are too hard.
2. Security Concerns: Hadoop security model is disabled by default so whoever uses the platform and
does not know to enable it than your data might be at huge risk.
3. Vulnerable by nature: [ http://www.bigdatacompanies.com/5-big-disadvantages-of-hadoop-forbig-
data/] As the Hadoop framework is written in java, which is very popular language, Java has been
heavily exploited by cybercriminals and as a result there are chances of data being hacked.
Case Studies in different domains:
1) Social Media:
Facebook used commercial RDBMS, but the data was growing too fast and the infrastructure was not
adequate. Therefore, Facebook switched to Hadoop. Facebook uses Hadoop in variety of ways like log
processing, recommendation systems, and data warehouse. Data is stored in Hadoop/Hive Warehouse and
Hadoop Archival Store. It is heavily used for sentiment analysis, business intelligence, emotion tracing, and
many other applications. [Ashish Thusoo, “Hive - A Petabyte Scale Data Warehouse using Hadoop”, Available
URL: “https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-datawarehouse-
using-hadoop/89508453919,”accessed on: 10-Jun-2009].
Ahmedabad University Big Data Analytics
11 Summer Internship
2) Retail:
Walmart started using the term Big Data before it was known in the industry. They use 250-node Hadoop
cluster. It used Polaris, a semantic search engine. Its objective is to deliver meaningful results on the
shopper’s keywords and using constellation of methods that takes shopper’s interest and intuit his or her
interest. [Rick Moss,”Walmart.com’s improved Search engine powered by ‘Social Genome’”, Available
URL:“http://www.retailwire.com/discussion/16260/walmart-coms-improved-search-enginepowered-by-social-genome”, accessed on:]
3) Internet of things:
According to Cisco estimation by 2020, there will be 11 billion devices will be connected with each other. By using an enterprise-grade Hadoop architecture, you can create enormous opportunities for your business. The partnership between Cisco and MapR provides you with an infrastructure that can readily
handle the massive amounts of real-time big data from the Internet of Things. [Rajeev Tiwari, “Functions
that can benefit from Hadoop”, Available URL: http://www.dqindia.com/functions-that-can-benefitfrom-
hadoop/, accessed on:]
4) Natural Calamity:
An innovative company called Terra Seismic believes that earthquakes can be predicted 20-30 days before
they occur. By using Big data and information from satellite, they collect data each of the region where there is more probability of an earthquake.
This data is then combined with a huge number of earthquake originators such as ground water level changes, sudden clouds, changes of the ground conductivity, geomagnetic and gravity anomalies, etc.
Hadoop cluster is to support increased dataset of this information and takes less job execution time with increase in number of nodes. [Srikanth R P, “Could Nepal Earthquake have been predicted by using Big Data?”
Available URL: “http://www.dqindia.com/could-the-nepal-earthquake-have-been-predicted-byusing-big-data/” accessed on: 4-may-2015]
5) Social Media:
Facebook used commercial RDBMS, but the data was growing too fast and the infrastructure was not
adequate. Therefore, Facebook switched to Hadoop. Facebook uses Hadoop in variety of ways like log
processing, recommendation systems, and data warehouse. Data is stored in Hadoop/Hive Warehouse and
Hadoop Archival Store. It is heavily used for sentiment analysis, business intelligence, emotion tracing, and
many other applications. [Ashish Thusoo, “Hive - A Petabyte Scale Data Warehouse using Hadoop”, Available
URL: “https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-datawarehouse-
using-hadoop/89508453919,”accessed on: 10-Jun-2009].
Ahmedabad University Big Data Analytics
12 Summer Internship
Sentiment Analysis on tweets from twitter for a particular Hash tag: For the above analysis, we have used RapidMiner.
RapidMiner is a software platform through which we can perform data mining, text mining, predictive
analytics and business analytics.
Following are the steps that were followed for analysis:
1. A Dataset is to be loaded in RapidMiner to perform sentiment analysis.
2. Using NodeXL i.e. a Microsoft template, it’s free and open-source software package for network
analysis and visualization, we brought dataset related to #Maggi from twitter.
3. After loading dataset related to #Maggi in RapidMiner, we performed few steps from the link
[http://blog.aylien.com/post/98466399268/analyzing-text-in-rapidminer-part-1].And below is the
result obtained:
Results for #Maggi:
Hence, from the polarity we can come to know what are the views of people on the ban of Maggi.
Ahmedabad University Big Data Analytics
13 Summer Internship
Results for #SalmanVerdict:
Hence from the polarity we can come to know what are the views of people regarding court’s verdict on
Salman Khan Hit and run case.
Tweet analysis for #SalmanVerdict:
Tweet analysis for #Maggi
negative 40 %
positive 20 %
neutral 40 %
Tweet Analysis: #Maggi
positive 47 %
negative 20 %
neutral 33 %
#SalmanVerdict
Ahmedabad University Big Data Analytics
14 Summer Internship
Future Scope: • Practical implementation of Big Data Analytics using Hadoop.
• Voice sentiment analysis of records.
References: [1] https://www.youtube.com/watch?v=l2lYXX2hwaI
[2] R. A. Fadenavis, Samrudhi Tabhane, (date), “Big Data Processing Using Hadoop”
[3] http://www.ibmbigdatahub.com/blog/semi-structured-data-analytics-relational-or-hadoopplatform-part-1
[4] https://www.youtube.com/watch?v=3rFtrCzfWVQ
[5] http://www.edupritstine.com/courses/big-data-hadoop-program/big-data-hadoop-course
[6] Unknown, (19-May-2015),”Challenges and Opportunities with Big Data”
[7] Gali Halevi, Dr.Henk Moed, (30-Sept-2012),”The Evolution of Big Data as a Research and Scientific Topic”
[8] http://www.slideshare.net/calidadgmv/big-data-analytics-research-report-29783323
[9] http://smartdatacollective.com/bernardmarr/205456/what-s-hadoop-here-s-simple-
explanationeveryone
[10] http://blogs.forrester.com/mike_gualtieri/13-06-07-what_is_hadoop
[11] http://www.dqindia.com/functions-that-can-benefit-from-hadoop/
[12] http://www.j2eebrain.com/java-J2ee-hadoop-advantages-and-disadvantages.html
[13] http://www.bigdatacompanies.com/5-big-disadvantages-of-hadoop-for-big-data/
[14] https://www.facebook.com/notes/facebook-engineering/hive-a-petabyte-scale-data-warehouseusing-hadoop/89508453919
[15] http://www.dqindia.com/could-the-nepal-earthquake-have-been-predicted-by-using-big-data/
[16] http://www.retailwire.com/discussion/16260/walmart-coms-improved-search-engine-powered-
bysocial-genome
top related