online tweet sentiment analysis with apache spark

Online Tweet Sentiment Analysis with Apache Spark Davide Nardone 0120/131 PARTHENOPE UNIVERSITY

Upload: davide-nardone

Post on 16-Apr-2017



Data & Analytics

4 download


Page 1: Online Tweet Sentiment Analysis with Apache Spark

Online Tweet Sentiment Analysis with Apache Spark

Davide Nardone0120/131


Page 2: Online Tweet Sentiment Analysis with Apache Spark

1. Introduction2. Bag-of-words3. Spark Streaming4. Apache Kafka5. DataFrame and SQL operation6. Machine Learning library (MLlib)7. Apache Zeppelin8. Implementation and results


Page 3: Online Tweet Sentiment Analysis with Apache Spark

Sentiment Analysis (SA) refers to the use of Natural Language Processing (NLP) and Text Analysis to extract, identify or otherwise characterize the sentiment content of a text unit.


The main dish was delicious It is an dish

The main dish was salty and


Positive NegativeNeutral

Page 4: Online Tweet Sentiment Analysis with Apache Spark

Existing SA approaches can be grouped into three main categories:1. Knowledge-based techniques;2. Statistical method;3. Hybrid approaches.

Statistical method take advantages on elements of Machine Learning (ML) such as Latent Semantic Analysis (LSA), Multinomial Naïve Bayes (MNB), Support Vector Machines (SVM) etc.

Introduction (cont.)

Page 5: Online Tweet Sentiment Analysis with Apache Spark

The bag-of-words model is a simplifying representation used in NLP and Information Retrieval (IR).

In this model, a text is represented as the the bag of its words, ignoring grammar and even word order but keeping multiplicity.

The bag-of-words model is commonly used in methods of document classification where the occurrence of each word (TF) is used as feature for training a classifier.


Page 6: Online Tweet Sentiment Analysis with Apache Spark

1. Tokening;2. Stopping;3. Stemming;4. Computation of tf (term frequency) idf (inverse

document frequency);5. Using a machine learning classifier for the

tweets classification (e.g., Naïve Bayes, Support Vector Machine, etc.)

Bag-of-words (cont.)

Page 7: Online Tweet Sentiment Analysis with Apache Spark

Spark Streaming in an extension of the core Spark API.

Data can be ingested from many sources like Kafka, etc.

Processed data can be pushed out to filesystems, databases, etc.

Furthermore, it’s possible to apply Spark’s machine learning algorithms on data streams.

Spark Streaming

Page 8: Online Tweet Sentiment Analysis with Apache Spark

Spark Streaming receives live input data streams and divides the data into batches.

Spark Streaming provides a high-level abstraction called Discretized Stream, DStream (continuous stream of data).

DStream can be created either from input data streams from sources as Kafka, Flume, etc.

Spark Streaming (cont.)

Page 9: Online Tweet Sentiment Analysis with Apache Spark

Kafka is a Distributed Streaming Platform and it behaves like a partitioned, replicated commit log services.

It provides the functionality of a messaging system.

Kafka is run as a cluster on one or more servers. The Kafka cluster stores streams of records in

categories called topics.

Apache Kafka

Page 10: Online Tweet Sentiment Analysis with Apache Spark

Kafka has two out of four main core APIs:1. The Producer API allows an application to publish

a stream record to one or more Kafka topics;2. The Consumer API allows an application to

subscribe to one or more topics and process the stream of records produced to them.

Apache Kafka (cont.)

So, at high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers.

Page 11: Online Tweet Sentiment Analysis with Apache Spark

Spark SQL is a component on the top of Spark Core that introduce a new data abstraction called SchemaRDD which provides support for structured and semi-structured data.

Spark SQL also provides JDBC connectivity and can access to several databases using both Hadoop connector and Spark connector.

In order to access to store or get data from it, it’s necessary: Define an SQLContext (entry point) for using all the

Spark's functionality; Create a table schema by means of a StructType on

which is applied a specific method for creating a Dataframe.

By using JDBC drivers, the previous schema is written on a database.

Output operations for DStream

Page 12: Online Tweet Sentiment Analysis with Apache Spark

MLlib is a Spark’s library of machine learning functions.

MLlib contains a variety of learning algorithms and is accessible from all Spark’s programming languages.

It consists of common learning algorithms and features, which includes classification, regression, clustering, etc.

Machine Learning with MLlib

Page 13: Online Tweet Sentiment Analysis with Apache Spark

The mllib.features package contains several classes for common features transformation. These includes algorithms to construct feature vectors from text and ways to to normalize and scale features.

Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.

Feature extraction

Page 14: Online Tweet Sentiment Analysis with Apache Spark

Classification and regression are two common forms of supervised learning, where algorithms attempts to predict a variable from features of objects using labeled training data.

Both classification and regression use LabeledPoint class in MLlib.

MLlib includes a variety of methods for classification and regression, including simple linear methods and decision three and forests.


Page 15: Online Tweet Sentiment Analysis with Apache Spark

Naïve Bayes is a multiclass classification algorithm that scores how well each point belongs in each class based on linear function of the features.

It’s commonly used in text classification with TF-IDF features, among other applications such as Tweet Sentiment Analysis.

In MLlib, it’s possible to use Naïve Bayes through the mllib.classification.NaiveBayes class.

Naïve Bayes

Page 16: Online Tweet Sentiment Analysis with Apache Spark

Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity.

Unlike the supervised tasks, where the data is labeled, clustering can be used to make sense of unlabeled data.

It is commonly used in data exploration and in anomaly detection


Page 17: Online Tweet Sentiment Analysis with Apache Spark

MLlib, in addition to including the popular K-means “offline algorithm”, it also provides an “online” version for clustering “online” data streams.

When data arrive in a stream, the algorithm dynamically:1. Estimate the membership data groups;2. Update the centroids of the clusters.

Streaming K-means

Page 18: Online Tweet Sentiment Analysis with Apache Spark

In MLlib, it’s possible to use Streaming K-means through the mllib.clustering.StreamingKMeans class.

Streaming K-means (cont.)

Page 19: Online Tweet Sentiment Analysis with Apache Spark

Given a dataset of points in high-dimension space, we are often interested in reducing the dimensionality of the points so that they can be analyzed with simpler tools.

For example, we might want to plot the points in two dimensions, or just reduce the number of features to train models more efficiently.

In MLlib, it’s possible to use Streaming K-means through the mllib.feature.PCA class.

Principal Component Analysis (PCA)

Page 20: Online Tweet Sentiment Analysis with Apache Spark

Apache Zeppelin is a web-based notebook that enables interactive data visualization.

Apache Zeppelin interpreters concept allows any language/data-processing-backend to be plugged into Zeppelin such as JDBC.

Apache Zeppelin

Page 21: Online Tweet Sentiment Analysis with Apache Spark

Because of the lack of Spark-Streaming API (Python) for accessing to a Twitter account, the tweet streams have been simulated using Apache Kafka.

In particular, the entity accounting for this task is a Producer which publishes stream of data on a specific topic.

The training and testing data stream have been retrieved from [1].

On the other side, each received DStream is processed by a Consumer, using stateless Spark functions such as map, transform, etc..

Implementation and results

Page 22: Online Tweet Sentiment Analysis with Apache Spark

Naïve Bayes classification results

Page 23: Online Tweet Sentiment Analysis with Apache Spark

Clustering results

Page 24: Online Tweet Sentiment Analysis with Apache Spark

Future work Integrate Twitter API’s method to retrieve tweet

from accounts. Use an alternative feature extraction method for the

Streaming K-means task.

Page 25: Online Tweet Sentiment Analysis with Apache Spark

[1][2] Karau, Holden, et al. Learning spark: lightning-fast big data analysis. “O'Reilly Media, Inc.", 2015.[3] Bogomolny, A. Benford’s Law and Zipf ’sLaw.


Page 26: Online Tweet Sentiment Analysis with Apache Spark

For any questions, contact me at: [email protected]