online tweet sentiment analysis with apache spark
TRANSCRIPT
Online Tweet Sentiment Analysis with Apache Spark
Davide Nardone0120/131
PARTHENOPEUNIVERSITY
1. Introduction2. Bag-of-words3. Spark Streaming4. Apache Kafka5. DataFrame and SQL operation6. Machine Learning library (MLlib)7. Apache Zeppelin8. Implementation and results
Summary
Sentiment Analysis (SA) refers to the use of Natural Language Processing (NLP) and Text Analysis to extract, identify or otherwise characterize the sentiment content of a text unit.
Introduction
The main dish was delicious It is an dish
The main dish was salty and
horrible
Positive NegativeNeutral
Existing SA approaches can be grouped into three main categories:1. Knowledge-based techniques;2. Statistical method;3. Hybrid approaches.
Statistical method take advantages on elements of Machine Learning (ML) such as Latent Semantic Analysis (LSA), Multinomial Naïve Bayes (MNB), Support Vector Machines (SVM) etc.
Introduction (cont.)
The bag-of-words model is a simplifying representation used in NLP and Information Retrieval (IR).
In this model, a text is represented as the the bag of its words, ignoring grammar and even word order but keeping multiplicity.
The bag-of-words model is commonly used in methods of document classification where the occurrence of each word (TF) is used as feature for training a classifier.
Bag-of-words
1. Tokening;2. Stopping;3. Stemming;4. Computation of tf (term frequency) idf (inverse
document frequency);5. Using a machine learning classifier for the
tweets classification (e.g., Naïve Bayes, Support Vector Machine, etc.)
Bag-of-words (cont.)
Spark Streaming in an extension of the core Spark API.
Data can be ingested from many sources like Kafka, etc.
Processed data can be pushed out to filesystems, databases, etc.
Furthermore, it’s possible to apply Spark’s machine learning algorithms on data streams.
Spark Streaming
Spark Streaming receives live input data streams and divides the data into batches.
Spark Streaming provides a high-level abstraction called Discretized Stream, DStream (continuous stream of data).
DStream can be created either from input data streams from sources as Kafka, Flume, etc.
Spark Streaming (cont.)
Kafka is a Distributed Streaming Platform and it behaves like a partitioned, replicated commit log services.
It provides the functionality of a messaging system.
Kafka is run as a cluster on one or more servers. The Kafka cluster stores streams of records in
categories called topics.
Apache Kafka
Kafka has two out of four main core APIs:1. The Producer API allows an application to publish
a stream record to one or more Kafka topics;2. The Consumer API allows an application to
subscribe to one or more topics and process the stream of records produced to them.
Apache Kafka (cont.)
So, at high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers.
Spark SQL is a component on the top of Spark Core that introduce a new data abstraction called SchemaRDD which provides support for structured and semi-structured data.
Spark SQL also provides JDBC connectivity and can access to several databases using both Hadoop connector and Spark connector.
In order to access to store or get data from it, it’s necessary: Define an SQLContext (entry point) for using all the
Spark's functionality; Create a table schema by means of a StructType on
which is applied a specific method for creating a Dataframe.
By using JDBC drivers, the previous schema is written on a database.
Output operations for DStream
MLlib is a Spark’s library of machine learning functions.
MLlib contains a variety of learning algorithms and is accessible from all Spark’s programming languages.
It consists of common learning algorithms and features, which includes classification, regression, clustering, etc.
Machine Learning with MLlib
The mllib.features package contains several classes for common features transformation. These includes algorithms to construct feature vectors from text and ways to to normalize and scale features.
Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.
Feature extraction
Classification and regression are two common forms of supervised learning, where algorithms attempts to predict a variable from features of objects using labeled training data.
Both classification and regression use LabeledPoint class in MLlib.
MLlib includes a variety of methods for classification and regression, including simple linear methods and decision three and forests.
Classification
Naïve Bayes is a multiclass classification algorithm that scores how well each point belongs in each class based on linear function of the features.
It’s commonly used in text classification with TF-IDF features, among other applications such as Tweet Sentiment Analysis.
In MLlib, it’s possible to use Naïve Bayes through the mllib.classification.NaiveBayes class.
Naïve Bayes
Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity.
Unlike the supervised tasks, where the data is labeled, clustering can be used to make sense of unlabeled data.
It is commonly used in data exploration and in anomaly detection
Clustering
MLlib, in addition to including the popular K-means “offline algorithm”, it also provides an “online” version for clustering “online” data streams.
When data arrive in a stream, the algorithm dynamically:1. Estimate the membership data groups;2. Update the centroids of the clusters.
Streaming K-means
In MLlib, it’s possible to use Streaming K-means through the mllib.clustering.StreamingKMeans class.
Streaming K-means (cont.)
Given a dataset of points in high-dimension space, we are often interested in reducing the dimensionality of the points so that they can be analyzed with simpler tools.
For example, we might want to plot the points in two dimensions, or just reduce the number of features to train models more efficiently.
In MLlib, it’s possible to use Streaming K-means through the mllib.feature.PCA class.
Principal Component Analysis (PCA)
Apache Zeppelin is a web-based notebook that enables interactive data visualization.
Apache Zeppelin interpreters concept allows any language/data-processing-backend to be plugged into Zeppelin such as JDBC.
Apache Zeppelin
Because of the lack of Spark-Streaming API (Python) for accessing to a Twitter account, the tweet streams have been simulated using Apache Kafka.
In particular, the entity accounting for this task is a Producer which publishes stream of data on a specific topic.
The training and testing data stream have been retrieved from [1].
On the other side, each received DStream is processed by a Consumer, using stateless Spark functions such as map, transform, etc..
Implementation and results
Naïve Bayes classification results
Clustering results
Future work Integrate Twitter API’s method to retrieve tweet
from accounts. Use an alternative feature extraction method for the
Streaming K-means task.
[1] http://help.sentiment140.com/for-students/[2] Karau, Holden, et al. Learning spark: lightning-fast big data analysis. “O'Reilly Media, Inc.", 2015.[3] Bogomolny, A. Benford’s Law and Zipf ’sLaw. http://www.cut-the-knot.org/doyouknow/zipfLaw.shtml.
References
For any questions, contact me at: [email protected]