text classification using convolution neural networks · pdf filesingle 1-dimensional...
TRANSCRIPT
Text Classification usingConvolution Neural Networks
Matt Falkenhainer
Classification Problems
Sentiment analysis Determine if a document is positive/negative towards a
specific subject
Categorization Given a set of n classes, determine which class a document
belongs to
Current Methods
Support Vector Machine (SVM) Finds an optimal hyper-plane to linearly separate data Create n “1 vs All” SVMs to classify more than 2 classes,
choose class with highest confidence Can work poorly if classes are not linearly separable
Clustering with mean-NN classification Combine features into similar clusters Choose most popular class in the nearest cluster Requires manually choosing cluster count Data may be difficult to separate into distinct clusters
Naive Bayes Create probabilities of each word given a class and each class Combine all probabilities of words in document for each class Choose class with highest probability Assumes all words are independent, which is rarely true
Features
Bag of words (BOW) Binary or frequency vector of all words in a document Loses word order Feature size equal to vocabulary size
One-hot (sequence) vectors Single binary vector per word in a document Keeps word order Very sparse, feature size quickly becomes unmanageable as
document size increases N-grams
Used with BOW or One-hot Use neighboring pairs of words as vocabulary instead of
individual
Classical Neural Networks
Network of layered perceptrons (think SVM) Each perceptron uses a non-linear function on its output
Most popular right now is Rectified Linear Unit ( max(0, x) ) Networks used to learn a complex, non-linear function
These functions allow us to classify complex, non-linear datasets more accurately
Training Neural Nets Typically trained using gradient descent via back-propagation
Loss function at end of network tells you how right/wrong your prediction is
Gradient Descent – Move weights towards the path of decreasing the loss using the gradients of the weights
Back-Propagation – Send the loss backwards through the network, with each node sending the loss in regards to its own weights
Usually require large amounts of data to train Training tends to be very slow, especially with larger networks
Idea was created in the 60s, but was thought to be too slow Recent advancements in tech and technique allow for efficient
training Can use GPUs many cores to speed up computation
Convolutional Neural Networks
Uses convolution layers with a learned filter on a subset of the input
Pooling layers condense subsets of the input using an averaging or max function
Popular in computer vision
Using Convolution on Text
Single 1-dimensional convolution layer followed by a max pooling layer combining neighboring vectors
Final linear classifier for each class, select class with highest confidence score
Goal is to learn a region based text embedding Proposed by Rie Johnson and Tong Zhang
Choosing an Input Representation
Localized convolution allows for region based BOW
Region based sequence vectors shown to also work
What region size do we pick? Binary or frequency counts? Depends on the problem
Sentiment classification shown to work well with region size of 3 and a binary BOW
Categorization shown to work well with large regions (~20) and a BOW containing frequency counts
Extending the Network We can get more information by looking at multiple region sizes at the same time
Designed to represent both local and broad features equally
Results
IMDB and Elec used to test sentiment classification RCV1 used to test categorization on 103 different topics
Test Error Rates
A Signal Based Approach
Instead of representing words, what if we looked at individual letters as a constantly changing signal?
Encode letters using a one-hot representation from a 70 character alphabet.
Classify 1015 characters at a time, can extend based on the task
Proposed by Xiang Zhang, Junbo Zhao, and Yann LeCun
Network Model
Contains 6 convolution and max pooling pairs, followed by 3 fully connected layers before outputting class scores
Two sized networks Large network outputs 4x as many features per layer
compared to the small network
Results
Datasets include a mixture of categorization and sentiment classification
Dataset size increases from left to right Results demonstrate how important training data size is when
training deeper networks
Test Error Rates
Data Augmentation
Slightly better results can be gotten by pre-processing the training data
Use a thesaurus to replace some words with other words that mean the same thing
Scale frequency counts to sum to 1 Convert all letters to lowercase
Tests were done on both variations, very small accuracy gain with only lowercase
Overview
Both introduced networks show positive results towards the use of convolutional nets
Deeper networks are known to take longer to train and require a very large amount of data
New network models or input representations could help to improve accuracy
References
R. Johnson, T. Zhang. Effective Use of Word Order for Text Categorization with Convolutional Neural Networks. In Association for Computational Linguistics (ACL), 2014.
Y. LeCun, X. Zhang, J. Zhao. Character-level Convolutional Networks for Text Classification. In Neural Information Processing Systems (NIPS), 2015.