text classification using convolution neural networks · pdf filesingle 1-dimensional...

Text Classification usingConvolution Neural Networks

Matt Falkenhainer

Classification Problems

Sentiment analysis Determine if a document is positive/negative towards a

specific subject

Categorization Given a set of n classes, determine which class a document

belongs to

Current Methods

Support Vector Machine (SVM) Finds an optimal hyper-plane to linearly separate data Create n “1 vs All” SVMs to classify more than 2 classes,

choose class with highest confidence Can work poorly if classes are not linearly separable

Clustering with mean-NN classification Combine features into similar clusters Choose most popular class in the nearest cluster Requires manually choosing cluster count Data may be difficult to separate into distinct clusters

Naive Bayes Create probabilities of each word given a class and each class Combine all probabilities of words in document for each class Choose class with highest probability Assumes all words are independent, which is rarely true

Features

Bag of words (BOW) Binary or frequency vector of all words in a document Loses word order Feature size equal to vocabulary size

One-hot (sequence) vectors Single binary vector per word in a document Keeps word order Very sparse, feature size quickly becomes unmanageable as

document size increases N-grams

Used with BOW or One-hot Use neighboring pairs of words as vocabulary instead of

individual

Classical Neural Networks

Network of layered perceptrons (think SVM) Each perceptron uses a non-linear function on its output

Most popular right now is Rectified Linear Unit ( max(0, x) ) Networks used to learn a complex, non-linear function

These functions allow us to classify complex, non-linear datasets more accurately

Training Neural Nets Typically trained using gradient descent via back-propagation

Loss function at end of network tells you how right/wrong your prediction is

Gradient Descent – Move weights towards the path of decreasing the loss using the gradients of the weights

Back-Propagation – Send the loss backwards through the network, with each node sending the loss in regards to its own weights

Usually require large amounts of data to train Training tends to be very slow, especially with larger networks

Idea was created in the 60s, but was thought to be too slow Recent advancements in tech and technique allow for efficient

training Can use GPUs many cores to speed up computation

Convolutional Neural Networks

Uses convolution layers with a learned filter on a subset of the input

Pooling layers condense subsets of the input using an averaging or max function

Popular in computer vision

Using Convolution on Text

Single 1-dimensional convolution layer followed by a max pooling layer combining neighboring vectors

Final linear classifier for each class, select class with highest confidence score

Goal is to learn a region based text embedding Proposed by Rie Johnson and Tong Zhang

Choosing an Input Representation

Localized convolution allows for region based BOW

Region based sequence vectors shown to also work

What region size do we pick? Binary or frequency counts? Depends on the problem

Sentiment classification shown to work well with region size of 3 and a binary BOW

Categorization shown to work well with large regions (~20) and a BOW containing frequency counts

Extending the Network We can get more information by looking at multiple region sizes at the same time

Designed to represent both local and broad features equally

Results

IMDB and Elec used to test sentiment classification RCV1 used to test categorization on 103 different topics

Test Error Rates

A Signal Based Approach

Instead of representing words, what if we looked at individual letters as a constantly changing signal?

Encode letters using a one-hot representation from a 70 character alphabet.

Classify 1015 characters at a time, can extend based on the task

Proposed by Xiang Zhang, Junbo Zhao, and Yann LeCun

Network Model

Contains 6 convolution and max pooling pairs, followed by 3 fully connected layers before outputting class scores

Two sized networks Large network outputs 4x as many features per layer

compared to the small network

Results

Datasets include a mixture of categorization and sentiment classification

Dataset size increases from left to right Results demonstrate how important training data size is when

training deeper networks

Test Error Rates

Data Augmentation

Slightly better results can be gotten by pre-processing the training data

Use a thesaurus to replace some words with other words that mean the same thing

Scale frequency counts to sum to 1 Convert all letters to lowercase

Tests were done on both variations, very small accuracy gain with only lowercase

Overview

Both introduced networks show positive results towards the use of convolutional nets

Deeper networks are known to take longer to train and require a very large amount of data

New network models or input representations could help to improve accuracy

References

R. Johnson, T. Zhang. Effective Use of Word Order for Text Categorization with Convolutional Neural Networks. In Association for Computational Linguistics (ACL), 2014.

Y. LeCun, X. Zhang, J. Zhao. Character-level Convolutional Networks for Text Classification. In Neural Information Processing Systems (NIPS), 2015.

text classification using convolution neural networks · pdf filesingle 1-dimensional...

Documents