vectorization - georgia tech - cse6242 - march 2015

18
Vectorization Core Concepts in Data Mining Georgia Tech – CSE6242 – March 2015 Josh Patterson

Upload: josh-patterson

Post on 15-Jul-2015

189 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Vectorization - Georgia Tech - CSE6242 - March 2015

VectorizationCore Concepts in Data Mining

Georgia Tech – CSE6242 – March 2015

Josh Patterson

Page 2: Vectorization - Georgia Tech - CSE6242 - March 2015

Presenter: Josh Patterson

• Email:– [email protected]

• Twitter: – @jpatanooga

• Github: – https://github.com/

jpatanooga

Past

Published in IAAI-09:

“TinyTermite: A Secure Routing Algorithm”

Grad work in Meta-heuristics, Ant-algorithms

Tennessee Valley Authority (TVA)

Hadoop and the Smartgrid

Cloudera

Principal Solution Architect

Today: Patterson Consulting

Page 3: Vectorization - Georgia Tech - CSE6242 - March 2015

Topic Index

• Why Vectorization?

• Vector Space Model

• Text Vectorization

• General Vectorization

Page 4: Vectorization - Georgia Tech - CSE6242 - March 2015

WHY VECTORIZATION?

“How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a

world far larger and more complicated than itself?”

--- Peter Norvig, “Artificial Intelligence: A Modern Approach”

Page 5: Vectorization - Georgia Tech - CSE6242 - March 2015

Classic Scenario:

“Classify some tweets for positive vs

negative sentiment”

Page 6: Vectorization - Georgia Tech - CSE6242 - March 2015

What Needs to Happen?

• Need each tweet as some structure that can be fed to a learning algorithm– To represent the knowledge of “negative” vs

“positive” tweet

• How does that happen?– We need to take the raw text and convert it into what

is called a “vector”

• Vector relates to the fundamentals of linear algebra– “Solving sets of linear equations”

Page 7: Vectorization - Georgia Tech - CSE6242 - March 2015

Wait. What’s a Vector Again?

• An array of floating point numbers

• Represents data

– Text

– Audio

– Image

• Example:

–[ 1.0, 0.0, 1.0, 0.5 ]

Page 8: Vectorization - Georgia Tech - CSE6242 - March 2015

VECTOR SPACE MODEL

“I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.”

--- Hal, 2001

Page 9: Vectorization - Georgia Tech - CSE6242 - March 2015

Vector Space Model

• Common way of vectorizing text– every possible word is mapped to a specific integer

• If we have a large enough array then every word fits into a unique slot in the array – value at that index is the number of the times the

word occurs

• Most often our array size is less than our corpus vocabulary – so we have to have a “vectorization strategy” to

account for this

Page 10: Vectorization - Georgia Tech - CSE6242 - March 2015

Text Can Include Several Stages

• Sentence Segmentation– can skip straight to tokenization depending on use case

• Tokenization– find individual words

• Lemmatization– finding the base or stem of words

• Removing Stop words– “the”, “and”, etc

• Vectorization– we take the output of the process and make an array of

floating point values

Page 11: Vectorization - Georgia Tech - CSE6242 - March 2015

TEXT VECTORIZATION STRATEGIES

“A man who carries a cat by the tail learns something he can learn in no other way.”

--- Mark Twain

Page 12: Vectorization - Georgia Tech - CSE6242 - March 2015

Bag of Words

• A group of words or a document is represented as a bag – or “multi-set” of its words

• Bag of words is a list of words and their word counts– simplest vector model – but can end up using a lot of columns due to number of words

involved.

• Grammar and word ordering is ignored – but we still track how many times the word occurs in the

document

• has been used most frequently in the document classification – and information retrieval domains.

Page 13: Vectorization - Georgia Tech - CSE6242 - March 2015

Term frequency inverse document frequency (TF-IDF)

• Fixes some issues with “bag of words”

• allows us to leverage the information about how often a word occurs in a document (TF)– while considering the frequency of the word in the

corpus to control for the facet that some words will be more common than others (IDF)

• more accurate than the basic bag of words model – but computationally more expensive

Page 14: Vectorization - Georgia Tech - CSE6242 - March 2015

Kernel Hashing

• When we want to vectorize the data in a single pass– making it a “just in time” vectorizer.

• Can be used when we want to vectorize text right before we feed it to our learning algorithm.

• We come up with a fixed sized vector that is typically smaller than the total possible words that we could index or vectorize– Then we use a hash function to create an index into

the vector.

Page 15: Vectorization - Georgia Tech - CSE6242 - March 2015

GENERAL VECTORIZATION STRATEGIES

“Everybody good? Plenty of slaves for my robot colony?”

--- TARS, Interstellar

Page 16: Vectorization - Georgia Tech - CSE6242 - March 2015

Four Major Attribute Types

• Nominal– Ex: “sunny”, “overcast”, and “rainy”

• Ordinal– Like nominal but with order

• Interval– “year” but expressed in fixed and equal lengths

• Ratio– scheme defines a zero point and then a distance

from this fixed zero point

Page 17: Vectorization - Georgia Tech - CSE6242 - March 2015

Techniques of Feature Engineering

• Taking the values directly from the attribute unchanged– If the value is something we can use out of the box

• Feature scaling– standardization– or Normalizing an attribute

• Binarization of features– 0 or 1

• Dimensionality reduction– Use only the most interesting features

Page 18: Vectorization - Georgia Tech - CSE6242 - March 2015

Canova

• Command Line Based– We don’t want to write custom code for every dataset

• Examples of Usage– Convert the MNIST dataset from raw binary files to

the svmLight text format.– Convert raw text into TF-IDF based vectors in a text

vector format {svmLight, arff}

• Scales out on multiple runtimes– Local, hadoop

• Open Source, ASF 2.0 Licensed– https://github.com/deeplearning4j/Canova