how to spot a bear - an intro to machine learning for seo

@TomAnthonySEO

April 2015 - BrightonSEO

HOW TO SPOT A BEAR A Machine Learning Introduction for SEOs

Can you define a list of rules for spotting

bears?

1) Four legs.

Let’s start with:

List of rules (first half):(when I asked in the office)

1. Four legs. 2.Breathes. 3.Furry. 4. Long snout.

List of rules:

1. Four legs. 2.Breathes. 3.Furry. 4. Long snout.

5. Brown. 6.Not always brown. 7. Mammal. 8.No tail.

(how do you spot a mammal?!)

Let’s check our rules…

Rules say:

Bear

Rules say:

Harmless Furry Thing (less than 4 legs)

Rules say:

Odd Grey Creature (no long snout)

Remove ‘long snout’, and rules say:

Bear (Extra-terrestrial bear?!)

Our rules suck.

A different bear: Google’s Panda

Can you define a list of rules for spotting spammy pages?

Same problem as bears!

NBED GOOD PAGE

Good page

NBED GOOD PAGE

Commercial page, still good.

Hrm…

Seems legit…

Google can’t write rules.

What we can do is identify spammy or

non-spammy attributes.

Are there adverts on the page?

Are there lots of spelling mistakes?

Is there little text content?

Are there Calls To Action in ALL CAPS?

Some Possible Spam Signals

Smooth segue to:

Machine Learning

List of pages we’ve manually classified.

List of attributes that we believe are important to

classifying pages.

adverts on page?

more than 5 spelling

mistakes?

less than 200 words of content?

CTA in ALL CAPS?

site A Y Y Y Y Spam Site

site B N N Y Y Good Site

site C Y N N N Spam Site

site D N Y N Y Spam Site

site E N Y N N Good Site

Example Data

Neural Networks: A Perceptron

Inputs Output

Neuron

Neural Networks: A Perceptron

Inputs Output

1

if:inputs >= 1

output TRUE

0

1

0

0.5

0.5

0.5

0.5

1 x 0.5 = 0.50 x 0.5 = 01 x 0.5 = 0.50 x 0.5 = 0

1______

Total:Output: TRUE

1

if:inputs >= 1

output TRUE

0

1

0

0.5

0.5

0.5

0.5

TRUE

1 x 0.5 = 0.50 x 0.5 = 00 x 0.5 = 00 x 0.5 = 0

0.5______

Total:Output: FALSE

1

if:inputs >= 1

output TRUE

0

0

0

0.5

0.5

0.5

0.5

FALSE

1 x 0.5 = 0.50 x 0.5 = 01 x 0.4 = 0.40 x 0.5 = 0

0.9______

Total:Output: FALSE

1

if:inputs >= 1

output TRUE

0

1

0

0.5

0.5

0.4

0.5

FALSE

adverts on page?

more than 5 spelling

mistakes?

less than 200 words of content?

CTA in ALL CAPS?

site A Y Y Y Y Spam Site

site B N N Y Y Good Site

site C Y N N N Spam Site

site D N Y N Y Spam Site

site E N Y N N Good Site

Example Data

Untrained Neuron

Is site spam?

adverts

>5 spelling mistakes

< 200 words content

CTA in ALL CAPS

if:inputs >= 1

output TRUE

0.5

0.5

0.5

0.5

Training

adverts


< 200 words content

CTA in ALL CAPS

if:inputs >= 1

output TRUE

0.5

0.5

0.5

0.5

0

0

1

1

SPAM!

Training

adverts


< 200 words content

CTA in ALL CAPS

if:inputs >= 1

output TRUE

0.5

0.5

0.6

0.6

After training: 4/5 sites correct

Is site spam?

adverts


< 200 words content

CTA in ALL CAPS

if:inputs >= 1

output TRUE

0.2

0.7

0.4

0.5

ANNs typically have many neuronssource: http://www.teco.edu/~albrecht/neuro/html/node18.html

Deep Learning

Humans are good at pattern matching

We’re better than machines…source: Pawan Sinha (http://web.mit.edu/bcs/sinha/papers/sinha_recog_review_NN.pdf)

ML can learn to recognise cats from examples

Deep Learning learns more like us

Ok, so what does this have to do with Google?

PandaML based algorithm updates

Old index Caffeine

Caffeine - Infrastructure Update (we believe this made Panda+Penguin possible)

Hummingbird is to ??? as

Caffeine is to Panda+Penguin

Hummingbird Is it similar to Caffeine? Is it the basis for new natural language algorithms?

Where is Google going next with ML?

Idea

Image Search 2.0

Image Labelling

Video Labelling

ML Generated Image Descriptions

“Two pizzas sitting on top of a stove top oven”

Natural Language Faceted Search

Idea

‘show me olympic athletes' ‘show me the women'

“Find well rated vegetarian cooking books written after 1990”

How about:

Idea

Factual Accuracy as a

Ranking Factor

Fact CheckingKnowledge Vault

Idea: Bad Facts

NBED- shot of Google talking about this shit

Estimating ‘Trustworthiness’

Idea

Entirely ML Generated Algorithm?

http://dis.tl/ml-algo

Thanks! :)

@TomAnthonySEO

how to spot a bear - an intro to machine learning for seo

Internet