analysis of websites as graphs for seo

21
Analysis of Websites as Graphs for SEO Analysis of Websites as Graphs for SEO Rubén Martínez – Junio 2015 – Open Analytics Madrid

Upload: ruben-martinez

Post on 04-Aug-2015

172 views

Category:

Marketing


1 download

TRANSCRIPT

Analysis of Websites as Graphs for SEO

Analysis of Websites as Graphs for SEO

Rubén Martínez – Junio 2015 – Open Analytics Madrid

Analysis of Websites as Graphs for SEO

Items  (books,  music,  etc)  used  to  be  arranged  in  5ght  silos  by  categories  

Analysis of Websites as Graphs for SEO

There is more to websites than meets the eye

Has  a  website  ever  been  this  boring?  

We  tend  to  think  of  websites  as  a  homepage  on  the  top  followed  by  a  second  layer  of  children  webpages  (categories),    a  third  level  below  (sub-­‐categories)  and  pages  of  items  (products,  ar5cles,  etc)  at  the  bo@om.  

Happily,  reality  is  not  so  simple!  

Analysis of Websites as Graphs for SEO

First-ever website - 1990

Source:  Tim  Berners-­‐Lee's  web  catalog  at  CERN.  A  copy  is  available  at  h@p://www.w3.org/History/19921103-­‐hypertext/hypertext/WWW/TheProject.html  

Not  even  the  1st  ever  website  was  a  simple  hierarchical  tree  of  categories  and  sub-­‐categories  

Analysis of Websites as Graphs for SEO

Websites are graphs

Graph  theory    A  graph  is  an  ordered  pair  G  =  (V,  E)  comprising  a  set  V  of  ver5ces  or  nodes  together  with  a  set  E  of  edges  or  links.    Websites    Websites  are  graphs  whose  webpages  are  nodes  and  links,  directed  edges.  

Actual  websites  are  a  more  organic,  messy  business  

Visualiza5on  of  a  300-­‐pages  ecommerce  website  

Analysis of Websites as Graphs for SEO

Link analysis in graph theory

PageRank  is  a  link  analysis  algorithm.  It  outputs  a  probability  distribu;on  that  represents  the  likelihood  that  a  person  clicking  on  links  will  arrive  at  any  par;cular  page.  

Google’s  reasonable  surfer  model  of  weigh5ng  of  hyperlinks  by  their  posi5on  on  the  page  

It  assigns  a  numerical  weigh5ng  to  each  element  of  a  hyperlinked  set  of  documents,  such  as  the  World  Wide  Web,  with  the  purpose  of  "measuring"  its  rela5ve  importance  within  the  set.    

Analysis of Websites as Graphs for SEO

Optimization of PageRank in websites

The  PageRank  is  diluted  with  every  level  down  the  structure  of  categories  and  sub-­‐categories.    

This is a waste of expensive PageRank Same information on a leaner, more efficient web architecture

PageRank  is  not  as  important  in  SEO  as  it  used  to  be.  It  is  s5ll  useful  to  op5mise  web  architectures  

On-­‐page  SEO  is  mostly  about  analysing  graphs,  measuring  them  and  op5mising  them  empirically  and  itera5vely  

Analysis of Websites as Graphs for SEO

Steps of the analysis of websites

Crawling  a  website  

Cleaning  the  output  of  inlinks  

csv  file    

Source,Des5na5on  

Visualizing  the  graph  

Analysing  the  rela5ons  of  specific  nodes  

Parameterizing  the  whole  graph  

SEO  experts  are  usually  presented  with  inefficient  websites  that  require  ra5onaliza5on  and  more  o_en  than  not,  extensive  re-­‐indexa5on  on  Google.    Understanding  and  parameterizing  the  graph  of  a  website  before  and  a_er  radical  changes  of  its  structure  is  key.  We  build  a  comma  separated  value  file  with  pairs  of  URLs  linking  to  other  URLs.    

The  csv  file  contains  the  data  of  the  connected  graph  that  can  be  visualized,  parameterized  and  analysed.  

Analysis of Websites as Graphs for SEO

Crawling and exporting a csv file of inlinks

1st    step  –  Crawl  a  significant  sample  of  the  webpages  of  a  website  Desktop  applica5ons  •  Screaming  Frog  (fee  per  licence,  all  OS)  •  Xenu  Link  Sleuth  (free,  Windows)    Bash  scripts  using  command  tools    -­‐  Beware  –  poorly  wri@en  scripts  might  not  be  polite.  •  CURL  •  Wget      (2nd  step  -­‐  Scrape  if  you  have  to  get  specific  snippets  of  text  from  the  crawled  pages)  Scrapy  in  Python  

$  pip  install  scrapy      (3rd  step  Extract  data  if  you  have  to  get  specific  URLs  linked  from  the  scraped  text)  Beau5ful  Soup  A  Python  library  for  pulling  data  out  of  HTML  and  XML  files.    

Analysis of Websites as Graphs for SEO

Cleansing & grooming of the output .csv file

Output:  csv  files  with  the  crawled  inlinks    Origin,  Des5na5on  URL  1,  URL  2  URL  2,  URL  3  URL  1,  URL  3  …  URL  n,  URL  m  

 Clean  and  filter:  best  with  bash  one-­‐liners    

#!/bin/bash    FILE=  DOMAIN=    cut  -­‐f2,3  $FILE  |  sed  -­‐e  "s/http\:\/\/$DOMAIN//g"  -­‐e    "s/http\:\/\/www\."$DOMAIN"//g"  -­‐e  's/\t/,/g'  |  grep  –vi  "\.jpg\|http\:\|\.css\|\.js\|\.gif\|\.png\|\@\|mailto\|xml\|http\|\?\|\=“  >  filtered.csv  

Analysis of Websites as Graphs for SEO

Visualization of a website or part of it

Gephi  is  an  interac5ve  visualiza5on  and  explora5on  plahorm  for  all  kinds  of  networks  and  complex  systems,  dynamic  and  hierarchical  graphs.      It  performs  poorly  with  large  graphs  (tens  of  thousands  of  nodes  and  hundreds  of  thousands  of  inlinks).      Other  tools?  –  promising    Key  Lines  h@p://keylines.com/neo4j    Tulip  h@p://tulip.labri.fr/TulipDrupal/  

Analysis of Websites as Graphs for SEO

Example 1 - Graph of the website of an annual conference

The  home  (dark  green  node  in  the  center)  links  down  to  categories  (light  green  or  light  orange)  like  the  page  of  program  which  in  its  turn  links  down  to  item  pages  (dark  orange)  with  descrip5on  of  each  talk  with  bio  of  the  speaker,  etc.  

This  web  architecture  seems  efficient  but  item  pages  might  be  be@er  connected  to  the  whole  graph  

The  cluster  on  the  right  is  the  1st  edi5on  of  the  event  (few  talks).  

The  cluster  on  the  le_  is  the  2nd  edi5on  of  

the  event  (more  talks).  

Analysis of Websites as Graphs for SEO

Example 2 - Graph of the website of a shopping website

The  orange  dots  are  products  and  green  balls  categories.  Why  do  they  ALL  connect  to  each  other?  Aren’t  there  products  more  relevant  to  users  and  to  the  business  than  others?  

Some  products  get  more  traffic  but  yield  less  margin.    The  op5mal  web  architecture  overweighs  the  internal  linking  to  the  most  popular  products  with  the  highest  revenue  or  margin.  

This  looks  like  a  programma5c  linking  

scheme.    

Ecommerce  is  usually  more  complex  than  it  is  represented  here.  

   

Analysis of Websites as Graphs for SEO

Example 3 - Graphs of 2 directly competing websites

This  looks  like  an  organic  network  of  clusters  connec5ng  other  clusters  and  distant  nodes  with  thin  links.    

This  is  a  dense  pack  of  many  webpages  connec5ng  to  many  other  webpages  without  discernible  pa@erns  or  clusters.  

These  graphs  are  small  samples  of  2  large  websites  compe5ng  for  the  same  keywords  on  Google  

Both  websites  are  successful  SEO  proposi5ons  with  radically  different  approaches.  Why?  

Analysis of Websites as Graphs for SEO

Thin  connec5ons  tend  to  link  the  clusters,  allowing  informa5on  to  move  between  them.    

Source: Giles, Jim. Making the links. Nature - Aug 23rd 2012

   

The power of weak links

These  networks  are  usually  efficient  enough  in  terms  of  SEO.  

Analysis of Websites as Graphs for SEO

Analysis of the whole graph

igraph  is  a  collec5on  of  network  analysis  tools    It  is  available  in  R      

library(igraph)  dat=read.csv(file.choose(),header=TRUE)  #  choose  an  edgelist  in  .csv  file  format  summary(dat)  g=graph.data.frame(dat,directed=TRUE)  vcount(g)  200637  ecount(g)  4174400    centralization.degree(g)  0.4998589  

Analysis of Websites as Graphs for SEO

Analysis of the whole graph - parameters

transitivity(g)  0.001666909  graph.density(g)  0.0001036989  

igraph  calculates  metrics  of  whole  graphs  with  built-­‐in  func5ons.    Transi5vity  or  clustering  coefficient  measures  the  probability  that  the  adjacent  ver;ces  of  the  ver;ces  or  a  graph  are  connected.  This  metric  along  the  graph  density  are  useful  references  to  compare  websites  between  them  or  one  website  before  and  a_er  changes  in  its  web  architecture.    

website5  has  the  lowest  values  of  transi5vity  and  density:  increasing  them  would  result  in  an  improved  SEO    

Sheet1

Page 1

graph vertices edges diameter transitivity

website1 8305 34185 30 0.007959 0.000499

website2 10852 88732 16 0.004671 0.000721

website3 11272 71035 20 0.004017 0.000639

website4 11593 47380 32 0.003730 0.001088

website5 200637 4174400 n/a 0.001667 0.000104

graph density

Analysis of Websites as Graphs for SEO

Analysis of specific nodes

 h@p://console.neo4j.org/    MATCH  (n:Crew)-­‐[r:LOVES*]-­‐(m)  WHERE  n.name='Neo'  RETURN  n,m                  

n   m  

(0:Crew  {name:"Neo"})   (2:Crew  {name:"Trinity"})  

Analysis of Websites as Graphs for SEO

Analysis of specific nodes

 Count  the  number  of  nodes  connected  to  one  node    MATCH  (n  {  name:  'Neo'  })-­‐-­‐>(x)  RETURN  n,  count(*)              MATCH  (n  {  name:  'Neo'  })-­‐-­‐>(x)  RETURN  x    

(2:Crew  {name:"Trinity"})  (1:Crew  {name:"Morpheus"})  

n   count(*)    

(0:Crew  {name:"Neo"})   2

Analysis of Websites as Graphs for SEO

Analysis of specific nodes

MATCH  (n:Crew)-­‐[r:KNOWS*]-­‐(m:Matrix)  WHERE  n.name='Neo'  RETURN  m    (3:Crew:Matrix  {name:"Cypher"})  (4:Matrix  {name:"Agent  Smith"})  

   Find  the  shortest  path  between  n  and  m  of  type  :LOVES    MATCH  p  =  shortestPath((n:Crew)-­‐[:LOVES]-­‐>(m:Matrix))  WHERE  n.name='Neo’  RETURN  p  AS  Neo,m  

Analysis of Websites as Graphs for SEO

That’s all Folks!

Thank you.

Rubén  Marqnez  

@ruben_at_it  

[email protected]