lecture 4

57
Lecture 4

Upload: tonya

Post on 15-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Lecture 4. Modules/communities in networks. What is a module?. Nodes in a given module (or community group or a functional unit) tend to connect with other nodes in the same module Biology: proteins of the same function (e.g. DNA repair) or sub-cellular localization (e.g. nucleus) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 4

Lecture 4

Page 2: Lecture 4

Modules/communitiesin networks

Page 3: Lecture 4

What is a module? Nodes in a given module (or community

group or a functional unit) tend to connect with other nodes in the same module Biology: proteins of the same function (e.g.

DNA repair) or sub-cellular localization (e.g. nucleus)

WWW – websites on a common topic (e.g. physics) or organization (e.g. EPFL)

Internet – Autonomous systems/routers by geography (e.g. Switzerland) or domain (e.g. educational or military)

Page 4: Lecture 4

Sometimes easy to discover

Page 5: Lecture 4

Sometimes hard

Page 6: Lecture 4

Hierarchical clustering calculating the “similarity weight” Wij for all pairs of

vertices (e.g. # of independent paths i j) start with all n vertices disconnected add edges between pairs one by one in order of

decreasing weight result: nested components, where one can take a

‘slice’ at any level of the tree

Page 7: Lecture 4

Girvan & Newman (2002): betweenness clustering Betweenness of and edge i -- j is the # of shortest paths

going through this edge Algorithm

compute the betweenness of all edges remove edge with the lowest betweenness recalculate betweenness

Caveats: Betweenness needs to be recalculated at each step very expensive: all pairs shortest path – O(N3) may need to repeat up to N times does not scale to more than a few hundred nodes, even

with the fastest algorithms

Page 8: Lecture 4

Using random walks/diffusion to discover modules in networks

K. Eriksen, I. Simonsen, S. Maslov, K. Sneppen, PRL 90, 148701(2003)

Page 9: Lecture 4

Why diffusion?

Any dynamical process would equilibrate faster on modules and slower between modules

Thus its slow modes reveal modules Diffusion is the simplest dynamical process (people

also use others like Ising/Potts model, etc.)

Page 10: Lecture 4

Random walkers on a network

Study the behavior of many VIRTUAL random walkers on a network

At each time step each random walker steps on a randomly selected neighbor

They equilibrate to a steady state ni ~ ki (solid state physics: ni = const)

Slow modes of equilibration to the steady state allow to detect modules in a network

Page 11: Lecture 4

Matrix formalism

ˆ( 1) ( )

1/ if j iˆ0 otherwise

i ij jj

j

ij

n t T n t

KT

Page 12: Lecture 4

Eigenvectors of the transfer matrix Tij

( ) ( ) ( )

( ) ( )

( )

ˆ

( )

1 1

i ij jj

t

i i

v T v

n t v

Page 13: Lecture 4

Similarity transformation Matrix Tij is asymmetric Could in principle result to complex

eigenvalues/eigenvectors Luckily, Sij=1/(Ki Kj) has the same

eigenvalues and eigenvectors vi

/Ki

Known as similarity transformation

Page 14: Lecture 4

Density of states ()

• filled circles –real AS-network• empty squares – degree-preserving randomized version

Page 15: Lecture 4

Participation ratio: PR() =i1/(v()i)4

-1 -0.5 0 0.5 10

50

100

150

200

250

Par

ticip

atio

n R

atio

Page 16: Lecture 4

US Military

Russia

Page 17: Lecture 4

2 0.9626 RU RU RU RU CA RU RU ?? ?? US US US US ?? (US Department of Defence)

3 0.9561 ?? FR FR FR ?? FR ??

RU RU RU ?? ?? RU ??

4 0.9523 US ?? US ?? ?? ?? ?? (US Navy)

NZ NZ NZ NZ NZ NZ NZ

5. 0.9474 KR KR KR KR KR ?? KR

UA UA UA UA UA UA UA

Page 18: Lecture 4

Hacked Ford AS

Page 19: Lecture 4
Page 20: Lecture 4

Using random walks/diffusion to rank information networks

e.g. Google’s PageRank made it 160 billion $

Page 21: Lecture 4

Information networks 3x105 Phys Rev articles connected

by 3x106 citation links 1010 webpages in the world To find relevant information one

needs to efficiently search and rank!!

Page 22: Lecture 4

Ranking webpages Assign an “importance factor” Gi to every

webpage Given a keyword (say “jaguar”) find all the

pages that have it in their text and display them in the order of descending Gi.

One solution still used in scientific publishing is Gi=Kin(i) (the number of incoming links), but: Too democratic: It doesn’t take into account

the importance of nodes sending links Easy to trick and artificially boost the ranking

(for the WWW)

Page 23: Lecture 4

How Google works Google’s recipe (circa 1998) is to simulate the

behavior of many virtual “random surfers”

PageRank: Gi ~ the number of virtual hits the page gets. It is also ~ the steady state number of random surfers at a given page

Popular pages send more surfers your way PageRank ~ Kin is weighted by the popularity of a webpage sending each hyperlink

Surfers get bored following links with probability =0.15 a surfer jumps to a randomly selected page (not following any hyperlinks)

Page 24: Lecture 4

How communities in the WWW influence Google ranking

H. Xie, K.-K. Yan, SM, cond-mat/0409087physics/0510107

Physica A 373 (2007) 831–836

Page 25: Lecture 4

How do WWW communities influence their average Gi?

Pages in a web-community preferentially link to each other. Examples: Pages from the same organization (e.g. EPFL) Pages devoted to a common topic (e.g. Physics) Pages in the same geographical location (e.g

Switzerland) Naïve argument: communities “trap”

random surfers to spend more time inside they should increase the average Google ranking of the community

Page 26: Lecture 4

log

10(<

G>

c)

# of intra-community links

Test of a naïve argument

Naïve argument is wrong: it could go either way

Community #1

Community #2

Page 27: Lecture 4

EccEww

Page 28: Lecture 4

Gc – average Google rank of pages in the community; Gw 1 – in the outside world Ecw Gc/<Kout>c – current from C to W It must be equal to: Ewc Gw/<Kout>w – current from W to C

Thus Gc depends on the ratio between Ecw and Ewc – the number of edges (hyperlinks) between the community and the world

Page 29: Lecture 4

Balancing currents for nonzero Jcw=(1- ) Ecw Gc/<Kout>c + Gc Nc

– current from C to W It must be equal to:

Jcw=(1- ) Ewc Gw/<Kout> + Gw Nw(Nc/Nw)

– current from W to C

( )

( )

(1 ) (1 )

(1 ) (1 )

wc wcrandom

c out w wc

cw cwrandom

c out cwc

c

E EN K EE E

N K E

G

Page 30: Lecture 4

For very isolated communities (Ecw/E(r)

cw< and Ewc/E(r)wc<) one has Gc=1.

Their Google rank is decoupled from the outside world!

Overall range: <Gc<1/

What are the consequences?

( )

( )

(1 )

(1 )

wcrandomwc

cwrandomcw

c

E

EE

E

G

Page 31: Lecture 4

WWW - the empirical data We have data for ~10 US

universities (+ all UK and Australian Universities)

Looked closely at UCLA and Long Island University (LIU) UCLA has different departments LIU has 4 campuses

Page 32: Lecture 4

10-1

100

101

102

103

100

101

102

103

104

PageRank

Kin

=0.15

Page 33: Lecture 4

=0.001Abnormally high

PageRank

Page 34: Lecture 4

Top PageRank LIU websites for =0.001 don’t make sense

• #1 www.cwpost.liu.edu/cwis/cwp/edu/edleader/higher_ed/ hear.html'

• #5 …/higher_ed/ index.html• #9 …/higher_ed/courses.html

World

Strongly connected componen

t

Page 35: Lecture 4
Page 36: Lecture 4
Page 37: Lecture 4

What about citation networks? Better use =0.5 instead of =0.15:

people don’t click through papers as easily as through webpages

Time arrow: papers only cite older papers: Small values of give older papers unfair advantage

New algorithm CiteRank (as in PageRank). Random walkers start from recent papers ~exp(-t/d)

Page 38: Lecture 4
Page 39: Lecture 4

Summary Diffusion and modules (communities) in

a network affect each other In the “hardware” part of the Internet

(Autonomous systems or routers ) diffusion allows one to detect modules

In the “software” part Diffusion-like process is used for ranking

(Google’s PageRank) WWW communities affect this ranking in a

non-trivial way

Page 40: Lecture 4

THE END

Page 41: Lecture 4
Page 42: Lecture 4

Part 2: Opinion networks

"Extracting Hidden Information fromKnowledge Networks", S. Maslov, and Y-C. Zhang,Phys. Rev. Lett. (2001).

"Exploring an opinion network for taste prediction:an empirical study",M. Blattner, Y.-C. Zhang, and S. Maslov, in preparation.

Page 43: Lecture 4

Predicting customers’ tastes from their opinions on products

Each of us has personal tastes Information about them is contained

in our opinions on products Matchmaking: opinions of customers

with tastes similar to mine could be used to forecast my opinions on untested products

Internet allows to do it on large scale (see amazon.com and many others)

Page 44: Lecture 4

Opinion networksW

ebap

ges

Oth

er w

ebpa

ges

2

1

3

4

1

2

3

WWW Opinions of movie-goers on movies

Cus

tom

ers

Mov

ies

opinion

2

1

3

4

1

2

3

Page 45: Lecture 4

Storing opinions

X X X 2 9 ? ?

X X X ? 8 ? 8

X X X ? ? 1 ?

2 ? ? X X X X

9 8 ? X X X X

? ? 1 X X X X

? 8 ? X X X X

Matrix of opinions IJNetwork of opinions

9

8

8

1

2

Cus

tom

ers

2

1

3

4

1

2

3

Mov

ies

Page 46: Lecture 4

Using correlations to reconstruct customer’s tastes

Similar opinions similar tastes

Simplest model: Movie-goers M-

dimensional vector of tastes TI

Movies M-dimensional vector of features FJ

Opinions scalar product: IJ= TIFJ

98

8

1

2

Cus

tom

ers

2

1

3

4

1

2

3

Mov

ies

Page 47: Lecture 4

Loop correlation

Predictive power 1/M(L-1)/2

One needs many loops to best reconstruct unknown opinions

L=5 known opinions:

Predictive power of an unknown opinion is 1/M2

An unknown opinion

Page 48: Lecture 4

Main parameter: density of edges The larger is the density of edges p the

easier is the prediction At p1 1/N (N=Ncostomers+Nmovies)

macroscopic prediction becomes possible. Nodes are connected but vectors TI and FJ are not fixed: ordinary percolation threshold

At p2 2M/N > p1 all tastes and features (TI and FJ) can be uniquely reconstructed: rigidity percolation threshold

Page 49: Lecture 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

ρ

π(ρ

)

Real empirical data (EachMovie dataset) on opinions of customers on movies:5-star ratings of 1600 movies by 73000 users1.6 million opinions!

Page 50: Lecture 4

Spectral properties of For M<N the matrix IJ has N-M zero

eigenvalues and M positive ones: = R R+.

Using SVD one can “diagonalize” R = U D V+ such that matrices V and U are orthogonal V+ V = 1, U U+ = 1, and D is diagonal. Then = U D2 U+

The amount of information contained in : NM-M(M-1)/2 << N(N-1)/2 - the # of off-diagonal elements

Page 51: Lecture 4

Recursive algorithm for the prediction of unknown opinions

1. Start with 0 where all unknown elements are filled with <> (zero in our case)

2. Diagonalize and keep only M largest eigenvalues and eigenvectors

3. In the resulting truncated matrix ’0

replace all known elements with their exact values and go to step 1

Page 52: Lecture 4

Convergence of the algorithm

• Above p2 the algorithm exponentially converges to theexact values of unknown elements

• The rate of convergence scales as (p-p2)2

Page 53: Lecture 4

Reality check: sources of errors Customers are not rational!

IJ= rIbJ + IJ(idiosyncrasy)

Opinions are delivered to the matchmaker through a narrow channel: Binary channel SIJ = sign(IJ) : 1 or 0 (liked or

not) Experience rated on a scale 1 to 5 or 1 to 10 at

best If number of edges K, and size N are large,

while M is small these errors could be reduced

Page 54: Lecture 4

How to determine M? In real systems M is not fixed: there are

always finer and finer details of tastes Given the number of known opinions K

one should choose Meff K/(Nreaders+Nbooks) so that systems are below the second transition p2 tastes should be determined hierarchically

Page 55: Lecture 4

Avoid overfitting Divide known votes into training and test sets Select Meff so that to avoid overfitting !!!

Reasonable fit Overfit

Page 56: Lecture 4

Knowledge networks in biology Interacting biomolecules: key and lock

principle

Matrix of interactions (binding energies) IJ= kIlJ+ lIkJ

Matchmaker (bioinformatics researcher) tries to guess yet unknown interactions based on the pattern of known ones

Many experiments measure SIJ =(IJ-th)

k(1) k(2) l(2)l(1)

Page 57: Lecture 4

THE END