efficiently learning structure · 2016-08-03 · edwin hancock department of computer science...

Post on 03-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Efficiently Learning Structure

Edwin Hancock Department of Computer Science

University of York

Supported by a Royal Society Wolfson Research Merit Award

Structural Variations

Variation in object appearance

• These graphs are represent different views of the same object.

• They vary in detail, but represent the same object in different poses.

3

Protein-Protein Interaction Networks

Variations in complexity

• These graphs represent PPI’s from organisms at different stages of evolution.

• The organisms and their PPI’s are of different structure and complexity.

5

6

Financial Data

• Represents trading in NYSE over 6000 day period.

• Nodes are stock, edges show closing price time series are correlated.

• Modularity (structure) of network changes sharply during trading crises.

7

Questions

• Can we learn models of structure when there are both different variants of the same object present and objects of different intrinsic complexity?

• How do we capture variance and complexity at the structural level?

8

Graph data• Problems based on graphs arise in areas such as language

processing, proteomics/chemoinformatics, data mining, computer vision and complex systems.

• Relatively little methodology available, and vectorial methods from statistical machine learning not easily applied since there is no canonical ordering of the nodes in a graph.

• Can make considerable progress if we develop permutation invariant characterisations of variations in graph structure.

Graph data• Problems based on graphs arise in areas such as language

processing, proteomics/chemoinformatics, data mining, computer vision and complex systems.

• Relatively little methodology available, and vectorial methods from statistical machine learning not easily applied since there is no canonical ordering of the nodes in a graph.

• Can make considerable progress if we develop permutation invariant characterisations of variations in graph structure.

Graph data• Problems based on graphs arise in areas such as language

processing, proteomics/chemoinformatics, data mining, computer vision and complex systems.

• Relatively little methodology available, and vectorial methods from statistical machine learning not easily applied since there is no canonical ordering of the nodes in a graph.

• Can make considerable progress if we develop permutation invariant characterisations of variations in graph structure.

Characterising graphs• Topological: e.g. average degree, degree

distribution, edge-density, diameter, cycle frequencies etc.

• Spectral or algebraic: use eigenvalues of adjacency matrix or Laplacian, or equivalently the co-efficients of characteristic polynomial.

• Complexity: use information theoretic measures of structure (e.g. Shannon entropy).

Complex systems

• Spatial and topological indices: node degree stats; edge density;

• Communicability: communities, measures of centrality, separation, etc. (Baribasi, Watts and Strogatz, Estrada).

• Processes on graphs: Markov process, Ising models, random walks, searchability (Kleinberg).

Links explored in this talk

• Structure: discriminate between graphs on the basis of their detailed structure.

• Complexity: determine whether different non-isomorphic structures are if similar or different intrinsic complexity.

• Learning: learn generative model of structure that gives minimum complexity description of training data (MDL).

Complexity

Information theory, graphs and kernels.

Graph Entropy• Entropic measures of complexity: Many possibilities -

Shannon , Erdos-Renyi, Von-Neumann.

• Problems: Difficult to compute for graphs. Require either probability distribution over nodes of graph or combinatorial (micro-canonical state) characterisation. Remains open problem in literature.

• Uses: Complexity level analysis of graphs, learning structure via description length, construct information theoretic kernels.

Entropy• Thermodynamics: measure of disorder in a system. Change in

entropy with energy measure temperature of system dE=TdH.

• Statistical mechanics: Entropy is measure of uncertainty of microstates of a system H=-k Sumi pi ln pi – Boltzmann.

• Quantum mechanics: Confusion of states H=-kTr[r ln r ] in terms of density matrix r for states of operator O– Von Neumann.

• Information theory: Shannon information H=- Sumi pi ln pi – in terms of probability of transmission of a message in an information channel.

Von Neumann entropy• Passerini and Severini – normalised Laplacian L=D-1/2(D-A) D-1/2 I is density

matrix for graph.

• Exploited to compute approximate VN entropy for undirected graphs by Han et al (PRL) 2013 and by Cheng et al (Phys Rev E 2014) for directed graphs.

• Used for graph kernel construction (Bai JMIV 2013) and learning generative models of graphs (Han SIMBAD 2011).

• Recently used as unary feature to analyse and classify complex network time series.

Von Neumann entropy and node degree

Von-Neumann Entropy

• Passerini and Severini – normalised Laplacian is density matrix for graph Hamiltonian

• Associated Von Neumann entropy is .

||

ˆln||

ˆ||

1 VVH i

V

i

iVN

λλ∑=

−=

TDADDL ΦΛΦ=−= −− ˆˆˆ)(ˆ 2/12/1

Simplified entropy

∑∈

−−=Evu vu

VN ddVVH

),(

1||1

||11 2

Quadratic approximation of Neumann entropy reduces to

Computed in quadratic time. Most spectral methods are at least cubic.

Some graph-entropy computations are combinatorial.

Properties

Based on degree statistics

Extremal values for cycle and star-graphs

Can be used to distinguish Erdos-Renyi, small worlds, and scale free networks.

Uses

• Complexity-based clustering (especially protein-protein interaction networks).

• Defining information theoretic (Jensen-Shannon) kernels.

• Controlling complexity of generative models of graphs.

Entropy component analysis

24

➢ Entropy Component Analysis

Entropy component analysis- For each graph construct a 2D histogram indexed by node-degrees of each

edge- Increment each bin by entropies of edges with relevant degree configuration.- Vectorise histogram bin contents and perform PCA on sample of vectors for

different graphs.

Financial Market Data

• Look at time series correlation for set of leading stocks.

• Create undirected or directed links on basis of time series correlation.

• Directed von Neumann entropy change during Black Monday, 1987.

• Entropy witnesses a sharp drop on Black Monday and recovers in a few trading days’ time.

Black Monday

Fig. 4. PCA plot for directed graph embedding on financial stock market data. Black: Black Monday, cyan and green: background, blue: dot-com bubble, red: subprime crisis.

The four clusters representing different eras are clearly seen and Black Monday is also separated, implying our graph characterization is effective.

PCA applied to entropy feature vectors: distinct epochs of market evolution occupy different regions of the subspace and can be separated. Black Monday is a clear outlier. There appears to be some underlying manifold structure.

Graph kernels

29

Jensen-Shannon Kernel

• Defined in terms of J-S divergence

• Properties: extensive, positive semidefinite.

• JSD is difference of entropy of graph-union and individual graph entropies.

{ })()()(),(

),(2ln),(

jijiji

jijiJS

GHGHGGHGGJS

GGJSGGK

+−⊕=

−=

Structural Learning

Deep learning• Deep belief networks (Hinton 2006, Bengio 2007).

• Compositional networks (Amit+Geman 1999, Fergus 2010).

• Markov models (Leonardis 200

• Stochastic image grammars (Zhu, Mumford, Yuille)

• Taxonomy/category learning (Todorovic+Ahuja, 2006-2008).

Description length

• Wallace+Freeman: minimum message length.

• Rissanen: minimum description length.

Use log-posterior probability to locate model that is optimal with respect to code-length.

Similarities/differences

• MDL: selection of model is aim; model parameters are simply a means to this end. Parameters usually maximum likelihood. Prior on parameters is flat.

• MML: Recovery of model parameters is central. Parameter prior may be more complex.

Coding scheme

• Usually assumed to follow an exponential distribution.

• Alternatives are universal codes and predictive codes.

• MML has two part codes (model+parameters). In MDL the codes may be one or two-part.

Method

• Model is supergraph (i.e. Graph prototypes) formed by graph union.

• Sample data observation model: Bernoulli distribution over nodes and edges.

• Mode: complexity: Von-Neumann entropy of supergraphs.

• Fitting criterion: MDL-like-make ML estimates of the Bernoulli parameters MML-like: two-part code for data-model fit + supergraph

complexity.

Model overview• Description length criterion

code-length=negative + model code-length log-likelihood (entropy)

Data-set: set of graphs G

Model: prototype graph+correspondences with it

Updates by expectation maximisation: Model graph adjacency matrix (M-step) + correspondence indicators (E-step).

)()|(),( Γ+Γ=Γ HGLLGL

Data Codelength

• Depends on correspondences s between data-graph adjacency matrix elements D and model-graph adjacency matrix elements M

40

Experiments

Delaunay graphs from images of different objects.

COIL dataset Toys dataset

Experiments---validation■ COIL dataset: model complexity increase, graph data log-likelihood

increase, overall code length decrease during iterations.

■ Toys dataset: model complexity decrease, graph data log-likelihood increase, overall code length decrease during iterations.

Experiments---classification task

We compare the performance of our learned supergraph on classification task with two alternative constructions , the median graph and the supergraph learned without using MDL. The table below shows the average classification rates from 10-fold cross validation, which are followed by their standard errors.

Experiments---graph embeddingPairwise graph distance based on the Jensen-Shannon divergence and the von Neumann entropy of graphs

Compute supergraph for each pair of graphs.

Experiments---graph embedding

Edit distance JSD distance

Generative model

• Train on graphs with set of predetermined characteristics.

• Sample using Monte-Carlo.

• Reproduces characteristics of training set, e.g. Spectral gap, node degree distribution, etc.

Erdos Renyi

Barabasi Albert (scale free)

Dealunay Graphs

Quantum machine learning

50

Quantum machine learning• Learn kernels by optimising quantum Jensen Shannon

divergence.

• Work with density matrices rather than probability distributions.

• Quantum walkers can hit exponentially faster than classical ones on symmetric structure.

• Also sensitive to long range structure and interference effects.

51

Initialization• Construct compositional graph

• Allow initial walks to interfere

• Emphasise constructive and destructive interference. For isomorphic graphs kernel maximises distinguishability between states of time averaged density matrices.

Bibliography• C. Ye, R.C. Wilson, C. Comin, C. da F. Costa, E.R. Hancock,

”Approximate Von Neumann Entropy for Directed Graphs”, Physical Review E, 89, 052804, 2014.

• Lin Han, Richard Wilson and Edwin Hancock, ‘’Generative Graphs Prototypes from Information Theory’’, IEEE TPAMI, 2015.

• L. Rossi, A. Torsello, E. R. Hancock and R.C. Wilson, “Characterizing graph symmetries through quantum Jensen-Shannon divergence”, Physical Review E, 88, 032806 2013.

• Bai, L., Rossi, L. & Hancock, E. R., “An Aligned Subtree Kernel for Weighted Graphs’’, 2015 International Conference on Machine Learning (ICML). 53

Conclusions• Shown how von Neumann entropy how can be used

as characterisation of graph complexity for component analysis, kernel construction and structural learning.

• Presented MDL framework which uses complexity characterisation to learn generative model of graph structure.

• Future: Deeper measures of structure (symmetry) and detailed dynamics of network evolution.

top related