efficiently learning structure · 2016-08-03 · edwin hancock department of computer science...
TRANSCRIPT
Efficiently Learning Structure
Edwin Hancock Department of Computer Science
University of York
Supported by a Royal Society Wolfson Research Merit Award
Structural Variations
Variation in object appearance
• These graphs are represent different views of the same object.
• They vary in detail, but represent the same object in different poses.
3
Protein-Protein Interaction Networks
Variations in complexity
• These graphs represent PPI’s from organisms at different stages of evolution.
• The organisms and their PPI’s are of different structure and complexity.
5
6
Financial Data
• Represents trading in NYSE over 6000 day period.
• Nodes are stock, edges show closing price time series are correlated.
• Modularity (structure) of network changes sharply during trading crises.
7
Questions
• Can we learn models of structure when there are both different variants of the same object present and objects of different intrinsic complexity?
• How do we capture variance and complexity at the structural level?
8
Graph data• Problems based on graphs arise in areas such as language
processing, proteomics/chemoinformatics, data mining, computer vision and complex systems.
• Relatively little methodology available, and vectorial methods from statistical machine learning not easily applied since there is no canonical ordering of the nodes in a graph.
• Can make considerable progress if we develop permutation invariant characterisations of variations in graph structure.
Graph data• Problems based on graphs arise in areas such as language
processing, proteomics/chemoinformatics, data mining, computer vision and complex systems.
• Relatively little methodology available, and vectorial methods from statistical machine learning not easily applied since there is no canonical ordering of the nodes in a graph.
• Can make considerable progress if we develop permutation invariant characterisations of variations in graph structure.
Graph data• Problems based on graphs arise in areas such as language
processing, proteomics/chemoinformatics, data mining, computer vision and complex systems.
• Relatively little methodology available, and vectorial methods from statistical machine learning not easily applied since there is no canonical ordering of the nodes in a graph.
• Can make considerable progress if we develop permutation invariant characterisations of variations in graph structure.
Characterising graphs• Topological: e.g. average degree, degree
distribution, edge-density, diameter, cycle frequencies etc.
• Spectral or algebraic: use eigenvalues of adjacency matrix or Laplacian, or equivalently the co-efficients of characteristic polynomial.
• Complexity: use information theoretic measures of structure (e.g. Shannon entropy).
Complex systems
• Spatial and topological indices: node degree stats; edge density;
• Communicability: communities, measures of centrality, separation, etc. (Baribasi, Watts and Strogatz, Estrada).
• Processes on graphs: Markov process, Ising models, random walks, searchability (Kleinberg).
Links explored in this talk
• Structure: discriminate between graphs on the basis of their detailed structure.
• Complexity: determine whether different non-isomorphic structures are if similar or different intrinsic complexity.
• Learning: learn generative model of structure that gives minimum complexity description of training data (MDL).
Complexity
Information theory, graphs and kernels.
Graph Entropy• Entropic measures of complexity: Many possibilities -
Shannon , Erdos-Renyi, Von-Neumann.
• Problems: Difficult to compute for graphs. Require either probability distribution over nodes of graph or combinatorial (micro-canonical state) characterisation. Remains open problem in literature.
• Uses: Complexity level analysis of graphs, learning structure via description length, construct information theoretic kernels.
Entropy• Thermodynamics: measure of disorder in a system. Change in
entropy with energy measure temperature of system dE=TdH.
• Statistical mechanics: Entropy is measure of uncertainty of microstates of a system H=-k Sumi pi ln pi – Boltzmann.
• Quantum mechanics: Confusion of states H=-kTr[r ln r ] in terms of density matrix r for states of operator O– Von Neumann.
• Information theory: Shannon information H=- Sumi pi ln pi – in terms of probability of transmission of a message in an information channel.
Von Neumann entropy• Passerini and Severini – normalised Laplacian L=D-1/2(D-A) D-1/2 I is density
matrix for graph.
• Exploited to compute approximate VN entropy for undirected graphs by Han et al (PRL) 2013 and by Cheng et al (Phys Rev E 2014) for directed graphs.
• Used for graph kernel construction (Bai JMIV 2013) and learning generative models of graphs (Han SIMBAD 2011).
• Recently used as unary feature to analyse and classify complex network time series.
Von Neumann entropy and node degree
Von-Neumann Entropy
• Passerini and Severini – normalised Laplacian is density matrix for graph Hamiltonian
• Associated Von Neumann entropy is .
||
ˆln||
ˆ||
1 VVH i
V
i
iVN
λλ∑=
−=
TDADDL ΦΛΦ=−= −− ˆˆˆ)(ˆ 2/12/1
Simplified entropy
∑∈
−−=Evu vu
VN ddVVH
),(
1||1
||11 2
Quadratic approximation of Neumann entropy reduces to
Computed in quadratic time. Most spectral methods are at least cubic.
Some graph-entropy computations are combinatorial.
Properties
Based on degree statistics
Extremal values for cycle and star-graphs
Can be used to distinguish Erdos-Renyi, small worlds, and scale free networks.
Uses
• Complexity-based clustering (especially protein-protein interaction networks).
• Defining information theoretic (Jensen-Shannon) kernels.
• Controlling complexity of generative models of graphs.
Entropy component analysis
24
➢ Entropy Component Analysis
Entropy component analysis- For each graph construct a 2D histogram indexed by node-degrees of each
edge- Increment each bin by entropies of edges with relevant degree configuration.- Vectorise histogram bin contents and perform PCA on sample of vectors for
different graphs.
Financial Market Data
• Look at time series correlation for set of leading stocks.
• Create undirected or directed links on basis of time series correlation.
• Directed von Neumann entropy change during Black Monday, 1987.
• Entropy witnesses a sharp drop on Black Monday and recovers in a few trading days’ time.
Black Monday
Fig. 4. PCA plot for directed graph embedding on financial stock market data. Black: Black Monday, cyan and green: background, blue: dot-com bubble, red: subprime crisis.
The four clusters representing different eras are clearly seen and Black Monday is also separated, implying our graph characterization is effective.
PCA applied to entropy feature vectors: distinct epochs of market evolution occupy different regions of the subspace and can be separated. Black Monday is a clear outlier. There appears to be some underlying manifold structure.
Graph kernels
29
Jensen-Shannon Kernel
• Defined in terms of J-S divergence
• Properties: extensive, positive semidefinite.
• JSD is difference of entropy of graph-union and individual graph entropies.
{ })()()(),(
),(2ln),(
jijiji
jijiJS
GHGHGGHGGJS
GGJSGGK
+−⊕=
−=
Structural Learning
Deep learning• Deep belief networks (Hinton 2006, Bengio 2007).
• Compositional networks (Amit+Geman 1999, Fergus 2010).
• Markov models (Leonardis 200
• Stochastic image grammars (Zhu, Mumford, Yuille)
• Taxonomy/category learning (Todorovic+Ahuja, 2006-2008).
Description length
• Wallace+Freeman: minimum message length.
• Rissanen: minimum description length.
Use log-posterior probability to locate model that is optimal with respect to code-length.
Similarities/differences
• MDL: selection of model is aim; model parameters are simply a means to this end. Parameters usually maximum likelihood. Prior on parameters is flat.
• MML: Recovery of model parameters is central. Parameter prior may be more complex.
Coding scheme
• Usually assumed to follow an exponential distribution.
• Alternatives are universal codes and predictive codes.
• MML has two part codes (model+parameters). In MDL the codes may be one or two-part.
Method
• Model is supergraph (i.e. Graph prototypes) formed by graph union.
• Sample data observation model: Bernoulli distribution over nodes and edges.
• Mode: complexity: Von-Neumann entropy of supergraphs.
• Fitting criterion: MDL-like-make ML estimates of the Bernoulli parameters MML-like: two-part code for data-model fit + supergraph
complexity.
Model overview• Description length criterion
code-length=negative + model code-length log-likelihood (entropy)
Data-set: set of graphs G
Model: prototype graph+correspondences with it
Updates by expectation maximisation: Model graph adjacency matrix (M-step) + correspondence indicators (E-step).
)()|(),( Γ+Γ=Γ HGLLGL
Data Codelength
• Depends on correspondences s between data-graph adjacency matrix elements D and model-graph adjacency matrix elements M
40
Experiments
Delaunay graphs from images of different objects.
COIL dataset Toys dataset
Experiments---validation■ COIL dataset: model complexity increase, graph data log-likelihood
increase, overall code length decrease during iterations.
■ Toys dataset: model complexity decrease, graph data log-likelihood increase, overall code length decrease during iterations.
Experiments---classification task
We compare the performance of our learned supergraph on classification task with two alternative constructions , the median graph and the supergraph learned without using MDL. The table below shows the average classification rates from 10-fold cross validation, which are followed by their standard errors.
Experiments---graph embeddingPairwise graph distance based on the Jensen-Shannon divergence and the von Neumann entropy of graphs
Compute supergraph for each pair of graphs.
Experiments---graph embedding
Edit distance JSD distance
Generative model
• Train on graphs with set of predetermined characteristics.
• Sample using Monte-Carlo.
• Reproduces characteristics of training set, e.g. Spectral gap, node degree distribution, etc.
Erdos Renyi
Barabasi Albert (scale free)
Dealunay Graphs
Quantum machine learning
50
Quantum machine learning• Learn kernels by optimising quantum Jensen Shannon
divergence.
• Work with density matrices rather than probability distributions.
• Quantum walkers can hit exponentially faster than classical ones on symmetric structure.
• Also sensitive to long range structure and interference effects.
51
Initialization• Construct compositional graph
• Allow initial walks to interfere
• Emphasise constructive and destructive interference. For isomorphic graphs kernel maximises distinguishability between states of time averaged density matrices.
Bibliography• C. Ye, R.C. Wilson, C. Comin, C. da F. Costa, E.R. Hancock,
”Approximate Von Neumann Entropy for Directed Graphs”, Physical Review E, 89, 052804, 2014.
• Lin Han, Richard Wilson and Edwin Hancock, ‘’Generative Graphs Prototypes from Information Theory’’, IEEE TPAMI, 2015.
• L. Rossi, A. Torsello, E. R. Hancock and R.C. Wilson, “Characterizing graph symmetries through quantum Jensen-Shannon divergence”, Physical Review E, 88, 032806 2013.
• Bai, L., Rossi, L. & Hancock, E. R., “An Aligned Subtree Kernel for Weighted Graphs’’, 2015 International Conference on Machine Learning (ICML). 53
Conclusions• Shown how von Neumann entropy how can be used
as characterisation of graph complexity for component analysis, kernel construction and structural learning.
• Presented MDL framework which uses complexity characterisation to learn generative model of graph structure.
• Future: Deeper measures of structure (symmetry) and detailed dynamics of network evolution.