spectral approaches to learning predictive representations filealgorithms often recover a model that...

21
Spectral Approaches to Learning Predictive Representations Thesis Proposal Byron Boots Machine Learning Department Carnegie Mellon University June, 2011 Thesis Committee: Geoffrey J. Gordon (Chair) J. Andrew Bagnell Dieter Fox, University of Washington Arthur Gretton, University College London Abstract A central problem in artificial intelligence is to choose actions to maximize reward in a partially observable, uncertain environment. To do so, we must obtain an accurate environment model, and then plan to maximize reward. However, for complex domains, specifying a model by hand can be a time consuming process. This motivates an alternative approach: learning a model directly from observations. Unfortunately, learning algorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to succeed; or, they require excessive prior domain knowledge or fail to provide guarantees such as statistical consistency. To address this gap, we propose spectral subspace identification algorithms which provably learn com- pact, accurate, predictive models of partially observable dynamical systems directly from sequences of action- observation pairs. Our research agenda includes several variations of this general approach: batch algorithms and online algorithms, kernel-based algorithms for learning models in high- and infinite-dimensional feature spaces, and manifold-based identification algorithms. All of these approaches share a common framework: they are statistically consistent, computationally efficient, and easy to implement using established matrix- algebra techniques. Additionally, we show that our framework generalizes a variety of successful spectral learning algorithms in diverse areas, including the identification of Hidden Markov Models, recovering struc- ture from motion, and discovering manifold embeddings. We will evaluate our learning algorithms on a series of prediction and planning tasks involving simulated data and real robotic systems. We anticipate several difficulties while moving from smaller problems and synthetic problems to larger practical applications. The first is the challenge of scaling learning algorithms up to the higher-dimensional state spaces that more complex tasks require. The second is the problem of integrating expert knowledge into the learning procedure. The third is the problem of properly accounting for actions and exploration in controlled systems. We believe that overcoming these remaining difficulties will allow our models to capture the essential features of an environment, predict future observations well, and enable successful planning.

Upload: others

Post on 31-Aug-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

Spectral Approaches to Learning Predictive RepresentationsThesis Proposal

Byron BootsMachine Learning Department

Carnegie Mellon University

June, 2011

Thesis Committee:Geoffrey J. Gordon (Chair)

J. Andrew BagnellDieter Fox, University of Washington

Arthur Gretton, University College London

AbstractA central problem in artificial intelligence is to choose actions to maximize reward in a partially observable,uncertain environment. To do so, we must obtain an accurate environment model, and then plan to maximizereward. However, for complex domains, specifying a model by hand can be a time consuming process.This motivates an alternative approach: learning a model directly from observations. Unfortunately, learningalgorithms often recover a model that is too inaccurate to support planning or too large and complex forplanning to succeed; or, they require excessive prior domain knowledge or fail to provide guarantees such asstatistical consistency.

To address this gap, we propose spectral subspace identification algorithms which provably learn com-pact, accurate, predictive models of partially observable dynamical systems directly from sequences of action-observation pairs. Our research agenda includes several variations of this general approach: batch algorithmsand online algorithms, kernel-based algorithms for learning models in high- and infinite-dimensional featurespaces, and manifold-based identification algorithms. All of these approaches share a common framework:they are statistically consistent, computationally efficient, and easy to implement using established matrix-algebra techniques. Additionally, we show that our framework generalizes a variety of successful spectrallearning algorithms in diverse areas, including the identification of Hidden Markov Models, recovering struc-ture from motion, and discovering manifold embeddings. We will evaluate our learning algorithms on a seriesof prediction and planning tasks involving simulated data and real robotic systems.

We anticipate several difficulties while moving from smaller problems and synthetic problems to largerpractical applications. The first is the challenge of scaling learning algorithms up to the higher-dimensionalstate spaces that more complex tasks require. The second is the problem of integrating expert knowledgeinto the learning procedure. The third is the problem of properly accounting for actions and exploration incontrolled systems. We believe that overcoming these remaining difficulties will allow our models to capturethe essential features of an environment, predict future observations well, and enable successful planning.

Page 2: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

Contents1 Introduction 1

2 Thesis 1

3 Related Work 33.1 Partially Observable Markov Decision Processes and

Predictive State Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.1.1 Transformed PSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2 Planning in Learned Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3 Spectral Dimensionality Reduction and Manifold Learning . . . . . . . . . . . . . . . . . . 53.4 Subspace Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.5 Probability Distributions in Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Completed Work 74.1 Reduced-Rank Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 Learning Predictive State Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2.1 Learning TPSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.2.2 Connections to Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.3 Planning in Learned Predictive State Representations . . . . . . . . . . . . . . . . . . . . . 104.3.1 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.3.2 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.4 Learning in Infinite-dimensional Feature Spaces . . . . . . . . . . . . . . . . . . . . . . . . 134.5 Online Learning Algorithms for PSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.6 Two-Manifold Problems: Dynamical Systems on the Manifold . . . . . . . . . . . . . . . . 14

5 Proposed Work 145.1 Incorporating Information into the Learning Process . . . . . . . . . . . . . . . . . . . . . . 145.2 Hilbert Space Embeddings of Predictive State Representations . . . . . . . . . . . . . . . . 155.3 Robotics Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Timeline 16

i

Page 3: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

1 IntroductionModeling partially-observable discrete-time dynamical systems under uncertainty is an important endeavor,occupying a central position in a wide range of scientific and engineering fields including computer vision,robotics, econometrics and financial modeling, speech and language modeling, and bioinformatics. Typically,building accurate models requires a great deal of domain knowledge and engineering know-how accumulatedover time by careful experimentation. For well-studied domains, a dynamical system may be characterized asa function of a small set of latent variables, and may be possible to completely specify by hand. However, forcomplex domains, manually searching for and specifying the latent variables and dynamics of a system is adifficult and time-consuming process. This motivates an alternative approach: learning models of dynamicalsystems, either partially or completely, from observations.

Early successes in learning models of dynamical systems have focussed on latent variable models likeHidden Markov Models (HMMs) and Linear Dynamical Systems (LDSs) which model the joint distributionof latent variables and observations. Different assumptions about the latent variable lead to either the HMMor the LDS, each with distinct characteristics, advantages and disadvantages. Such models showed promise,but, at first, proved to be difficult to learn.

A breakthrough came with the advent of subspace identification (SSID) methods for LDSs [1]. Subspaceidentification methods calculate LDS parameters through a spectral decomposition of a matrix of observationsto yield an estimate of the underlying state space, and then derive parameter estimates using least squares.The most straightforward such technique is used here, which relies on the singular value decomposition(SVD) [2], although there are variations based on canonical correlation analysis (CCA) [3] or reduced-rankregression (RRR) [4]. See [1, 5] for variations. SSID quickly became the standard way to find parametersof a LDS due to the low computational cost and ease and robustness of implementation. Recently, SSIDalgorithms were also derived for HMMs [6], conveying the same benefits. However, since they are notbased on optimizing or integrating a likelihood, SSID techniques have proven difficult to integrate with othermachine learning techniques (e.g., graphical models). This problem has slowed adoption of SSID within themachine learning community.

At the same time as SSID techniques were being developed for latent variable models, new types ofpredictive models were proposed: Observable Operator Models (OOMs), Linear Predictive State Representa-tions (PSRs), and Predictive Linear Gaussians (PLGs). These models have been shown to be equally powerfulas (and often more compact than) popular dynamical system models like Partially Observable Markov De-cision Processes (POMDPSs) and LDSs. PSRs generalize discrete-observation HMMs, and PLGs subsumethe LDS model. Instead of modeling state by a latent variable, however, predictive models model the state ofthe dynamical system by a set of statistics defined on future observable events. This dependence on observ-able quantities promised to make it easier to learn consistent parameter settings and avoid local minima inpredictive models [7], although initially, well-developed learning algorithms were scarce.

As we will see, spectral SSID learning algorithms and PSR models are very well matched: SSID naturallyproduces a predictive representation of the state of dynamical system; and PSRs are naturally defined in termsof a subspace in which distributions over future observable events are embedded. This connection leads tothe central thesis of our work. In the next section we explore this connection more deeply.

2 ThesisWe consider the problem of modeling a dynamical system when the state s ∈ S is partially observable,and when the parameters of the system are unknown. We receive information about s by taking actionsa ∈ A and receiving observations o ∈ O. In this case, the information that we have about state is notan element of the unobserved set S, but rather a history (an ordered sequence of action-observation pairs

1

Page 4: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

h = [ah1 , oh1 , . . . , a

ht , o

ht ] that have been executed and observed prior to time t). Our task is then to use

(features of) a history to predict the future. In particular, we define a state to be a set of features of historythat are sufficient to define the probability of any future observable event.

compress expand

bottleneck

predict

data about past(many samples)

data about future(many samples)

stat

eFigure 1: A general principle for modeling state.

The central thesis of this work is that we can approach the problem of finding a good set of features, and thusa predictive state, from a bottleneck perspective. That is, given some signal from history, in this case a largearbitrary set of features, we would like to find a compression that preserves only relevant information forpredicting features of the future. This idea is illustrated in Figure 1. If we think of the bottleneck as a linearcompression of features of history, then we are attempting to identify a predictive subspace of these features.If the bottleneck is defined to be a rank constraint on a covariance matrix of features of histories and featuresof futures, then the subspace can be identified by means of a spectral method such as SVD, CCA, or RRR.

There are many other ways to learn models of dynamical systems including maximum likelihood viaEM [8], gradient descent [9], etc.; or, Bayesian inference via Gibbs sampling [10], Metropolis-Hastings [11],etc.. In contrast to these algorithms, we will show:

• Predictive dynamical system models can be learned via spectral learning algorithms with no localoptima and large gains in computational efficiency.

• Nonparametric (kernel-based) versions of these learning algorithms handle near-arbitrary observationdistributions.

• One general principle yields algorithms for HMMs, OOMs and PSRs, structure from motion, rangeonly SLAM, LDS identification, value function approximation, etc.

• Good results can be obtained from a general purpose machine learning algorithm on problems typi-cally tackled by lots of engineering.

The remainder of this proposal is organized as follows. We first review three of the main building blocks ofour work, predictive models of dynamical systems, spectral dimensionality reduction algorithms, and Hilbertspace embeddings of probability distributions. We then present our contributions to the literature on learningmodels of dynamical systems including a powerful new class of models called reduced-rank Hidden Markovmodels, spectral learning algorithms for PSRs, kernel-based methods for learning HMMs in Hilbert spaces,online and approximate variations of PSR learning algorithms, and manifold learning methods for dynamicalsystem modeling. Finally, we propose several new directions of research that will complete the thesis andsuggest a timeline for the remainder of this work.

2

Page 5: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

3 Related WorkWe briefly describe several major areas of research that we draw on and contribute to in our own work. Firstwe discuss dynamical system models with special focus on predictive state representations which representbelief as a set of probabilities assigned to observable quantities. We also discuss spectral approaches to reduc-ing the dimensionality of observations, and subspace identification which leverages these spectral algorithmsfor learning dynamical system models. These ideas are critical to our work in Section 4 where we show howsubspace identification can be used to learn models of predictive state representations. Finally we discusshow kernel methods can be used to reason about probability distributions embedded in high- or infinite-dimensional Hilbert spaces, a technique that we use to build powerful non-parametric SSID algorithms.

3.1 Partially Observable Markov Decision Processes andPredictive State Representations

Partially Observable Markov Decision Processes (POMDPs) [12, 13] are a general framework for single-agent planning. POMDPs model the state of the world as a latent variable and explicitly reason about un-certainty in both action effects and state observability. Plans in POMDPs are expressed as policies, whichspecify the action to take given any possible probability distribution over states. Unfortunately, exact planningalgorithms such as value iteration [12] are computationally intractable for most realistic POMDP planningproblems. Furthermore, researchers have had only limited success learning POMDP models from data. Thereare arguably two primary reasons for these problems [14]. The first is the “curse of dimensionality”: for aPOMDP with n states, the optimal policy is a function of an n− 1 dimensional distribution over latent state.The second is the “curse of history”: the number of distinct policies increases exponentially in the planninghorizon.

Predictive State Representations (PSRs) [15] and the closely related Observable Operator Models (OOMs) [16]are generalizations of POMDPs that have attracted interest because they both have greater representationalcapacity than POMDPs and yield representations that are at least as compact [7, 17]. In contrast to thelatent-variable representations of POMDPs, PSRs and OOMs represent the state of a dynamical system bytracking occurrence probabilities of a set of future events (called tests or characteristic events) conditionedon past events (called histories or indicative events). Because tests and histories are observable quantities, ithas been suggested that learning PSRs and OOMs should be easier than learning POMDPs. A final benefit ofPSRs and OOMs is that many successful approximate planning techniques for POMDPs can be used to planin PSRs and OOMs with minimal adjustment. Accordingly, PSR and OOM models of dynamical systemshave potential to overcome both the curse of dimensionality, by compactly representing state, and the curseof history, by utilizing approximate planning techniques.

Formally, a PSR consists of five elements 〈A,O,Q, s1,F〉. A is a finite set of possible actions, and O isa finite set of possible observations. Q is a core set of tests, i.e., a set whose vector of predictions Q(h) is asufficient statistic for predicting the success probabilities τ(h) of all tests τ . F is the set of functions fτ whichembody these predictions: τ(h) = fτ (Q(h)). And, m1 = Q(ε) is the initial prediction vector. In this workwe will restrict ourselves to linear PSRs, in which all prediction functions are linear: fτ (Q(h)) = rTτQ(h)for some vector rτ ∈ R|Q|. Finally, a core set Q for a linear PSR is said to be minimal if the tests in Q arelinearly independent [16, 7], i.e., no one test’s prediction is a linear function of the other tests’ predictions.

Since Q(h) is a sufficient statistic for all tests, it is a state for our PSR: i.e., we can remember just Q(h)instead of h itself. After action a and observation o, we can update Q(h) recursively: if we write Mao for thematrix with rows rTaoτ for τ ∈ Q, then we can use Bayes’ Rule to show:

Q(hao) =MaoQ(h)

Pr[o |h, do(a)]=

MaoQ(h)

mT∞MaoQ(h)

(1)

3

Page 6: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

where m∞ is a normalizer, defined by mT∞Q(h) = 1 for all h, and hao is a history extended by action a and

observation o.In addition to the above PSR parameters, we need a few additional definitions for reinforcement learning:

a reward function R(h) = ηTQ(h) mapping predictive states to immediate rewards, a discount factor γ ∈[0, 1] which weights the importance of future rewards vs. present ones, and a policy π(Q(h)) mapping frompredictive states to actions. (Specifying a reward in terms of the core test predictions Q(h) is fully general:e.g., if we want to add a unit reward for some test τ 6∈ Q, we can instead equivalently set η := η + rτ , whererτ is defined (as above) so that τ(h) = rTτQ(h).)

3.1.1 Transformed PSRs

Transformed PSRs (TPSRs) [18, 19] are a generalization of PSRs: for any invertible matrix J , if the parame-ters m1, Mao, and m∞ represent a PSR, then the transformed parameters b1 = Jm1, Bao = JMaoJ

−1, andb∞ = J−>m∞ represent an equivalent TPSR. In addition to the initial TPSR state b1, we define normalizedconditional internal states bt, which we can update similarly to Eq. 1:

bt+1 ≡Bao1:tb1bT∞Bao1:tb1

=BatotbtbT∞Batotbt

(2)

Pairs J−1J cancel during the update, showing that predictions are equivalent as claimed:

Pr[o1:t | do(a1:t)]=mT∞Mao1:tm1

=mT∞J−1JMao1:tJ

−1Jm1

= bT∞Bao1:tb1 (3)

By choosing the invertible transform J appropriately, we can think of TPSRs as maintaining a small numberof sufficient statistics which are linear combinations of predictions for a (potentially very large) core set oftests. As we show in our work [19], this view leads to the main benefit of TPSRs over regular PSRs: givena core set of tests, we can find low dimensional parameters using spectral methods and regression instead ofcombinatorial search. In this respect, TPSRs are closely related to the transformed representations of LDSsand HMMs found by subspace identification [1, 5, 20, 6]. Furthermore, to make it practical to work withdata gathered from complex real-world systems, we can learn from finite-dimensional features of the pastand future, rather than an extremely large or even infinite core set of tests. Additional details regarding therelationship between TPSRs and PSRs can be found in [19].

3.2 Planning in Learned ModelsPlanning a sequence of actions or a policy to maximize reward has long been considered a fundamentalproblem for autonomous agents. Generally, planning algorithms like value iteration [21, 22], policy itera-tion [23, 22], lookahead search [22], and policy gradient [24, 25], are applied to a known, accurate model.However, in the hardest version of the problem, an agent must form a plan based solely on its own experience,without the aid of a human engineer who can design problem-specific models, features or heuristics; it is thisversion of the problem which we must solve to build a truly autonomous agent.

The quality of an optimized policy for a POMDP, PSR, or OOM depends strongly on the accuracy of themodel: inaccurate models typically lead to useless plans. A fully autonomous agent must be able to learnmodels from data, but, due to the difficulty of learning, it is far more common to see planning algorithmsapplied to hand-specified models, and therefore to small systems where there is extensive and goal-relevantdomain knowledge. For example, recent extensions of approximate planning techniques for PSRs have onlybeen applied to hand-constructed models [26, 27].

4

Page 7: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

Work that does learn models for planning in partially observable environments has so far met withonly limited success. For example, Expectation-Maximization (or EM—see, e.g., [28]) does not avoid lo-cal minima or scale to large state spaces; and, although many learning algorithms have been proposed forPSRs [29, 30, 31, 32, 33] and OOMs [16, 34, 35], none have been shown to learn models that are accurateenough for planning. Several researchers have, however, made progress in the problem of planning using alearned model. In one instance [36], researchers obtained a POMDP heuristically from the output of a model-free algorithm [37] and demonstrated planning on a small toy maze. In another instance [11], researchers usedMarkov Chain Monte Carlo (MCMC) inference both to learn a factored Dynamic Bayesian Network (DBN)representation of a POMDP in a small synthetic network administration domain, as well as to perform onlineplanning. Due to the cost of the MCMC sampler used, this approach is still impractical for larger models.In a third example, researchers learned Linear-Linear Exponential Family PSRs from an agent traversing asimulated environment, and found a policy using a policy gradient technique with a parameterized functionof the learned PSR state as input [38, 39]. In this case both the learning and the planning algorithm weresubject to local optima. In addition, the authors determined that the learned model was too inaccurate tosupport value-function-based planning methods [39]. Finally, there is a successful line of research whichcomputes closed-loop controllers from learned or partly-learned models, starting from linear subspace iden-tification [1] and ranging to controllers for helicopters [40] and bird-like robots [41]. This line of researchfocuses on control-like problems, in which accurate state estimation and dealing with continuous controlsare the main sources of difficulty, in contrast to the planning-like problems we are generally interested in, inwhich longer-term lookahead and discrete choices are more important.

3.3 Spectral Dimensionality Reduction and Manifold LearningHere we briefly review a number of spectral approaches to dimensionality reduction that are fundamental forsubspace identification approaches to learning models of dynamical systems.

Multidimensional Scaling and Principal Components Analysis In classical multidimensional scaling(CMDS), the aim is to embed n points in a low dimensional Euclidean space so that the inter-point distancesin the low dimensional space are as close as possible to given dissimilarities between the points in the orig-inal space. Principal components analysis (PCA) is a method for finding a subspace of an input space thatpreserves the greatest possible fraction of the variance of a set of data points. PCA and CMDS are stronglyrelated: if we start from a data matrix X ∈ Rn×m, where n is the number of data points and m is the numberof features, CMDS effectively employs an eigendecomposition of the similarity matrix XXT, while PCAemploys an eigendecomposition of the sample covariance matrix 1

nXTX.

Singular Value Decomposition Singular Value Decomposition (SVD) [] is a spectral decomposition of a(possibly) non-square matrix. We are especially interested in the SVD of a cross-covariance matrix XTY,where X and Y are data matrices. The singular values of XTY are defined to be the square roots of theeigenvalues of (XTY)T(XTY), and the right and left singular vectors are defined to be the correspondingeigenvectors.

One can also compute whitened variations of SVDs. Whitening means transforming a covariance matrixto the identity: define Lx to be the lower triangular Cholesky factor of the empirical covariance 1

nXTX.Then XL−Tx is a whitening transformation of X: the covariance of XL−Tx is 1

nL−1x XTXL−Tx = I. Similarly,define Ly to be the lower triangular Cholesky factor of 1

nYTY. In a finite-dimensional space, SVD of thewhitened cross-covariance matrix L−1x XTYL−Ty is called canonical correlation analysis (CCA) [3], and theresulting singular values are called canonical correlations. Whitening can also be used to find a factoredcoefficient matrix β for reduced-rank regression (RRR) [4]: if we want to minimize the squared error in

5

Page 8: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

Y ≈ Xβ subject to the constraint rank(β) ≤ k, we can compute the optimal β from the first k singularvalues and vectors of L−1x XTY.

Kernel PCA Kernel PCA is a generalization of PCA [42]: we first map our inputs to a higher-dimensionalfeature space F using a feature mapping φ : Rd → F , and then find the principal components in this newspace. If the features are sufficiently expressive, kernel PCA can find structure that regular PCA misses.However, if F is high- or infinite-dimensional, PCA in F is in general intractable. Kernel PCA overcomesthis problem by assuming that F is a reproducing-kernel Hilbert space (RKHS), and that the feature mappingφ is implicitly defined via an efficiently-computable kernel function K(x,x′) = 〈φ(x), φ(x′)〉F . Popularkernels include the linear kernel K(x,x′) = x · x′ (which identifies the feature space with the input space)and the RBF kernel K(x,x′) = exp(−γ‖x−x′‖2); kernels have also been defined on a variety of structuredobjects including strings and graphs.

Manifold Learning Manifold learning algorithms are non-linear methods for embedding a set of datapoints to a low-dimensional space while preserving the local geometry of the manifold on which the datapoints lie. Recently, there has been a great deal of interest in spectral approaches to learning manifolds.These algorithms may aptly be described as kernel eigenmap methods and include Isomap [43], Locally Lin-ear Embedding (LLE) [44], Laplacian Eigenmaps (LE) [45], Maximum Variance Unfolding (MVU) [46],and Maximum Entropy Unfolding (MEU) [47]. Each of these approaches can be viewed as applying kernelprincipal component analysis [42] to a specific choice of kernel learned from the data [48].

3.4 Subspace IdentificationModels of stochastic discrete-time dynamical systems have important applications in a wide range of fields.Hidden Markov Models (HMMs) [49] and Linear Dynamical Systems (LDSs) [50] are two examples of latentvariable models which assume that sequential data points are noisy, incomplete observations of a latent statethat evolves over time. HMMs are typically learned using Expectation-Maximization (EM) [49], which isprone to local optima, especially in large state spaces. On the other hand, LDSs are often learned usingSSID [1]. The latter is a spectral method: it finds an approximate factorization of the estimated covariancebetween past and future observations by means of a (possibly whitened) SVD. And, it learns an observablerepresentation, whose parameters can be simply related to directly-measurable quantities. In part becauseof these qualities, subspace ID is free of local optima and statistically consistent, though (unlike EM) itdoes not typically find even a local optimum of the log-likelihood for finite sample sizes. In general, it ispossible to analyze the theoretical properties of subspace identification algorithms by bounding the error inthe eigenvalues of the estimated covariance matrix of past and future observations. Recently, Hsu, Kakade andZhang (HKZ for short) proposed a spectral algorithm which learns observable representations of HMMs [6].The HKZ algorithm is free of local optima and statistically consistent, with a finite-sample bound on L1 errorin joint probability estimates. However, learning large-state-space HMMs is still difficult: the number ofparameters grows prohibitively with the size of the state space.

3.5 Probability Distributions in Hilbert SpaceOften we are interested in representing probability distributions of past and future observations in terms offeatures. We can reason about these distributions by thinking about them as points embedded in some featurespace. Therefore, recent work on representing and updating probability distributions as embedded as pointsin Hilbert spaces [51, 52, 53] will be important for some of our work.

LetF be a reproducing kernel Hilbert space (RKHS) associated with kernelK(x,x′) := 〈φ(x), φ(x′)〉F .Then for all functions f ∈ F and x ∈ X we have the reproducing property: 〈f, φ(x)〉F = f(x), i.e. the

6

Page 9: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

evaluation of function f at x can be written as an inner product. objects. Let P be the set of probabilitydistributions on X , and X the random variable with distribution P ∈ P . Following [51], we define themapping of P ∈ P to RKHS F , µX := EX∼P[φ(X)], as the Hilbert space embedding of P or simply meanmap. For all f ∈ F , EX∼P[f(X)] = 〈f, µX〉F by the reproducing property. A characteristic RKHS is onefor which the mean map is injective: that is, each distribution has a unique embedding [52]. This propertyholds for many commonly used kernels (eg. the Gaussian and Laplace kernels when X = Rd).

As a special case of the mean map, the marginal probability vector of a discrete variable X is a Hilbertspace embedding, i.e. (P(X = i))Mi=1 = µX . Here the kernel is the delta function K(x,x′) = I[x = x′], andthe feature map is the 1-of-M representation for discrete variables.

Givenm i.i.d. observations{xl}ml=1

, an estimate of the mean map is straightforward: µX := 1m

∑ml=1 φ(xl) =

1mΦ1m, where Φ := (φ(x1), . . . , φ(xm)) is a conceptual arrangement of feature maps into columns. Fur-thermore, this estimate computes an approximation within an error of Op(m−1/2) [51].

Mean maps corresponding to embeddings of joint distributions P(X,Y ) over two variables X on X andY on Y , called covariance operators, and conditional distributions P(Y |x) called conditional embeddingoperators, can be defined analogously. See [54] for details.

4 Completed WorkWe briefly discuss several large pieces of work that we have completed toward developing a family of spectralalgorithms for learning predictive representations of data. First we discuss Reduced-Rank Hidden Markovmodels, a class of models closely related to PSRs, that yields a computationally and statistically efficientspectral learning algorithm. This was our first attempt to generalize SSID algorithms from HMMs towardPSRs. We next discuss a generalization of this algorithm to PSRs in Section 4.2; this will form the centerpieceof our thesis. We also discuss the problem of planning in learned models and some of the approaches that wehave taken to solve this problem. Finally we discuss several extensions to the learning algorithm, includingkernel-based learning for infinite dimensional feature spaces, online and approximate learning algorithms forcomputationally efficient learning, and manifold dimensionality reduction approaches for learning PSRs fromstructured training data. These extensions allow us to handle continuous observations, to learn non-parametricmodels, and to manage large quantities of data resulting in large increments in practical usability.

4.1 Reduced-Rank Hidden Markov ModelsReduced-Rank Hidden Markov Models (RR-HMMs) are a smoothly evolving dynamical model with theability to represent nonconvex predictive distributions by relating discrete-state and continuous-state mod-els. HMMs can approximate smooth state evolution by tiling the state space with a very large number oflow-observation-variance discrete states with a specific transition structure. However, inference and learningin such a model is highly inefficient due to the large number of parameters, and due to the fact that exist-ing HMM learning algorithms, such as Expectation Maximization (EM) [49], are prone to local minima.RR-HMMs allow us to reap many of the benefits of large-state-space HMMs without incurring the associ-ated inefficiency during inference and learning. We showed that all inference operations in the RR-HMMcan be carried out in the low-dimensional space where the dynamics evolve, decoupling their computationalcost from the number of hidden states. This makes rank-k RR-HMMs (equivalent to a HMM with a rank-ktransition matrix and any number of states) as computationally efficient as k-state HMMs, but much moreexpressive. Though the RR-HMM is in itself novel, its low-dimensional Rk representation is related to exist-ing models such as Predictive State Representations (PSRs) [15], Observable Operator Models (OOMs) [16],generalized HMMs [55], and weighted automata [56, 57], as well as the the representation of LDSs learnedusing Subspace Identification [1].

7

Page 10: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

−1

C. RR-HMMB. Stable LDSA. HMM

Figure 2: State space manifold and video frames simulated by a HMM, a stable LDS, and a RR-HMM learned usingclock pendulum video (manifold scales are arbitrary). (A) 10-state HMM. The number of discrete states is not sufficientto represent the dynamical system. (B) 10-dim LDS. The number of dimensions is sufficient to represent the dynamicalsystem, but the linear-Gaussian assumption results in blurred simulations. (C) Rank 10 RR-HMM. Can represent thedynamics, and exhibits the competitive inhibition necessary to simulate from multimodal observation distributions.

To learn RR-HMMs from data, we adapted a recently proposed spectral learning algorithm by Hsu,Kakade and Zhang [6] (henceforth referred to as HKZ) that learns observable representations of HMMsusing matrix decomposition and regression on empirically estimated observation probability matrices of pastand future observations. An observable representation of an HMM allows us to model sequences with a seriesof operators without knowing the underlying stochastic transition and observation matrices. We showed howto generalize the HKZ bounds to the low-rank transition matrix case and derive tighter bounds that dependon k instead of m, allowing us to learn rank-k RR-HMMs of arbitrarily large m in O(Nk2) time, where Nis the number of samples.

We demonstrated that RR-HMMs are able to compactly model smooth evolution and competitive inhibi-tion in a clock pendulum video (Figure 2), as well as in real-world mobile robot vision data captured in anoffice building and slot car inertial measurement data (Figure 6). Robot vision data and slot car data (and, infact, most real-world multivariate time series data) exhibit smoothly evolving dynamics requiring multimodalpredictive beliefs, for which RR-HMMs are particularly suited.

RR-HMMs are a special case of PSRs, and the RR-HMM learning algorithm is also a special case of theTPSR learning algorithm discussed below in Section 4.2.

4.2 Learning Predictive State RepresentationsThe heart of our work is a spectral learning algorithm for PSRs. For some PSR, letQ be a minimal core set oftests. Then, let T be a (larger) core set of tests, and letH be a mutually exclusive and exhaustive partition ofthe set of all possible histories. (Elements ofH are called indicative events [16].) And, letAO be the set of allpossible action-observation pairs. Define φH(h) for h ∈ H to be a vector of indicative features, i.e., featuresof history, and define φAO(a, o) to be a vector of features of a present action and observation. Finally, defineφT (h) to be a vector of characteristic features: that is, each entry of φT (h) is a linear combination of someset of test predictions.

We will also assume that we execute a known exploration policy from each sampled history; with thisassumption, it is possible to construct unbiased samples of φT (h) by importance weighting [33, 58]. Whenour algorithms below call for samples of φT (h), we use this importance weighting trick to provide them.

We define ΦT , ΦH, and ΦAO as matrices of characteristic, indicative, and present features respectively,

8

Page 11: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

with first dimension equal to the number of features and second dimension equal to |H|. An entry of ΦH

is the expectation of one of the indicative features given the occurrence of one of the indicative events andthe exploration policy; an entry of ΦT is the expectation of one of our characteristic features given one ofthe indicative events; and an entry of ΦAO is the expectation of one of the present features given one of theindicative events and the exploration policy. We also define ψ = P[H], D = diag(ψ), R ∈ R|T |×|Q| as thematrix with rows rTτi , S ∈ R|Q|×|H| as the expected state E[Q |H], and M as a |Q| × |AO| × |Q| third-ordertensor (each |Q| × |Q| slice, Mao, of M is the transition matrix for a particular action-observation pair).

Given the above notation, we define several covariance and “trivariance” matrices which are related to theparameters of the PSR. In several of the following equations we use tensor-matrix multiplication ×v , alsoknown as a mode-v product: ×v multiplies the second dimension of a matrix with the vth mode of a tensor.

[µH]i ≡E[φHi (h)]

=⇒ µH = ΦHψ (4a)

[ΣAO,AO]i,j ≡E[φAOi (a, o) · φAOj (a, o)]

=⇒ ΣAO,AO =ΦAODΦAOT

(4b)

[ΣT ,H]i,j ≡E[φTi (τO) · φHj (h) |do(τA)]

=⇒ ΣT ,H =ΦT RSDΦHT

(4c)

[ΣT ,AO,H]i,j,k ≡E[φTi (τO) · φHj (h) · φAOk (ao) |do(τA, a)]

=⇒ ΣT ,AO,H =M ×1 (ΦT R)×2 (ΦAOD)×3 (ΦHDST) (4d)

[ΣR,H]i ≡E[R · φHi (h)]

=⇒ ΣR,H =ηTSDΦHT

(4e)

Now, if we are given an additional matrix U such that UTΦT R is invertible, we can use Equations 4a–d todefine a TPSR whose parameters are only a similarity transform away from the original PSR parameters.

b∗ ≡ UTΣT ,He = (UTΦT R)m∗ (5a)

bT∞ ≡ µTH(UTΣT ,H)† = mT

∞(UTΦT R)−1 (5b)Bao ≡ ΣT ,AO,H

×1 UT ×2 ΦAO

T(ΣAO,AO)−1 ×3 (ΣT

T ,HU)†

= (UTΦT R)Mao(UTΦT R)−1 (5c)

bTη ≡ ΣR,H(UTΣT ,H)† = ηT(UTΦT R)−1 (5d)

Here b∗ is a feasible TPSR state, b∞ is a normalization vector, the matrices Bao are transition matrices, onefor each action-observation pair ao, and bη is a linear TPSR reward function.

4.2.1 Learning TPSRs

The identities in Equation 5a–d yield a straightforward spectral learning algorithm [19]: we build empiricalestimates µH, ΣAO,AO, ΣT ,H, and ΣT ,AO,H of the matrices defined in Equation 4. Once we have con-structed ΣT ,H, we can compute U as the matrix of n leading left singular vectors of ΣT ,H. Finally, weplug the estimated covariances and U into Equation 5 to compute estimated PSR parameters. One of the

9

Page 12: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

−8 −4 0 4 8x 10

−3

−4

0

4

x 10−3B.Outer Walls

Inner Walls

A. C.

Simulated EnvironmentSimulated Environment

3-d View (to scale)

D.

Learned Subspace

Learned Representation Mapped to

Geometric Space

red

magenta

green

blue

yellow

red

yellow

magenta

green blue

Figure 3: Learning the Autonomous Robot Domain. (A) The robot uses visual sensing to traverse a square domain withmulti-colored walls and a central obstacle. Examples of images recorded by the robot occupying two different positionsin the environment are shown at the bottom of the figure. (B) A to-scale 3-dimensional view of the environment. (C)The 2nd and 3rd dimension of the learned subspace (the first dimension primarily contained normalization information).Each point is the embedding of a single history, displayed with color equal to the average RGB color in the first imagein the highest probability test. The star-shaped manifold captures the visual space of the robot with each “point” of thestar containing concentrations of embeddings that predict images predominantly composed of a particular color. (D) Thesame points in (C) projected onto the environment’s geometric space, demonstrating that the manifold sensibly capturesfeatures of geometric space.

advantages of spectral subspace identification is that the complexity of the model can be tuned by selectingthe number of singular vectors in U , at the risk of losing prediction quality.

As we include more data in our averages, the law of large numbers guarantees that our estimates convergeto the true matrices µH, ΣAO,AO, ΣT ,H, and ΣT ,AO,H. So by continuity of the formulas above, if oursystem is truly a PSR of finite rank, our estimates b∗, b∞, and Bao converge, with probability 1, to the trueparameters up to a linear transform—that is, our learning algorithm is consistent.1

We demonstrated the representational capacity of our model and the effectiveness of our learning algo-rithm by learning a compact model from simulated autonomous robot vision data (Figure 3). In Section 4.3.1we plan using value iteration in this learned model.

4.2.2 Connections to Robotics

In addition to the major new contributions outlined in this section, we have discovered connections betweenthe spectral learning algorithms stated above and several well-known problems in robotics including struc-ture from motion and range-only simultaneous localization and mapping with known correspondences. Thedetails of these connections will be the subject of a future paper.

4.3 Planning in Learned Predictive State RepresentationsIn addition to computationally and statistically efficient learning algorithms for PSRs, we have also shownthat it is feasible to plan in PSRs with value iteration and policy iteration.

4.3.1 Value Iteration

The primary motivation for modeling a controlled dynamical system is to reason about the effects of takinga sequence of actions in the system. PSR value iteration is a straightforward extension of POMDP valueiteration [27, 26]. Given a discount factor γ, the problemfor TPSRs is to find a policy that maximizes theexpected discounted sum of rewards, E [

∑t γ

tR(bt, at)]. The optimal policy can be compactly represented

1This argument for continuity holds if we fix U ; a similar but more involved argument works if we estimate U as well.

10

Page 13: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

using the optimal value function V ∗(b), which specifies the expected sum of future rewards in each TPSRstate. The value function is defined recursively as:

V ∗(b) ≡ maxa∈A

[R(b, a) + γ

∑o∈O

Pr(o | b, a)V ∗(bao)

](6)

where bao is the state obtained from b after executing action a and observing o. We have implicitly assumedthat the expected reward is a linear function of the TPSR state; we can ensure that this assumption holds byincluding the reward as an observation when we learn the TPSR dynamics. (Or, if the reward is not directlyobservable, by including its expectation given all observable information.) We can obtain the optimal actionby taking the arg max instead of the max in Equation 7:

π∗(b) = arg maxa∈A

[R(b, a) + γ

∑o∈O

Pr(o | b, a)V ∗(bao)

](7)

Exact value iteration for TPSRs optimizes the value function over all possible beliefs or state vectors.However, computing the exact value function is problematic because the number of sequences of actions thatmust be considered grows exponentially with the planning horizon, called the “curse of history.” Approximatepoint-based planning techniques specifically target the curse of history by attempting only to calculate thebest sequence of actions at some finite set of belief points. Unfortunately, in high dimensions, approximateplanning techniques have difficulty adequately sampling the space of possible beliefs. This is called the“curse of dimensionality.” Because TPSRs often admit a compact low-dimensional representation, they canreduce the effect of the curse of dimensionality, and so approximate point-based planning techniques such asPoint-Based Value Iteration (PBVI) [21] can work well in these models.

We demonstrate value iteration in a learned model (the model in Figure 3), with successful results (Fig-ure 4). See [19] for details.

Estimated Value Function Policies Executed inLearned Subspace

Paths Taken in Geometric Space

−8 −4 0 4 8x 10

−3

−4

0

4

−3

−8 −4 0 4 8x 10

−3

−4

0

4x 10−3

12.8

22.8291.5

Optimistic Learned Random

B.A. D.C.x 10

0 50 1000

.5

1

~~Mean # of Actions

# of Actions

Opt.Learned

RandomCum

ulat

ive

Den

sity

red

magenta

green

blue

yellow

red

yellow

magenta

green blue

Figure 4: Planning in the learned state space. (A) The value function computed for each embedded point; lighter indicateshigher value. (B) Policies executed in the learned subspace. The red, green, magenta, and yellow paths correspond to thepolicy executed by a robot with starting positions facing the red, green, magenta, and yellow walls respectively. (C) Thepaths taken by the robot in geometric space while executing the policy. Each of the paths corresponds to the path of thesame color in (B). The darker circles indicate the starting and ending positions, and the tick-mark indicates the robot’sorientation. (D) Analysis of planning from 100 randomly sampled start positions to the target image (facing blue wall).In the bar graph: the mean number of actions taken by the optimistic solution found by A* search in configuration space(left); taken by executing the policy found by Perseus in the learned model (center); and taken by executing a randompolicy (right). Line graph illustrates the cumulative density of the number of actions given the optimal, learned, andrandom policies.

11

Page 14: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

4.3.2 Policy Iteration

In policy iteration one alternately estimates a value function and a greedy policy based on this value function.SSID techniques are a natural fit for teh value function estimation step, since they are reliable, efficient, andeffective. So, we introduce Predictive State Temporal Difference (PSTD) learning, a new approach to featurediscovery for temporal difference methods, which demonstrates how insights from system identification canbenefit reinforcement learning. Specifically, we use PSTD to estimate a value function within one step ofpolicy iteration. PSTD can be viewed as a model-free approach that automatically chooses a small set offeatures that preserves only predictive information useful for finding a value function, or, a PSR model-basedapproach that first estimates the parameters of a TPSR and then applies a Bellman recursion to estimate thevalue function.

For a fixed policy π, a TPSR’s value function is a linear function of state, Jπ(s) = wTb, and is thesolution of the TPSR Bellman equation [26]: for all b, wTb = bTη b+ γ

∑o∈O w

TBπob, or equivalently,

wT = bTη + γ∑o∈O

wTBπo

If we substitute in our learned PSR parameters from Equations 5(a–d), we get

wT = ΣR,H(UTΣT ,H)† + γ∑o∈O

wTUTΣT ,πo,H(UTΣT ,H)†

wT = ΣR,H

(UTΣT ,H − γ

∑o∈O

wTUTΣT ,πo,H

)†(8)

By comparison, if we instead solve the value function estimation problem by LSTD, and then use CCAto select a relevant subspace of the feature space, we can show that we get the exact same answer [58]. Inaddition to adding to our understanding of both methods, an important corollary of this result is that PSTD isa statistically consistent algorithm for PSR value function approximation—to our knowledge, the first suchresult for a TD method. Our experiments suggest that the new method is not just a nice theoretical tie betweendirect and indirect reinforcement learning methods, but a highly effective planning method: see Figure 5.

0 5 10 15 20 25 300.95

1.00

1.05

1.10

1.15

1.20

1.25

1.30

Expe

cted

Rew

ard

Policy Iteration

LSTD (16)LSTD

PSTDLARS-TD

Threshold

Figure 5: Pricing a high-dimensional derivative via policy iteration. Error bars show standard error. The y-axis isexpected reward for the current policy at each iteration. The optimal threshold strategy (sell if price is above a thresh-old [59]) is in black, LSTD (16 canonical features) is in blue, LSTD (on the 16 canonical and 204 additional features) iscyan, LARS-TD [60] (feature selection from set of 220) is in green, and PSTD (16 dimensions, compressing 220 features(16 + 204)) is in red. PSTD performs better than competing methods by finding a subspace of predictive features thatpreserves information relevant to estimating a value function. See [58] for details.

12

Page 15: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

4.4 Learning in Infinite-dimensional Feature SpacesIn the previous sections, we detailed several spectral learning algorithms for identifying the state of a dy-namical system where observations are embedded in a finite dimensional Euclidean space. Here we extenddynamical system identification to observations embedded in potentially infinite-dimensional Hilbert spacesand provide a new kernel-based representation and kernelized spectral learning algorithm for HMMs; thisnew representation and algorithm will allow us to learn HMMs in any domain where we can define a kernel.Furthermore, our algorithm is free of local minima and admits finite-sample generalization guarantees.

In particular, we will represent HMMs using a recent concept called Hilbert space embedding [51, 52].The essence of Hilbert space embedding is to represent probability measures (in our case, corresponding todistributions over observations and latent states in a HMM) as points in Hilbert spaces. We can then performinference in the HMM by updating these points, entirely in their Hilbert spaces, using covariance opera-tors [61] and conditional embedding operators [53]. By making use of the Hilbert space’s metric structure,our method works naturally with continous and structured random variables, without the need for discretiza-tion.

In addition to generalizing HMMs to arbitrary domains where kernels are defined, our learning algorithmcontributes to the theory of Hilbert space embeddings with hidden variables. Previously, [53] derived a kernelalgorithm for HMMs; however, they only provided results for fully observable models, where the training dataincludes labels for the true latent states. By contrast, our algorithm only requires access to an (unlabeled)sequence of observations. See [54] for details.

We provide experimental results comparing embedded HMMs learned by our spectral algorithm to sev-eral other well-known approaches to learning models of time series data. The results demonstrate that ournovel algorithm exceeds the previous state-of-the-art performance, often beating the next best algorithm bya substantial margin (Figure 6), although other recent methods such as Gaussian process models [62] canachieve similar results on this data at higher computational cost.

Kernel-based methods represent a powerful extension to our spectral learning algorithms. Although thelearning algorithm that we developed is only proven to be statistically consistent for HMMs at this time, webelieve that it may actually be a consistent learning algorithm for uncontrolled PSRs as well.

C. D.

0 10 20 30 40 50 60 70 80 90 100

345678

x 106

Prediction Horizon

Avg

. Pre

dict

ion

Err.

21

IMUSlot

Car

0

Racetrack

RR-HMMLDS

HMMMeanLast

Embedded

A. Example Images

Environment

Path

B.

0 10 20 30 40 50 60 70 80 90 100

345678

x 106

Prediction Horizon

Avg

. Pre

dict

ion

Err.

RR-HMMLDS

HMMMeanLast2

Embedded

9

1

Figure 6: Prediction tasks for various models on two problems: robot vision data and slot car inertial measurement data.(A) Sample images from the robot’s camera. The figure below depicts the hallway environment with a central obstacle(black) and the path that the robot took through the environment (the red counter-clockwise ellipse). (B) Squared error forprediction with different estimated models and baselines. (C) The slot car platform and the IMU (top) and the racetrack(bottom). (D) Squared error for prediction with different estimated models and baselines. Hilbert space embeddings ofHMMs outperform other models, sometimes by a substantial margin.

4.5 Online Learning Algorithms for PSRsSpectral algorithms for learning dynamical systems have, until now, had an important drawback: they arebatch methods (needing to store their entire training data set in memory at once) instead of online ones (with

13

Page 16: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

space complexity independent of the number of training examples and time complexity linear in the numberof training examples).

To remedy this drawback, we proposed a fast, online spectral algorithm for TPSRs. Since TPSRs subsumeHMMs, PSRs, and POMDPs [7, 18], our algorithm also improves on past algorithms for these other models.Our method leverages fast, low-rank modifications of the thin singular value decomposition [63], and usestricks such as random projections to scale to extremely large numbers of examples and features per example.Consequently, the new method can handle orders of magnitude larger data sets than previous methods, andcan therefore scale to learn models of systems that are too complex for previous methods.

Experiments showed that our online spectral learning algorithm did a good job recovering the parametersof a nonlinear dynamical system in two partially observable domains. In our first experiment we empiricallydemonstrated that our online spectral learning algorithm is unbiased by recovering the parameters of a smallbut difficult synthetic Reduced-Rank HMM. In our second experiment we demonstrated the performance ofthe new method on a difficult, high-bandwidth video understanding task. See [64] for details.

4.6 Two-Manifold Problems: Dynamical Systems on the ManifoldWe propose a class of problems called two-manifold problems where two sets of corresponding data points,generated by a latent variable, lie on or near two different manifolds. We design algorithms by relatingtwo-manifold problems to cross-covariance operators in RKHS, and show that these algorithms result in asignificant improvement over standard manifold learning approaches in the presence of noise or limited data.Furthermore, the relationship to cross-covariance operators suggests that manifold learning can be intuitivelyintegrated into supervised learning methods. We demonstrate this fact by designing a manifold version of ourkernel-based approach to learning Hilbert space embeddings of HMMs. Interestingly, this algorithm can beinterpreted as learning a dynamical system model on a low-dimensional manifold of the training data, or asa kernel-based learning algorithm for Hilbert space embeddings of a HMMs where the kernels are learnedfrom the training data. This work has been submitted for publication but not yet accepted.

5 Proposed WorkOur completed work has made several novel contributions to the problem of learning models of dynamicalsystems. We developed new batch and online learning algorithms in finite and infinite dimensional featurespaces, established connections to a range of spectral dimensionality reduction methods including kernel-based methods and manifold learning, and preliminary experiments have showed promise in prediction andplanning tasks. The main thrust of our proposed work is to apply our learning algorithms to difficult real-world problems, analyze the potential difficulties associated with these tasks, and develop new techniques, ifnecessary.

Our main application focus will be robotics. We anticipate several difficulties while moving from smallerproblems and synthetic problems to larger practical applications. The first is the challenge of scaling upto the higher-dimensional predictive state spaces that more complex tasks require, the second is integratingexpert knowledge into the learning process, and the third is properly accounting for actions and explorationin controlled systems. Below we provide details of some of these subgoals that we are interested in tacklingwhile pursuing applications to robotics.

5.1 Incorporating Information into the Learning ProcessThe examples that we have focussed on so far have not required a very large predictive state space (typically< 20 dimensions) and have also been learned with a relatively large number of samples (typically 1,000 to

14

Page 17: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

10,000 training data points). We have not demonstrated that we can sample densely enough to learn modelsthat are tractable for planning and prediction tasks in real-world circumstances. Overcoming this problemmay require new techniques. One possibility is to guide learning by incorporating prior knowledge into thelearning process. In this situation we might specify part of the model and learn part of the model. Recentwork on developing a kernel Bayes rule [65] could be helpful in this regard. Progress on this front will greatlyincrease the applicability of our learning algorithms in practice.

5.2 Hilbert Space Embeddings of Predictive State RepresentationsIn previous work, we developed a non-parametric kernel-based learning algorithm for continuous-valuedHMMs (Section 4.4). We believe that this learning algorithm can also be interpreted as a learning algorithmfor an uncontrolled PSR. Part of our future work will be to establish the relationship between Hilbert spaceembeddings of HMMs and PSRs; and to prove that the learning algorithms in the uncontrolled models areactually equivalent. Finally, we will attempt to develop the theoretical framework necessary for sample com-plexity bounds on non-parametric learning algorithms for both controlled and uncontrolled PSRs in Hilbertspace. In addition to extending PSR learning algorithms to infinite dimensional feature spaces, this will alsogive us bounds for PSRs in finite dimensional Euclidean spaces as a special case.

We also wish to focus on spectral learning algorithms for PSRs for off-policy planning. Although we havealready detailed how this can be done in theory, by importance weighting observation trajectories, we havenot yet attempted to do this in practice in a domain of significant size. We believe that effectively dealingwith bias induced by a data collection policy in our spectral learning algorithms will be critical for applyingour algorithms in practice, especially for bigger problems where exploration is necessary.

5.3 Robotics ApplicationsWe are interested in a number of potential robotics applications. A few are mentioned below for which weeither have access to data or have conducted preliminary experiments.

Visual Mapping We are interested in learning maps of place locations directly from monocular videostreams. This learning task is essentially a more difficult version of the synthetic experiment in Figure 3.Given the success of this experiment, and some small successes in learning manifolds from video data, webelieve that this task is possible. While we do not believe that such mapping will immediately supplantmore sophisticated approaches to learning detailed maps of an environment from vision-based sensors, wedo believe that the technical difficulties associated with this task are interesting in their own right; further-more, these same technical difficulties will arise in other applications, and solving them in the relatively wellexplored area of mapping will allow us to make rapid progress in other domains.

Manipulation We are interested in applying spectral learning techniques to the problem of manipulationplanning and the problem of perception for manipulation. Some interesting aspects of these problems includelearning models of push-grasping and belief compression for trajectory planning. Siddartha Srinivasa isinterested in providing data and expert advice in this domain.

Quadrotor Sensing Quadrotor helicopters are flying robotic platforms with a variety of sensors includinglaser rangefinders, video cameras, and inertial measurement units. We are interested in directly modelingcombined sensor data from these robots. We hope to learn low-dimensional manifolds (maps) of the sensordata, to filter and predict sensor data, and attempt to attempt to plan in the quadrotor’s data space. Nick Royand Maxim Likhachev have provided data and are willing to work with us to deploy software that we provide.

15

Page 18: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

6 TimelineThe following is a tentative timeline for the completion the thesis:

June, 2011: Thesis Proposal

Jul-Aug, 2011: Collaboration with Arthur at UCL

• Hilbert Space Embeddings of PSRs

• PSTD in Hilbert Space

Fall, 2011 - Spring 2012: Robotics Applications

• Grasping with Sidd

• Quadrotor sensing and mapping

• Additional theoretical work, as needed

Summer, 2012: Write thesis

Sept, 2012: Graduation

16

Page 19: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

References[1] P. Van Overschee and B. De Moor. Subspace Identification for Linear Systems: Theory, Implementation, Applica-

tions. Kluwer, 1996.

[2] Roger Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985.

[3] Harold Hotelling. The most predictable criterion. Journal of Educational Psychology, 26:139–142, 1935.

[4] Gregory C. Reinsel and Rajabather Palani Velu. Multivariate Reduced-rank Regression: Theory and Applications.Springer, 1998.

[5] Tohru Katayama. Subspace Methods for System Identification. Springer-Verlag, 2005.

[6] Daniel Hsu, Sham Kakade, and Tong Zhang. A spectral algorithm for learning hidden Markov models. In COLT,2009.

[7] Satinder Singh, Michael James, and Matthew Rudary. Predictive state representations: A new theory for modelingdynamical systems. In Proc. UAI, 2004.

[8] Sam Roweis and Zoubin Ghahramani. A unified view of linear gaussian models. Neural Computation, 11:305–345,1999.

[9] Pierre Baldi and Yves Chauvin. Smooth on-line learning algorithms for hidden markov models, 1994.

[10] E.B. Fox, E.B. Sudderth, M.I. Jordan, and A.S. Willsky. Bayesian Nonparametric Inference of Switching DynamicLinear Models. IEEE Transactions on Signal Processing, 59(4), 2011.

[11] Stephane Ross and Joelle Pineau. Model-based Bayesian reinforcement learning in large structured domains. InProc. UAI, 2008.

[12] E. J. Sondik. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford University,1971.

[13] Anthony R. Cassandra, Leslie P. Kaelbling, and Michael R. Littman. Acting optimally in partially observablestochastic domains. In Proc. AAAI, 1994.

[14] Joelle Pineau, Geoffrey Gordon, and Sebastian Thrun. Anytime point-based approximations for large POMDPs.Journal of Artificial Intelligence Research (JAIR), 27:335–380, 2006.

[15] Michael Littman, Richard Sutton, and Satinder Singh. Predictive representations of state. In Advances in NeuralInformation Processing Systems (NIPS), 2002.

[16] Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural Computation, 12:1371–1398,2000.

[17] Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Planning in POMDPs using multiplicity automata. In UAI,2005.

[18] Matthew Rosencrantz, Geoffrey J. Gordon, and Sebastian Thrun. Learning low dimensional predictive representa-tions. In Proc. ICML, 2004.

[19] Byron Boots, Sajid M. Siddiqi, and Geoffrey J. Gordon. Closing the learning-planning loop with predictive staterepresentations. In Proceedings of Robotics: Science and Systems VI, 2010.

[20] S. Soatto and A. Chiuso. Dynamic data factorization. Technical report, UCLA, 2001.

[21] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In Proc.IJCAI, 2003.

[22] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-dynamic programming. Athena scientific optimization and computationseries. Athena Scientific, 1996.

[23] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. J. Mach. Learn. Res., 4:1107–1149, 2003.

[24] R. S. Sutton, David McAllester, S. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learningwith function approximation. In S. A. Solla, T. K. Leen, and K.-R. Mller, editors, Advances in Neural InformationProcessing Systems 12, pages 1057–1063, Cambridge, MA, 2000. MIT Press.

17

Page 20: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

[25] S. Kakade. A natural policy gradient. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances inNeural Information Processing Systems 14, pages 1531–1538. MIT Press, 2002.

[26] Michael R. James, Ton Wessling, and Nikos A. Vlassis. Improving approximate value iteration using memories andpredictive state representations. In AAAI, 2006.

[27] Masoumeh T. Izadi and Doina Precup. Point-based planning for predictive state representations. In Proc. CanadianAI, 2008.

[28] Jeff Bilmes. A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixtureand hidden Markov models. Technical Report ICSI-TR-97-021, 1997.

[29] Satinder Singh, Michael L. Littman, Nicholas K. Jong, David Pardoe, and Peter Stone. Learning predictive staterepresentations. In Proc. ICML, 2003.

[30] Britton Wolfe, Michael James, and Satinder Singh. Learning predictive state representations in dynamical systemswithout reset. In Proc. ICML, 2005.

[31] Peter McCracken and Michael Bowling. Online discovery and learning of predictive state representations. In Proc.NIPS, 2005.

[32] Eric Wiewiora. Learning predictive representations from a history. In Proc. ICML, 2005.

[33] Michael Bowling, Peter McCracken, Michael James, James Neufeld, and Dana Wilkinson. Learning predictive staterepresentations using non-blind policies. In Proc. ICML, 2006.

[34] A. Kolling H. Jaeger, M. Zhao. Efficient training of OOMs. In NIPS, 2005.

[35] M. Zhao, H. Jaeger, and M. Thon. A bound on modeling error in observable operator models and an associatedlearning algorithm. Neural Computation, 2009.

[36] Guy Shani, Ronen I. Brafman, and Solomon E. Shimony. Model-based online learning of POMDPs. In Proc.ECML, 2005.

[37] A. McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University ofRochester, 1995.

[38] David Wingate and Satinder Singh. Efficiently learning linear-linear exponential family predictive representationsof state. In Proc. ICML, 2008.

[39] David Wingate. Exponential Family Predictive Representations of State. PhD thesis, University of Michigan, 2008.

[40] Andrew Y. Ng, Adam Coates, Mark Diel, Varun Ganapathi, Jamie Schulte, Ben Tse, Eric Berger, and Eric Liang.Inverted autonomous helicopter flight via reinforcement learning. In In International Symposium on ExperimentalRobotics, 2004.

[41] Russ Tedrake, Zack Jackowski, Rick Cory, John William Roberts, and Warren Hoburg. Learning to fly like a bird.In Under Review, 2009.

[42] Bernhard Scholkopf, Alex J. Smola, and Klaus-Robert Muller. Nonlinear component analysis as a kernel eigenvalueproblem. Neural Computation, 10(5):1299–1319, 1998.

[43] Joshua B. Tenenbaum, Vin De Silva, and John Langford. A global geometric framework for nonlinear dimension-ality reduction. Science, 290:2319–2323, 2000.

[44] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,290(5500):2323–2326, December 2000.

[45] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.Neural Computation, 15:1373–1396, 2002.

[46] Kilian Q. Weinberger, Fei Sha, and Lawrence K. Saul. Learning a kernel matrix for nonlinear dimensionalityreduction. In In Proceedings of the 21st International Conference on Machine Learning, pages 839–846. ACMPress, 2004.

[47] Neil D. Lawrence. Spectral dimensionality reduction via maximum entropy. In Proc. AISTATS, 2011.

18

Page 21: Spectral Approaches to Learning Predictive Representations filealgorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to

[48] Jihun Ham, Daniel D. Lee, Sebastian Mika, and Bernhard Schlkopf. A kernel view of the dimensionality reductionof manifolds, 2003.

[49] L. R. Rabiner. A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE,77(2):257–285, 1989.

[50] R.E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal ofBasic Engineering, 1960.

[51] A.J. Smola, A. Gretton, L. Song, and B. Scholkopf. A Hilbert space embedding for distributions. In E. Takimoto,editor, Algorithmic Learning Theory, Lecture Notes on Computer Science. Springer, 2007.

[52] B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Scholkopf. Injective Hilbert space embeddingsof probability measures. In Proc. Annual Conf. Computational Learning Theory, 2008.

[53] L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions. In Interna-tional Conference on Machine Learning, 2009.

[54] L. Song, B. Boots, S. M. Siddiqi, G. J. Gordon, and A. J. Smola. Hilbert space embeddings of hidden Markovmodels. In Proc. 27th Intl. Conf. on Machine Learning (ICML), 2010.

[55] Vijay Balasubramanian. Equivalence and Reduction of Hidden Markov Models. MSc. Thesis, MIT, 1993.

[56] M. P. Schutzenberger. On the definition of a family of automata. Inf Control, 4:245–270, 1961.

[57] M. Fleiss. Matrices deHankel. J. Math. Pures Appl., 53:197–222, 1974.

[58] Byron Boots and Geoff Gordon. Predictive state temporal difference learning. In J. Lafferty, C. K. I. Williams,J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23,pages 271–279. 2010.

[59] John N. Tsitsiklis and Benjamin Van Roy. Optimal stopping of markov processes: Hilbert space theory, approx-imation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Transactions onAutomatic Control, 44:1840–1851, 1997.

[60] J. Zico Kolter and Andrew Y. Ng. Regularization and feature selection in least-squares temporal difference learning.In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 521–528, NewYork, NY, USA, 2009. ACM.

[61] C. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society,186:273–289, 1973.

[62] Jonathan Ko and Dieter Fox. Learning gp-bayesfilters via gaussian process latent variable models. In Proc. RSS,2009.

[63] Matthew Brand. Fast low-rank modifications of the thin singular value decomposition. Linear Algebra and itsApplications, 415(1):20–30, 2006.

[64] Byron Boots, Sajid Siddiqi, and Geoffrey Gordon. An online spectral learning algorithm for partially observablenonlinear dynamical systems. In Proceedings of the 25th National Conference on Artificial Intelligence (AAAI-2011), 2011.

[65] Kenji Fukumizu, Le Song, and Arthur Gretton. Kernel Bayes’ rule. Stat, 1050(2):21, 2010.

19