programming language support for autoassociative memorycse400/cse400_2014_2015/... · 2015. 5....

Programming Language Support for AutoassociativeMemory

Dept. of CIS - Senior Design 2014-2015∗

Fan [email protected]

Univ. of PennsylvaniaPhiladelphia, PA

Haolin [email protected]


Yukuan [email protected]


ABSTRACTData structures such as arrays or hashmaps use exact keylookups and memory addresses for retrieving entries. Thisdiffers from a more relational model of memory such ashuman memory, which learns from heuristics and patternmatching as it receives input from its surrounding environ-ment. Analogously, an autoassociative memory receives aninput and returns a output value that may only be similarto the input and to the other memory values according tolearned patterns. Our programming language library seeksto introduce a novel interface to this autoassociative mem-ory in order to facilitate programming efficiency for the enduser. Specifically, the interface will provide an easier way tohandle structured input data for the user, as well as manag-ing user-defined types in the autoassociative memory.

1. INTRODUCTIONAutoassociative memories are frequently used in machine

learning applications, and while these applications gener-ally handle data in the form of plain, flat vectors, there arecertain applications where programmers would want to pre-serve their highly structured, rich representations of data.For instance, machine learning tasks related to NLP (Nat-ural Language Processing) often employ n-ary parse trees,and these parse trees can become quite complicated due totheir recursive nature and non-binary branching. Moreover,sentences or phrases take on different semantic meaning dueto their structure, so it would clearly behoove programmersto somehow preserve structure in their applications. In re-sponse, this programming language library offers the endusers with an intuitive interface to autoassociative memorythat preserves the structure of data and gives them the ca-pability to define their own custom data types.

First, this paper will discuss background information rel-evant to autoassociative memory, structured data, and ma-chine learning in order to contextualize the more technicalaspects of the library. Next, the paper will delineate researchrelated to autoassociative memories and give an overview ofspecific methods that were used in the library. Then, itwill establish a model or workflow of the interface, and theprocess of using this interface for autoassociative memoriesin machine learning applications. Afterwards, it will delveinto the implementation details, and discuss the languageand third-party libraries that were used. Finally, the perfor-mance and functionality of the library will be assessed, and

∗Advisor: Steve A. Zdancewic ([email protected]).

possible improvements and new features will be discussed.

2. BACKGROUND

2.1 Machine LearningAutoassociative memory is part of the much broader topic

of machine learning. Machine learning is widely used to-day to solve problems that lack easily definable patterns(i.e. natural language processing, computer vision, classifi-cation). The goal of machine learning algorithms is to buildmodels based on an abundant amount of training data. Sincethe types of data vary wildly from one problem to the next,from sentences in different languages to video data from acamera, a method of reconciling the data with mathemati-cal models is required. In this respect, researchers usuallyconvert elements of the data set into a list of features, rep-resented using a vector of real numbers. Each feature is ameasurable value that encapsulates some aspect of the orig-inal data. Mathematical models are built to take in thesefeature vectors and produce results in some fashion.

2.2 Autoassociative MemoryAutoassociative memories use learned patterns about the

data to retrieve information, whereas traditional memoriesrely on a unique address to recall data. Autoassociativememory mirrors the memory of human minds, which useheuristics to interpret new observations using previous ex-periences. In essence, human minds also implicitly performfeature extraction. Since most humans do not have pho-tographic memories, it is important to commit the specificfeatures of observations to memory. For example, a dog canbe visually described by its size, fur length, color, etc. Us-ing those features, one could vaguely reconstruct what theoriginal dog looked like. An autoassociative memory emu-lates these aspects of human memory using mathematicalmodels, and is frequently implemented via neural networks.A neural network consists of a series of functions acting onthe features of given input values. Using these functions,it can find relationships or make predictions based on itsinputs. It also reacts and changes whenever it sees new in-put to enhance accuracy for future input. Autoassociativememories do not maintain a one-to-one relationship betweeninputs and stored entries. Outputs are constructed based onknowledge obtained from all of the stored entries.

2.3 AutoencodersAutoassociative memory can be implemented using a type

of neural network called autoencoders. An autoencoder findsrelationships between data by removing noise. Often, thereare a lot of features associated with a particular piece ofdata, but they are not all equally useful in every application.Autoencoders seek to find the “essential” features of a giveninput by encoding the input vectors into a lower dimensionvia an encoding matrix. This encoded representation offersa more direct and accurate way to determine relationshipsbetween data.

Wex

y

* + =

b

Figure 1: Autoencoder Encoding

For example, consider the encoding step of an autoencoderrepresented graphically in Figure 1. The autoencoder willuse a matrix We ∈ Rm×n to encode vector x ∈ Rn as anothervector y ∈ Rm using the equation

y = f(Wex + b), (1)

where f is a non-linear transformation such as a sigmoidfunction and b is a bias factor. The encoded representationy can then be decoded using a different decoding matrix toproduce an output with the original dimension. This out-put is essentially what the memory “remembers” the originalinput to be. Through training, the values of the encodingand decoding matrices as well as the bias values will be op-timized for finding relevant combinations of features in theencoding as well as reconstruction of the original input.

The work presented here heavily relies upon the ideas inautoencoders, and these ideas are extended to help workwith recursively structured data.

3. RELATED WORKMachine learning applications generally deal with linear,

flat data in the form of vectors, and cannot process struc-tured data as easily. There have been some strides in ma-chine learning to deal with this issue; Sperduti [8] attemptedto study the training of linear autoencoders for modelingstructured input. There has also been research that involvesautoencoders with NLP applications. In particular, RichardSocher [7] applied semi-supervised recursive autoencodersto sentiment analysis at the sentence level.

In addition, Socher [6] used the method of recursive au-toencoders to perform paraphrase detection by measuringword and phrase similarity. These recursive autoencoderscan function as a model of autoassociative memory, sincethey take in input data, recombine and apply patterns uponthem, and output a value of the same type and format as theinput data. However, Socher’s autoencoders only worked onbinary trees, so his parse trees had to first be binarized.Thus, the input data still had to be transformed in a way

that did not preserve its structure. Our programming lan-guage library improves upon Socher’s recursive autoencodersby supporting ways to encode data without sacrificing theircomplex structure.

4. SYSTEM MODEL

4.1 InterfaceAn overview of the library workflow is presented in Figure

2. As shown in the figure, the central component of thislibrary interface is a memory, which requires a data typethat is specified by the user. This type will correspond tothe format of the training data, which the user also providesto the memory.

After the memory trains on the data, the user obtainsan output from the memory based on the training from theprevious step and the input value given to the memory.

Finally, it is likely that the user will have to apply somepost-processing to the output values, for whatever purposeshe desires.

Figure 2: Library Interface Workflow as Applied toMachine Learning

4.2 Recursive AutoencoderThe main component of our library requires an autoas-

sociative memory implementation that can handle arbitrarystructured data. We begin by describing a traditional re-cursive autoencoder that only works on binary trees withvector-valued leaves, and discuss the changes upon this modelthat allow us to use any given recursive structure with theautoencoder.

A traditional recursive autoencoder works in the followingfashion: Suppose we have a binary tree with a node y andchildren x1 and x2 that are leaves holding n-dimensionaldata vectors. We can encode the two children into a singlenode by concatenating the two vectors and encoding themusing a n × 2n encoding matrix We. The equation for thisencoding is fairly similar to a standard autoencoder

y = f(We[x1;x2] + b), (2)

where f is a sigmoid function and b is a bias value. Noticethat y will have dimension n, so that the parent of y willbe able to encode y recursively (refer to Figure 3). As longas the binary tree is full and only has vector values at theleaves, this autoencoder will be able to encode a whole treeas a single vector of dimension n.

Figure 3: Recursive Autoencoder

Decoding works in an intuitive fashion as well. Givenan encoded tree represented as a single vector y ∈ Rn, theautoencoder uses a decoding matrix Wd ∈ R2n×n to decodey into a 2n-dimensional vector, which can be split into twon-dimensional vectors y1 and y2 and decoded recursively,

[y1; y2] = f(Wdy + b). (3)

The recursion ends when the decoded structure matches thatof the input structure.

This model of recursive autoencoder works well for thepurposes it was designed for, but there are two main draw-backs. First, the structure that it can work on is veryrigidly defined, which necessitates that users convert theirdata types. Second, in the decoding step, the original struc-ture must be supplied to create a sensible output. To ad-dress these issues, we propose augmenting the traditional re-cursive autoencoder with tags as well as a variable numberof encoding/decoding matrices such that it better handlesgeneric data structures.

4.3 Tagged Recursive AutoencoderIn the recursive autoencoder described above, it was suf-

ficient to keep one encoding matrix of size n × 2n. We canexamine the type definition of the structure to confirm thatthis works. For the tree described above, there are only twopossible identities for a given node in the tree: either it isa leaf or it has two children. A leaf does not need to beencoded, so only one matrix is needed to handle the recur-sive case. This obviously does not work in general, as onecan define a tree to have two or three children, or to containvalues at the internal nodes. For example, if a node had 3children, the autoencoder would need to use a n×3n matrixto encode it.

For the autoencoder to work with many different recursivecases of a given data type, it will need to maintain a differentmatrix for each recursive case. Although it is possible togenerate these matrices using compiler introspection of thegiven data type, it is out of the scope of this paper, and weshall discuss it further in the Future Work section.

Since we cannot automatically generate the matrices basedon the data type, the user must provide a “description”of his type definition, which is essentially a list of the en-tries in each recursive case. Upon initialization, the autoen-coder will create the matrices required using the “descrip-tion” given by the user. For encoding, the only change willbe choosing which encoding matrix to used based on therecursive case at runtime. Using multiple encoding matri-ces, it is possible to get an encoding of the data structureas a single vector. Decoding can be done in the same wayas a traditional recursive autoencoder, by providing the in-put structure and recursively decoding until the structurematches.

Now, we discuss the second aforementioned drawback oftraditional recursive autoencoders: the structure of the in-put must be provided to decode. Since the autoencoder canlearn how to encode the data vectors as a single vector, it isa natural extension to try to encode the structure itself. Toincorporate the structure into the autoencoder, we can usea set of orthonormal vectors, or tags, to annotate which re-cursive case each piece of data belonged to. At construction,the autoencoder can create a set of random tags based onthe number of recursive cases, much like the encoding/de-coding matrices. Described below are two ways in whichtags can be used to encode the structure of the data alongwith the data vectors.

4.3.1 Tags as DataOne possible method is to simply treat the tags as part of

the data. Given a node x ∈ Rn in the tree, either a leaf or anencoded representation of its subtree, we can prepend a tagt ∈ Rk to x (where k is the number of recursive cases) to geta n+k dimensional vector yt. Then, yt can be encoded as adata vector y ∈ Rn using a special tag encoding matrix T ∈Rn×(n+k) (refer to Figure 4). Afterwards, the autoencodercan proceed as before by concatenating the children andapplying the appropriate transformation matrix.

xt

yT T

W

tagging

injecting

combining

tagging

Figure 4: Tags as Data

The decoding step is where we can make use of the tagsto recover the structure of the data. Starting with a singlevector y, we apply the tag decoding matrix Td to get a vector

yt ∈ Rn+k. Next, yt can be split into a tag t and a datavector x. Based on the tag t, the autoencoder will choosethe mostly likely structure that the original data had, anddecode the data portion accordingly.

4.3.2 Parallel Tag TrainingAnother possible method is to train the tags in the same

fashion that the data is trained. Given two nodes x and x2,and their tags t and t2 as defined in the previous section,we can concatenate the tags together with a tag denotingthe ”node” structure, t3. The data is concatenated as in theoriginal recursive autoencoder, and both the concatenatedtag and data vectors are individually passed through theencoding matrices T and W , respectively, where T ∈ Rk×3k.

In the end, a k + n sized vector is returned.

xt

t3

x2t2

W

tagging

combiningT

Figure 5: Parallel Tags

For decoding, a k + n sized vector is split into the k sizedtag vector and the n sized data vector. The tag vector isdecoded and compared to the node tag, as well as comparedto the leaf tag without decoding in order to decided whetherthe data should be interpreted as a node or a leaf. It is thenrecursively decoded as necessary.

4.4 OptimizationIn a similar fashion to standard machine learning appli-

cations, the user will start with a set of training data. Theuser will then need to specify the type and structure of thedata he is inputting to the library, so it will know how toprocess it appropriately.

Moreover, the error function will need to be supplied bythe user as well. While many machine learning packagesthat deal with vectors may offer some standard preset errorfunctions such as the Euclidean norm, applying those errorfunctions to data structures is not standardized. Thus, itwill be a requisite input to this library.

To train, it will take the training data provided, and foreach entry, the reconstitution error will be calculated. Thatis, the data will be encoded and decoded, and then comparedto the original data using the provided error function. Thissum of the reconstitution errors will be the function that thislibrary attempts to optimize, and will be the benchmark foraccuracy.

5. SYSTEM IMPLEMENTATION

5.1 Choice of LanguageIn order to implement a generalized process for handling

structured data, the programming language library has beenwritten in OCaml. OCaml is a functional language, so itoffers more fluidity when handling structured data. In ad-dition, OCaml has procedural elements that facilitates theease of implementing and prototyping the various algorithmsand models that have been encountered so far.

An OCaml code snippet in Figure 6 is provided as anexample of the interface delineated in Section 4.1.

Figure 6: Example Interface Written in OCaml

As is clear from the comments, tree is just an example ofa type that can be used with this library.

In addition, the encode and decode functions work to-gether to obtain a value from the memory: After the trainingdata is fed to the memory, an object can be encoded as avector, and then decoded back into an object, where bothoperations take place in conjunction with the memory. Thisimplements the output retrieval scheme outlined in Section4.1.

5.2 Third-party LibrariesWe had to utilize certain third-party libraries in order to

achieve performance and functionality goals. For adjustingthe OCaml runtime itself, we used Jane Street’s Core library[1] for tuning the garbage collection to improve performance.

To perform the linear algebra operations involved in im-plementing the different autocoders delineated in Section 4,we used a library called Lacaml [3] that provides OCamlbindings for BLAS (Basic Linear Algebra Subprograms) [4]and LAPACK (Linear Algebra PACKage) [5]. These aretwo well-established, robust and performant linear algebralibraries that are commonly used for machine learning ap-plications. Notably, BLAS at the low level is widely usedin scientific computation in many more complicated higherlevel languages, such as MATLAB or R. This allows efficientperformance with matrix operations and mathematical op-erations, while also allowing much of the higher-level logicto remain within within OCaml.

5.3 Gradient DescentGradient descent is used to perform the optimizations on

the encoding/decoding matrices of the autoencoders. It is a

standard, yet relatively simple method of minimizing a func-tion. To do so, it requires the gradient of a function - thatis, the partial derivative of the function with respect to allof its parameters. Since the partial derivative of a functionwith respect to a parameter represents the rate of changeof that function with respect to a small positive change inthe parameter, a positive partial derivative indicates thatthe parameter should be lower, while a negative partial in-dicates that the parameter should be higher.

However, it is possible that by increasing or decreasing theparameter values, it “overshoots”, and thus ends up with ahigher function value. Therefore, the aggressiveness of thisparameter adjustment must be adaptive in order to maintaina balance between efficiency and correctness. In this way,given starting parameters, gradient descent can eventuallyfind slightly more optimal values from that starting point.

This library’s implementation of gradient descent is rel-atively simple. The structure of the routines were initiallyroughly based on example code posted online by the writersof F# for Scientists, and were adapted for this library [2].It takes in a function, the derivative of that function, andstarting parameters, and continually adjusts parameter val-ues as described above until either a fixed point is reached,or a specified max number of iterations is reached. It uti-lizes the bindings in Lacaml to perform vector and matrixoperations more efficiently, while the main loop is still insideOCaml. This way, the flexibility in addressing data struc-tures in error calculation is maintained, while being able toobtain faster performance.

6. RESULTS

6.1 ExperimentsExperiments for the programming library were performed

on two datasets, one “small” and one “large”. The smalldataset consisted of two trees with 10-dimensional vectorsas the internal data type, and two trees with 20-dimensionalvectors as the internal data type.

We encountered significant performance issues regardingcertain portions of our code. A profiler was run, and thereseems to be a large amount of time spent in a vector mapfunction, which we call to apply a sigmoid function aftermatrix transformations. While the runtime was reduced byusing a faster sigmoid function (we had been previously us-ing tanh), the number of calls could not be reduced, andthus is still a bottleneck.

Two other issues were a large number of calls to meth-ods that find matrix and vector dimensions, and a largeamount of time being spent in the garbage collector. Inorder to fix these issues, we experimented with certain op-timizations, which included compiling the library with an“unsafe” option to remove array bound checking and tuningthe OCaml garbage collection parameters for better mem-ory performance. Finally, we tested these datasets for bothtagged and untagged implementations of our programminglibrary. The graphs for these experiments1 are shown inFigure 7 and Figure 8.

1These experiments were performed on a Quad-core IntelCore i7-4700MQ @ 2.4 GHz

Figure 7: Graph with Small Data Sets

Figure 8: Graph with Large Data Sets

6.2 EvaluationThese graphs compare the training error (with respect to

the error function) to the number of iterations that it tookto complete the test. The lower bound for training error is0, and the closer the graphs converge to 0, the better theerror.

For the untagged tests, we used the standard Euclideanmetric in our tree error function. The error of two trees T1

and T2 is simply the sum of the distances of each leaf in T1

to its corresponding leaf in T2.As can be seen from the graphs, the optimized versions

of the untagged tests for both small and large datasets tookfewer iterations to complete, so our optimizations indeedimproved the performance of our library.

On the other hand, the tagged tests struggled to converge.While they appear to be better in earlier iterations, that issimply due to an adjustment that needed to be made withthe error function to handle non-matching tree structures.To deal with trees where the structure is different, the normof any vectors in non-matching areas will be added to theerror. However, they are first multiplied by a factor thatscales with the inverse of the depth of that vector in the

tree.The lack of convergence was determined to likely be due

to the non-continuous nature of the error function once thestructure of the data was introduced in the encoding/decod-ing. Since mismatched structures are likely to be penalizedmore than mismatched values, and since structural decod-ing is done with strict thresholds, the optimization functionscould not determine the direction of optimization unless theinitial values were already on the cusp of changing the struc-ture of the encoded/decoded data. Therefore, the trainingsimply attempted to optimize for smaller values in the mis-matched values, which resulted in convergence to a localminimum.

7. FUTURE WORK

7.1 Code Generation for OptimizationOur optimization routines currently accept functions with

a fixed number of arguments to ensure an accurate compiler-level representation of the types involved. That is, if theerror function takes in two matrices and two vectors as in-puts, that should be reflected in the typing. However, thisapproach leads to difficulties when it comes to generalizingcode to any supplied structure. A solution to this wouldsimply be to accept functions that take lists of matrices andvectors. However, this removes the semantics behind the ar-guments. A better solution may be to dynamically generatetype-safe code that encapsulates the types of the arguments.

7.2 Compiler IntrospectionCurrently, while our system model supports user-supplied

data structures, it requires a substantial amount of userspecification for the library to function. The end goal wouldbe to achieve a level of flexibility where the user may sim-ply provide data, and the runtime type would be determinedand all of the memory operations would adapt to account forthe type. However, this requires a higher level of introspec-tion than OCaml provides, as variants cannot be enumer-ated programmatically. Thus, some degree of compiler-levelintrospection must be used in order to provide this function-ality.

7.3 Error Functions for Tagged StructuresFor the tagged structure experiments, our optimization

routines struggled to find values that resulted in accuratelyencoded and decoded structures. As mentioned earlier, thisis likely due to the error function no longer being differen-tiable with respect to the parameters. Since the gradientwill likely be too short-sighted to accurately reflect the ad-justments necessary, a possible route forward would be toannotate the data structures with the tag, resulting in a“soft” structure declaration as well as the actual structure.The error function, therefore, would be able to detect notonly if two structures are different, but also how differentthey are, and could therefore optimize in the direction thatcauses the structures to be more similar.

8. ETHICSSince this library interface is heavily research-based, there

are no obvious ethical issues to consider. However, if the li-brary were put into a production system, it would likely seeuse in machine learning applications, and there are plenty of

ethical issues related to machine learning. In this context,our programming library interface would face similar ethi-cal issues. For instance, machine learning applications workon large sets of data, and some of this data can be quitepersonal and private. In a certain sense, machine learningmodels that only work on flat vectors obfuscate this databecause the data must be converted and transformed fromtheir original structure into something different. This in-terface, however, would not perform this data obfuscationbecause it is able to accept the data in its original structure.This could potentially create some privacy issues, since theend user of this library would have an easier way – eitherintentionally or unintentionally – of gathering personal, pri-vate information from the collected datasets.

9. CONCLUSIONSThis library started as an endeavor to create a interface

to offer programming support for autoassociative memo-ries. There were two main goals - offer a flexible user in-terface, and allow support for structured data. Structureddata was handled in a method similar to that in recur-sive autoencoders, with additional experimental methodsexplored as well. Unfortunately, this complicated the userinterface, which currently requires substantial user specifi-cation of data structure-related information. However, themodularity of our methods implies that there are distinct av-enues forward in order to provide a better user experience.Moreover, the exploration of tagged autoencoders also in-dicates potential ways forward with “soft structure” errorfunctions.

10. REFERENCES

[1] core. janestreet/core.http://github.com/janestreet/core/, 2011.Accessed: 2015-01-27.

[2] Jon Harrop. F# news: Gradient descent.http://fsharpnews.blogspot.com/2011/01/

gradient-descent.html, 2014. Accessed: 2015-01-27.

[3] Markus Mottl. mmotl/lacaml.http://github.org/mmottl/lacaml/, 2014. Accessed:2015-01-27.

[4] netlib. BLAS (Basic Linear Algebra Subprograms).http://www.netlib.org/blas/, 2014. Accessed:2015-01-27.

[5] netlib. LAPACK (Linear Algebra PACKage).http://www.netlib.org/lapack/, 2014. Accessed:2015-01-27.

[6] Richard Socher and Eric Huang et al. Dynamic poolingand unfolding recursive autoencoders for paraphrasedetection. In Advances in Neural InformationProcessing Systems, 2011.

[7] Richard Socher and Jeffery Pennington et al.Semi-supervised recursive autoencoders for predictingsentiment distributions. In Conference on EmpiricalMethods in Natural Language Processing, 2011.

[8] Alessandro Sperduti. Linear autoencoder networks forstructured data. In Ninth International Workshop onNeural-Symbolic Learning and Reasoning, 2013.

programming language support for autoassociative memorycse400/cse400_2014_2015/... · 2015. 5....

Documents