abstract machine design on a multithreaded architecture

12
Future Generation Computer Systems 16 (2000) 705–716 Abstract machine design on a multithreaded architecture q Zsolt Németh MTA Computer and Automation Research Institute, H-1518 Budapest, PO Box 63, Hungary Abstract Usual approaches of designing a Prolog abstract machine primarily aimed at different kinds of parallelism to be exploited (e.g. AND, OR-parallelism). Our novel approach focuses rather on some specific features, like macro-dataflow properties derived from the Prolog language. The work is aimed at finding a proper relationship between the essential characteristics of the macro-dataflow computational model and the underlying architecture. In this paper a part of this design work is introduced, where the variable binding method is established and its possible effect to the performance is analysed. © 2000 Published by Elsevier Science B.V. All rights reserved. Keywords: Prolog abstract machine; Multithreading; Variable binding; Performance prediction 1. Introduction Recently, a lot of research efforts are devoted to multithreaded architectures. They offer a possible solution to some basic questions of multiprocessing: tolerating memory latencies and synchronisation [2]. However, from the programming point of view it is a new challenge to write programs which can benefit from the features offered by the architecture. Roughly saying, the main question is whether the program can be split into a satisfactory number of threads. At low number of threads multithreaded architectures perform far worse than conventional ones, whereas high num- ber of threads can lead to other source of performance degradation. Obviously, this burden cannot be put on the shoulders of high-level application programmers. One way to achieve this goal is the development of q The work presented in this paper was supported by a Na- tional Science Research Fund (OTKA) project under grant number T022106. E-mail address: [email protected] (Z. N´ emeth) new compilers which are capable to extract threads from programs written in conventional languages. Another possible way is compiling high-level lan- guages to abstract machine code and interpreting the abstract code, where the interpreter is written conforming the architecture. The other important area of investigation is the relationship of the architecture and the computational model. Interestingly, multithreaded architectures span a wide spectrum of possible models from von Neu- mann to pure dataflow ones. Most promising ones are the hybrid dataflow-von Neumann architectures between the two extremes. Obviously, their efficient work strongly depends on the applied computa- tional model which should not necessarily be a von Neumann one [23]. The topic of the current research work is an imple- mentation of a Prolog interpreter on a multithreaded architecture. However, the implementation represents just the framework and the goal is different: investigat- ing how a certain logicflow model fits a kind of hybrid multithreaded architecture and what the conditions of 0167-739X/00/$ – see front matter ©2000 Published by Elsevier Science B.V. All rights reserved. PII:S0167-739X(99)00068-0

Upload: zsolt-nemeth

Post on 02-Jul-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Abstract machine design on a multithreaded architecture

Future Generation Computer Systems 16 (2000) 705–716

Abstract machine design on a multithreaded architectureq

Zsolt NémethMTA Computer and Automation Research Institute, H-1518 Budapest, PO Box 63, Hungary

Abstract

Usual approaches of designing a Prolog abstract machine primarily aimed at different kinds of parallelism to be exploited(e.g. AND, OR-parallelism). Our novel approach focuses rather on some specific features, like macro-dataflow propertiesderived from the Prolog language. The work is aimed at finding a proper relationship between the essential characteristics ofthe macro-dataflow computational model and the underlying architecture. In this paper a part of this design work is introduced,where the variable binding method is established and its possible effect to the performance is analysed. © 2000 Published byElsevier Science B.V. All rights reserved.

Keywords:Prolog abstract machine; Multithreading; Variable binding; Performance prediction

1. Introduction

Recently, a lot of research efforts are devoted tomultithreaded architectures. They offer a possiblesolution to some basic questions of multiprocessing:tolerating memory latencies and synchronisation [2].However, from the programming point of view it isa new challenge to write programs which can benefitfrom the features offered by the architecture. Roughlysaying, the main question is whether the program canbe split into a satisfactory number of threads. At lownumber of threads multithreaded architectures performfar worse than conventional ones, whereas high num-ber of threads can lead to other source of performancedegradation. Obviously, this burden cannot be put onthe shoulders of high-level application programmers.One way to achieve this goal is the development of

q The work presented in this paper was supported by a Na-tional Science Research Fund (OTKA) project under grant numberT022106.

E-mail address:[email protected] (Z. Nemeth)

new compilers which are capable to extract threadsfrom programs written in conventional languages.Another possible way is compiling high-level lan-guages to abstract machine code and interpretingthe abstract code, where the interpreter is writtenconforming the architecture.

The other important area of investigation is therelationship of the architecture and the computationalmodel. Interestingly, multithreaded architectures spana wide spectrum of possible models from von Neu-mann to pure dataflow ones. Most promising onesare the hybrid dataflow-von Neumann architecturesbetween the two extremes. Obviously, their efficientwork strongly depends on the applied computa-tional model which should not necessarily be a vonNeumann one [23].

The topic of the current research work is an imple-mentation of a Prolog interpreter on a multithreadedarchitecture. However, the implementation representsjust the framework and the goal is different: investigat-ing how a certain logicflow model fits a kind of hybridmultithreaded architecture and what the conditions of

0167-739X/00/$ – see front matter © 2000 Published by Elsevier Science B.V. All rights reserved.PII: S0167-739X(99)00068-0

Page 2: Abstract machine design on a multithreaded architecture

706 Z. Nemeth / Future Generation Computer Systems 16 (2000) 705–716

efficient work are. In a more general sense, it dealswith the relationship of computational models andthe underlying physical architecture. The results learntfrom this research work can be broadened to find wayshow the certain gap between models and architecturescan be, at least, narrowed.

LOGFLOW is a highly parallel implementation ofProlog for distributed memory architectures [14]. Theabstract execution model is the so-called logicflowmodel which is a higher abstraction of dataflow prin-ciples. In this model, Prolog programs are compiledinto the so-called dataflow search graph (DSG), wherenodes represent Prolog features like unification, alter-natives, body goals, facts, built-ins, etc. The executionis governed by token streams. A single token appear-ing on an input arc of a node can make the node fire.As a result, the node can generate several answer to-kens which are forwarded to the next node on the re-quest arc. A Prolog program is launched by a singlerequest token on the request arc of the first node, anda query is solved when all the tokens (representing allpossible solutions) have appeared on the answer arcof the same node.

In summary, the logicflow model can be consideredas a kind of macro-dataflow model, where the nodesdo not represent single instructions but a short se-quence of instructions and nodes can have inner state.The compiled code is executed (i.e. interpreted) byan abstract machine, called Distributed Data DrivenProlog Abstract Machine (3DPAM) [13]. The abstractmachine consists of an engine which basically per-forms the logicflow based execution sketched above,a scheduler (responsible for proper load balancing),input and output managers and possibly other com-ponents for debugging and monitoring purposes. The3DPAM has been implemented on various transputerconfigurations and experiments were made on clustersof workstations [15].

The target architecture of the current MultithreadedProlog Abstract Machine (MPAM) implementationis the Kyushu University Multimedia Processor onDatarol (KUMP/D) [24]. KUMP/D is a successor ofmultithreaded Datarol [1] and Datarol-II [17] ma-chines. Some drawbacks (inability to use high-speedregisters and pipeline technique of conventionalRISC processors) severely limited the performanceof the Datarol machine and hence, an optimised re-design became necessary. The new version, called

Datarol-II, extracts fine-grain threads from a dataflowgraph, and executes these threads by means ofa program-counter-based pipeline equipped withhigh-speed registers similar to that of the conventionalRISC processors. However, the continuation-basedthread activation control which includes thread syn-chronisation and split-phase operations, is still exe-cuted in a circular pipeline, as it is in the originalDatarol machine. In Datarol-II, threads are invokedby packets. Each instance has logical registers, andmatching counters which are used for synchronisation.Using these logical registers and matching counters,threads are activated by a data-driven mechanism.

KUMP/D is based on a commercially availablePentium and a co-processor (Fine-grain MessageDriven Processor, FMP). The FMD (Fine-grain Mes-sage Driven Mechanism) is a revised version of theDatarol-II execution mechanism. The basic FMDmessage is an explicitly addressed remote memorywrite message which also contains a continuationthread after its write operation. The basic runtimemodel of the FMD is similar to that of the Datarol.A number of function instances are created duringprogram execution. Such an instance has a sharedprogram code and a private environment. The code issplit into threads, i.e. code blocks, which are executedwithout any interruption until the termination point.A context switch may occur at the end of the thread.FMP is an implementation of the FMD mechanismwhich assists fine-grain message passing, thread syn-chronisation, remote memory access and instanceframe management.

The target of the current research is the elaborationof MPAM, an abstract machine which forms a bridgebetween the logicflow model and the KUMP/D archi-tecture. In the following sections the founding prin-ciples of the work will be introduced, then a certaindesign issue, namely the variable handling will bedescribed. Some performance models were chosen inorder to predict the performance, and a certain designdecision of the variable binding was supported bythem.

2. A multithreaded abstract machine design

The development of LOGFLOW on KUMP/D canbe split into two phases. The first one, called macrothread level design, aims at setting up a proper run-

Page 3: Abstract machine design on a multithreaded architecture

Z. Nemeth / Future Generation Computer Systems 16 (2000) 705–716 707

time model for the MPAM. This phase is in an ad-vanced stage and several key issues were discoveredwhich show how promising the idea of logicflow onmultithreaded architecture is.

The 3DPAM was based on the concept of tokens.It means that a node (represented by a code fragment)was initiated by a data packet got from the waitingqueue. At this point the state of the node is updated bythe context tables. The token and the context informa-tion give the necessary parameters for the execution.After the execution of the code belonging to the node,the token and context tables are updated again. As itcan be seen, the implementation of dataflow featuresrequires a considerable amount of overhead. While theexecution semantics of KUMP/D are nearly the same,it has dataflow nature at processor level.

The basic idea of the new MPAM, according to theexecution model of KUMP/D, is that each node ac-tivation is represented by a new, separate procedureinstance. Such a procedure instance has a shared pro-gram code and private execution environment whichis called an instance frame. In such a way, the tokenand context information is handled as local variables,making the token and context tables completely super-fluous, and moreover, eliminating the load and storeprocedures. Token and context information is propa-gated as parameters to a function call.

Nodes usually have multiple entry points (andmultiple procedures) according to different activationcases. These procedures in the new abstract machineare represented as multiple threads. A thread meansa sequential procedure, which is initiated by either afunction call or a synchronisation event. A node is afunction instance with many threads.

Threads which are ready to run are kept in a ThreadQueue. The FMP takes a thread from the queue andthe CPU (which acts as the engine) executes it se-quentially. The smallest scheduling unit is the threadand thus even finer granularity can be achieved thanby token based scheduling. As a consequence, tokensare eliminated from the system while maintaining thesame semantics.

The abstract machine can be considered as a mid-dle layer between the high level language (abstrac-tion) and the physical machine. The goal is to put theabstract level as close to the physical world as possi-ble, thus reducing the runtime overheads of interpre-tation (Fig. 1). In our case the aim of the first stage in

Fig. 1. The goal of investigation is narrowing the gap between theabstract machine and the physical architecture.

design is finding all the relevant similarities betweenthe logicflow model and the KUMP/D runtime modeland exploit these coincidences. As it can be seen, es-sential simplifications were made possible in 3DPAMallowing a near ‘native’ execution on KUMP/D. In thenext section an important issue, the variable bindingmethod is introduced.

2.1. Variable handling

Originally, LOGFLOW was based entirely on theclosed binding environment scheme. A closed envi-ronment in Conery’s definition [7] is a set of framesE so that no pointers or links originating from E referto slots in frames that are not members of E. In otherwords the environment holds all the necessary infor-mation to continue the computation at a given point.Thus, tokens holding an environment are allowed tobe distributed over the processor space freely, sincethe piece of work represented by the token can be ex-ecuted using the information held by the token only.This method has two special procedures to close an en-vironment with respect to another one: the (two-phase)unification and the back-unification.

The environment closing is a costly operation bothin time and in memory consumption. In [19] thesources of overhead were investigated and dividedinto three groups as: overhead from environmentscanning, overhead from environment extension andoverhead from structure copying. Comparing theseoverheads to each other concludes that the mainsources of overhead are: scanning the compoundterms and copying the compound terms.

There were proposed solutions to eliminate theseoverheads by combining the local addressing schemeof the closed environment with the global address-ing scheme of non-closed environments. In [19] the

Page 4: Abstract machine design on a multithreaded architecture

708 Z. Nemeth / Future Generation Computer Systems 16 (2000) 705–716

Fig. 2. Elements of the new hybrid binding scheme. Heap contains symbolic references to the variables which can be resolved by theHash Table. The Pointer Array serves as a key to the Hash Table and is intended to assure separated access to the table for differentenvironments. The sharing of structures would have been impossible in closed scheme.

quasi-closed environment (QCE) is introduced wherevariables are divided into two groups whether they canoccur in a structure or not. Variables occurring in astructure are stored by global address whereas othersare stored still locally. In such a way, the most expen-sive part of the closing procedure, the structure scan-ning and copying, can be discarded, more precisely,they can be postponed until the task is really migratedbetween processors.

The redesign of the variable binding scheme inLOGFLOW resulted the new hybrid binding schemewhich combines many ideas from different bindingmethods. Two main sources of the hybrid scheme areConery’s closed environment and the QCE idea. Bydefault, variables are handled in the usual way ac-cording to the closed binding environment principles.Variables are madeglobal, and thus, are handled dif-ferently when they are in the arguments of a structure.

The details of the design and the description of thehybrid binding scheme can be found in [20].

The hybrid scheme has the following workingmodel (Fig. 2). New, global names are assigned tovariables occurring in a structure. Their values arestored in an auxiliary data structure, called Hash Ta-ble. Binding methods based on some kind of stacksharing use a number of such structures for privatedata, e.g. Binding Array, Directory Tree, VersionVector, etc. In common, they separate the private databy workers, whose number is known in advance. InLOGFLOW, in contrast, environments are separatedby execution threads which are created dynamicallyand their number is greater in several magnitude thanthat of workers. This is the reason why Hash Tablewas chosen that can be handled in a dynamic way.

The other important part of the solution is thePointer Array. It serves as a key to the Hash Table.

Different environments have different indices to theArray, and thus, they can see different parts of theHash Table (the effect is similar as if the Hash Tablewould have been divided into clusters). This solutionallows a simple and elegant way of duplicating theenvironments at alternatives: only the correspondingitem in the Pointer Array needs to be copied, and thedata in the Hash Table are shared between the twoenvironments.

Whenever a variable must be accessed, a hash ad-dress is calculated by applying a function to the nameof the variable. The index in the environment points toan element in the Pointer Array, and the address canbe considered as a displacement in the item. By thisinformation the necessary variable can be found.

Detailed test results on transputer networks con-cluded that the overhead related to environment clos-ing can be significantly reduced within a processingelement (PE) by combining different binding meth-ods. However, the frequent structure copying betweenPEs (which is related to the fine-grain property ofLOGFLOW) cannot be eliminated or hidden by thesebinding schemes. It is suggested that some kind ofhardware support for accessing remote structures isneeded. These results pointed out that this kind of hy-brid binding scheme is an excellent candidate for bind-ing environment on multithreaded architectures, wherethe intra-processor overhead associated with closingprocedures is reduced by the binding method, whereasthe inter-processor copying overhead is eliminated bythe facilities of the architecture. How can the hybridmethod be modified for proper use in a massively par-allel, multithreaded environment?

Let us denoteθ as a set of variables andσ as a setof compound terms that can be accessed at a givenpoint of execution. In the closed binding scheme both

Page 5: Abstract machine design on a multithreaded architecture

Z. Nemeth / Future Generation Computer Systems 16 (2000) 705–716 709

Table 1Comparison of binding schemes (Locality means that the data item can be accessed by local read operations. Closed property means ifthe environment holds all the necessary data without external references)

θ1 θg σ

Closed Local, closed, transferredbetween PEs

Local, closed, transferredbetween PEs

Local, closed, transferredbetween PEs

Hybrid Local, closed, transferredbetween PEs

Local, global within PE,transferred between PEs

Local, global within PE,transferred between PEs

Hybrid-multithreaded Local, closed, transferredbetween PEs

?, global in processor spacebut local with respect toσ ,never transferred

?, global in processor space,never transferred

θ andσ constitute the environment frame, which mustbe closed and it results in frequent structure copyingeven inside a processing element.

The hybrid environment dividesθ into two subsets:θ1 andθg according to their occurrence.θg andσ arehandledglobally within a processing element, whileθ1 represents the local environment information whichis still closed with respect to other frames. However,when a task is migrated physically to a different pro-cessing element not onlyθ1 but θg and σ must becopied as well, because remote memory accesses areprohibitively expensive.

In a multithreaded environment the hardware canhide the latencies caused by such remote memorylatencies. In such a way,θg and σ are organisedglobally for the whole processor space. It means thatwhile computation can migrate over the space freely,holding the closedθ1 set of local variables, structures(σ ) and variables occurring in the structures (θg) arenever moved. Table 1 contains a comparison of theseschemes.

The hybrid binding scheme must be extended attwo points. First, unique global variable names mustbe generated for the whole processor space. It can beeasily solved by encoding the processor identificationin the variable name where it was created.

Second, the state of the computation is representedby data which are distributed in the processor space(in fact they can be imagined as portions left behindon those processing elements, where the thread hasever been executed). Therefore, it must be recordedwhat the states ofθg andσ are on different places withrespect to the current thread. Sinceσ is excluded fromall the closing and migration procedures, its elementscan be considered as skeletons and their addresses arestatic. On the other side,θg is changing. The Pointer

Array is the only way for accessing the Hash Table,and thus, the environment should contain not only anindex to the Pointer Array but a vector of indices,where an index is assigned to a Pointer Array on acertain processing element. Using this information, thenecessary variable can be accessed.

The way how the data access is carried out needsfurther investigation. Question marks in Table 1 rep-resent that a given information can either be localor remote. The amount of remote accesses basicallyinfluences the utilisation of a processor, and thus, theperformance. Therefore, the aspect of locality of thenew data layout is analysed in a later section.

3. Analytical and computational models formultithreaded processing

In [9] the possible analytical models are dividedinto four groups on the base whether the architecturebelongs to the group of small or large latencies andwhether the model is discrete-time or continuous. Ourresearch work is aimed at large-latency architectures.The discrete-time model for large latency architecturesdistinguishes two sets of system states: Context switch(CS) and non-context switch (NCS) ones. The idlestate is hidden and represented by those states (eitherin the CS or NCS ones) where the number of readyto run threads is zero. All the essentialL (latency),R(running length),C (context switch cost) parametersare assumed to be stochastic with geometric distri-bution. The model is based on a time-homogeneous,discrete-time Markov-chain. The paper gives an iter-ative formula for calculating the probabilities of eachindividual states. The charts presented in this paperare derived based on these formulas.

Page 6: Abstract machine design on a multithreaded architecture

710 Z. Nemeth / Future Generation Computer Systems 16 (2000) 705–716

In [21] a Petri-net based analytical model for mul-tithreaded execution is introduced. The individualthreads can be in the following states: running, leav-ing, blocked and ready. All the threads can determinethe state of the processor which can be: busy, switch-ing and idle.

Within this model several models are set. Theroughest assumption is thatR is constant. This ap-proach will lead to a closed form expression, however,it is unrealistic in most cases. Yet, it is able to de-scribe the basic working ranges of the multithreadedarchitecture as linear and saturated modes.

If R is assumed to be stochastic, the model leadsto a Markov-chain model, and the expressions forthe probability of busy, idle and switching states be-came significantly more complex regarding both theirderivation and the calculation. This model gives abetter estimation, however, some parameters are stilltreated as constant. Despite the fact that calculationmethods can give a fast result (within the possibilitiesof the model), the derivation of mathematical form

Fig. 3. A snapshot form a simple visual multithreaded simulation.

requires too many assumptions which do not allowimprovement of precision beyond a limit.

In [21] only the mathematical forms were intro-duced. The formulas presented were taken into consid-eration in our work as well. Furthermore, the Petri-netmodel of the multithreaded architecture can also besimulated. In this case all the temporal behaviour ofall the parameters can be observed and some specialfeatures (e.g. the effect of cache) can be described.The simulation can validate the result generated bythe numerical methods. However, while the numeri-cal calculation can produce results promptly, runninga simulation may take several hours to get the accept-able results.

Finally, if the exactness can be omitted but a quickqualitative view is necessary, the simulation can be vi-sualised on-line (we implemented a simple experimen-tal visualisation that can be seen in Fig. 3). Althoughvisualisation is an interfacing technique, in this con-text it means a qualitative way of observation. Hereall the state transitions can be observed, and basically

Page 7: Abstract machine design on a multithreaded architecture

Z. Nemeth / Future Generation Computer Systems 16 (2000) 705–716 711

the working cycle can be understood. It also gives aquick insight about how the parameters can influencecertain features of execution however, precise expres-sions, estimates cannot be established. On the otherside, the temporal behaviour of the system can beobserved best by visualisation tools.

4. Analysis of Prolog data structures

After surveying the possible ways of performanceanalysis and introducing a performance model, let ustake a certain example into consideration: the designof the data organisation of the multithreaded Prologabstract machine on the KUMP/D. The discussion pre-sented here will not give precise results since, most ofthe practical work is ahead of us. The intention of thisexample is to show the possible means to deal with acomplex design phase; their capabilities and limits.

The ability of multithreaded architectures for hidingthe memory latency is true only in a narrow range ofparameters. Therefore, if the variable binding schemeis partially relying on the features of architectures, itmust be carefully investigated whether the parameterscan fulfil the conditions of optimality.

4.1. The scope of investigation

The hybrid-multithreaded binding scheme consistsof the following parts:• The environment.• The heap.• The Hash Table.• The Pointer Array.

The Hash Table and the Pointer Array are supple-mentary data structures to the heap. There are twoobvious points: the environment frames are alwayslocal, thus moved with the token, and the heap isorganised in a virtually shared way, i.e. most of theheap accesses are targeted at a remote memory block.The auxiliary data structures, e.g. the Hash Table andthe Pointer Array are organised in such a way that thedata they contain are always local with respect to thestructure that refers to them. By this arrangement theratio of local/remote loads and thus the fundamentalR parameter can be altered.

Structures can appear at unification in two caseswhether they are in the head of a clause or in the

arguments of a call. By analysing all the possible oc-currences, it can be shown that the only subject of thisinvestigation is the unification in READ mode whenthe structure (and the accompanying data structures)are remote but the environment is local. The unifica-tion of a structure has the following steps: getstruct,where the name and arity of the structure is matchedand then a sequence of unify instructions (unifyvar,unify value, unifyatom, unifyconst, etc.). The loca-tion of the structure can be determined at the getstructinstruction. In the sequence of unify instructions, vari-ables are going to be dereferenced (get their value)and possibly bound. This phase might involve most ofthe remote load and write instructions, and thus, thevariable unification forms the object of the followinganalysis.

The decision to be made is about data access meth-ods. Two versions are taken into consideration and itshould be clarified, which one is better and under whatcircumstances. By this example the modelling, calcu-lation, and simulation methods are introduced. It willbe shown how far analytical models can lead and whatis the point where it is essential to have some practicalexperiences in form of the simulation.

4.1.1. Case 1: a naive approachCase 1 represents a naive data access method, i.e.

there are no special procedures to shorten the data ac-cess time. The variables are accessed uniformly nomatter whether they are local or remote ones. Thisapproach results in very frequent remote loads in thephase of deferring (see Fig. 4). If the architecture canhide the latency caused by these loads, this approachcould be proven effective since there are no extraprocedures related to data access.

4.1.2. Case 2: an intelligent approachIn Case 2 the remote loads are entirely eliminated by

starting a micro-thread on the remote site. This threadcan get every variable locally. However, the price forthis solution is starting a new thread which, in thiscase, involves moving a considerable amount of data(Fig. 5).

4.2. Analysis of Case 1

The goal of the preliminary analysis is to clarifywhether the naive approach can yield in performance

Page 8: Abstract machine design on a multithreaded architecture

712 Z. Nemeth / Future Generation Computer Systems 16 (2000) 705–716

Fig. 4. Variables are accessed uniformly no matter whether they are local or remote. This approach can lead to frequent remote loads.

Fig. 5. The entire unification process is realised by a thread on the site where the data are present. Every access is local however, creationof a new thread represents overhead.

degradation or not and what are the critical thresholdsof the parameters.

A remote read instruction is executed in the fol-lowing way (Fig. 6) The instruction itself consumesa few (6 machine instructions, 1 cycle each and amemory write=6+W ) CPU cycles before the threadis terminated. The period of latency starts with the

Fig. 6. The working cycle of a remote read.

termination point. The Fine-Grained Message Pro-cessor FMP(A) decodes and executes the instruction(prepares a message packet) which takesF cycles.The packet is transferred via the network. The time oftransmission is determined by the network bandwidth,the packet size and the number of hops (NOH). On theremote (target) side, the FMP(B) receives the packet,decodes and executes it (F cycles), reads the neces-sary memory cell (W cycles) and prepares the answerpacket (F cycles). The transfer time isk*NOH again,wherek is a constant derived from the physical ca-pabilities of the network and the packet size. Finally,the FMP(A) decodes the answer (F cycles), updatesthe synchronisation cell (Wf cycles) and writes theanswer into the memory (W cycles). If the synchro-

Page 9: Abstract machine design on a multithreaded architecture

Z. Nemeth / Future Generation Computer Systems 16 (2000) 705–716 713

Table 2Some typical values for the latency

NOH 1 2

F andWf 4 8 4 8L 76 96 124 144

nisation was successful (there are no more pendingreads), the thread can be selected for execution.

From this analysis the latency can be calculated, atleast, the range of latencies can be set between thebest and the worst cases (see Table 2).W is 4 cycles,F andWf can be either 4 or 8 cycles.k is 24 for thecurrent design of KUMP/D. NOH is typically 1 on atorus of 16 processors, seldom 2 and almost never 3(assuming a scheduling strategy that takes into accountthe distance).

This calculation assumed that FMP(B) can serve therequest at once. In fact, other remote read requests mayarrive at the same time, they are queued, and a givenrequest is served after a certain amount of time. In thiscase the analytical model treatsL as a stochastic value,so some amount of uncertainty is taken into account.More precise analysis would require the modelling ofthe queue and the result would not differ significantlyfrom the range set.

The other important parameter, the switch cost (C),can be obtained by the following consideration. Thethread queue is kept in the main memory. CPU takesthe ready-to-run threads from there. If the queue isempty, CPU tries to get new threads from the FMP(this is some kind of synchronisation between CPU

Fig. 7. Utilisation as a function of number of threads, parametrised byR. L = 76, C = 9 and 27, respectively.

and FMP). The switch cost depends on the factwhether the appropriate parts of the thread queue canbe found in the cache or not. According to these cases,there is not a wide range of possibleC parametersbut two distinct values: 9 and 27. Additionally, someregisters may be necessary to be saved and restoredat context switch. The cost depends on the numberof registers to be saved and the state of the cache. Itrepresents a few additional cycles.

These parameters are determined by the hardwareand by those data the possible efficiency charts can bedrawn (Fig. 7). TheL parameter does not influencethe utilisation in saturated mode but theC parameteris essential therefore, two cases of differentC valuesare illustrated in Fig. 7. From the charts it can beconcluded that the average run length (R) should be atleast 35 in the phase of structure unification in order toavoid a serious performance degradation (even if someregister saving is taken into consideration byC = 12andC = 15). Interestingly, the number of threads isnot essential in this case, the system is in saturatedstate if there are at least 6–7 threads.

However, from this point the analytical model can-not tell more and it should be completed by somekind of simulation tools. The relative occurrence ofC = 9 andC = 27 is an open issue. Furthermore,the analysis of assembly source can only be donedynamically, where branches are taken according tosome real statistics. The real behaviour of the CPU(which is a Pentium in KUMP/D), e.g. the instructionpairing and the pipeline utilisation can be observed bya sophisticated simulator. Therefore, the estimation

Page 10: Abstract machine design on a multithreaded architecture

714 Z. Nemeth / Future Generation Computer Systems 16 (2000) 705–716

Table 3Code fragment of thread invocation (caller side) (Cost formulacan be explained by Fig. 6).

Label Instruction Cost formula

L1 s=newscell(CF ,L2,1) 6+W

!allocf(PE,sizeof(f),&r,s) 6+W + F+24*NOHL2 link(p,0,arg0) 6+2*W+2*F+24*NOH

link(p,1,arg1) ......link(p, n − 1,argn−1)s=newscell(CF ,L3,m) 6+W

rlink(p,n,&r0,s) 6+2*W+2*F+24*NOHrlink(p,n+1,&r1,s) ......rlink(p,n+m,&rm−1),s)!start(p,F0) 6+2*W+2*F+24*NOH

L3 ...

of the average run length needs further elaboration ofa complex simulator.

Yet, the results of the analytical modelling can givesome hints. The structure unification consists of sev-eral abstract instructions and only a few of them arecritical, e.g. unifyvar or unify value. The average runlength in case of other instructions is far above thethreshold, and by analysing the unifyvar, there is asmall range of instructions where the run lengths be-tween remote reads is below 20. It can be concludedthat for small structures the naive unification approachlikely will not lead to degradation in utilisation.

4.3. Analysis of Case 2

While in the previous case the goal was the mainte-nance of an acceptable processor utilisation, in Case 2the central issue is the overhead. There is significantlyfewer thread termination due to waiting for pendingremote reads. Therefore, the processor utilisation islikely not demanding but the extra instructions neces-sary for thread invocation in some cases can be compa-rable to the useful computation. In Case 2 the amountof the introduced overhead should be calculated.

Let us assume, there aren input parameters:arg0· · · argn−1 and m output parameters: r0· · · rm−1(Table 3).f denotes the name of called function, pdenotes the frame to be allocated for called functionand F0 is the initial thread of called function. L1 isthe thread which creates the new frame on the remotesite. L2 is the thread which makes the parameter pass-

ing. L3 is the one which is executed after the threadhas been terminated.

L1 consists of two instructions which require2*(6+W ) CPU cycles. L2 involves the main part ofthe thread invocation and requires (m + n)*(6+W )CPU cycles for parameter passing and 2*(6+W ) cy-cles for other activities. The return phase incorporatesm*(6+W ) cycles and 3+W for the thread termina-tion. So, the total CPU overhead (additional cycles)is (4+2*m + n)*(6+W ). Taking into account that thetypical value forn is 20. . . 30 and form is 5. . . 15, theadditional CPU cycles are comparable with the usefulcomputation for very small (one or two elements)structures.

The usual analytical models would give good util-isation curves for this situation. However, there aresome aspects they cannot take into consideration. Thecommunication phase in L2 can be approximated as24*NOH+(m+n)*(2* F +W )-(m+n)*(6+W ), whichis in the magnitude of latency. During this period thethread on the caller site has been terminated but thenew thread has not been started yet on the remote site.It represents a concentrated high network traffic, wherethe network bandwidth and the capacity of the FMPscan limit the performance. Furthermore, the numberof threads is decreased on one site while increased onthe other one.

Although, the formal analysis of Case 2 providedseveral characteristic values, the decision between thetwo cases cannot be made without the help of somesimulation tools. It is anticipated from the formal anal-ysis that for large structures the locality results in longrun lengths, good processor utilisation, and hence, theoverhead introduced by the thread invocation is negli-gible, whereas for extremely small structures the naiveapproach works better. However, this situation cannotbe neglected since lists, which are quite common datastructures are built up of compound terms of size 2.The simulation can help to explore the precise work-ing cycle of this solution. Furthermore, this approachincreases the number of threads so, a visual simula-tion can help to observe the temporal behaviour ofprocessors.

4.4. Summary

The data obtained from the simple analytical mod-elling and calculation suggest that the two possibili-

Page 11: Abstract machine design on a multithreaded architecture

Z. Nemeth / Future Generation Computer Systems 16 (2000) 705–716 715

ties, namely the remote load instructions and the re-mote procedure invocation, can be both effective indifferent cases.

Remote load instructions of the naive approach havenearly no additional cost in CPU time however, thethread must be suspended after the issue. The latencyof a single read can be hidden by a relative low num-ber of threads. This solution is optimal for smallerstructures.

On the other hand, the remote thread invocation hasa high overhead in CPU time but only one thread sus-pension. Obviously, it works better with larger struc-tures. However, the threshold cannot be establishedwithout the simulation of the communication phase.The decision can be critical due to the possible largenumber of small structures.

These basic cases showed that none of the per-formance estimation methods can make a conclusionalone but a combined application of these approachescan lead to theoretically and experimentally well sup-ported decision. They showed the necessity of widerange of computational methods at designing a com-plex hardware and software system.

5. Related works

The parallel implementation techniques of Prologhave been investigated thoroughly in 1980s. The is-sue in those research projects was to establish dif-ferent models for exploiting the inherently availableparallelism of logic programming, like OR parallelmodels (e.g. Aurora [4], Muse [16], Opera [3], etc.),AND parallel models (e.g. [5,8,11]), and the combinedAND–OR models (e.g. [6,10]).

Multithreading techniques are used in massivelyparallel environments and at processor level as well.The massively parallel multithreaded models have awide spectrum from von Neumann types to dataflowones [23]. The KUMP/D machine and its predecessorsDatarol and Datarol-II are inbetween trying to com-bine both the dataflow and the sequential features. Thenearest relatives to the Datarol family are the EM-4where the Strongly Connected Block is introduced[22] and the MIT Hybrid machine with the Schedul-ing Quantum concept [12].

The most comprehensive analytical models for mul-tithreaded processors were presented in [9]. A special

case, the discrete time model for long latency mod-els, was also elaborated in [21]. Both papers comparetheir work with practical results.

In our case the goal is matching a given com-putational model — which has been derived fromProlog—, and a certain, multithreaded architecturalmodel. This approach is somewhat orthogonal to theprevious parallel Prolog research work: not the typeof exploited parallelism is investigated but rather thelevel of parallelism. Up to our knowledge this issuehas been addressed by one research group [18] so far,where a fine-grained model with dataflow featureswas elaborated.

6. Conclusion and further work

The primary goal of this paper is to give an insightof some design issues of a software system on a mul-tithreaded architecture. The paper does not concludeany optimal parameters of a multithreaded executionbut rather introduces some of the problems and possi-ble solutions at a multithreaded design.

The next step in the project is the development ofseveral simulation environments, where different as-pects of the multithreaded execution can be observed.The data derived from the analytical modelling cangive a starting point for these simulations and resultsfrom the simulation may verify the models or helpwith refining them.

Acknowledgements

Author would gratefully acknowledge the work ofJapanese colleagues at Kyushu University, Fukuoka.Detailed information about the Datarol family wasprovided courtesy of Prof. Makoto Amamiya, HiroshiTomiyasu and Shigeru Kusakabe. Prof. Ferenc Vajdahad many suggestions and gave a major contributionto the paper.

References

[1] M. Amamiya, R. Taniguchi, Datarol: a massively parallelarchitecture for functional language, Proceedings of theSecond IEEE Symposium on Parallel and DistributedProcessing, 1990, pp. 726–735.

Page 12: Abstract machine design on a multithreaded architecture

716 Z. Nemeth / Future Generation Computer Systems 16 (2000) 705–716

[2] Arvind, R.A. Ianucci, Two fundamental issues inmultiprocessing. Procedings of the DFVLR Conference 1987on Parallel Processing in Science and Engineering, Bonn-Bad,Godesberg, 1987.

[3] J. Briat, M. Favre, C. Geyer, J. Chassin de Kergommeaux,OPERA: OR-parallel Prolog system on supernode, in: P.Kacsuk, M.J. Wise (Eds.), Implementations of DistributedProlog, Wiley, New York, 1992.

[4] M. Carlsson, Design and Implementation of an OR-ParallelProlog Engine, SICS Dissertation Series 02.

[5] Chang, A.M. Despain, D. Degroot, AND-parallelism oflogic programs based on a static data dependency analysis,Proceedings of the IEEE Symposium on Logic Programming,1985.

[6] J. Chassin de Kergommeaux, P. Codognet, Parallel logicprogramming systems. ACM Computing Surveys September26(3) (1994).

[7] J.S. Conery, Binding environments for parallel logic programsin non-shared memory multiprocessors, Proceedings of the1987 Symposium on Logic Programming, 1987.

[8] D. Degroot, Restricted And-parallelism, Proceedings of theInternational Conference on Fifth Generation ComputerSystems, 1984.

[9] P.K. Dubey, A. Krishna, M. Squillante, Performance modelingof a multithreaded processor spectrum, in: Bagchi, Walrand,Zobbrist (Eds.), Advanced Computer Performance Modelingand Simulation, Gordon and Breach, Newark.

[10] G. Gupta, M.V. Hermenegildo, ACE: AND/OR-parallelCopying-based Execution of Logic Programs, Proceedings ofICLP’91 Pre-Conference Workshop.

[11] M.V. Hermenegildo, K.J. Greene, &-Prologand its performance: exploiting independent And-parallelism,Proceedings of the Seventh International Conference on LogicProgramming, 1990, pp. 253–268.

[12] R.A. Ianucci, Towards a dataflow/von Neumann hybridarchitecture, Proceedings of the 15th Annual InternationalSymposium on Computer Architecture, May 1988.

[13] P. Kacsuk, Distributed data driven Prolog abstract machine, in:P. Kacsuk, M.J.Wise, implementations of Distributed Prolog,Wiley, New York, 1992.

[14] P. Kacsuk, Execution models for a massively parallel prologimplementation, J. Comput. Artificial Intell. 17(4) (1998).

[15] P. Kacsuk, Zs. Németh, Zs. Puskás, Tools for mapping, loadbalancing and monitoring in the LOGFLOW parallel Prologproject, Parallel Computing 22 (1997) 1853–1881.

[16] R. Karlsson, A high performance OR-parallel Prolog system,SICS Dissertation Series 07, March 1992.

[17] T. Kawano, S. Kusakabe, R. Taniguchi, M. Amamiya,Fine-grain multi-thread processor architecture for massivelyparallel processing, Proceedings of the First IEEESymposium High Performance Computer Architecture,1995.

[18] H. Kim, J.-L. Gaudiot, Exploitation of fine-grain parallelismin logic languages on massively parallel architectures,Proceedings of the IFIP Working Conference on ParallelArchitectures and Compilation Techniques, Montreal, 1994.

[19] H. Kim, J.-L. Gaudiot, A Binding environment for processinglogic programs on large-scale parallel architectures,Technical Report, University of Southern California,1994.

[20] Zs. Németh, P. Kacsuk, Analysis and Improvement of theVariable Binding Scheme in LOGFLOW, Workshop onParallelism and Implementation Technology for (Constraint)Logic Programming Languages, Port Jefferson, 1997.

[21] R.H. Saavedra-Barrera, D.E. Culler, T. von Eicken, Analysisof multithreaded architectures for parallel computing, SecondAnnual Symposium on Parallel Algorithms and Architectures,Crete, Greece, July 1990.

[22] M. Sato, Y. Kodayama, S. Sakai, Y. Yamaguchi, Y. Koumura,Thread-based programming for the EM-4 hybrid dataflowmachine, Proceedings of the 19th Annual InternationalSymposium on Computer Architecture, May 1992.

[23] D. Sima, T. Fountain, P. Kacsuk, Advanced ComputerArchitectures, Addison-Wesley, Reading, MA, 1997.

[24] H. Tomiyasu, T. Kawano, R. Taniguchi, M. Amamiya,KUMP/D: The Kyushu University Multi-media Processor,Proceedings of the Computer Architectures for MachinePerception, CAMP’95, pp 367–374.

Zsolt Németh received his MSc degreefrom the Technical University of Budapestin 1994. Since then he is a research as-sistant at the Laboratory of Parallel andDistributed Systems, where he is involvedin LOGFLOW project. He participated inits multi-transputer implementation. Cur-rently he is investigating the possibility ofa multithreaded realisation of LOGFLOW.