spmd programming in java

11
CONCURRENCY: PRACTICE AND EXPERIENCE, VOL. 9(6), 621–631 (JUNE 1997) SPMD programming in Java SUSAN FLYNN HUMMEL * , TON NGO AND HARINI SRINIVASAN IBMT. J. Watson Research Center, P.O. Box217, Yorktown Heights, NY 10598, U.S.A. SUMMARY We consider the suitability of the Java concurrent constructs for writing high-performance SPMD code for parallel machines. More specifically, we investigate implementing a financial application in Java on a distributed-memory parallel machine. Despite the fact that Java was not expressly targeted to such applications and architectures per se, we conclude that efficient implementations are feasible. Finally, we propose a library of Java methods to facilitate SPMD programming. 1997 by John Wiley & Sons, Ltd. 1. MOTIVATION Although Java was not specifically designed as a high-performance parallel-computing language, it does include concurrent objects (threads), and its widespread acceptance makes it an attractive candidate for writing portable computationally-intensive parallel applications. In particular, Java has become a popular choice for numerical financial codes, an example of which is arbitrage – detecting when the buying and selling of securities is temporarily profitable. These applications involve sophisticated modeling techniques such as successive over-relaxation (SOR) and Monte Carlo methods[1]. Other numerical financial applications include data mining (pattern discovery) and cryptography (secure transactions). In this paper, we use an SOR code for evaluating American options (see Figure 1)[1], to explore the suitability of using Java as a high-performance parallel-computing language. This work is being conducted in the context of a research effort to implement a Java run- time system (RTS) for the IBM POWERparallel System SP machine[2], which is designed to effectively scale to large numbers of processors. The RTS is being written in C with calls to MPI (message passing interface)[3] routines. Plans are to move to a Java plus MPI version when one becomes available. The typical programming idiom for highly parallel machines is called data-parallel or single-program multiple-data (SPMD), where the data provide the parallel dimension. Parallelism is conceptually specified as a loop whose iterates operate on elements of a, perhaps multidimensional, array. Data dependences between parallel-loop iterates lead to a producer–consumer type of sharing, wherein one iterate writes variables that are later read by another, or collective communication, wherein all iterates participate. The communication pattern between iterates is often very regular, for example a bidirectional flow of variables between consecutive iterates (as in the code in Figure 1). This paper explores the suitability of the Java concurrency constructs for writing SPMD programs. In particular, the paper: 1. identifies the differences between the parallelism supported by Java and data paral- lelism * Also with Polytechnic University and Cornell Theory Centre. CCC 1040–3108/97/060621–11 $17.50 Received 1 February 1997 1997 by John Wiley & Sons, Ltd.

Upload: harini

Post on 06-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

CONCURRENCY: PRACTICE AND EXPERIENCE, VOL. 9(6), 621–631 (JUNE 1997)

SPMD programming in JavaSUSAN FLYNN HUMMEL∗, TON NGO AND HARINI SRINIVASAN

IBM T. J. Watson Research Center, P.O. Box 217, Yorktown Heights, NY 10598, U.S.A.

SUMMARYWe consider the suitability of the Java concurrent constructs for writing high-performanceSPMD code for parallel machines. More specifically, we investigate implementing a financialapplication in Java on a distributed-memory parallel machine. Despite the fact that Java wasnot expressly targeted to such applications and architectures per se, we conclude that efficientimplementations are feasible. Finally, we propose a library of Java methods to facilitate SPMDprogramming. 1997 by John Wiley & Sons, Ltd.

1. MOTIVATION

Although Java was not specifically designed as a high-performance parallel-computinglanguage, it does include concurrent objects (threads), and its widespread acceptancemakes it an attractive candidate for writing portable computationally-intensive parallelapplications. In particular, Java has become a popular choice for numerical financial codes,an example of which is arbitrage – detecting when the buying and selling of securitiesis temporarily profitable. These applications involve sophisticated modeling techniquessuch as successive over-relaxation (SOR) and Monte Carlo methods[1]. Other numericalfinancial applications include data mining (pattern discovery) and cryptography (securetransactions).

In this paper, we use an SOR code for evaluating American options (see Figure 1)[1], toexplore the suitability of using Java as a high-performance parallel-computing language.This work is being conducted in the context of a research effort to implement a Java run-time system (RTS) for the IBM POWERparallel System SP machine[2], which is designedto effectively scale to large numbers of processors. The RTS is being written in C withcalls to MPI (message passing interface)[3] routines. Plans are to move to a Java plus MPIversion when one becomes available.

The typical programming idiom for highly parallel machines is called data-parallel orsingle-program multiple-data (SPMD), where the data provide the parallel dimension.Parallelism is conceptually specified as a loop whose iterates operate on elements of a,perhaps multidimensional, array. Data dependences between parallel-loop iterates leadto a producer–consumer type of sharing, wherein one iterate writes variables that arelater read by another, or collective communication, wherein all iterates participate. Thecommunication pattern between iterates is often very regular, for example a bidirectionalflow of variables between consecutive iterates (as in the code in Figure 1).

This paper explores the suitability of the Java concurrency constructs for writing SPMDprograms. In particular, the paper:

1. identifies the differences between the parallelism supported by Java and data paral-lelism

∗Also with Polytechnic University and Cornell Theory Centre.

CCC 1040–3108/97/060621–11 $17.50 Received 1 February 19971997 by John Wiley & Sons, Ltd.

622 S. F. HUMMEL, T. NGO AND H. SRINIVASAN

2. discusses compiler optimizations of Java programs, and the limits imposed on suchoptimizations because of the memory-consistency model defined at the Java languagelevel

3. discusses the key features of data-parallel programming, and how these features canbe implemented using the Java concurrent and synchronization constructs

4. identifies a set of library methods that facilitate data parallel programming in Java.

public void SORloop()

{int i, k;

double y;

for (int t = 0; t < Psor.Convtime; t++) {error = 0.0;

for (i = 0; i < Psor.N-1; i++) {if (i == 0)

y = values[i+1]/2;

else if (i == Psor.N-1)

y = values[i-1]/2;

elsey = (values[i-1] + values[i+1])/2;

y = max(obstacle[i], values[i]+Psor.omega*(y-values[i]));

values[i] = y;

}}

}

Figure 1. Java SOR code for evaluating American options

1.1. Organization of the paper

In Section 2, we elaborate on the differences between data parallelism and the type ofparallelism that Java was designed to support, which we call control parallelism. In thesubsequent Sections, we discuss the concurrent programming features of Java, and theirsuitability for writing data-parallel programs. Section 3 summarizes the Java parallel pro-gramming model, and its impact on performance optimizations in multi-threaded Javaprograms. In Section 4, we consider data-parallel programming using the concurrent andsynchronization constructs supported by Java; we illustrate data parallel programming inJava using a Successive Over Relaxation code for evaluating American options. Finally, inSection 5, we propose a library of methods that would facilitate SPMD-programming inJava on the SP.

2. CONTROL VS. DATA PARALLELISM

Although there is a continuum in the types of parallelism found in systems and numericapplications, it is instructive to compare the two extremes of what we call control and data

SPMD PROGRAMMING IN JAVA 623

parallelism (respectively):

1. Parallelism in systems: Parallel constructs were originally incorporated into program-ming languages to express the inherent concurrency in uniprocessor operating sys-tems[4]. These constructs were later effectively used to program multiple-processorand distributed systems[5–9]. This type of parallelism typically involves heteroge-neous, ‘heavy weight’ processes, i.e. each process executes different code sequencesand demarcates a protection domain. The spawning of processes usually does notfollow a regular pattern during program execution, and processes compete for re-sources, for example, a printer. It is thus important to enforce mutually exclusiveaccess to resources; however, it is often not necessary to impose an order on accessesto resources – the system functions correctly as long as only one process accesses aresource at a time.

2. Parallelism in numerical applications: The goal of parallel numeric applicationsis performance, and the type of parallelism differs significantly from that found inparallel and distributed systems. Data parallelism[10] is homogeneous, ‘light weight’,and regular – for example, the iterates of a forAll loop operating on an array. Paralleliterates are co-operating to solve a problem, and hence, it is important to ensure thatevents, such as the writing and reading of variables, occur in the correct order. Thus,producer–consumer access to shared data must be enforced. Iterates may also needto be synchronized collectively, for example to enforce a barrier between phases ofexecution or to reduce the elements of an array into a single value using a binaryoperation (such as addition or maximum). A barrier can sometimes collectivelysubsume individual produce–consume synchronization between iterates.

As discussed in the next Section, the Java parallel constructs were designed for interactiveand Internet (distributed) programming, and lie somewhere between the above two ex-tremes. However, with some minor modifications SPMD programming could be expressedin Java more naturally.

3. THE JAVA PARALLEL PROGRAMMING MODEL

Based on the classification described in the previous Section, the parallel programmingmodel provided in Java fits best under control parallelism. This Section presents a high-level view of the support in Java for parallelism; for further details, the reader is referredto the Java language specification[11].

Java has built-in classes and methods for writing multi-threaded programs. Threads arecreated and managed explicitly as objects; they perform the conventional thread operationssuch as start, run, suspend, resume, stop, etc. The ThreadGroup class allows a user toname and manage a collection of threads. The grouping can be nested but cannot overlap;in other words, a nested thread group forms a hierarchical tree. A ThreadGroup can beused to collectively suspend, resume, stop and destroy all thread members; however, thisclass does not provide methods to create or start all thread members simultaneously.

Java adopts a shared-memory model, allowing a public object to be directly accessi-ble by any thread. The shared-memory model is easily mapped onto uniprocessor andshared-memory multiprocessor machines. It can also be implemented on disjoint-memorymulticomputers, such as the SP, using message passing to implement a virtual shared mem-

624 S. F. HUMMEL, T. NGO AND H. SRINIVASAN

ory. For distributed machines, Java’s Remote Method Invocation provides an RPC-likeinterface that is appropriate for coarse grained parallelism.

Java threads synchronize through statements or methods that have a synchronized

attribute. These statements and methods function as monitors: only one thread is allowed toexecute synchronized code that access the same variables at a time. Locks are not providedexplicitly in the language, but are employed implicitly in the implementation of monitors(they are embedded in the bytecode instructions monitorenter and monitorexit[12]).Conceptually, each object has a lock and a wait queue associated with it; consequently, asynchronized block of code is associated with the objects it accesses.

When two threads compete for entry to synchronized codes by attempting to lock thesame object, the thread that successfully acquires the lock continues execution, while theother thread is placed on the wait queue of the object. When executing synchronized code,a thread can explicitly transfer control by using notify, notifyAll and wait methods.notifymoves one randomly chosen thread from the wait queue of the associated object tothe run queue, while notifyAll moves all threads from the wait queue to the run queue.If the thread owning the monitor executes a wait, it relinquishes the lock and places itselfon its wait queue, allowing another thread to enter the monitor.

The Java shared-memory model is not the sequentially consistent model[13], whichis commonly used for writing multi-threaded programs. Instead, Java employs a form ofweak consistency[14], to allow for shared-data optimization opportunities. In the sequentialconsistency model, any update to a shared variable must be visible to all other threads. Javaallows a weaker memory model wherein an update to a shared variable is only guaranteedto be visible to other threads when they execute synchronized code. For instance, a valueassigned by a thread to a variable outside a synchronized statement may not be immediatelyvisible to other threads because the implementation may elect to cache the variable untilthe next synchronization point (with some restrictions). The following subsection discussesthe impact of the Java shared variable rules on compiler optimizations.

3.1. Optimizing memory accesses in multi-threaded programs

The goal of SPMD programming is high performance, and exploiting data locality isarguably one of the most important optimizations on parallel machines. The ability to per-form data locality optimizations, such as caching and prefetching, depends on the under-lying language memory model. Communication optimizations[15,16] require an analysisof memory-access patterns in an SPMD program, and therefore, the ability to performcommunication optimizations also depends on the language memory model.

The Java memory model[11] is illustrated in Figure 2. Every Java thread has, in addition toa local stack, a working memory in which it can temporarily keep copies of the variables thatit accesses. The main memory contains the master copy of all variables. In the terminologyof Java, the interactions between the stack and the working memory are assign, use, lockand unlock; the interactions between the working memory and the main memory are storeand load, and those within the main memory are write and read.

The Java specification[11] states that an ‘. . . implementation is free to update mainmemory in an order that may be surprising’. The intention is to allow a Java implementationto maintain the value of a variable in the working memory of a thread; a user must usethe synchronized keyword to ensure that data updates are visible to other threads, i.e.the main shared memory is made consistent. Alternatively, a variable can be declared

SPMD PROGRAMMING IN JAVA 625

load

storeassign

Java Virtual Machine

Stack

Main Memory

Working memory

read

write

use

Figure 2. Java memory abstraction

volatile so that it will not be cached by any thread. Any update to a volatile variableby a thread is immediately visible to all other threads; however, this disables some codeoptimizations.1

Java defines the required ordering and coupling of the interactions between the memories(the complete set of rules is complex and is not repeated here). For instance, the read/writeactions to be performed by the main memory for a variable must be executed in the orderof arrivals. Likewise, the rules call for the main memory to be updated at synchronizationpoints by invalidating all copies in the working memory of a thread on entry to a monitor(monitorenter) and by writing to main memory all newly generated values on exit froma monitor (monitorexit).

Although the Java shared data model is weakly consistent, it is not weak enough to allowthe oblivious caching and out of order access of variables between synchronization points.For example, Java requires that (a) the updates by one thread be totally ordered and (b) theupdates to any given variable/lock be totally ordered. Other languages, such as Ada[17],only require that updates to shared variables be effective at points where threads (tasksin Ada) explicitly synchronize, and therefore, an RTS can make better use of memoryhierarchies[18]. A Java implementation does not have enough freedom to fully exploitoptimization opportunities provided by caches and local memories. A stronger memoryconsistency model at the language level forces an implementation to adopt at least thesame, if not a stronger, memory consistency model. A strong consistency model at theimplementation level increases the complexity of code optimizations – the compiler/RTShas to consider interactions between threads at all points where shared variables are updated,not just at synchronization points[19].

1Note that the term volatile adopted by Java designates a variable as noncacheable, whereas in parallel pro-gramming this term has traditionally meant that a variable is cacheable but may be updated by other processors[17].

626 S. F. HUMMEL, T. NGO AND H. SRINIVASAN

If whole program information is available, then a compiler/RTS can determine whichvariables are shared, and which shared variables are not updated in synchronized code.Such information is necessary to perform many standard compiler optimizations, e.g. codemotion[20], in the presence of threads and synchronization. With the separate compilationof packages, where whole program information is not always available, it is difficult todistinguish between shared and private variables. In such situations, therefore, severalcommon code optimizations are effectively disabled.

4. DATA PARALLELISM IN JAVA

Data parallelism arises from simultaneous operations on disjoint partitions of data by mul-tiple processors. Conceptually, each thread executes on a different virtual processor andaccesses its own partition of the data. Synchronization is needed to enforce the orderingnecessary for correctness. Communication is necessary when a thread has to access a datapartition owned by a different thread, and, as described earlier, the producer/consumer rela-tionship conveniently encapsulates both of these requirements. Determining the producer(the thread that generates the data) and the consumer (the thread that accesses the data) isstraightforward since the relationship is specified by the algorithm. The following featuresare idiomatic in data-parallel programming:

1. creating and starting multiple threads simultaneously2. producer–consumer operation for synchronization and data communication between

threads3. collective communication/synchronization between a set of threads.

Although Java provides language constructs for parallelism, there are several factors thatmake expressing data parallelism in Java awkward. First, as noted in the previous Section,there is no mechanism for creating and starting threads in a ThreadGroup simultaneously.A forAll must be implemented as a serial loop in Java:

SPMDThread spmdthread[] = new SPMDThread[NumT];

for (i = 0; i < NumT; i++) {spmdthread[i] = new SPMDThread();

spmdthread[i].start();}

A simple high-level construct for creating multiple threads simultaneously is more de-sirable. Furthermore, if thread creation is expensive and thread startup is not well synchro-nized across the parallel machine, the serialization will result in idle time that degrades thespeedup of a program.

Secondly, implementing the producer–consumer data sharing using the Javasynchronized statements, wait, notify and notifyAll, is problematic for the fol-lowing reasons. Because notifys are not queued, it is complicated to enforce a specificordering. Consider the producer–consumer code in Figure 3: since there is no guaranteethat the Produce method will be invoked after the Consume method, an early notify

will be lost. A programmer must therefore add mutual exclusion and auxiliary variables to

SPMD PROGRAMMING IN JAVA 627

public void Producer {. . .synchronized(spmdthread[my id+1]) {spmdthread[my id+1].notify();

}. . .

}public void Consumer {. . .synchronized(spmdthread[my id]) {spmdthread[my id].wait();

}. . .

}

Figure 3. Incorrect Java producer–consumer code

enforce the correct ordering, resulting in a low-level style of programming. It is not clearthat a compiler can always generate efficient code for these producer–consumer methods.

Another problem is that the implementation of notify selects a random thread onthe associated wait queue of an object; however, producer–consumer semantics requiresthat the notify selects a specific thread from the wait queue. Therefore, to implementproducer–consumer synchronization, we use a separate object for each producer–consumerpair to ensure that only one thread can appear on the wait queue.

public void Produce(Semaphore mySem) {synchronized (mySem) {

if (mySem.flag == 1) {mySem.counter = 0; mySem.notify();

} else {mySem.wait();mySem.counter = 0;

}}

}public void Consume(Semaphore mySem) {

synchronized (mySem) {mySem.counter = 1;

mySem.notify();mySem.wait();

}}

Figure 4. Java producer–consumer methods

628 S. F. HUMMEL, T. NGO AND H. SRINIVASAN

The language specification[11] claims that wait and notify are ‘. . . especially ap-propriate in situations where threads have a producer–consumer relationship’; however,this mainly applies to bounded buffer applications where it is not necessary to associatea producer with a specific consumer. Producer–consumer Java methods are given in [21],for example, that use synchronized statements and methods to implement a linked list. Asomewhat simpler Java implementation of producer–consumer methods that adheres to oursemantics is given in Figure 4.

Collective communication in Java must also employ mutual exclusion and auxiliaryvariables. An example barrier code is given in Figure 5.

public void Barrier(Semaphore barrier) {synchronized(barrier) {barrier.counter -= 1;

if barrier.counter > 0 {barrier.wait();

else {barrier.counter = num threads;

barrier.notifyAll();}

}

Figure 5. Java barrier code

Note that the RTS can capitalize on the MPI library to implement the Java producer–consumer data sharing and collective communications. In this case, the message passingroutines are C functions with Java wrappers and can be invoked as native methods in adata-parallel Java program.

A data-parallel version of the main loop of the SOR code from Figure 1 is given inFigure 6. The main loop is executed by each thread on its local partition of the valuesOldand valuesNew arrays. Each thread computes valuesNew using its valuesOld and theboundary valuesOld elements of it neighbors. The thread then copies valuesNew intovaluesOld for the next outer-loop iterate. The individual producer–consumer sharing issubsumed by barriers. The code in Figure 6 could be made more data parallel using thenative library methods, e.g. forAll, that we propose in the next Section.

5. SPMD JAVA LIBRARY INTERFACE

For efficiency, we opted to localize shared data structures in our SPMD-Java library; thus,each thread has a local partition of each shared array. Localizing shared data structures is anatural extension of the per-thread working-memory model supported by Java. This designchoice necessitates that threads explicitly communicate. As discussed in Section 3.1, tofurther enable optimizations, it is desirable to designate shared variables as such. Librarymethods for creating thread arrays and for establishing communication links betweenthem according to various topologies are provided. The library also includes methods forproducer–consumer data sharing and for collective communication. The proposed librarymethods are given in Tables 1 and 2 and are described in more detail below.

SPMD PROGRAMMING IN JAVA 629

public void SPMD SORloop(){

int i, t, N;double y, z;N = SPMDPsor.N/spmd num threads();for (t = 0; t < SPMDsor.ConvTime; t++) {

for (i = 0; i < N; i++) {if (i != 0)

if (thread id() != 0) {z = SPMDsor.threads[thread id()-1].valuesOld[N-1];y = (z + valuesOld[i+1])/2;

} elsey = valuesOld[i+1]/2;

else if (i == N - 1)if (thread id() != N - 1) {z = SPMDsor.threads[thread id()+1].valuesOld[0];y = (z + valuesOld[i-1])/2;

elsey = valuesOld[i-1]/2;

elsey = (valuesOld[i-1] + valuesOld[i+1])/2;

y = max(obstacle[i], valuesOld[i]+SPMDsor.omega*(y-valuesOld[i]));valuesNew[i] = y;

}Barrier(SPMDsor.barrier);for (i = 0; i < N; i++) {valuesOld[i] = valuesNew[i];

}Barrier(SPMDsor.barrier);

}}

Figure 6. SPMD Java SOR code for evaluating American options

The library methods for managing threads are given in Table 1. The start all methodimplements a forAll loop, and is parameterized with a keyword indicating how shareddata-structures are localized and, in particular, how initial values are assigned. Static distri-butions are blocked – that is, each thread is given an equal-sized consecutive chunk of eacharray dimension. Dynamic distributions are also blocked, but elements may be migratedduring execution by the RTS to improve load balancing. Our RTS supports fractiling[22]dynamic scheduling, which allocates iterates in decreasing size chunks.

The spmd setProducerMap and spmd setConsumerMap methods create thread com-munication channels for an arbitrary topology. We also provide library methods to spec-ify several common communication channel topologies, such as one-, two- and three-dimensional grids or a binary tree (spmd thread 1dgrid, spmd thread 2dgrid,spmd thread 3dgrid, spmd thread tree).

The library methods for thread communication are given in Table 2. The spmd produce

and spmd consume methods are parameterized by a keyword specifying which thread thecaller is being synchronized with. The parameter can be a specific thread or a logical threadbased on the ThreadGroup topology. There are methods for broadcasting and multicastingvariables to threads in a ThreadGroup. A barrier method is provided to synchronizephases of a parallel algorithm. The library includes methods for common reduction andscan operations, which are the cornerstones of many parallel algorithms[23]. A reduction

630 S. F. HUMMEL, T. NGO AND H. SRINIVASAN

applies a binary operation to reduce the elements of an array into a single element, while ascan applies a binary operation to reduce subsequences of the elements of an array.

Table 1. Thread management methods

Name Function

create spmd groups create an SPMD thread groupstart all (static, dynamic) create threads and begin their executionthread id return thread IDnum threads return number of threadsspmd thread 1dgrid define a linear array of producer–consumerspmd thread 2dgrid define a 2D grid of producer–consumerspmd thread 3dgrid define a 3D array of producer–consumerspmd thread tree define a tree of producer–consumerspmd setProducerMap, define an arbitrary producer–consumer topologyspmd setConsumerMap

Table 2. Thread communication methods

Name Function

spmd produce signal to a specific consumerspmd consume wait for a signal from specific producerspmd barrier wait for all threads to arrivespmd gather compact a sparse array into local arraysspmd scatter uncompact a sparse array into local arraysspmd permute permute the local protions of an arrayspmd barrier wait for all threads to arrivespmd reduce:(add,multiply,or,and,max,min) global reductionspmd scan:(add,multiply,or,and,max,min) parallel prefix

6. CONCLUSION

The concurrent constructs of Java were selected to facilitate the writing of interactiveand Internet applications. In this paper, we explored writing SPMD programs in Java,and demonstrated that it is possible, if somewhat awkward, with the current languagespecification. We identified several features that would make expressing SPMD parallelismin Java more natural, for example, by including keywords to specify that threads are to bestarted simultaneously, and by requiring that notifys be queued. In lieu of the addition ofthese features, we propose that a standard library of SPMD routines be adopted. This paperis an initial step towards this end.

Since the goal of parallel computing is primarily efficiency, there are other changes to thelanguage that are desirable for SPMD computing. One of the most important of these is theexplicit declaration of shared variables and the relaxing of the Java memory consistencymodel, so that more code optimizations can be performed.

SPMD PROGRAMMING IN JAVA 631

REFERENCES

1. J. Dewynne, P. Wilmott and S. Howison, Option Pricing Mathematical Models and Computation,Oxford Financial Press, 1993.

2. T. Agerwala, J. L. Martin, J. H. Mizra, D. C. Sadler, D. M. Dias and M. Snir, ‘SP2 systemarchitecture’, IBM Syst. J., 2(34),152–184 (1995).

3. MPI Forum, ‘Document for a standard message passing interface’, Technical Report CS-93-214,University of Tennessee, November 1993.

4. P. Brinch Hansen, ‘The programming language Concurrent Pascal’, IEEE Trans. Softw. Eng.,1(2), 199–206 (1975).

5. Parallel Computing Forum, ‘PCF Parallel FORTRAN extensions’, Special issue, FORTRANForum, 10(3), (1991).

6. IBM, Parallel Fortran Language and Library Reference, March 1988. Pub. No. SC23-0431-0.7. S. Flynn Hummel and R. Kelly, ‘A rationale for massively parallel programming with sets’,

J. Program. Lang., 1, (1993).8. I. Foster, R. Olsen and S. Tuecke, ‘Programming in Fortran M, version 1.0’, Technical Report,

Argonne National Laboratory, October 1993.9. H. F. Jordan, M. S. Benten, G. Alaghband and R. Jakob, ‘The force: A highly portable parallel

programming language’, in E. C. Plachy and Peter M. Kogge (Eds.), Proc. 1989 InternationalConf. on Parallel Processing, vol. II, St. Charles, IL, August 1989, pp. II-112–II-117.

10. High Performance Fortran Forum, ‘High Performance Fortran language specification, version1.0’, Technical Report CRPC-TR92225, Rice University, May 1993.

11. B. Joy, J. Gosling and G. Steele, The Java Language Specification, Addison-Wesley, 1996.12. T. Lindholm and F. Yellin, The Java Virtual Machine Specification, Addison-Wesley, 1997.13. Leslie Lamport, ‘How to make a multiprocessor computer that correctly executes multiprocess

programs’, IEEE Trans. Comput., C-28(9), 690–691 (1979).14. Michel Dubois, Christoph Scheurich and Faye Briggs, ‘Memory access buffering in multipro-

cessors’, in Conf. Proc. 13th Annual International Symp. on Computer Architecture, Tokyo,June 1986, pp. 434–442.

15. Manish Gupta, Edith Schonberg and Harini Srinivasan, ‘A unified data-flow framework foroptimizing communication’, IEEE Trans. Parallel Distrib. Syst., 7 (7), (1996).

16. S. P. Amarasinghe and M. S. Lam, ‘Communication optimization and code generation fordistributed memory machines’, in Proc. ACM SIGPLAN ’93 Conference on ProgrammingLanguage Design and Implementation, Albuquerque, NM, June 1993.

17. Ada 95 Rationale, Intermetrics Inc, 1995.18. S. Flynn Hummel, R. B. K. Dewar and E. Schonberg, ‘A storage model for Ada on hierarchical-

memory multiprocessors’, in A. Alverez (Ed.), Proc. of the Ada-Europe Int. Conf., CambridgeUniversity Press, 1989.

19. Harini Srinivasan and Michael Wolfe, ‘Analyzing programs with explicit parallelism’, in UtpalBanerjee, David Gelernter, Alexandru Nicolau and David A. Padua (Eds.), Languages andCompilers for Parallel Computing, Springer-Verlag, 1992, pp. 405–419.

20. Robert Strom and Kenneth Zadeck, personal communication, March 1996.21. D. Lea, Concurrent Programming in Java, Addison-Wesley, 1997.22. S. Flynn Hummel, I. Banicescu, C. Wang and J. Wein, ‘Load balancing and data locality

via fractiling: An experimental study’, in Boleslaw K. Szymanski and Balaram Sinharoy (Eds.),Proc. Third Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers,Kluwer Academic Publishers, Boston, MA, 1995, pp. 85–89.

23. G. Blelloch, ‘Scan primitives and parallel vector models’, Ph.D. Dissertation MIT/LCS/TR-463,MIT, October 1989.