a simd approach to large-scale real-time …parallel/papers/mikeyuanphd... · a simd approach to...
TRANSCRIPT
A SIMD APPROACH TO LARGE-SCALE REAL-TIME SYSTEM AIR TRAFFIC CONTROLUSING ASSOCIATIVE PROCESSOR AND CONSEQUENCES FOR PARALLEL COMPUTING
A dissertation submitted toKent State University in partial
fulfillment of the requirements for thedegree of Doctor of Philosophy
by
Man Yuan
August 2012
Dissertation written by
Man Yuan
B.S., Hefei University of Technology, China 2001
M.S., University of Western Ontario, Canada 2003
Ph.D., Kent State University, 2012
Approved by
Dr. Johnnie W. Baker , Chair, Doctoral Dissertation Committee
Dr. Lothar Reichel , Members, Doctoral Dissertation Committee
Dr. Mikhail Nesterenko
Dr. Ye Zhao
Dr. Richmond Nettey
Accepted by
Dr. Javed I. Khan , Chair, Department of Computer Science
Dr. Raymond Craig , Dean, College of Arts and Sciences
ii
TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Flynn’s Taxonomy and Classification . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Multiple Instruction Stream Multiple Data Stream (MIMD) . . . 9
2.1.2 Single Instruction Stream Multiple Data Stream (SIMD) . . . . . 10
2.2 Real-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Air Traffic Control (ATC) . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Task Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 The worst-case environment of ATC . . . . . . . . . . . . . . . . 16
2.3.3 ATC Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 An Associative Processor for ATC . . . . . . . . . . . . . . . . . . . . . . 18
2.5 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Avoid False Sharing . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Optimize Barrier Use . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.3 Avoid the Ordered Construct . . . . . . . . . . . . . . . . . . . . 24
iii
2.5.4 Avoid Large Critical Regions . . . . . . . . . . . . . . . . . . . . 24
2.5.5 Maximie Parallel Regions . . . . . . . . . . . . . . . . . . . . . . 24
2.5.6 Avoid Parallel Regions in Inner Loops . . . . . . . . . . . . . . . 25
2.5.7 Improve Load Balance . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.8 Using Compiler Features to Improve Performance . . . . . . . . . 26
3 Survey of Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Previous Work on ATC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Classical MIMD Real-Time Scheculing Theory . . . . . . . . . . . . . . . 31
3.3 Other Similar Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 AP Solution to ATC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Overview of AP Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Advantages of AP over MIMD for ATC . . . . . . . . . . . . . . . . . . . 33
4.3 AP properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Emulating the AP on the ClearSpeed CSX600 . . . . . . . . . . . . . . . . . . 39
5.1 Overview of ClearSpeed CSX600 . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Emulating the AP on the ClearSpeed CSX600 . . . . . . . . . . . . . . . 41
6 ATC System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.1 ATC Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Static Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Scaling up from CSX600 to a Realistic Size AP . . . . . . . . . . . . . . 45
iv
7 AP Solution for ATC Tasks Implemented on CSX600 . . . . . . . . . . . . . . 48
7.1 Report Correlation and Tracking . . . . . . . . . . . . . . . . . . . . . . 48
7.2 Cockpit Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.3 Controller Display Update . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.4 Automatic Voice Advisory . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.5 Sporadic Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.6 Conflict Detection and Resolution (CD&R) . . . . . . . . . . . . . . . . . 55
7.6.1 Conflict Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.6.2 Conflict Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.7 Terrain Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.8 Final Approach (Runways) . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.9 Comparison of AP and MIMD solutions for ATC Tasks . . . . . . . . . . 63
7.10 Implementation of Algorithm Issues . . . . . . . . . . . . . . . . . . . . . 65
8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.2.1 Comparison of Performance on STI, MIMD and CSX600 . . . . . 69
8.2.2 Comparison of Performance on MIMD, CSX600 and STARAN . . 70
8.2.3 How close to the AP is our CSX600 emulation of the AP . . . . . 72
8.2.4 Comparison of Predictability on CSX600 and MIMD . . . . . . . 74
8.2.5 Comparison of Missing Deadlines on CSX600 and MIMD . . . . . 75
8.2.6 Timings for 8 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 75
v
8.2.7 Video and Actual STARAN Demos . . . . . . . . . . . . . . . . . 78
9 Observations and Consequences of the Preceding Results . . . . . . . . . . . . 80
10 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
vi
LIST OF FIGURES
1 SIMD Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 AP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Fork-Join Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 A Possible AP Architecture for ATC . . . . . . . . . . . . . . . . . . . . 36
5 CSX600 accelerator board . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6 MTAP architecture of CSX600 . . . . . . . . . . . . . . . . . . . . . . . . 40
7 Static schedule of ATC tasks implemented in AP . . . . . . . . . . . . . 45
8 Track/Report correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9 Search box size for flight maneuver . . . . . . . . . . . . . . . . . . . . . 52
10 Conflict detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
11 Timing of tracking task. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
12 Timing of terrain avoidance task. . . . . . . . . . . . . . . . . . . . . . . 70
13 Timing of CD&R task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
14 Comparison of tracking task of MTI, CSX600 and STARAN. . . . . . . . 71
15 Comparison of terrain avoidance task of MTI, CSX600 and STARAN. . . 72
16 Comparison of CD&R task of MTI, CSX600 and STARAN. . . . . . . . 72
17 Comparison of display processing task of MTI, CSX600 and STARAN. . 73
18 Time of tracking for number of aircraft from 10 to 384. . . . . . . . . . . 73
19 Time of CD&R for number of aircraft from 10 to 384. . . . . . . . . . . . 74
vii
20 Comparison of MTI, STARAN and Super-CSX600 with at most one air-
craft per PE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
21 Predictability of execution times for report correlation and tracking task. 75
22 Predictability of execution times for terrain avoidance task. . . . . . . . . 75
23 Predictability of execution times for CD&R task. . . . . . . . . . . . . . 76
24 Number of iterations missing deadlines when scheduling tasks. . . . . . . 76
25 Overall ATC System Design. . . . . . . . . . . . . . . . . . . . . . . . . . 100
viii
LIST OF TABLES
1 Timings of Associative Functions . . . . . . . . . . . . . . . . . . . . . . 42
2 Static Schedule for ATC Tasks . . . . . . . . . . . . . . . . . . . . . . . . 44
3 Statically Scheduled Solution Time for Worst Case Environment of ATC 47
4 Data Structures for Radar Reports . . . . . . . . . . . . . . . . . . . . . 50
5 Data Structures for Tracks . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Comparison of Required ATC Operations . . . . . . . . . . . . . . . . . . 64
7 Performance of One Flight/PE . . . . . . . . . . . . . . . . . . . . . . . . 76
8 Performance of Eight Tasks for Ten Flights/PE . . . . . . . . . . . . . . 77
ix
Acknowledgements
I would like to sincerely thank my advisor, Dr. Johnnie W. Baker, for his constant
support and encouragement throughout my Ph.D. study. He has been the advisor for
my research work throughout these years and also the mentor for my professional ca-
reer development. Without his encouragement and guidance, this dissertation would be
impossible.
My special thanks will go to many people who have helped me at various stages of
this work: Will Meilander, Frank Drews, Sue Peti, and Marcy Curtiss etc. In particular,
Professor Will Meilander has contributed valuable information for my dissertation. He
has read through all my publications and provided many useful comments.
I would also like to thank my dissertation committee members for their willingness
to serve on the committee and their valuable time.
On a personal note, I would like to express my gratitude to my parents Zhen Yuan
and Xiaohang Xu for their love and emotional support. Finally, I would like to express
my thanks to my girl friend Sijin Zhang for her moral support and for the incredible
amount of patience she had with me so that I can realize my dream eventually.
x
Abstract
This dissertation has two complementary focuses. First, it provides a solution to large
scale real-time system air traffic control (ATC) using an enhanced SIMD machine model
called an associative processor (AP). The second is the comparison of this implementa-
tion with a multiprocessor implementation and the implications of these comparisons.
This paper demonstrates how one application, ATC, can more easily, more simply, and
more efficiently be implemented on an AP than is generally possible on other types of
traditional hardware. The AP implementation of ATC will take advantage of its de-
terministic hardware to use static scheduling. Our solution differs from previous ATC
systems that are designed for MIMD computers and have a great deal of difficulty meet-
ing the predictability requirements for ATC, which are critical for meeting the strict
certification standards required for safety critical software components. The proposed
AP solution supports accurate predictions of worst case execution times and guarantees
all deadlines are met. Furthermore, the software developed based on the AP model is
much simpler and smaller in size than the current corresponding ATC software. As the
associative processor is built from SIMD hardware, it is considerably cheaper and sim-
pler than the MIMD hardware currently used to support ATC. While APs were used
for ATC-type applications earlier, these are no longer available. We use a ClearSpeed
CSX600 accelerator to emulate the AP solutions of ATC on an ATC prototype consist-
ing of eight data-intensive ATC real-time tasks. Its performance is evaluated in terms of
execution time and predictability and is compared with an 8-core multiprocessor (MP)
2
using OpenMP. Our extensive experiments show that the AP implementation meets all
deadlines while the MP will regularly miss a large number of deadlines. It is shown that
the proposed AP solution will support accurate predictions of worst case execution times
and will guarantee that all deadlines are met. In addition, the AP code will be similar
in size to sequential code for the same tasks and will avoid all of the additional support
software needed with an MP to handle dynamic scheduling, load balancing, shared re-
source management, race conditions, and false sharing, etc. At this point, essentially
only MIMD systems are built. Many of the advantages of using an AP to solve an ATC
problem would carry over to other applications. AP solutions for a wide variety of ap-
plications will be cited in this paper. Applications that involve a high degree of data
parallelism such as database management, text processing, image processing, graph pro-
cessing, bioinformatics, weather modeling, managing UAS (Unmanned Aircraft Systems
or drones) etc, are good candidates for AP solutions. This raises the issue of whether
we should routinely consider using non-multiprocessor hardware like the AP for appli-
cations where substantially simpler software solutions will normally exist. It also raises
the question of whether the use of both AP and MIMD hardware in the same system
could provide more versatility and efficiency. Either the AP or MIMD could serve as
the primary system but could hand off jobs it could not handle efficiently to the other
system.
CHAPTER 1
Introduction
The Air Traffic Control (ATC) system is a typical real-time system that continuously
monitors, examines, and manages space conditions for thousands of flights by processing
large volumes of data that are dynamically changing due to reports by sensors, pilots, and
controllers, and gives the best estimate of position, speed and heading of every aircraft in
the environment at all times. The ATC software consists of multiple real-time tasks that
must be completed in time to meet their individual deadlines. By performing various
tasks, the ATC system keeps timely correct information concerning positions, velocities,
contiguous space conditions, etc., for all flights under control. Since any lateness or
failure of task completion could cause disastrous consequences, time constraints for tasks
are critical. [1–3]
In the past, solutions to this problem have been implemented on multicomputer sys-
tems with records for various aircraft stored in the memory accessible to the processors
in this system. The dispersed nature of this dynamic ATC system and the necessity of
maintaining data integrity while providing rapid access to this data by multiple instruc-
tion streams (MIS ) increases the difficulty and complexity of handling air traffic control.
It is difficult for these MIMD implementations to satisfy reasonable predictability stan-
dards that are critical for meeting the strict certification standards needed to ensure
safety for critical software components. Massive efforts have been devoted to finding an
efficient MIMD solution to the ATC problems since 1963 but none of them has been
3
4
satisfactory yet [4–7]. To a large degree, this is due to the fact that the software used by
MIMDs in solving real-time problems involves solutions to many difficult problems such
as dynamic scheduling, load balancing, etc [6]. It is currently widely accepted that large
real-time problems such as ATC do not have a polynomial time MIMD solution [8, 9].
Even though many researchers have put extensive efforts into producing a reasonably
good (e.g., using heuristic algorithms) solution to the ATC problem, the current ATC
system has repeatedly failed to meet the requirements of the ATC automation system.
It is frequently been reported that the current ATC system periodically loses radar data
without any warnings and misses tracking of aircraft in the sky. This problem has even
occurred with Air Force One. [3–5, 7]
Instead of using the traditional MIMD approach, we implement a prototype of the
ATC system on an enhanced SIMD hardware system called an associative processor(AP)
or associative SIMD, where interactions are much simpler and more efficiently controlled,
due in large part to avoiding the need to coordinate the interactions of multiple instruc-
tion streams. AP supports some additional constant time operations, e.g., broadcast-
ing, associative searches, AND/OR and maximum/minimum reductions (assuming word
length is a constant) [4,10], details of which will be illustrated in section 2.4. It is shown
that an efficient polynomial time solution to a real-time problem for Air Traffic Control
can be obtained [3–7]. In addition, associative SIMD computers have been built which
support the AP model and can support this polynomial time solution to the ATC and
meet the required deadlines. The first AP system was designed by Kenneth Batcher and
implemented in the Goodyear Aerospace STARAN computer, which was specifically de-
signed to support ATC [11–13]. A second generation STARAN, called the ASPRO [14],
5
also an AP designed at Goodyear Aerospace, was used extensively by the Navy for Air-
borne Air Defense Systems applications for many years [15]. In addition, normal SIMDs
cannot handle ATC because they cannot input and output radar sensor data efficiently.
STARAN and ASPRO had overcome this I/O limitation by the flip network of Multidi-
mensional Accesss(MDA) memory, details are in Section 2.4.
We use ClearSpeed CSX600 to emulate an AP and implement eight key ATC tasks for
our ATC prototype, namely report correlation and tracking, cockpit display, controller
display update, sporadic requests, automatic voice advisory, terrain avoidance, conflict
detection and resolution (CD&R) and final approach (runway optimization). The se-
lection of the specific tasks was based on information in the following references: FAA
Grants for Aviation Research Program Solicitation [16], FAA’s NextGen Implementation
Plan [17], and the FAA 1963 ATC Specifications [18]. Reference [18] indicates that the
tasks we selected were similar to the ones selected in 1963 for implementation by indus-
tries interested in competing for the job of implementing ATC for FAA. Reference [16]
describes the FAA’s efforts to improve the aircraft capacity of the airspace while main-
taining high safety standards and aircraft safety technology for conflict detection and
resolution. These indicate that the tasks of the type that we have selected are key
to ATC implementation. Further, the NextGen plan [17] discusses the new standards
for ATC including developing capabilities in traffic flow management, dynamic airspace
configuration, separation assurance, super density operations, and airport surface opera-
tions. The purpose is to achieve a safe, efficient and high-capacity airspace system. The
following subtopics in [17] address current research that needs improvement: comprehen-
sive analysis of uncertainties in the National Airspace System; air traffic management
6
functional allocation using advanced computing and networking; guarantee safety of air-
to-air and air-to-ground. The type of activities discussed there will require the use of
tasks similar to the ones that we have selected for implementation, which show that our
eight ATC tasks have already captured most of the workload of ATC, and nothing major
is missing that may impact the ATC performance. In our careful search of literature for
publications on ATC, the ones that kept re-occurring in the ATC literature we found
were all included in the tasks we choose for our prototype to use for benchmarking.
Several of our previous papers [4–6,19] have used the AP to manage ATC computation
for an ATC sector. Similar SIMD parallel approaches have been used for collision avoid-
ance algorithms between multiple agents for real-time simulations in S.Guy et. al. [20].
We used ClearSpeed CSX600 to emulate an AP in our previous work [21]. However, the
assumed maximum number of aircraft being tracked is 4000 IFR (instrument flight rules)
aircraft and 10000 VFR (visual flight rules) aircraft, for a total of 14000 aircraft [21].
Because the emulation tool CSX600 can only process a small number of aircraft, an
ideal AP system would have to be a lot larger than the CSX600. Our paper [21] has
implemented two ATC tasks, i.e., report correlation tracking and conflict detection and
resolution (CD&R) on CSX600, and compared the performance of only one task, report
correlation and tracking, on both CSX600 and MIMD. However, in the comparison tests,
only one task was executed repeatedly, without any competition from other tasks. Also,
no adjustment was included for the greater combined computational speed of MIMD. In
our most recent papers [22, 23], we not only implement the 8 tasks in CSX600, but also
schedule them on both CSX600 and an 8-core MIMD with the fastest host-only version
implemented using OpenMP [24].
7
Our results show that some deadlines are regularly missed by these tasks when imple-
mented on a MIMD, while the AP implementation meets all deadlines for the tasks that
can be statically scheduled. The results indicate that the AP emulation has much better
scalability, efficiency and predictability than the MIMD implementation. Moreover, the
proposed AP solution will support accurate predictions of worst case execution times and
will guarantee that all deadlines are met. In contrast, MIMD can only support average
case timings, not worst case timings, and cannot guarantee that all deadlines are met all
the time. Also, the software used by the AP is much simpler and smaller in size than the
current corresponding ATC software, as the MIMD software must also support additional
activities such as dynamic scheduling and load balancing etc, which are not needed by
the AP. The solution used in AP is very similar to a sequential solution for this problem,
both in style and in code size. As a result, the size of this software solution is only a very
small fraction of the size of the MIMD solutions to the ATC problem. This results in a
dramatic drop in the cost in both the cost of creating and in maintaining this software
when compared to the MIMD solutions that have been given to the ATC problem. An
important consequence of these features is that the V&V (Validation and Verification)
process will be considerably simpler than for current ATC software. Moreover, the AP
hardware is considerably cheaper, simpler, and easier to build than the MIMD hardware
currently used to support ATC. Furthermore, the AP solution to ATC can be applied
in many other real-time problems, dynamic database problems, Unmanned Aircraft Sys-
tems (UASs), and many other applications. Thus we should consider SIMD solutions to
solve many parallel computing problems for which SIMD or AP solutions can be sim-
pler and more efficient than MIMD solutions. A large portion of our research has been
8
presented in our most recent journal [23].
This dissertation is organized into ten chapters. First, Chapter 2 introduces back-
ground information and terminology used throughout this dissertation. Section 2.1 intro-
duces the two most important parallel architectures: MIMD in Section 2.1.1 and SIMD
in Section 2.1.2. Section 2.2 introduces terminology of real-time processing. Section 2.3
describes the air traffic control (ATC) problem including worst case assumptions, con-
straints of the system and examples of ATC tasks. Section 2.4 gives an overview of asso-
ciative processor(AP), its properties such as constant time associative operations etc. The
basics of OpenMP and the techniques of how to improve performance of OpenMP pro-
grams are introduced in Section 2.5. Second, Chapter 3 gives a survey of previous related
research on ATC, and comparison of their pros and cons. Chapter 4 gives an overview
of AP solution to ATC, discusses the advantages of AP over MIMD, and discusses some
issues concerning the property of AP. Chaper 5 gives an overview of the CSX600 ar-
chitecture and programming concepts, and discuss emulating the AP on the CSX600.
In Chapter 6, the overall system design and static scheduling is illustrated. Chapter 7
describes the algorithms for 8 key ATC real-time tasks implemented on CSX600, and
how they can be scaled up to realistic size ATC. Experimental results are presented in
Chapter 8: comparison of worst case times of some most important ATC tasks for 8-core
MIMD, CSX600 and AP, comparison of predictability and missing deadlines, and timings
for 8 tasks. The observations and consequences of the preceding results are in Chapter 9.
Chapter 10 gives the final conclusions and discusses some work planned for the future.
CHAPTER 2
Background Information
This chapter introduces and discusses some terminologies and background informa-
tion used for this dissertation. Some basic knowledge of parallel algorithms have been
introduced in [25–27].
2.1 Flynn’s Taxonomy and Classification
Flynn’s taxonomy is the best-known classification scheme for parallel computers, in
which a computer’s category depends on parallelism it exhibits with its instruction stream
and data stream. A process is: a sequence of instructions (the instruction stream denoted
as I ) manipulates a sequence of operands (the data stream denoted as D), and the
instruction stream (I ) and the data stream (D) can be either single (S ) or multiple
(M ). Hence Flynn’s classification results in four categories: SISD, MISD, MIMD and
SIMD [26,28]. Because SISD and MISD have nothing to do with this dissertation, they
are not introduced.
2.1.1 Multiple Instruction Stream Multiple Data Stream (MIMD)
MIMD is generally considered to be the most important class and includes most com-
puters currently being built. Processors are asynchronous, since they can independently
execute different instructions on different data sets simultaneously. Communications
are handled either through shared memory called multiprocessors or by use of message
9
10
passing called multicomputers. [25, 26, 29, 30] MIMDs have been considered by most re-
searchers to include the most important and least restricted computers compared with
SIMD computers. However, there are numerous NP-hard problems for MIMDs in real-
time applications [6,8,9], including dynamic scheduling, load balancing, race conditions,
data dependency, non-determinism, deadlocks, etc. See Garey and Johnson, ”Computers
and Intractability: a Guide to the Theory of NP-completeness” [8], and related ref-
erences [31–34]. MIMD parallel programming is normally optimized for average case
performance and typically worst case is not considered [8, 9].
2.1.2 Single Instruction Stream Multiple Data Stream (SIMD)
Most early parallel computers had a SIMD-style design. Figure 1 shows the abstract
model of SIMD. SIMD has the instruction stream (IS) that is also called the control unit
and stores the program to be executed. Each PE contains arithmetic logical unit (ALU)
to perform arithmetic and logical operations, and its local memory to keep only data,
not program. The IS compiles the program and broadcasts the executable commands to
the PEs (or processing elements). The PEs are very simplistic and are essentially ALUs.
The active PEs execute the IS steps synchronously. All active processors execute the
same instructions synchronously, but on different data. [25, 26, 29, 30]
There are three basic types of parallel systems that have been called SIMD [26, 29].
The first is the traditional SIMD and includes the Goodyear Aerospace MPP, Thinking
Machines CM 2, and the MasPar. These parallel computers perform the same operation
on multiple data values simultaneously. The second type is called vector machines or
vector processing machines. They involve the use of pipelined processors and include
11
Figure 1: SIMD Model
the large CRAY vector machines. The third and most recent SIMD type are systems
that are sometime collectively called short vector machines. They evolved from desktop
computers as they became more powerful and capable of supporting real-time gaming
and video programming. Our focus here is strictly on the traditional SIMD type. Of
these three types, only the associative SIMDs can process both sequences of data parallel
and scalar operations efficiently and switch between them with no conversion or low level
synchronization overhead.
MIMD computers are generally believed to be more powerful and more performance/cost
effective compared with SIMD computers. Therefore, current research in various appli-
cation areas including real-time systems has put a great deal of emphasis on use of
MIMDs [3,9,35–37]. We know that MIMDs have a lot of difficulties in application based
on Section 2.1.1 above. However, use of SIMD computation can avoid essentially all of
these problems [3, 6, 7, 12, 30]. Since only one control unit exists in a SIMD, there is no
need for synchronization. The major advantage of SIMDs is their simplicity. Their pro-
grams are easy to program, debug, and optimize. One of the complaints by programmers
about parallel computers is that they are difficult to program. This is not true in SIMDs.
12
Because of the synchronous nature of SIMDs, the flow control of SIMDs is sequential and
at exactly one place in the program text. There is no need to consider numerous process
interactions. Additionally, there is only one program to write and debug and data are
usually distributed among PEs naturally. This dissertation will use an enhanced SIMD
to solve large scale real-time system and show their advantages over MIMD.
2.2 Real-Time Systems
A real-time system is distinguished from a traditional computing system in that its
computations must meet specified timing requirements. A real-time system executes
tasks to ensure not only their logical correctness but also their temporal correctness.
Otherwise, the system fails no matter how accurate a computation result is. [3,9,34–37]
A real-time task is ”an executable entity of work that, at minimum, is characterized by
a worst-case computation time and a time constraint or deadline” [35]. A job is ”an
instance of a task or an invocation of a task” [35]. When a task is invoked, a job is
generated to execute this task with the given conditions. The release time of a task
is the time that the task (or a job of the task) is ready to be executed. A realtime
task can be either periodic, which is activated (released) in a regular interval (period),
or aperiodic, which is activated irregularly in some unknown and possibly unbounded
interval. Aperiodic tasks have either soft or no deadlines, and sporadic tasks have hard
deadlines. Typically, real-time scheduling can be static or dynamic. Static scheduling
refers to the case that the scheduling algorithm has complete knowledge a priori about
all incoming tasks and their constraints such as deadlines, computation times, shared
resource access, and future release times. In contrast, in dynamic scheduling, the system
13
has knowledge about the currently active tasks but it does not have knowledge about
new task arrivals. Thus the scheduling algorithm has to be designed to change its task
schedule over time.
A system period P is the time during which the set of all tasks must be completed
[9, 35–37]. The system deadline D is the constraint time for a system period. A task
deadline d is the time constraint for an individual task. Lateness of a task is the time
elapse between the task release and its execution. Deadlines can be hard, firm, or soft. A
hard deadline is a time constraint for a task that failure to meet it will result in disastrous
consequences. A firm deadline is a time constraint for a task that failure to meet it will
produce useless results but cause no severe consequences. A soft deadline is the time
constraint for a task that the result produced after this deadline will be degraded but
may be still useful in a limited time period. Because it is not tolerable to miss deadlines
in ATC systems, we consider only hard deadlines in our work. According to [9], a static
schedule can be used if the summation of task times for the system period is less than
the system deadline D, otherwise a static schedule is not feasible.
A primary concern for a real-time system is how to schedule these tasks to ensure they
meet various constraints including deadlines. Depending on different criterion such as
minimizing the sum of all task computation times and minimizing the maximum lateness,
different scheduling algorithms can be designed [3, 9, 34–37]. An optimal scheduling
algorithm is one that may fail to meet a deadline only if no other scheduling algorithm can
meet the deadline [34,38,39]. There are three scheduling approaches. The first scheduling
algorithm is Cyclic Executives which is static analysis and static scheduling [9, 35]. All
tasks, times and priorities are given before system startup, and it is time-driven, i.e.,
14
schedules are computed and hardcoded before system startup. We use this approach for
our AP solution to ATC. The second approach is fixed priority scheduling which is static
analysis and dynamic scheduling. All tasks, times and priorities are given before system
startup, and it is priority-driven, dynamic scheduling, i.e., the schedule is constructed
by the OS scheduler at run time, e.g., Rate Monotonic (RM) Scheduling [38, 39]. The
third one is dynamic priority scheduling, which assigns priorities based on the current
state of the system. The examples are: Least Completion Time (LCT), Earliest Deadline
First (EDF), and Least Laxity (Laxity is deadline minus computation time minus current
time) First (LLF) algorithims [9, 35–37].
Although LCT, EDF, LLF and RM scheduling algorithms can be optimal [9,35,38,39],
however, as real-time systems become larger and tasks become more sophisticated, it is
impossible for only one processor to execute them to guarantee real-time properties. The
current trend is to use MIMD systems to schedule real-time tasks [40]. Unfortunately,
it has been proved for the long established theory that almost all real-time scheduling
problems using MIMDs are NP-complete [8,9,32,39]. No optimal scheduling algorithms
has ever been discovered for almost all cases with the MIMD systems. Researchers
have been driven to using heuristics to design scheduling algorithms on MIMD systems.
However, these heuristics are usually expensive in that they consume a lot of computation
time, and sometimes additional hardware support such as a scheduling chip is needed.
Moreover, use of heuristics will cause systems to be inherently unpredictable [3, 8, 39].
15
2.3 Air Traffic Control (ATC)
The Air Traffic Control (ATC) system is a real-time system that continuously moni-
tors, examines, and manages space conditions for thousands of flights by processing large
volumes of data that are dynamically changing due to reports by sensors, pilots, and
controllers, and gives the best estimate of position, speed and heading of every aircraft
in the environment at all times [1–3]. The ATC software consists of multiple real-time
tasks that must be completed in time to meet their individual deadlines. By performing
various tasks, the ATC system keeps timely correct information concerning positions, ve-
locities, contiguous space conditions, etc, for all flights under control. Since any lateness
or failure of task completion could cause disastrous consequences, time constraints for
tasks are critical.
Data inputs come mainly from radar reports and are stored in a real-time database.
In our working prototype of the ATC system, a set of radar reports arrive every 0.5
second. Multiple reports may return redundant data on some aircraft. All tasks must be
completed before new report data arrive for the next cycle.
In order to present our work more clearly, we briefly describe our analytical prototype
of the ATC system and ATC tasks in this section. ATC task characteristics and the worst
case ATC environment that is based on real world application are listed first, then ATC
tasks and task examples are explained. Some details have been discussed in [4–6].
16
2.3.1 Task Characteristics
Besides time constraints, there are other assumptions for ATC tasks in our ATC
system. Before specific ATC tasks are described in Section 2.3.3, we give general char-
acteristics of an ATC task [3, 4].
• All tasks are periodic. Although each task has its own deadline, all of them must
be completed by the system deadline D, or the system period P.
• All deadlines are known at the task release time.
• Aperiodic jobs are handled by a special task in a particular time slot in every cycle.
• Tasks are independent in that there is no synchronization between them nor shared
resources. The static schedule fixes release times and deadlines for each task.
• Overhead costs for interrupt handling are included in each task cost.
• Task execution is non-preemptive.
• Task deadlines are all hard and critical. None can be preempted or deleted without
possible adverse effects.
2.3.2 The worst-case environment of ATC
The US nation air space (NAS) is divided into 20 airspace regions called ATC enroute
centers in the lower (or contiguous) U.S. states plus one each for Hawaii and Alaska.
Each center is divided into sectors. A controller may control one or more sectors in a
center. In the ATC center we consider, there are 600 controllers. The assumed maximum
number of aircraft being tracked by one air traffic control center is 4, 000 controlled IFR
17
(Instrument Flight Rules) aircraft in the current controller center and 10, 000 uncontrolled
VFR (Visual Flight Rules) aircraft and adjacent center controlled IFR, for a total of
14, 000 aircraft [3–5]. Because of the multiplicity of sensors, we assume there are 12, 000
radar reports coming into the system per second.
2.3.3 ATC Tasks
An ATC system has to process a large amount of data coming from radar sensors
and provides accurate flight information in highly constrained time. Various processing
procedures can be grouped to following eight major tasks [4, 5]: report correlation and
tracking, cockpit display, controller display update, sporadic requests, automatic voice
advisory, terrain avoidance, conflict detection and resolution (CD&R) and final approach
(runway optimization). The 8 tasks are used to represent the benchmark of ATC, as can
be seen in references: FAA Grants for Aviation Research Program Solicitation [16], FAA’s
NextGen Implementation Plan [17], and the FAA 1963 ATC Specifications [18]. Refer-
ence [18] indicates that the tasks we selected were similar to the ones selected in 1963 for
implementation by industries interested in competing for the job of implementing ATC
for FAA. Reference [16] states the FAA’s effort to improve the capacity of the airspace
while maintaining high safety standards, aircraft safety technology for conflict detection
and resolution. These indicate that the tasks of the type that we have selected are key
to ATC implementation. Further, the NextGen plan [17] discusses the new standards
for ATC including developing capabilities in traffic flow management, dynamic airspace
configuration, separation assurance, super density operations, and airport surface opera-
tions. The purpose is to achieve a safe, efficient and high-capacity airspace system. The
18
following subtopics in [17] address current research that need innovation: comprehen-
sive analysis of uncertainties in the National Airspace System; air traffic management
functional allocation using advanced computing and networking; guarantee safety of air-
to-air and air-to-ground. The type of activities discussed there will require use of tasks
similar to the ones that we have selected for implementation, which show that our eight
ATC tasks have already captured most of the workload of ATC, and nothing major is
missing that may impact the ATC performance. In our careful search of literature for
publications on ATC, the ones that kept re-occurring in the ATC literature we found
were all included in the tasks we choose for our benchmark.
2.4 An Associative Processor for ATC
An associative processor (AP) [4,13] is a SIMD system with several additional useful
hardware enhancements that simplify supporting the real-time ATC system. The name
”associative” is due to computer’s ability to locate items in the memory of PEs by content
rather than location. Figure 2 shows the architecture of AP. It has one control unit, or
called instruction stream, and thousands of PEs(Cells). Each PE has its own arithmetic
logical unit (ALU) and memory. The first AP system was designed by Kenneth Batcher
and implemented in the Goodyear Aerospace STARAN computer, which was specifically
designed to support ATC [11–13]. A second generation STARAN called the ASPRO [14],
also an AP designed at Goodyear Aerospace, was used extensively by the Navy for
Airborne Air Defense Systems applications for over ten years [15].
19
Figure 2: AP Architecture
The hardware enhancement that is required for an AP is a broadcast/reduction net-
work such as described in [10]. This network supports the execution of the below opera-
tions, often called the associative operations, in constant time. A list of the associative
operationss follow [10, 15, 41, 42]:
• Global MAX and MIN: finds all instances where a maximal (or minimal) value
is stored by a processor at a fixed memory location. The processors whose value
is maximal (respectively minimal) are active at the end of this operation and are
called responders. The remaining processors participating in this activity are called
non-responders.
• AND and OR: finds the global reduction of Boolean values stored in the same
memory location of each active processor.
• Associative search: finds all instances where a search pattern is matched by the
content stored by a processor at a fixed memory location. The active processors
whose content matches the search pattern are the responders and the processors
which did not match the search pattern are the non-responders.
20
• Any-Responder: the ANY operation determines if there is at least one responder
after an associative search.
• Pick-One: selects one responder from the set of the responding processors. It is
implemented on ClearSpeed using GET and NEXT operations.
• Broadcast data or instructions from the control unit to PEs.
We introduce a new term that is called jobset for the AP [3,4]. Observe that an AP is a
set processor, as each of its instructions operates on a set of data. As a result, the AP can
execute a set of jobs involving the same basic operations on different data simultaneously.
This set of jobs is called a jobset. Compared with the common understanding of a job
as an instance of a task in the MP, in the AP, a task is considered to be a sequence of
jobsets.
2.5 OpenMP
OpenMP is a parallel programming model for shared memory and distributed memory
multiprocessors [43–46]. It consists of a set of compiler directives and library routines,
and it is relatively easy to create multi-threaded applications in Fortran, C and C++. In
this dissertation, we focus on the fork-join programming model employed by the OpenMP
system. To the best of our knowledge, prior work has not seriously studied the fork-join
task model in the context of real-time systems [47]. Fork-Join is a popular parallel
programming paradigm employed in systems such as Java and OpenMP. The basic fork-
join tasks are shown in Figure 3.
Each basic fork-join task begins as a single master thread that executes sequentially
21
Figure 3: Fork-Join Task Model
until it encounters the first fork construct, where it splits into multiple parallel threads
which execute the parallelizable part of the computation. After the parallel execution
region, a join construct is used to synchronize and terminate the parallel threads, and
resume the master execution thread. This structure of fork and join can be repeated mul-
tiple times within a job execution. In this dissertation, we use the basic fork-join task
model as described above. Many real-time systems, such as radar tracking, autonomous
driving, and video surveillance, exhibit a data parallel nature that lends itself easily to the
fork-join model. As the problem sizes scale and processor speeds saturate, the only way
to meet task deadlines in such systems would be to parallelize the computation. We focus
on preemptive fixed-priority scheduling algorithms, and dynamic priority scheduling for
fork-join tasks is beyond the current scope of this work. We also restrict our attention to
tasks with the number of cores of our MIMD system. Although the number of threads
in a parallel region can be dynamically adjusted by the OMP NUM THREADS envi-
ronment variable in specific implementations of OpenMP, we do not consider such a task
model in this work.
While it is relatively easy to quickly write a correctly functioning program in OpenMP,
22
it is more difficult to write a program with a good performance. There are some tech-
niques to improve the performance [43, 44].
2.5.1 Avoid False Sharing
An important efficiency concern is false sharing, which can also severely restrict scal-
ability. It is a side effect of the cache-line granularity of cache coherence implemented
in shared memory systems. The cache coherency mechanism keeps track of the status
of cache lines by appending ”state bits” to indicate whether the data on the cache line
is still valid or is outdated. Any time a cache line is modified, cache coherence software
notifies other caches holding a copy of the same line that its line is invalid. If data from
that line is needed, a new updated copy must be fetched. A problem is that state bits
do not keep track of which part of the line is outdated, but indicates the whole line is
outdated. As a result, a processor can not detect which part of the line is still valid and
instead requests a new copy of the entire line. As a result, when two threads update
different data elements in the same cache line, they interfere with each other. This effect
is known as false sharing.
False sharing is likely to significantly impact performance under the following con-
ditions: shared data is modified by multiple threads; the access pattern is such that
multiple threads modify the same cache lines; these modifications occur in rapid succes-
sion. An extreme case of false sharing is illustrated by the example shown in Algorithm 1.
Suppose all threads contain a copy of the vector a with 8 elements in it and thread 0
modifies a[0]. Everything in the cache line with a[0] is invalidated, so a[1], . . . , a[7] are
not accessible until the cache is updated in all threads, even though their data has not
23
changed and is valid. So threads can not access a[i] for i > 0 until each of their cache
lines containing a[0], · · · , a[7] have been updated.
Algorithm 1 Example of false sharing
1: pragma omp parallel for shared(Nthreads, a) schedule(static, 1)2: for int i = 0; i < Nthreads; i + + do
3: a[i]+ = i;4: end for
In this case, we can solve the problem by padding the array by dimensioning it
as a[n][8] and changing the indexing from a[i] to a[i][0]. Access to different elements
a[i][0] are now separated by a cache line and updating an array element a[i][0] does not
invalidate a[j][0] for i 6= j. Although this array padding works well, it is a low-level
solution that depends on the size of the cache line and may not be portable. In a more
efficient implementation, the variable a is declared and used as a private variable for
each thread instead of having thread i to access the global array position a[i]. In general,
using private data instead of shared data significantly reduces the risk of false sharing.
Unlike padding, this is a portable solution.
2.5.2 Optimize Barrier Use
Even efficiently implemented barriers are expensive. The nowait clause can be used
to eliminate the barrier that results from several constructs, including the loop construct.
We must do this safely, especially consider the order of different reads and writes from
same portion of memory. A recommended strategy is to first ensure that the OpenMP
program works correctly and then avoid the nowait clause where possible by carefully
inserting explicit barriers at points in the program where needed.
24
2.5.3 Avoid the Ordered Construct
The ordered construct ensures that the corresponding blocks of code within a parallel
loop are executed in the order of the loop iterations. This construct is expensive to
implement. The runtime system has to keep track of which iterations have finished and
possibly keep threads in wait state until their result is needed, which slows program
execution. The ordered construct can often be avoided, e.g., it may be better to wait
and perform I/O outside of the parallelized loop.
2.5.4 Avoid Large Critical Regions
A critical region is used to ensure that no two threads execute a piece of code simul-
taneously. It can be used when the actual order of which threads perform computation is
not important. However, the more code in the critical region, the greater the time that
threads needing this critical region will have to wait. Therefore the programmer should
minimize the amount of code enclosed withing a critical region. If a critical region is very
large, program performance will become poor. An alternative is to rewrite the piece of
code and separate, as far as possible, those computations that cannot lead to data races
from those operations that do need to be protected.
2.5.5 Maximie Parallel Regions
Indiscriminate use of parallel regions may lead to suboptimal performance. There are
significant overheads associated with starting and terminating parallel regions. Large
parallel regions offer more opportunities for using data in cache and provide a bigger
context for other compiler optimizations. For example, if multiple parallel loops exist,
we must choose whether to enclose each loop in an individual parallel region or create
25
one parallel region encompassing all of them.
2.5.6 Avoid Parallel Regions in Inner Loops
Another common technique to improve the performance is to move parallel regions
out of the innermost loops. Otherwise, we repeatedly incur the overheads of the parallel
construct. For example, in the loop nest shown in Algorithm 2, the overheads of the
parallel region are incurred n2 times. A more efficient solution is indicated in Algorithm 3.
The pragma omp parallel for construct is split into its constituent directives and the
pragma omp parallel construct has been moved to enclose the entire loop nest. The
pragma omp for construct remains at the inner loop level. Depending on the amount of
work performed in the innermost loop, the parallel construct overheads are minimized.
Algorithm 2 Parallel region embedded in a loop nest
1: for int i = 0; i < n; i + + do
2: for int j = 0; j < n; j + + do
3: pragma omp parallel for4: for int k = 0; k < n; k + + do
5: {. . .}6: end for
7: end for
8: end for
Algorithm 3 Parallel region moved outside of the loop nest
1: pragma omp parallel2: for int i = 0; i < n; i + + do
3: for int j = 0; j < n; j + + do
4: pragma omp for5: for int k = 0; k < n; k + + do
6: {. . .}7: end for
8: end for
9: end for
26
2.5.7 Improve Load Balance
In some parallel algorithms, there is a wide variation in the amount of work threads
have to do. When this happens, threads wait at the next synchronization point until the
slowest thread arrives. One way to avoid this problem is to use the schedule clause with a
non-static schedule. The problem is that the dynamic and guided workload distribution
schedules have higher overheads than static scheme. However, if the load balance is
severe enough, this cost is offset by the more flexible allocation of work to the threads.
It is a good idea to experiment with these schemes, as well as with various values for the
chunk size. In general, when possible, a general rule for MIMD parallelism is to overlap
computation and communication so that the total time taken is less than the sum of the
times to do each of these. OpenMP uses dynamic schedule to overcome this problem
and lead to a significant speedup: the thread performing I/O joins the others once it has
finished reading data and shares in any computations that remain at that time. But it
will not cause them to wait untill they have performed all of the work by the time the
I/O is done. After this, one thread writes out the results, another thread starts reading
while the others can immediately move on to the computation.
2.5.8 Using Compiler Features to Improve Performance
Compilers use a variety of analysis to determine which of techniques can be used.
They also apply a variety of techniques to reduce the number of operations performed and
reorder code to better exploit the hardware. Once the correctness of a numerical result is
assured, experiment with compiler options to squeeze out an improved performance. We
used the compiler optimization flags ’-mfpmath=387,sse -msse3 -ffast-math -O3’ to enable
27
Intel’s Streaming SIMD Extensions (SSE), which is a SIMD instruction set extension to
the Intel’s x86 architecture.
CHAPTER 3
Survey of Literature
3.1 Previous Work on ATC
The current airspace has a rigid, centralized structure, and aircraft are required to
follow predefined airways or instructions given by controllers [1, 2, 48]. The Federal Avi-
ation Administration (FAA) has put tremendous efforts on finding a predictable and
reliable system to achieve free flight which would allow pilots to choose the best path to
minimize fuel consumption and time delay rather than following pre-selected flight cor-
ridors [1,2,49]. This requires clear and unambiguous methods for maintaining safe sepa-
ration between aircraft. Therefore, conflict detection and resolution (CD&R) emerges as
a critical issue for the implementation of free flight. In the past, solutions to this prob-
lem have been implemented on multicomputer systems with records for various aircraft
distributed over the memory of the processors in this system. The distributed nature of
this dynamic ATC system and the necessity of maintaining data integrity while providing
rapid access to this data by multiple instruction streams (IS) increases the difficulty and
complexity of handling air traffic control. It is difficult for these MIMD implementations
to satisfy reasonable predictability standards that are critical for meeting the strict certi-
fication standards needed to ensure safety critical software components. Massive efforts
have been devoted to finding an efficient MIMD solution to the ATC problems for more
than forty years [4–6].
28
29
The performance of all CD&R algorithms available depends on aircraft state esti-
mation according to the comprehensive survey of Kuchar and Yang [50]. The current
ATC system is based on a surveillance system that uses data from radar measurements
to track aircraft. In the early versions of this system, human operators manually fol-
lowed the blips on the video displays, but the increasing number of aircraft necessitated
the development of automated tracking algorithms. The current algorithms in use for
ATC tracking are based on constant gain Kalman filters, known as α − β, α − β − γ
filters [51, 52]. The major problem with the single Kalman filter is that it does not pre-
dict well when the aircraft makes an unanticipated change of flight mode such as making
a maneuver, accelerating, etc [53–56]. Many adaptive state estimation algorithms have
been proposed [57–60]. The Interacting Multiple Model (IMM) algorithm [61, 62] runs
two or more Kalman filters that are matched to different modes of the system in parallel.
It uses a weighted sum of the estimates from the bank of Kalman filters to compute the
state estimate. IMM and its variants have been applied to single and multiple aircraft
tracking problems in [57]. However, it becomes inaccurate for tracking multiple aircraft
as the number of aircraft increases. Current MIMD implementations of this algorithm
are computationally very intensive. Hwang et. al. [54] proposed that the flight mode
likelihood function can be used to improve the estimation results of the IMM algorithm.
The likelihood function uses the mean of the residual produced by each Kalman filter.
A heuristic algorithm in [56] that evaluates correlation error values has been shown to
provide better results than the Kalman filter [56].
To the best of our knowledge, all existing conflict detection algorithms are based on
30
the continuous state information of the aircraft (see Kuchar and Yang [50] for a compre-
hensive survey of the CD&R algorithms). Menon et al. [63] formulate conflict resolution
as a multi-participant optimal control problem. Using parameter optimization and state
constrained trajectory optimization, they compute a conflict resolution trajectory for
two different cost functions: deviation from the original trajectory and a linear combi-
nation of total flight time and fuel consumption. Their method results in 3D optimal
multiple-aircraft conflict resolution. In general, the optimization process is computation-
ally intensive and is difficult to implement in real-time. In [64], Krozel et. al. propose
one centralized strategy that is controller-oriented and two decentralized strategies that
are user-oriented. In the centralized approach, a central agent analyzes the trajectories of
the aircraft and determines resolutions. In the two decentralized strategies, each aircraft
resolves its own conflicts as they are detected. In [49], Yang and Kuchar propose a con-
flict alerting logic based on sensor and trajectory uncertainties, with conflict probability
based on Monte Carlo simulation. Chiang et. al. [65] propose CD&R algorithms from
the perspective of computational geometry. Paielli and Erzberger [66] and Prandini et.
al. [67] propose analytic algorithms for computing probability of conflict. Many of the
algorithms consider only two aircraft. For example, Krozel et. al. [64] show that neither
their centralized nor decentralized CD&R algorithms can guarantee safety for multiple
aircraft when the number of aircraft is growing. Furthermore, many algorithms propose
optimization schemes that are not guaranteed to be completed within real-time deadlines.
Due to the increasing number of FAA problems, FAA is inviting proposals for new and
efficient conflict detection and resolution technologies [68].
31
3.2 Classical MIMD Real-Time Scheculing Theory
There are some classical multiprocessor real-time scheduling theory [9, 37, 40]. John
Stankovic et.al [9]: ”...complexity results show that most real-time multiprocessing schedul-
ing is NP-hard.” Mark Klein et. al.: [38]”One guiding principle in real-time system
resource management is predictability. The ability to determine for a given set of tasks
whether the system will be able to meet all the timing requirements of those tasks.”;
”...most realistic problems incorporating practical issues... are NP-hard.” Garey, Gra-
ham and Johnson: [31]”...all but a few schedule optimization problems are considered
insoluble... . For these scheduling problems, no efficient optimization algorithm has been
found, and indeed, none is expected.”
John Stankovic et.al [9, 32, 34] and Garey et.al [8, 31] have identified a number of
difficulties in scheduling periodic real-time tasks on multiprocessor, some good instances
include dynamic scheduling, load balancing, race conditions, data dependencies, shared
resource management, sorting, indexing, and cache and memory coherency problems, etc.
Garey et.al [8, 31] have shown them to be multiprocessor NP-hard problems. In [8], the
definition of multiprocessor is a parallel computer that uses message passing or shared
memory, so basically it is a MIMD. To avoid confusion, in this paper, we will treat
multiprocessor and MIMD as being identical terms.
3.3 Other Similar Work
There are also some related works excluding ATC. Most of the results of real-time
scheduling on MIMD are focused on the sequential programming model, where the prob-
lem is to schedule many sequential real-time tasks on multiple processor cores. Parallel
32
programming models introduce a new dimension to this problem, where jobs may be
split into parallel execution segments at specific points [47]. S.Guy et. al. [20] had used
SIMD parallel approaches for collision avoidance algorithms between multiple agents for
real-time simulations. K. Park et. al. [69] have used CUDA to evaluate performance of
image processing algorithms. The NVIDIA GPU has many SIMD PE groups on its chips
and their approach has many similar ideas as ours.
CHAPTER 4
AP Solution to ATC
4.1 Overview of AP Solution
Instead, we implement the ATC system on associative processor(AP), where interac-
tions are much simpler and more efficiently controlled, due in large part to avoiding the
need to coordinate the interactions of multiple instruction streams. Our previous papers
have detailed the solution [3–6, 19, 21, 70]. Similar SIMD parallel approaches have been
used for collision avoidance algorithms between multiple agents for real-time simulations
in S.Guy et. al. [20].
All records for each aircraft will be stored in a single processor. Initially, assume each
processor will store the records for at most one aircraft since the memory size and speed
of processors in a large SIMD is typically small, due to cost restrictions. For ATC tasks,
an AP with n processors can execute n instances of the same task in essentially the same
time as it takes to execute one instance of this task. As long as there are no more than
one aircraft per processor, the running time for the AP remains essential the same as the
number of aircraft increase.
4.2 Advantages of AP over MIMD for ATC
Since SIMDs have only one instruction stream (IS) or control unit, control-type com-
munication between processors is completely eliminated. Communication between the
IS and processors occurs in constant time using its broadcast/reduction network. Data
33
34
communication between processors is completely deterministic and a tight upper bound
can be calculated for the worst case. As a result, a numerical worst case time required for
communications can be accurately calculated. The AP has some important advantages
over MIMD including low overhead synchronization, deterministic hardware, much faster
communication, predictable worst-case running time, much wider ”memory to processor”
bandwidth, and elimination of the following: race conditions, data dependencies, shared
resource management, sorting, indexing, and cache and memory coherency problems, etc.
When solving a problem, MIMD systems often use software to solve additional prob-
lems repeatedly which is not needed in a sequential solution of the original problem.
Examples of these types of this ”additional software” are dynamic scheduling, load bal-
ancing, shared resource management, memory and cache coherency management, pre-
emption, synchronization, priority inversion handling, sorting, indexing, multi-tasking
and multi-threading management software, etc [6]. Several of these types of difficulties
were identified in scheduling periodic real-time tasks on MIMD in [47]. A number of these
are problems that have been shown to be multiprocessor NP-hard problems(e.g., [8]).
In [8], the definition of multiprocessor is a parallel computer that uses message passing
or shared memory, so basically it is a MIMD. To avoid confusion, in this paper, we will
treat multiprocessor and MIMD as interchangeable terminologies. SIMDs do not ordi-
narily need to use this additional software. Most sequential solutions to a problem can
usually be used to create a similar AP solution with roughly the same number of lines
of code. AP solutions are often shorter and simpler than the sequential solutions, as the
constant-time associative search AP property can be used to eliminate the use of sorting
and linked lists to organize and locate items. Also, the additional AP properties often
35
allow simpler and more efficient solutions for problems than with a SIMD that is not an
AP.
4.3 AP properties
We discuss some issues regarding AP properties in this section. One is whether the
methods used in data transfer of the ClearSpeed emulator is similar to data transfer
between PEs and control unit memory in AP. The issue really comes down to whether
the AP we plan to use can make the necessary data transfers in sufficient time, as there
are no transfer properties assumed for AP. According to [11, 71–78], the AP model is a
general purpose computational model and does not make any assumptions regarding how
the interconnection network is used. Often, the use of the interconnection network in an
algorithm can be avoided by using the above associative operations, leaving the running
time of this algorithm independent of the network used. Also, the AP does not make
any assumptions regarding how data is transferred from the AP to outside buffers or
between the control unit memory and the parallel memory. Clearly, these transfer times
depend on the data size, and this increases as the application size (e.g., the number of
aircraft) increases. While it is very efficient to transfer data between the mono and PE
memory in ClearSpeed, we do not know if this rate of transfer would scale if a much
larger ClearSpeed accelerator is built. We must point out the important fact that most
and probably all of the non-AP SIMDs that have been proposed and built could not
handle the ATC requirement because of I/O limitations. However, the previously built
AP machines STARAN [11–13] and ASPRO [14] overcame the I/O limitation by the
MDA (multidimensional access) memory [4,11–13,41] and the flip network(see Figure 4).
36
Figure 4: A Possible AP Architecture for ATC
The flip network is a corner turning network that provides access to a slice of memory
in a SIMD PE for I/O purposes [11, 41, 73]. The flip network in STARAN and ASPRO
was the key to providing high speed I/O movement. The placement of the flip network
between processors and their memory allowed very fast movement of data between an
outside buffer and the ASPRO/STARAN PE parallel memory. Corner-turned data and
the assignment of one record per processor allows multiple PEs to work together to
transfer large amounts of data rapidly between PEs and memories. The flip network and
MDA allow efficient movement of a record in one PE to an outside buffer. More about
this can be found in [41]. Moreover, Jerry Potter describes an alternate design in [41]
that allows multiple records to be transferred to an outside buffer. So the transfer of data
speed is not a bottleneck for building an associative processor to accommodate realistic
size ATC problems. See references [4, 11, 71–75, 78, 79] to find more information about
how to implement AP hardware.
Another issue involving AP properties is how the constant time operations can be
supported in AP and how can we be sure they actually run in constant time. The ATC
problem is essentially a dynamic database problem. Much of the data in this database
changes rapidly, with many parts being updated every one-half second. In order to be able
37
to update and process this data, we need to have some functions that allow us to access,
update, change, and add or delete records in this dynamic database very rapidly. In order
to support these rapid database operations, we require that these operations execute in
constant time. These are the additional functions we require a SIMD to possess in order
for it to be called an associative processor. Here ”constant time” is defined to be the
time that the sequential RAM model requires for a PE to access its memory, the time for
AND or OR operations, the time for addition or multiplication of word length numbers,
etc. However, there is an additional issue here that does not arise for sequential constant
time operations, namely that these functions may involve reductions of vectors with a
component value from each processor. Clearly, if the number of processors is allowed to
increase without bound, such constant time reductions are impossible. However, if we
limit the number of processors to a practical size (e.g., bound by the number of atoms
in the observable universe), it is shown in [10] that all of the constant time functions
required for an associative processor are possible. However, these functions cannot use a
typical interconnection network, as use of these will require non-constant time. Instead
these constant time functions can be supported by use of a binary (or n-ary for n > 2)
broadcast-reduction tree with nodes that have very limited computational capabilities. In
fact, this is the way these constant time operations are supported in both STARAN and
ASPRO, which use a 4-nary tree for broadcasts and another 4-nary tree for reductions.
The broadcasting/reduction network on the STARAN is constructed using a group of
resolver circuits [11,71,72,75]. It is important to keep in mind that all of the above basic
operations are implemented in hardware. These features make the AP model a unique
parallel computation model that is powerful and feasible. More information about this
38
can be found in [4, 5, 10, 11].
The key to AP hardware design is [76–81]: 1) high memory-to-PE bandwidth, and
2) low level synchronous operation supported by the i) elimination of branches in low
level loops, and ii) the elimination of low level barrier synchronism. In the STARAN and
the ASPRO, these abilities were supported by corner turning the data and assigning one
record per processor, the multi-dimensional array memory (and flip network) and mask
register hardware. Specifically, the mask register allows additional parallel searching and
processing on subsets of the original search with no branching for special cases. After
the search phase of a search, process, retrieve (or SPR) cycle is complete, associative
SIMDs can either continue data parallel processing or select a single record to process
sequentially with essentially no overhead. These attributes allow associative SIMDs to
process all the records in a file using data parallelism. More information about this
and other issues addressed in this section can be found in the book titled Associative
Computing by Jerry Potter [41].
CHAPTER 5
Emulating the AP on the ClearSpeed CSX600
5.1 Overview of ClearSpeed CSX600
The ClearSpeed accelerator board shown in Figure 5 is a PCI-X card equipped with
two CSX600 coprocessors. The CSX600 board is a multi-core processor with two CSX600
coprocessors, each with 96 processing elements (PEs) connected in the form of a one-
dimensional array. At present, we are only using one of the two coprocessors in order
to obtain a more SIMD-like environment. This multi-core section is called a multi-
threaded array processor (MTAP) core, and the architecture is shown in Figure 6. The
programmer only has to provide a single instruction stream, and the instructions and
data are dispatched to the execution units that have two parts: one is mono unit that
functions as a control unit and processes sequential instructions, and the other is poly unit
that has 96 PEs. At each step, all active PEs execute the same command synchronously
on their individual data. Each PE has its own local memory of 6 Kbytes, a dual 64-bit
FPU, its own ALU, integer MAC, registers and I/O. The PEs operate at a clock speed
of 210 MHz. The aggregate bandwidth of all PEs is specified to be 96 Gbytes/s, which is
for on-chip memory. Since the parallel bandwidth between the PEs and their memory is
(number of PEs) × (memory-PE bandwidth of each PE), this bandwidth increases as the
number of PEs increase. This provides an extremely wide bandwidth for SIMDs with a
large number of PEs. This allows SIMDs to avoid the von Neumann bottleneck, since a
large number of PEs can access their memory in the same step without any slowdown due
39
40
Figure 5: CSX600 accelerator board
Figure 6: MTAP architecture of CSX600
to message passing or shared memory access time. Further information on the hardware
architecture can be found in the documentation [82].
The ClearSpeed accelerator provides the Cn language as the programming interface
for the CSX600 processors. It is very similar to the standard C programming language.
The main difference is that it introduces two types of variables, namely mono and poly
variables. The mono variables are equivalent to common C variables and are used by
the control unit. A poly value is a parallel variable, with the ith value stored in the ith
processor; moreover, all these values are stored in the same memory location in their
respective processor. Further information on the associated software can be found in the
documentation [83, 84]. Here we will use three of the library functions for data transfer
on the ClearSpeed. The command memcpym2p copies from mono to poly memory and
the command memcpyp2m copies from poly to mono memory. The third one, swazzle,
41
exchanges data between adjacent PEs using the swazzle network, which is a ring network
connecting all the processors together. More details of usages of library functions can be
found in the documentation [84, 85].
5.2 Emulating the AP on the ClearSpeed CSX600
Because the CSX600 coprocessor is SIMD, so only the associative functions introduced
in Section 2.4 need to be emulated. These associative functions have been implemented
mostly by Dr. Kevin Schaffer on the CSX600 efficiently, in both the Cn language in-
troduced in Section 5.1 and assembly language supported by CSX600 [19, 21–23]: AND
and OR reductions across a Boolean poly variable; Associative search across a poly vari-
able; AnyResponder (following an associative search); MAX and MIN reductions across
a integer poly variable; PickOne (following use of AnyResponder). The source codes for
the ASC library for the ClearSpeed Cn language are posted on [86], and [79–81] have
explained the library more in detail. To evaluate their running time, we store 30 records
in each PE and perform each associative operation once for each of these 30 records. The
timing results in microseconds (µsec or 10−6 second) are shown in Table 1. The NA in the
table means that either they cannot be implemented in assembly or they are assembler
commands. Although they are not constant time, they are very efficient, and establish
that we can efficiently emulate an AP using CSX600. The reason that these functions are
not constant time on ClearSpeed is that they are not supported by a broadcast-reduction
network, but instead with the swazzle (or ring) network. Clearly, the time to use this
network increases linearly as the number of processors increases.
42
Table 1: Timings of Associative FunctionsAssociative functions Timings in Cn(microsecond) Timings in assembly(microsecond)
max 5.257 3.654min 5.257 3.654AND 7.024 NAOR 7.364 NA
associative search 29.2 NAany 0.281 NAnany 15.147 8.245get 13.116 7.876next 13.032 7.816
broadcast 100 NA
CHAPTER 6
ATC System Design
6.1 ATC Data Flow
The overall system design for an ATC system is shown in Figure 25. The executive
box controls the single instruction stream of AP using static scheduling. All control
paths are from ATC in the executive box to all of the tasks, e.g., report correlation
and tracking, etc. Controller input simulates sporadic requests, e.g., weather change,
controller input, etc. Radar reports data are simulated by data from data lines to two
modems and transferred from the host to the CSX600 PEs. Tracks are simulated from
flight plans in the PEs. The radar reports and tracks are used for report correlation
and tracking task. The outputs of tracking task are used for cockpit display, controller
display update, terrain avoidance, conflict detection and resolution (CD&R) and final
optimization. The results of terrain avoidance, CD&R and final approach are used for
cockpit display and controller display update. The results of terrain avoidance and CD&R
are used for automatic voice advisory that transfers results to automatic voice advisory
driver in the host and produces voice output. The resolution advisories of CD&R task
are sent to controllers.
6.2 Static Scheduling
The eight ATC real-time tasks can be statically scheduled on the CSX600 using
the static schedule for the AP discussed in [4, 21–23]. The eight tasks and the periods
43
44
Period 0.5 sec 1 sec 4 sec 8 sec
1 T1 T2,T3,T4
2 T1 T5
3 T1 T2,T3,T4
4 T1
5 T1 T2,T3,T4
6 T1 T6
7 T1 T2,T3,T4
8 T1 T7
9 T1 T2,T3,T4
10 T1 T5
11 T1 T2,T3,T4
12 T1
13 T1 T2,T3,T4
14 T1 T8
15 T1 T2,T3,T4
16 T1
Table 2: Static Schedule for ATC Tasks
used for each are (1) report correlation and tracking is executed every 0.5 second; (2)
cockpit display, (3) controller display update and (4) sporadic requests are executed every
second; (5) automatic voice advisory is executed every 4 seconds; (6) terrain avoidance,
(7) conflict detection and resolution, and (8) final approach are executed every 8 seconds.
An 8 second period is split into 16 one-half second periods. Task 1 is executed during
each half-second period. Tasks 2, 3 and 4 are executed during the 1st, 3rd, 5th, 7th,
9th, 11th, 13th, and 15th half-second periods. Task 5 is executed in the 2nd and 10th
half-second periods. Task 6 is executed in the 6th, task 7 is executed in the 8th and
task 8 is executed in the 14th half-second period. Table 2 shows the static schedule more
intuitively.
Figure 7 illustrates the static scheduling of ATC tasks. Time slots for individual tasks
within a system major period (8 seconds) are carefully tailored. Each of them provides
45
Figure 7: Static schedule of ATC tasks implemented in AP
sufficient time for the worst-case execution of the complete set of system tasks in the
ATC system. The periodic tasks are run at their release times and each is completed by
its deadline.
6.3 Scaling up from CSX600 to a Realistic Size AP
We use the present ClearSpeed System to show that our ATC solution is feasible.
However, our purpose is to propose building an AP whose size is appropriate for ATC,
not a larger CSX600 board or multiple CSX600 boards because some of the disadvantages
in MIMD systems mentioned in subsection 4.2 may show up in a system with multiple
CSX600 boards. As described in Chapter 1, our ideal AP has at least 14, 000 PEs with
only one aircraft per PE. Moreover, the memory size and speed of the PEs and of the
control unit can be chosen to optimize the ability of this large AP to handle the realistic
size ATC system. Although we are unable to prove it is possible to build this ideal
AP using our CSX600 simulation, STARAN and ASPRO have already met the goal
of supporting 14, 000 aircraft simultaneously [4, 5, 11]. A similar processor, the MPP
[11,71–75], with 16, 384 PEs, was delivered in 1982. While the MPP was not an AP as it
46
did not have the hardware reduction network, this feature could easily have been included.
Even larger SIMD machines have been built. In particular, Thinking Machines CM-2
was delivered in 1987 and had 64K processors. Paracel developed a parallel processor
which was generally believed to be a SIMD that had one million processors.
Figure 25 shows the overall system design and data flow based on AP architecture.
Our solutions to ATC tasks for the CSX600 in Chapter 7 are easy to implement on
the AP with 14, 000 PEs. References [4, 5, 11] have timing results of previous AP with
16, 000 PEs. For example, Table 3 in [4] shows that the STARAN AP executed a similar
set of ATC procedures in 4.52 seconds, leaving 3.48 unused seconds remaining. That
table is simplified to be Table 3 in this paper, which provides worst case execution times
for statically scheduled ATC tasks with 14, 000 total aircraft, 12, 000 sensor reports/sec,
6, 000 controllers and 8 second major period. The data transfer of large records in real-
time is not their bottleneck because an AP can bring data in and out efficiently (please
see details in Section 2.4). Additional reasons that the AP model can solve the ATC
problem efficiently are its constant time associative operations, SIMD execution style,
and extremely wide memory bandwidth. These have eliminated a number of difficulties
that have to be handled in MIMD architectures. So an AP with 14, 000 PEs using modern
technology can be built and can easily meet all of the deadlines, as this has already been
done in the past with older technology.
47
Table 3: Statically Scheduled Solution Time for Worst Case Environment of ATC
Tasks Period(sec) Proc Time(sec)Report Correlation & Tracking 0.5 1.44Cockpit Display 1.0 0.72Controller Display Update 1.0 0.72Aperiodic Requests(200/sec) 1.0 0.4Automatic Voice Advisory 4.0 0.36Terrain Avoidance 8.0 0.32Conflict Detection & Resolution 8.0 0.36Final Approach(100 runways) 8.0 0.2Summation of tasks in a period 4.52
CHAPTER 7
AP Solution for ATC Tasks Implemented on CSX600
This chapter describes the algorithms for each of the eight real-time tasks, report
correlation and tracking, cockpit display, controller display update, sporadic requests,
automatic voice advisory, terrain avoidance, conflict detection and resolution (CD&R)
and final approach (runway optimization). The solutions are implemented on the CSX600
architecture, but it is easy to scale up from 96 PEs to an AP with 14, 000 PEs using similar
algorithms. The best algorithm will always depend on the exact architecture of the AP.
For instance, the use of the swazzle network in CSX600 is more efficient here but in
a true AP, ordinarily each radar report would be broadcast from the host to all PEs
simultaneously, as this can be done in constant time on a true AP.
7.1 Report Correlation and Tracking
Report correlation and tracking is a task that correlates aircraft data about position
as reported by radar and the predicted data of established tracks for the aircraft in the
system [3–5]. A track is the best estimate of position and velocity of each aircraft under
observation. This task is present in many command and control real-time systems and is
executed every 0.5 second, so it is a major limitation in ATC performance. If total time
consumed is considered, this is easily the ATC task that consumes the most time, as it
is performed much more frequently than the other tasks. It is challenging because each
report has to be compared with each track, and some aircraft are changing flight mode.
48
49
The main idea of this task is from [3–5, 87]. Since all these data are stored in a
shared relational database, the input data are two database relations: aircraft data from
radar reports(Relation R) and the predicted positional data of established tracks for the
aircraft (Relation T ). The correlated radar reports are used to smooth the position and
velocity of the tracks to obtain the next estimate of position and velocity. Given an
unordered set of tracks, each report record must be evaluated with every track record in
the system to assure a match (correlation) is not missed. Multiple matches are treated
different from unique matches, and any report records that do not match a track record
are entered as new tentative tracks.
It is challenging because each report has to be compared with each track, and some
aircraft are changing flight mode. We use the SIMD architecture to do the task. First,
we create a box for each report and each track to accommodate uncertainties of report
and track. The two database relations R and T contain information about the x and y
coordinates of points in 3-D space. The objective of the task is to determine the join of
the two relations as the intersection of two boxes, which is a many-to-many join. Each
box from R is evaluated for intersection with every box in T until all boxes from R have
been compared with all boxes in T.
As presented in [19, 21], the data structures of R and T are shown in Table 4 and
Table 5, in which positions of next period are predicted by smoothed current positions
and velocities. Let T have columns X, Y , j, X1, Y1, and q, and R have columns X, Y , r
and k. Here (X, Y ) is the position of aircraft on the records of T or R; j and r are used
to give the sizes of boxes developed for aircrafts in T and R, respectively. (X1, Y1) in T
is the reported position for the track record that is based on the current correlated radar
50
Table 4: Data Structures for Radar Reports
Attribute Type Comments
report id int Report identity
r float Report box size
X float X position
Y float Y position
Hr(tk) int Altitude
Match count int Number of correlated tracks
Match id int ID of the correlated track
k int Correlation flag
Figure 8: Track/Report correlation
report. Both q and k are flags set during the correlation procedure. As shown in Figure 8,
a box is developed around each point (x, y) with the four corner points (X ± r, Y ± r) in
T. A similar box is developed around each point (x, y) in R with the four corner points
(X ± j, Y ± j). r > 0 is based on uncertainties in the radar report, and j > 0 is based
on uncertainties of each track to compensate turning or larger errors in the radar report.
Initially j = 0.5 nautical mile for all tracks, and r = 0.5 nautical mile for all radar reports.
Each box of the radar report in R is compared with each box of the track record in T. If
an intersect is found between one report box and one track box, for example, r2 and t2
in Figure 8, then the report data is entered into X1 and Y1 of the correlated record in T
and a correlation flag is set in column k, the match count of this report is incremented,
51
Table 5: Data Structures for TracksAttribute Type Comments
ID int Flight identity
q int Track state(seven values)
C int Error measures(3 values)
j float Track box size
report id int ID of correlated report
X float Current X position
Y float Current Y position
Ht int Current altitude
Vx(tk) float Current X velocity
Vy(tk) float Current Y velocity
X1 float X position of correlated report
Y1 float Y position of correlated report
its ID is entered into the correlated track’s report id, and the ID of the track is entered
into the radar report’s match id. We calculate the distance between them, which is the
track’s shortest distance. If two tracks correlate to the same radar report, i.e., the radar
report’s match count > 1, then this radar report is discarded. If a second radar report
correlates with the same track, calculate the distance between the track and radar report
and call it the track’s current distance; if it is less than shortest distance, update the
shortest distance to be this distance, and record this report’s position as a candidate
report position for this track. Next, another radar report is broadcast to all PEs to
compare to all of the track records using the same procedure above. After all report
boxes have been compared with all track boxes, set error measure C = 1 for the tracks
that have correlated reports. The details of the smoothing process have been explained
in [4, 5, 19, 21, 87].
When all report boxes have been compared with all track boxes, a flight that is not
52
Figure 9: Search box size for flight maneuver
produced by noise might not correlate to any reports because the flight that it corresponds
to is accelerating, turning, maneuvering or by greater noise in the report. We increase
track box size for any track that has not correlated with a radar report to increase its
probability to intersect a radar report box as shown in Figure 9. First we double the box
sizes of tracks that do not correlate any reports, i.e, j = j × 2. Next, we apply the same
algorithm to compare them with uncorrelated reports whose match count are 0. The
error measures C are set to 2 for tracks that correlate reports in this round. The process
is repeated for all unmatched radar reports, which have the ”not match flag” of column
q set in R. When no intersections are found, the process is repeated with j = j × 3.
If there still remain unmatched radar reports, new tentative tracks are started for the
reports in order to detect arrival of any new flights. If reports are due to noise, they
usually will be dropped in a few periods (several seconds) based on further evaluation of
other radar reports. The details of this task are shown in Algorithm 4. This task is done
both accurately and within deadline because of the SIMD features of AP.
If a track created for a potentially new aircraft does not correlate with any aircraft
after several passes through Algorithm 4, it is viewed as being due to noise and dropped.
Note that if this algorithm were to be executed on an AP, it would need to be modified so
53
Algorithm 4 Algorithm for Aircraft Tracking
1: Radar reports are transferred from host to mono memory, then distributed frommono to PEs.
2: for i = 1 → 96 in parallel do
3: Boxes are created around each radar report and each track in each PE to accom-modate report and track uncertainties.
4: Check intersection of each report box with every track box in each PE.5: If there is an intersection, the radar report and the track are correlated. The
match count of this report is incremented, which indicates that it correlates withone track, and its ID and positions are entered into the correlated track’s record.
6: All radar reports in each PE are transferred to next PE using the ring/swazzlenetwork, and steps 3 to 6 are repeated.
7: end for
8: After the 96 iterations, all reports have been compared with all tracks. A track thatis not produced by noise might not correlate to any reports because the aircraft thatit corresponds to is maneuvering.
9: Double the box sizes of tracks that have not correlated with any reports to increasetheir probability of intersecting a report box and repeat steps of for loop above tocompare them with uncorrelated reports.
10: Triple the original box sizes of tracks that have not correlated yet, and run thealgorithm again.
11: After 3 rounds, if there are still any uncorrelated reports, they are used to start newtracks.
that each radar report would be broadcast in constant time to all PEs and processed prior
to broadcasting the next radar report to the PEs. The modification of the algorithms
in this section to run on an AP will be easy but will depend on the exact architecture
of the AP. It is noted that, in the AP solution, each report record is tested with every
track record in one set operation that requires constant time. That is, the overall AP
time is O(n) where n is the number of reports in this period. On the other hand, the
MIMD process is O(n2) because it has |R|× |T | processes, assuming O(1) processors can
access and update needed data from the dynamic data base and work on this problem
simultaneously. In the worst case situation we anticipate 12, 000 reports against 14, 000
tracks. For the AP this is 12, 000 operations which is O(n), while in the MIMD it is
R(T − 1)/2 or 8.4 × 107 which is O(n2). It is noted that the number does not include
54
many operations that are needed by MIMD but not by AP: dynamic scheduling, load
balancing, data distribution, processor assignment, mutual exclusion of data access, etc.
7.2 Cockpit Display
First the associative operation PickOne is used to select one aircraft. Next, the
broadcast operation is used to broadcast the x, y and altitude coordinates of the plane
picked in the previous step. For each of its aircraft records, each processor computes
the x-distance, y-distance and altitude distance between the location of its aircraft and
the location of the aircraft broadcast. Then the processor identifies the aircraft that are
approaching this aircraft. This is done by using the conflict detection algorithm covered
in Section 7.6.1, find all aircraft that will be within 2 × 2 nm in x and y and within
1000 feet in altitude in 2 minutes. Next, transfer these selected aircraft’s identity, x,
y positions, altitude, velocity, heading, conflict information, etc to the display server.
Finally, use the conflict resolution algorithm in section 7.6 to obtain a conflict advisory
and transfer it to the server. This algorithm uses the SIMD architecture to parallelize
the computation and to improve its efficiency.
7.3 Controller Display Update
This task is similar to cockpit display. We transfer the updated flight identity, posi-
tions, altitude, speed, heading, etc from PEs to the ClearSpeed server, which plays the
role of the controller display in this simulation. It uses the SIMD architecture to speed
up the display process.
55
7.4 Automatic Voice Advisory
Automatic Voice Advisory (AVA) automatically advises an uncontrolled flight (VFR)
of near term conditions of other aircraft and terrain by voice. This task is simulated by
having the ClearSpeed server print advisories of conflict detection and resolution, terrain
avoidance tasks, etc. For example, if there is an aircraft that is approaching the aircraft
called, the message might be ”aircraft at 4 miles ahead, 4, 500 feet above, in 1 minute”; if
the aircraft called is heading for a terrain, the message might be ”terrain, 4 miles, 3, 100
feet ahead”. We use an AP style of computation to do this efficiency.
7.5 Sporadic Requests
Sporadic requests include information requests or changes in data. For example,
aircraft have to avoid an area that has bad weather, aircraft make maneuvers to avoid bad
weather, or controllers make a request for runway usage, etc. This task is executed once
every second. Although the requests are not processed immediately, they are processed
very quickly. We simulate this task as follows. We first process the next unprocessed
sporadic task (assuming there are more than one). If it is to divert aircraft so they will
miss a storm area, all affected aircraft will be processed using the associative operation
PickOne to select one aircraft to redirect at a time.
7.6 Conflict Detection and Resolution (CD&R)
7.6.1 Conflict Detection
This paper considers a conflict to occur when two aircraft are predicted to be within
a distance of three nautical miles in x and y and within 1, 000 feet in altitude. To
assure timely evaluation we let the detection cycle be eight seconds, and we determine
56
the possibility of a future conflict between any pairs of aircraft within a 5 minute ”look
ahead” period (i.e., 300 seconds). [3–5]
All IFR flights(defined in section 2.3.2) are evaluated for their future relative space
positions between each other; and each of these IFR flights is also evaluated against
all VFR flights(defined in section 2.3.2) for their future space positions. In the worst
case situation we anticipate 4, 000 IFR records and 10, 000 VFR records according to
section 2.3.2. This means that every 8 seconds we must evaluate each of the 4, 000 IFR
flights with the other 13, 999 IFR and VFR flights in the worst environment. The process
is essentially a recursive join on the track relation where the best estimate of each flights
future position is projected as an envelope into future time. The envelope has some
uncertainty in its future position that is a function of the track state. It is modified by
adding 1.5 miles to each x, y edge of the future position (to provide a 3 mile minimum
miss distance) and 1, 000 feet in the altitude (to provide a 2, 000 feet minimum miss
height). For each future space envelope, its data is broadcast to all remaining PEs first.
Then an intersection of this envelope with every other space envelope in the environment
can be checked simultaneously in constant time. First, all future flight envelopes are
generated simultaneously, which takes O(1) steps. Then, the first ”trial envelope” of
the 4, 000 IFR future flight envelopes is compared for possible conflict with all the other
13, 999 flight envelopes for a look-ahead period of 5 minutes. Thus the equivalence of
maximum 13, 999 jobs is completed simultaneously in this jobset. This occurs because
each of the 14, 000 records in the track table is simultaneously available to each of the
14, 000 PEs that are active in the AP. The next operation selects the second trial envelope
and repeats the conflict tests against the remaining 13, 998 tracks. When a trial envelope
57
has been tested, it is marked ”done”, and future trial envelopes will exclude all prior trial
tracks. When the last of the 4, 000 trial envelopes is tested, the conflict detection will
have finished.
We implement the algorithm on CSX600 to emulate AP solution. The input data
is from the track records in the PEs. We copy each track ’s ID, 3D position Xt, Yt and
Ht, and velocity Vxt and Vyt at time t to the following variables, respectively, ID, Xc,
Yc, Hc, Vxc and Vyc in the poly structure trial in PEs. Initialize their time till which is
the aircraft’s earliest collision time with other aircraft to 300.00. The intuitive idea is to
compare each trial aircraft with all the other ones, but we use the CSX600 architecture
to parallelize the computation. The details of the conflict detection algorithm are shown
in Algorithm 5 (see Figure 10), known as Batcher’s algorithm.
Algorithm 5 Algorithm for Conflict Detection
1: for i = 1 → 96 in parallel do
2: if for each trial and track record in each PE, their flight IDs are different andaltitudes are within 1000 feet then
3: Project their positions into 5 minutes ahead, add 1.5 to each x and y coordinateto provide a 3.0 minimum miss distance in each dimension. The x dimensioncase is shown in Figure 10.
4: Calculate the min x, max x, min y and max y for minimum and maximumintersection times in x and y dimensions, as shown in equations 1, 2, 3 and 4.
5: Find the largest minimum time time min and smallest maximum time time maxacross the two dimensions using equations 5 and 6.
6: If time min is less than time max, there is a potential conflict between theaircraft whose ID is trial.ID and another aircraft whose ID is track.ID.
7: If time min is less than trial.time till, then trial.time till is updated totime min.
8: end if
9: All trial records in each PE are passed along the swazzle/ring network to the nextPE and the above steps are repeated to calculate trial.time till(trial, track).
10: end for
11: After 96 iterations, all trial records have been compared with all track records. Thetime till of each trial is its soonest collision time with another track.
58
Figure 10: Conflict detection
The following formulas are used in Algorithm 5:
min x =|trial.Xc − track.Xt| − 3
|trial.Vxc − track.Vxt|(1)
max x =|trial.Xc − track.Xt| + 3
|trial.Vxc − track.Vxt|(2)
min y =|trial.Yc − track.Yt| − 3
|trial.Vyc − track.Vyt|(3)
max y =|trial.Yc − track.Yt| + 3
|trial.Vyc − track.Vyt|(4)
time min = max{min x, min y} (5)
time max = min{max x, max y} (6)
In the AP solution, each IFR record is tested with every IFR except itself and every
VFR record in one set operation that requires constant time, according to the descriptions
above. So the overall AP process requires only 4, 000 operations at most, which is O(n)
steps where n is the number of IFR flights in this period. On the other hand, the MIMD
59
process is O(n2) because it has IFR × (IFR − 1)/2 + IFR × V FR processes, assuming
O(1) processors can access and update needed data from the dynamic data base and work
on this problem simultaneously. In the MIMD it is 4, 000× 3, 999/2 + 4, 000× 10, 000 =
40, 799, 800 which is O(n2). The AP will complete the entire set of jobs in at most
4, 000 steps due to the simultaneous execution of all jobs in each jobset. It is noted
that the number does not include many operations that are needed by MIMD but not by
AP: dynamic scheduling, load balancing, data distribution, processor assignment, mutual
exclusion of data access, etc.
7.6.2 Conflict Resolution
We use the SIMD architecture of CSX600 to do conflict resolution accurately and
efficiently. Conflict resolution will occur as quickly as possible after conflict detection has
been executed, so that each aircraft will have an updated time till value. In practice,
conflicts should be reasonably rare. First, we find which aircraft have the shortest conflict
time. This is the trial aircraft that will make the initial heading change. Next, the PEs
will evaluate in parallel different trial trajectories for the trial aircraft. These trajectories
will be created by changing the direction of the current trajectory of the trial aircraft both
clockwise and counterclockwise in increments of 5 degrees, up to a maximum increment
change of 30 degrees. The various trajectories are rotated through all processors using the
ring (or swazzle) network, and the conflict detection algorithm is used to determine the
longest time before each trajectory collides with another aircraft. Finally, the maximum
of the collision times of all trajectories is determined and used as the best heading change
that the trial aircraft can make. If more than one of the best heading change are found,
60
the one with the minimal degree is used. If two have the same degree changes (e.g., +10
and −10 degree change), one of them is arbitrarily chosen. Although conflicts cannot be
fully resolved theoretically, it runs very well in our simulation and during our tests, all
conflicts are resolved after several rounds. In an actual implementation, any unresolved
conflicts could be resolved by changing the altitude of a plane that still has a conflict
after Algorithm 6 is executed.
The algorithm for conflict resolution is described in Algorithm 6. The input data are
the tracks and trial records in PEs. Each PE can have 1 to 17 tracks and trial records.
Algorithm 6 is based on the CSX600, not an AP. The AP could handle collision avoidance
in a much simpler manner, due to its ability to do a constant-time broadcast [3–6]. For an
AP, the conflict resolution can be included as part of conflict detection. Proceeding one
aircraft at a time, each selected aircraft’s trajectory information is broadcast (in constant
time) to all other aircraft and any potential conflict is immediately corrected as follows.
The heading of the initial aircraft is altered by perhaps an increment of 10 degrees and
immediately rechecked for a conflict. This procedure continues until a conflict free path
with all other aircraft is found with the heading being altered a maximum of 30 degrees.
This method is more efficient on an AP since the broadcast of trajectory information of
one aircraft’s trajectory information can be done in constant time.
7.7 Terrain Avoidance
Terrains are lines that make a box shape that encloses a terrain height, e.g., a TV
tower is a 1.0 by 1.0 nm box with a height equal to 3, 100 feet. All terrains and tracks
are entered in each PE. Terrain avoidance is as challenging as the report correlation and
61
Algorithm 6 Algorithm for Conflict Resolution
1: In parallel, find the minimum time till of all trial records sequentially in each PE.2: Compute the global minimum time till by taking the minimum of the local time till
value in each PE.3: Use the pick-one command to select a processor whose local time till is equal to the
global time till, and transfer the trajectory record of an aircraft in this PE whosetime till is minimal to the mono memory.
4: Create array projectedpath[11] in mono memory, copy the best trial aircraft’s IDand positions to each of the records of projectedpath[11], initialize collision time, thesoonest collision time with other aircraft to ∞.
5: Each of the projectedpath[i] represents a path where the best aircraft makes a dif-ferent heading change from left to right 5 to 30 degrees in increments of 5 degrees,alternating between left and right (e.g., 5, −5, 10, −10 degrees, etc). This will allownumerous different possible paths for the trial aircraft to be evaluated in parallel.
6: Transfer the projectedpath[0], · · · , projectedpath[11] from mono memory to the PEsin parallel with each PE receiving one projectedpath[i].
7: for i = 1 → 96 do
8: Each projectedpath is compared to all aircraft records in each PE with a differentID (so that the trial aircraft will not compare to itself) unless their altitudes differby more than 1000 feet.
9: Calculate minimum and maximum intersection times in x and y dimension usingthe earlier conflict detection equations 1, 2, 3 and 4.
10: Use equations 5 and 6 to get time min and time max. If time min is less thantime max, there is a potential conflict between the aircraft whose ID is the trialID and this path.
11: Check whether time min is less than the collision time of this projectedpath, if so,this projectedpath.collision time is updated to time min. If no conflict is found,the time is not updated and nothing needs to be done.
12: The projectedpath records in each PE are then passed to the next PE along thering network to compare with the records in the neighbor PE.
13: end for
14: After 96 iterations, all projectedpath records have been compared to all the otheraircraft.
15: Find the maximum projectedpath.collision time across all PEs. This path is thebest scenario.
16: Change the best trial aircraft’s x and y velocity according to the best scenario path,display the resolution advisory and change the flight plan in the host.
62
tracking task. The terrain avoidance algorithm in Algorithm 7 is similar to the conflict
detection algorithm. The challenge is the computational intensiveness and we use the
CSX600 architecture again to speed up this computation.
Algorithm 7 Algorithm for Terrain Avoidance
1: for i = 1 → 96 do
2: if for each terrain and track record in each PE, the track record’s height is lowerthan the terrain record’s then
3: Project the track record’s position to 2 minutes in the future, add 1.5 to each xand y edge of the future position to provide a 3.0 minimum miss distance. Theterrain records are 1.0 by 1.0 nm boxes.
4: Calculate the minimum and maximum intersection times in both x and y di-mensions.
5: Set time min to be the larger of the two minimum intersection times in both xand y dimensions in step 4. Similarly, set the record time max to be the largerof the two maximum intersection times in both x and y dimensions in step 4.
6: if time min < time max then
7: There is a potential conflict between the track and the terrain.8: end if
9: end if
10: All track records in each PE are passed to next PE and steps 2 to 8 are repeated.11: end for
12: After 96 iterations, all track records have been compared with all terrain records forterrain avoidance.
7.8 Final Approach (Runways)
The final approach task is to optimize runway usage. Each flight has a flight plan
that specifies its departure terminal, planned departure time, its destination terminal,
and planned arrival time. The runways that occur in the region being managed by an
ATC system could be distributed among the processors. Each processor manages the
information for the runways assigned to it. Here, we assume that there are 96 runways in
the sector being managed by this ATC system and assign one runway to each processor.
The PE assigned to a runway collects aircraft departure times from this runway and
arrival times to this runway and sorts the times. The PEs will instruct the aircraft to
63
increase or decrease their speed to optimize runway usage and also optimize fuel cost.
The last step is currently done by controllers manually.
7.9 Comparison of AP and MIMD solutions for ATC Tasks
We assume that there is a different PE for each flight and that the data for each of
the n flights is stored in its PE. We apply static scheduling in our AP solution to the
ATC problem, which is fundamentally different from the heuristic scheduling normally
used in MIMD solutions [3–5]. The reasons that static scheduling is possible are: the
simultaneous execution of thousands of jobs in each jobset, the wider data access band-
width (200 to 500 times increase), and the ability to predict the worst case execution
time for the AP, which have been explained in section 2.4. However, like most other
real-time systems, the ATC problem cannot be efficiently scheduled using MIMD. One of
the main reasons is that an MIMD approach uses dynamic scheduling to schedule all the
tasks. The data keeps changing, so the MIMD system cannot determine, a priori, how
many records will actually participate in the computation of a task or how much actual
computation time a task in a cycle will require. In the AP, since tasks are executed as
jobsets, there is no need to differentiate between a jobset with one record and a jobset
with thousands of records, since both will require the same amount of time. Thus, the
number of records in a jobset is simply a ”don’t care” parameter. All set operation times
are based on the worst case assumption. This makes it possible to schedule ATC tasks
statically in the AP solution.
Table 6 shows the comparison of the complexities of AP and MIMD solutions for
ATC tasks [4, 5]. It is noted that we do not include the MIMD software for dynamic
64
Table 6: Comparison of Required ATC Operations
Tasks MIMD APReport Correlation & Tracking O(n2) O(n)Track Smooth and Predict O(n) O(1)Flight Plan Update and Conformance O(n) O(1)Cockpit Display O(n2) O(n)Controller Display Update O(n2) O(n)Aperiodic Requests(200/sec) O(n) O(1)Automatic Voice Advisory O(n) O(1)Terrain Avoidance O(n2) O(n)Conflict Detection O(n2) O(n)Conflict Resolution O(n2) O(n)Final Approach O(n2) O(n)Coordinate Transform O(n) O(1)
scheduling, load balancing, data management overhead, etc, and we assume that the
number of processors in MIMD is much smaller than the number of PEs in AP.
Furthermore, the MIMD solutions include costs of running a dynamic scheduling
algorithm, resource sharing due to concurrency, processor assignment, data distribution,
concurrency control, mutual exclusion of data access, indexing and sorting, etc [6]. These
costs not only complicate ATC software and algorithm design, but also dramatically
increase the size, complexity, and processing time of the system. On the other hand, our
AP approach provides a much simpler solution to the ATC problem: the static scheduling
assures the predictability of system performance, and the design procedure has been
greatly simplified by the table-driven scheduler shown in Table 2 and Figure 7. The
static schedule is both reliable and adaptable and can easily be modified to incorporate
system changes as needed.
65
7.10 Implementation of Algorithm Issues
One issue that needs attention is the implementation of these algorithms on real
machines. Why does conflict detection have 96 iterations? Why are there 96 runways in
final approach? If we replace 96 runways with 100, will our approach still outperform?
Since our ClearSpeed accelerator has 96 processors on a SIMD chip, the execution of
ATC tasks is faster if at most one record of each type is stored on each processor. For
example, if there are at most 96 aircraft being tracked at any time, then these can be
processed simultaneously in data parallel fashion. However, if there are 100 planes being
tracked, then at least 4 processors will have to manage the records for two planes each.
For example, when executing the tracking algorithm, all planes with two records will first
check their first plane’s location against a radar location and then the four PEs managing
two aircraft records will check their second plane’s location against the radar location.
During the time the 4 processors check their second plane’s location against the radar
location, the other 92 processors without a second aircraft record are inactive. The result
is that having even one processor contain two aircraft records will about double the time
required to process the aircraft records. Likewise, having two runway records stored on
one processor will typically double the amount of time to process the runway records. It
is more efficient to have at most one record of each type on each processor so that all
records of the same type can be simultaneously processed at the same time. Of course,
if the processor speeds are fast enough and they have sufficient memory, then it may be
reasonable to assign more than one record of each type to processors. More generally, if
the maximum number of aircraft records in the processors is k then the time to execute
66
the tracking algorithm will be about k times larger than if there were at most 1 record
per processor. As a result, when a new aircraft record is created, this new record should
be assigned to a processor with a minimum number of current records of that same type.
This can be easily and efficiently accomplished using the AP associative function MIN.
As each processor can keep the maximum number of each type of records it stores and
the constant time functions can be used to quickly locate a processor with a minimum
number of records of that type where the new record can be stored.
CHAPTER 8
Experimental Results
This chapter describes the results of a set of preliminary experimental results that
were conducted to achieve four different goals. First, the experiments provide a proof-of-
concept for the proposed ATC system implementation based on the CSX600. Second, the
performance and scalability of the proposed approach will be evaluated by performing a
comparison between the CSX600-based implementation and the fastest host only version
using OpenMP [24] on a state-of-the-art multiprocessor server system featuring a total of
8 system cores. Similar experiments using OpenMP have been used in image processing
algorithms on GPUs [69]. Third, these experiments will provide some initial evidence
for the claim that the proposed AP-based ATC system implementation exhibits greater
efficiency and a huge increase in the degree of predictability when compared to the
MIMD-based solution. Fourth, we show that our model of ATC that consists of eight
selected tasks can meet the deadlines for the hard real-time ATC tasks.
8.1 Experimental Setup
We are creating a solution for a prototype of ATC since our implementation cannot
manage the number of aircraft in a real-world situation. In order to have information
about flights that we can control, we simulate the real-world situation by generating
aircraft flights in a three dimensional airspace of 1, 024 by 1, 024 nautical miles(nm)
and 1, 000 to 10, 000 feet in altitude. The initial positions and realistic velocities of the
67
68
aircraft are generated randomly and trajectories of aircraft consist of a constant velocity
mode and a coordinated turn mode. A random amount of error is added to the location
of each plane to create the radar reports. Since we control this entire process, we can
generate different numbers of aircraft to test algorithms and test the limits on the number
of aircraft that can be processed fast enough to meet deadlines. Unlike live FAA flight
data, when two aircraft are on a collision course, we can alter the flight path of one
aircraft to eliminate this problem.
The emulation of the AP solution of ATC uses the ClearSpeed CSX600 accelerator
board and is implemented using the Cn language as described in Section 5.1. The al-
ternative implementation was on an Intel Dual Processor Xeon E5410 Quad Core 2.33
GHz system with 32 GB of main memory and 2×6Mb of L2 cache (for each CPU). This
system has a total of 8 cores. The implementation was done in C using the gcc compiler
version 4.1.3. Both machines’ power are similar because their peak performance in doing
some embarrassingly parallel computations [46] are almost the same, e.g., Mandelbrot
set [43, 83]. So our comparison is fair. In this section, MIMD will usually be used to
refer to this specific machine. Occasionally, it will refer to a generic MIMD, but these
instances should be clear from the context. A single threaded implementation on a single
core of the multicore is called STI. We denote the multi-threaded implementation on the
multicore by MTI. We used the compiler optimization flags ’-mfpmath=387,sse -msse3
-ffast-math -O3’ to enable Intel’s Streaming SIMD Extensions (SSE) for both the single-
threaded as well as the multi-threaded version of the code. We have carefully tuned the
OpenMP codes to achieve a good performance, e.g., avoid false sharing, cache ping pong,
maximize parallel regions, ensure the efficient use of cache, avoid poor load balance and
69
high lock contention problems, etc. The multi-threads were specifically designed to avoid
locking whenever possible. Additional information about the techniques used to obtain
highly efficient OpenMP programs are given in Section 2.5.
8.2 Experiment Results
8.2.1 Comparison of Performance on STI, MIMD and CSX600
We scheduled the tasks of report correlation and tracking, terrain avoidance and
CD&R on both machines. The report correlation and tracking is executed every 0.5
second, the terrain avoidance every 1 second, and the CD&R task every 2 seconds. The
deadline for each task occurs at the end of each of its periods. None of the tasks can
start their executions before their release times and all of them must complete their
executions before their deadlines. Each approach was executed for a varying number
of aircraft, ranging from 96 to 1824 in increments of 96. For each number of aircraft
in the given range, each approach was executed for 100 times. The major period is 2
seconds and contains one CD&R period, two terrain avoidance periods, and four report
correlation and tracking periods. If any of the three tasks misses its deadlines during a
major period, we say that the execution has missed its deadline during this period.
We execute the three tasks under all three approaches, CSX600, MTI and STI.
The comparisons of maximum execution times are shown in Figure 11, Figure 12 and
Figure 13. From these results we can see that MTI takes more time than the CSX600 and
the additional time required for the MTI implementation increases more rapidly than
the CSX600 as the number of aircraft increases. This is because MIMD solution has
more overhead, more costs in inconsistency, dynamic scheduling and load balancing etc.
70
�
�
�
��
��
��
�����
��
���
���������������������
��������
�����
���
���
����
�
�
�
�
�
��
��
��
� ��� ���� ���� ����
�����
��
���
���������������������
����� ��!�"� � !�
��������
�����
���
���
����
Figure 11: Timing of tracking task.
�
�
�
��
��
��
��
�����
��
���
���������������������
��������
����
����
���
����
�
�
�
� ��� ���� ���� ��������� ��!�"� � !�
Figure 12: Timing of terrain avoidance task.
Because there is too much difference between the data of STI and the other two, it is very
hard to distinguish the bottom two curves in these three figures. All of these graphs show
a significant speedup of the performance of the MTI implementation over the sequential
STI implementation so the parallel code for MTI is well-written. Section 8.2.2 shows the
difference between MTI and SIMD more clearly in its three figures.
8.2.2 Comparison of Performance on MIMD, CSX600 and STARAN
Timing results for tracking and correlation, CD&R, terrain avoidance and display
processing tasks for the STARAN are given in [4, 5, 11]. It only takes 0.11 second to
execute the tracking task for 2, 000 aircraft for the STARAN. The AP in the following
four figures stands for STARAN specifically. Figure 14 compares timings of the tracking
task for the MIMD, CSX600 and STARAN. It only takes 0.1 second to execute terrain
avoidance task for 2, 000 aircraft in the STARAN. Figure 15 compares timings of terrain
71
�
�
�
��
��
��
��
��
�����
��
���
����� ���������������
��������
�����
����
���
����
�
�
�
� ��� ���� ���� ��������� �!�"� � !�
Figure 13: Timing of CD&R task.
�
���
�
���
����
��
�
���������������������
��������
���
������
��������
��
� � �� � ���������� � ����� �
Figure 14: Comparison of tracking task of MTI, CSX600 and STARAN.
avoidance task for the MIMD, CSX600 and STARAN. It only takes 0.28 second to execute
CD&R task for 2, 000 aircraft in the STARAN. Figure 16 compares timings of CD&R
task for the MIMD, CSX600 and STARAN. It only takes 0.16 second to execute display
processing (controller display update) task for 2, 000 aircraft for the STARAN. Figure 17
compares timings of this task for the MIMD, CSX600 and STARAN. Although we do not
have a modern AP system to test, with today’s advanced architecture, we believe that
it will execute the ATC tasks more quickly and accurately and at least as predictably,
compared to the STARAN.
The comparisons between MTI and CSX600 in Figures 14, 15, 16, and 17 are fair, as
both are small systems and are doing the entire job. However, the comparisons between
the CSX600 and the STARAN are not fair, as the STARAN has only one aircraft per PE,
while the CSX600 has an increasing number of aircraft. The CSX600 curves in the three
graphs above make it appear that this system is more like a MIMD than STARAN, due
to the fact that the number of records per processor is increasing. The next section 8.2.3
72
�
���
�
���
����
��
�
����������������������
��������
���
������
��������
��
� � �� � ��
�������� � ����� �
Figure 15: Comparison of terrain avoidance task of MTI, CSX600 and STARAN.
�
���
�
���
�
���
��
��
�����������������������
��������
�����
����
��������
��
� � �� � ������� ��!��� � �!�
Figure 16: Comparison of CD&R task of MTI, CSX600 and STARAN.
will explain how to compare STARAN with CSX600 fairly and demonstrate how closely
our CSX600 emulates the STARAN.
8.2.3 How close to the AP is our CSX600 emulation of the AP
This section illustrates how closely our CSX600 emulation of the AP performance is
to AP performance. Figure 18 and Figure 19 shows the timings of tracking and CD&R
tasks for the total number of aircraft increasing from 10 to 384, i.e., number of aircraft
per PE is 1, 2, 3 and 4 in CSX600. We can see that the execution time is basically linear
when the maximum number of aircraft per PE does not change, and there is a gap when
the maximum number of aircraft per PE increases. Moreover, the slope of each of the
line segments is reasonably close to the slope of the STARAN over that interval. These
results show that our CSX600 emulation is closely tied to STARAN, and it emulates the
solution of STARAN the best when the maximum number of aircraft in each PE is 1.
As explained in previous section 8.2.2, the comparison between the STARAN and the
CSX600 is unfair, so we provide a fairer way to compare them. We will use the CSX600
73
�
���
�
���
�
���
��
��
�����������������������
��������
�����
����
��������
��
� � �� � ������� ��!��� � �!�
Figure 17: Comparison of display processing task of MTI, CSX600 and STARAN.
Figure 18: Time of tracking for number of aircraft from 10 to 384.
to emulate a Super-CSX600 with enough processors to match the upper bound of the
number of planes so that there is at most one aircraft per PE on it. The CSX600 is
still paying a price for having to do extra moves between mono memory and parallel
memory in emulating the Super-CSX600. The CD&R execution time for the Super-
CSX600 is obtained by adding the non-parallel computation time to the quotient of the
parallel execution time divided by the maximum number of aircraft per PE. The results
in Figure 20 show that the Super-CSX600 is basically linear and reasonably similar
to the STARAN, but the cost increases faster for the former due to greater overhead,
such as data transfer between mono memory and processors, the non-constant cost of
the associative functions, and disadvantages of hardware compared to STARAN, etc.
There is a significant difference between AP and Super-CSX600, but it is dwarfed by the
difference between their curves and the MTI curves.
74
Figure 19: Time of CD&R for number of aircraft from 10 to 384.
�
���
�
���
�
���
�� ������
��
�����������������������
�
���
� ��� ���� ���� ���� ���� �!� ��"��� � #"�
Figure 20: Comparison of MTI, STARAN and Super-CSX600 with at most one aircraftper PE.
8.2.4 Comparison of Predictability on CSX600 and MIMD
The preceding results demonstrate that the performance of CSX600 is better than
that of STI and MTI. We next focus on the predictability of the execution times, which
is an important factor in guaranteeing that all the ATC processing can be performed
within the specified time bounds. We measure the Coefficient of Variation (COV),
which is a common normalized measure of dispersion, and is defined as the ratio of
the standard deviation to the mean. The smaller the value is, the lower the variance
is and so the higher the predictability is. Unlike the standard deviation, the COV is
dimensionless, so we think that it is a better tool to use to compare predictability than
standard deviation. The COV of the three tasks are shown in Figures 21, 22 and 23. Note
that the y-axis in these figures uses a logarithmic scale. The results clearly show that
the COV values for CSX600 are several orders of magnitude below the ones for MTI. In
addition, the increased variability of MIMD will not be compensated for by the running
75
����
����
����
���
����
����
����
����
��
���������
����
�������
��������
�����
����
������������������������ � �!����"������#��
�
����
����
����
����
���
����
����
����
� $�� ���� �$�� ����
����
��
%&�'������(�������
���������
����
�������
��������
�����
����
������������������������ � �!����"������#��
Figure 21: Predictability of execution times for report correlation and tracking task.
����
����
����
����
����
����
���
���
���
� ����
���
������������������������ �������� !��� ��"�# ���������
�����
�������
�����$�%
%�&'!
���#
�
����
����
����
����
����
����
����
���
���
���
� ��� ���� ���� ����
� ����
���
()!*������+�������
������������������������ �������� !��� ��"�# ���������
�����
�������
�����$�%
%�&'!
���#
Figure 22: Predictability of execution times for terrain avoidance task.
time. As shown in Figures 14, 15 and 16, the MIMD running time increases much more
rapidly than CSX600 and AP. Again this is because of the advantages of AP over MIMD
mentioned in Section 4.2 above. These scenarios are true for not only less than 8, 000
aircraft, but also for large scales such as 14, 000 aircraft.
8.2.5 Comparison of Missing Deadlines on CSX600 and MIMD
Last, we compare the number of major periods that have missed their deadlines for
both CSX600 and MTI. The scheduling was described at the beginning of this section.
The results are shown in Figure 24. When the number of aircraft increases, the number
of deadlines missed by the MTI execution increases dramatically. However, the CSX600
does not miss any deadline during each major period.
8.2.6 Timings for 8 Tasks
In this section we will show that ClearSpeed can perform all the required tasks in
our ATC prototype and meet all deadlines for each major period. Table 7 shows the
76
����
���
����
���
����
�����
��
� ���������� ��������� � �� ����������������
� �������
�� ���
����
�� �����
�!"�
����
�
����
���
����
���
����
� ��� ���� ���� ����
�����
��
#$�%�� ��&�������
� ���������� ��������� � �� ����������������
� �������
�� ���
����
�� �����
�!"�
����
Figure 23: Predictability of execution times for CD&R task.
��
��
��
��
��
��
��
�
�����
�������
���
�����
�������
���� �
� ��!�"
��#$��
� �
%�
��
��
��
��
��
��
��
�
� ��� %��� %���
�����
�������
����������&��'����
���
�����
�������
���� �
� ��!�"
��#$��
�
Figure 24: Number of iterations missing deadlines when scheduling tasks.
performance of one flight per PE, which is the closest scenario to the AP. The execution
time (secs) is the time that is spent to execute this task once. The processing time (secs)
is the total time that is spent for this task in an 8 second period. We can see that all
tasks can be done within their deadlines. The total used time is 0.25508 seconds, which
is only 3.19% of the available time of 8 second.
Table 8 lists the worst case timings of eight tasks for 10 aircraft per PE running on
the CSX600, which is the maximum number that can be scheduled because each PE only
Table 7: Performance of One Flight/PETasks Exec Time(Seconds) Proc Time (Seconds)
Report Correlation & Tracking 0.00552 0.08832Cockpit Display 0.00272 0.02177
Controller Display Update 0.00276 0.02209Sporadic Requests 0.00155 0.01244
Automatic Voice Advisory 0.00544 0.01088Terrain Avoidance 0.00782 0.0782
Conflict Detection & Resolution 0.01301 0.01301Final Approach(96 runways) 0.00837 0.00837
Total 0.25508
77
Table 8: Performance of Eight Tasks for Ten Flights/PETasks Exec Time(Seconds) Proc Time(Seconds)
Report Correlation & Tracking 0.3409 5.4544Cockpit Display 0.1705 1.364
Controller Display Update 0.1825 1.46Sporadic Requests 0.0987 0.7896
Automatic Voice Advisory 0.1586 0.3172Terrain Avoidance 0.2386 0.2386
Conflict Detection & Resolution 0.5182 0.5182Final Approach(96 runways) 0.2598 0.2598
Total 10.4018
has 6K of memory. The total processing time is 10.4018 seconds while the maximum
time required by ATC is typically 10 seconds [2, 4, 5, 17, 68]. Since a 10.4 second cycle is
only slightly larger than a 10 second cycle, the performance results are relatively good
and demonstrate the scalability of this implementation on the CSX600. We can conclude
that the CSX600 implementation can meet all deadlines that can be statically scheduled.
The ClearSpeed CSX700 provides a larger SIMD system with 192 processors and should
be able to double the number of aircraft supported by the CSX600. With a faster 250
MHz clock speed, the CSX700 should either meet or come close to meeting a 10 second
deadline for execution of the major period.
The reason that we use 96 in Table 3 is that 96 is the maximum we can use without
increasing the CSX600 running time of this task by a factor of about 2. Also, this
larger number demonstrates the advantage of using a SIMD for a more computationally
intensive problem and should be enough to include all of the runways in airfields under
the control of an ATC center. The bound of 100 runways is used in both [4] and [5].
We have also tested executions for 10 runways, which is more practical for an airport
size application (as opposed to an entire ATC sector). The result is almost the same.
78
However, CSX600 is low power consumption, small size and weight, so it is not a big
waste of resources.
8.2.7 Video and Actual STARAN Demos
In a contract with FAA, Goodyear Aerospace gave a demonstration in 1971 in Knoxville,
TN. This unit used a content-addressable memory built with magnetic wire to demon-
strate automatic tracking, conflict detection and resolution, terrain avoidance, and auto-
matic voice advisory using only about 4500 instructions. This unit was a predecessor to
the STARAN.
As a result of the successful performance of the experimental magnetic SIMD at
Knoxville, Goodyear developed a production AP named STARAN. The STARAN sys-
tem had array modules containing 256 PEs and 1028 bits of storage per PE. The system
was programmed by five people in five months and consisted of 4, 017 AP assembly in-
structions, and about 1, 600 instructions for the control processor [13]. It was delivered
for demonstration in 1972 at the International Air Exposition at Dulles Field, and per-
formed as a control center processor. The Edwards AFB radar was used as radar input.
All radar data was tracked automatically.
A video made from the 16 mm film taken in 1972 at the International Air Exposition
at Dulles Field is available for viewing at [88]. The video shows automatic, primary and
beacon, radar tracking, conflict detection and resolution, and display processing. The
demonstration uses 256 flight plans to develop the primary and beacon radar reports.
Initially, a ten second major period was used, but later the system clock was sped up to
process major periods in one tenth of a second. The speedup of the system clock by 100
79
times real time is the equivalent of 25, 600 tracks in real time. Both the real-time and
the 100 times speedup are captured on the video. What was demonstrated by STARAN
in 1972 cannot be done in any ATC system in the world today.
CHAPTER 9
Observations and Consequences of the Preceding Results
This section focuses on the theoretical aspects of the paper, i.e., the contributions
of our AP approach to the bigger picture of high performance computing, parallel and
distributed computing etc, which are very popular now. We will show that the idea of
using AP for solving ATC problems is useful for many other data intensive problems,
and other computation methods in many areas can benefit from it.
A recent paper in IEEE Computer [89] observes that the widely accepted RAM model
for sequential computation is the reason for the huge progress that has occurred in se-
quential computation over the past 60 years and observes that a single widely acceptable
model for parallel computation could result in similar successes in parallel computation.
It describes several features that this parallel model and its target architectures should
satisfy, including encouraging parallel solutions to applications that are easy to use, en-
ergy efficient, scalable, and highly portable. Moreover, the pitfalls and problems that are
identified as desirable to avoid in software production are data dependences, race condi-
tions, load balancing, nondeterminism, false sharing, deadlocks, and non-predictability
while paradigms that support construction of easy-to-use, productive, energy-efficient,
scalable, and portable parallel applications are desirable. Unfortunately, most program-
mers do not have a good understanding of the latest developments in hardware design,
which is an essential skill needed for multicore and particularly many-core programming.
80
81
Sequential programming is a deterministic and predictable process that arises intu-
itively from the way programmers solve problems using algorithms. Parallel programming
should be as simple and intuitive as sequential programming. The general programmers
should not be required to have an in-depth understanding of the latest developments in
hardware design in order to write efficient programs. The current complexity of designing
high quality parallel programs makes the process of redesigning and rewriting applica-
tions both time-consuming and expensive. The result is that most legacy applications
are still waiting to be parallelized. These and related issues are discussed further in [89].
The AP programming has all of the advantages mentioned above, moreover, it is
very easy for programmers because there is only one instruction stream every time. It is
unnecessary to worry about the extra work of MIMD, so AP programming is not as error
prone as multicore and particularly many-core programming. Since not many SIMDs
have been built since the 1980’s, most of today’s computer professionals have very limited
knowledge about how to design software, algorithms, or programs for SIMDs. Essentially
none of today’s computer professionals even know what an AP is. They do not think in
terms of SIMD or AP solutions to problems, and are unaware of how to solve problems
using these machines.
The associative computational model ASC introduced in [15] was designed to support
algorithm development for the AP and satisfies all of the above desired characteristics.
While the name ASC in this paper originally applied to both APs and multi-APs (which
allowed multiple instruction streams), the name ASC was later restricted to the single
instruction stream AP parallel computer and the model for multi-APs was called MASC.
Comparisons between the power of these models and other well-known computational
82
models are given in [42,90,91]. A possible implementation of a multi-threaded associative
processor is also investigated in [79].
Next, we consider whether the AP is primarily useful in solving only ATC-type prob-
lems, or whether it can also solve a wide range of other problems. A language (also called
ASC) with a primer, compiler, and Windows emulator has been designed for ASC [41,92].
Many algorithms have been designed for the ASC model and software projects for AP
platform (often using the ASC language) in order to demonstrate the versatility of the
ASC model and the AP system. These include graph algorithms [15,41,93], convex hull
[94–98], string matching [99,100], sequence alignment [101,102], NP-Complete [103,104],
databases [105, 106], parallel compilers [107–109], supporting functional and logic-based
languages [110–112], and supporting expert system languages [113–115] and computer ar-
chitecture [76–81]. In [11], Batcher lists fast Fourier transforming, sonar post-processing,
string search, file processing, air traffic control, image processing, data management, posi-
tion locating and reporting, bulk filtering, and radar processing as STARAN applications,
and for the first five of these, provides a comparison of the STARAN’s performance with
several other systems. Additional applications are discussed in [41], the MPP was specif-
ically designed and used for image processing. This wide range of AP algorithms and
software demonstrate the versatility of the AP to solve a wide range of applications.
The extra problems that MIMDs have to solve as part of their solution to the original
problem require a lot of extra work. In addition to the extra problems discussed earlier,
the extra work that MIMDs have to do to support the integrity and the ACID properties
of a database requires substantially additional work. This requires additional software
support that typically involves many of the same types of support software discussed
83
earlier such as synchronization, dynamic scheduling, critical sections of code, sorting,
indexing, etc. Also this extra work may involve a substantial increase in communications.
All of these problems become much more difficult to handle when the database is dynamic,
as records typically have to be added or transferred frequently, and massive updates occur
frequently, e.g., every 0.5 second. In contrast, the AP (and SIMDs in general) stores all
information about an object such as an aircraft in the local memory of the processor this
object is assigned to. This processor executes essentially all of the computations regarding
that object, so updates to the database involving this object only require changes to the
local memory of that processor.
The graphs in Chapter 8 of this paper show that the magnitude of this extra work
dramatically slows down the time required for MIMDs on ATC type problems. However,
there is nothing highly unusual about the solution to ATC problems, except the hard
deadlines. While the heavy real-time nature of the ATC problem and the extremely
dynamic database that must be maintained exacerbates the extra work that MIMDs are
required to do to solve this problem, this just primarily increases the magnitude of this
extra work. It is still the case that for most non-trivial applications, MIMDs have to use
a huge amount of extra software and do a huge amount of extra work when compared to
SIMDs.
The hardware cost of building a large AP should be much cheaper than building a
MIMD that has the capability of solving the same problems due to the simplicity of
the AP hardware. While many applications may be reasonable to solve on a MIMD,
on important large scale problems that will be expensive and challenging to develop a
MIMD solution, it may be considerably cheaper and much more satisfactory to have
84
an appropriate size AP built to handle the problem. The history of the ATC problem
provides an excellent justification for considering this alternate approach. The ongoing
MIMD ATC solution is widely known for its repeated shortcomings, in spite of the fact
that a very large scale project to redevelop this system has been launched about once
every ten years for the past 50 years.
CHAPTER 10
Conclusions and Future Work
In this dissertation, we provided an efficient and scalable solution for Air Traffic
Control (ATC) problems on the associative processor (AP) system that is emulated on
the ClearSpeed CSX600 system. We established the feasibility of this solution by showing
that if this solution is mapped onto the Clearspeed CSX600 emulator of the AP, then
this solution works on the AP system. Without many of the details of the AP solution,
we were able to recreate a large portion of this earlier implementation.
The contributions of this paper are summarized as follows. First, the AP imple-
mentation has much better scalability and efficiency than the MIMD implementation.
This implementation used static scheduling, which is possible due to the deterministic
SIMD architecture. This AP ATC software also avoids the inclusion of solutions to the
following types of problems (or calls to library functions which solve these problems):
load balancing, shared resource management, synchronization, preemption management,
sorting, indexing, cache and memory coherency management, false sharing, priority in-
version handling, race conditions, data dependencies, and deadlocks, etc. In particular,
this solution avoids use of solutions or approximate solutions to any of the numerous
multiprocessor NP-hard problems of the type given in [8]. The solution used is very sim-
ilar to a sequential solution for this problem, both in style and in code size. As a result,
the size of this software solution is only a very small fraction of the size of the MIMD
solutions to the ATC problem. This results in a dramatic drop in the cost in both the
85
86
cost of creating and in maintaining this software when compared to the MIMD solutions
that have been given to the ATC problem.
Second, the proposed AP solution will support accurate and meaningful predictions of
worst case execution times and will guarantee all deadlines are met. In contrast, MIMD
systems optimize average case running time and have highly unpredictable worst case
running time. These contributions can provide major help in meeting the goals of FAA’s
NextGen Plan: fly more aircraft, more safely, more precisely, more efficiently and use
less fuel [17].
Third, the AP hardware easily scales to handle larger problem sizes by either increas-
ing the maximum number of records stored in each processor (which will slow down the
run-time) or by building a larger AP computer with more processors, which is easy to do.
Based on the simplicity of the architecture of the STARAN and ASPRO, which are the
only APs that have been built, building a larger AP system should be both easy to do and
much cheaper to build than a MIMD system of comparable size. The CM-2 Connection
Machine was a SIMD computer built by Thinking Machines in the late 1980’s that had
over 64K processors. Paracel developed a parallel processor which was generally believed
to be a SIMD that had one million processors. It would seem reasonable to expect that
an AP with a million processors could easily be built currently. Fourth, the result is that
the Validation and Verification (V&V) is simpler than for current MIMD software.
The ATC problem has similar requirements to most embedded real-time problems
with periodic tasks and hard deadlines, such as command and control problems in mil-
itary, autonomous driving, video surveillance, image processing, simulations of complex
87
physical systems (e.g., weather forecasting, molecular modeling), and massive data pro-
cessing, etc. The AP’s ability to easily handle the ATC problem would also enable it
to easily handle many other real-time problems, both large and small. As a result, this
research is relevant not only to the ATC problem, but also to numerous other important
applications that involve real-time problems with hard deadlines. For example, command
and control systems such as an air defense system would be natural candidates. Other
examples may include embedded real-time systems as well.
The ATC problem is a large dynamic database problem. The AP excels in handling
the ATC since it was designed to handle real-time dynamic database activities rapidly.
For example, locating records with a particular property, reading a value from this record,
changing a value in this record, and determining whether a record with a certain property
exists are actions that can be done in constant time. By use of the flip network, large
pieces of records can be moved into parallel memory or copied from parallel memory in
constant time, making it possible to enter and to ship these records elsewhere rapidly.
The AP’s capability of handling dynamic databases makes it very useful for numerous
other applications.
A natural application area is controlling fleets of Unmanned Aircraft Systems (UASs).
In fact, the ATC system developed here with the ClearSpeed CSX600 could be expanded
to manage flight control for a small fleet of 480 to 960 UASs (depending on the number
of tasks performed and the size of the records stored, etc) in applications such as: (1)
patrolling areas such as international borders and reporting unusual activities; (2) early
identification of forest fires and during actual fires, maintaining an updated information
about regions that are burnt, threatened, etc; and (3) surveillance of agriculture crops and
88
performing functions like spraying when needed, turning water on at various locations
when plants need water, etc. The ClearSpeed CSX700 provides a larger SIMD system
with 192 processors and roughly a 20% faster clock time for processors should more than
double the capabilities of the CSX600. As ClearSpeed SIMD accelerators are well-known
for their very small power consumption, size, and weight, it should be possible to build
an AP that also has similar characteristics. For example, it should be possible to build
an AP that requires less power than a light bulb and have a sufficiently low weight that
would enable it to be easily deployed in the field and still be able to control 1000 UASs in
the immediate vicinity. Since UASs will soon be permitted to enter airspace controlled
by FAA, many avionics experts expect the total aircraft in the skies to rapidly increase
in numbers as the anticipated civilian use of UASs explodes. Since the demands on our
ATC system are likely to increase at a much more rapid pace than in the past, the FAA
should quickly initiate an investigation into the use of APs for ATC and how to best
integrate the APs into the FAA system.
From Sections 9 and 10 above, we know that AP approach has novelty and contribu-
tions in many other applications. However, we discuss its limitation now. It is obvious
that AP is ideal for solutions that have a lot of data parallelism in them, especially large
scale, while MIMD fits distributed applications. For example, at the end of Section 8.2.6,
it takes the same time to process 10 runways and 96 ones. A natural way to overcome
this limitation is to combine AP and MIMD in the same system. The ClearSpeed has
been used extensively as an accelerator to MIMD systems. In several cases, supercom-
puters increased their ranking in the top 500 by adding multiple ClearSpeed accelerators
to their system. MIMD processors could hand off problems that the SIMD accelerator
89
could compute more efficiently and perform other work while waiting for ClearSpeed to
return the solution. It may be useful to investigate whether the use of both AP and
MIMD hardware in the same system as co-equal partners could also prove to be benefi-
cial. Either system could serve as the main system, but would have the option of handing
off problems to the other system that it could not handle efficiently. Such a system would
need to be able to convert from one mode to the other efficiently or else transfer data
from one system to the other efficiently. Perhaps such combination might be useful in
building a system that could handle a wider range of problems efficiently. It also might
be useful in large systems (e.g., exascale) which would be able to handle a variety of very
large applications.
A possibly important extension to our current research would be to consider an im-
plementation of ATC on NVIDIA GPU hardware using CUDA, which has many SIMD
PE groups on its chips. The NVIDIA technology including the latest FERMI chip has
a lot in common with the MTAP approach of ClearSpeed. Implementing the CSX600
ATC algorithms on this architecture may provide another useful platform to use in this
project and would provide useful information about its ability to provide another useful
platform to use for the types of real-time applications mentioned earlier. Furthermore,
using CUDA GPU, we can investigate whether the shortcomings or bottlenecks of AP
can be improved. For example, GPUs have evolved into highly parallel multicore sys-
tems allowing very efficient manipulation of large blocks of data. This design is more
effective than general-purpose CPUs for algorithms where processing of large blocks of
data is done in parallel. We will investigate whether ATC can run well using CUDA
90
and whether the bottlenecks can be improved. Another potential project is to investi-
gate implementing our prototype on other parallel systems, e.g., a Cray Systems, such
as their vector processor, IBM’s Cell processor, and Convex with FPGA reconfigurable
hardware, etc.
BIBLIOGRAPHY
[1] S. Kahne and I. Frolow, “Air traffic management: Evolution with technology,”IEEE Control Systems Magazine, vol. 16, no. 4, pp. 12–21, November 1996.
[2] M. Nolan, Fundamentals of Air Traffic Control, 3rd ed. Wadsworth: Brooks/Cole,1998.
[3] M. Jin, “Evaluating the power of the parallel masc model using simulations andreal-time applications,” Ph.D. dissertation, Department of Computer Science, KentState University, August 2004.
[4] W. Meilander, M. Jin, and J. Baker, “Tractable real-time air traffic control au-tomation,” in Proc. of the 14th IASTED International Conference on Parallel andDistributed Computing and Systems (PDCS), Cambridge, MA, November 2002, pp.483–488.
[5] W. Meilander, J. Baker, and M. Jin, “Predictable real-time scheduling for air trafficcontrol,” in Fifteenth International Conference on Systems Engineering, August2002, pp. 533–539.
[6] ——, “Importance of simd computation reconsidered,” in Proc. of the 17th In-ternational Parallel and Distributed Processing Symposium (IEEE Workshop onMassively Parallel Processing), Nice, France, April 2003.
[7] W. Meilander, J. Potter, K. Liszka, and J. Baker, “Real-time scheduling in com-mand and control,” in Proc. of the 1999 Midwest Workshop on Parallel Processing,August 1999.
[8] M. Garey and D. Johnson, Computers and Intractability: a Guide to the Theory ofNP-completeness. New York: W.H. Freeman, 1979.
[9] M. N. J.A. Stankovic, M. Spuri and G. Buttazzo, “Implications of classical schedul-ing results for real-time systems,” IEEE Computer, pp. 16–25, June 1995.
[10] M. Jin, J. Baker, and K. Batcher, “Timings for associative operations on the mascmodel,” in Proc. of the 15th International Parallel and Distributed Processing Sym-posium (IEEE Workshop on Massively Parallel Processing), San Francisco, CA,April 2001, pp. 193–200.
[11] K. Batcher, “Staran parallel processor system hardware,” in National ComputerConference and Exposition (AFIPS74), New York, NY, May 1974, pp. 405–410.
[12] W. Meilander, “Staran an associative approach to multiprocessing,” in Multipro-cessor Systems, Infotech State of the Art Reports, Infotech International, 1976, pp.347–372.
91
92
[13] J. A. Rudolph, “A production implementation of an associative array processor -staran,” in The Fall Joint Computer Conference (FJCC), Los Angeles, CA, De-cember 1972.
[14] W. Meilander, “Aspro-vme hardware/architecture,” June 1992, eR3418-5 LORALDefense Systems.
[15] J. Potter, J. Baker, S. Scott, A. Bansal, C. Leangsuksun, and C. Asthagiri, “Asc:An associative-computing paradigm,” Computer, vol. 27, no. 11, pp. 19–25, 1994.
[16] “Faa grants for aviation research program solicitation no. faa-06-01,” 2011. [Online].Available: http://www.tc.faa.gov/logistics/grants/solicitation/97solict.doc
[17] “Faa’s nextgen implementation plan,” 2011. [Online]. Available: http://www.faa.gov/nextgen/media/ng2011\ implementation\ plan.pdf
[18] “Federal aviation agency 1963 atc specifications,” 1963. [Online]. Available:http://www.cs.kent.edu/∼parallel/papers/FAA1963ATCSpecifications.pdf
[19] M. Yuan, J. Baker, F. Drews, and W. Meilander, “Efficient implementation ofair traffic control (atc) using the clearspeed csx620 system,” in Proc. of the 21stIASTED International Conference on Parallel and Distributed Computing and Sys-tems (PDCS), Cambridge, MA, November 2009, pp. 353–360.
[20] S. Guy, J. Chhugani, C. Kim, N. Satish, M. Lin, D. Manocha, and P. Dubey,“Clearpath: highly parallel collision avoidance for multi-agent simulation,” in ACMSIGGRAPH/Eurographics Symposium on Computer Animation(SCA). ACM, Au-gust 2009, pp. 177–187.
[21] M. Yuan, J. Baker, F. Drews, L. Neiman, and W. Meilander, “An efficient asso-ciative processor solution to an air traffic control problem,” in Large Scale ParallelProcessing (LSPP) IEEE Workshop at the International Parallel and DistributedProcessing Symposium (IPDPS), Atlanta, GA, April 2010.
[22] M. Yuan, J. Baker, W. Meilander, and K. Schaffer, “Scalable and efficient asso-ciative processor solution to guarantee real-time requirements for air traffic con-trol systems,” in Large Scale Parallel Processing (LSPP) IEEE Workshop at theInternational Parallel and Distributed Processing Symposium (IPDPS), Shanghai,China, May 2012, pp. 1682–1689.
[23] M. Yuan, J. Baker, and W. Meilander, “Comparisons of air traffic control implemen-tations on an associative processor with a mimd and consequences for parallel com-puting,” Journal of Parallel and Distributed Computing(JPDC), 2012, accepted, toappear.
[24] “Openmp website,” 2010. [Online]. Available: http://openmp.org/wp/
[25] S. Akl, Parallel Computing: Models and Methods. New York: Prentice Hall, 1997.
[26] M. Quinn, Parallel Programming in C with MPI and OpenMP. McGraw-Hill,2004.
93
[27] T. Cormen, C. Leisterson, and R. Rivest, Introduction to Algorithms, 1st ed. Mc-Graw Hill and MIT Press, 1990, chapter 30 on Parallel Algorithms.
[28] M. Flynn, “Some computer organizations and their effectiveness,” IEEE Transac-tions on Computers, vol. C, no. 21, pp. 948–960, 1972.
[29] M. Quinn, Parallel Computing: Theory and Practice. McGraw-Hill, 1994.
[30] W. Chantamas, “A multiple associative computing model to support the executionof data parallel branches using the manager-worker paradigm,” Ph.D. dissertation,Department of Computer Science, Kent State University, December 2009.
[31] M. Garey, R. Graham, and D. Johnson, “Performance guarantees for schedulingalgorithms,” Operations Research, vol. 26, no. 1, pp. 3–21, Jan-Feb 1978.
[32] J. Stankovic, “Misconceptions about real-time computing,” IEEE Computer,vol. 21, no. 10, pp. 17–25, Oct. 1988.
[33] J. Stankovic, S. Son, and J. Hansson, “Misconceptions about real-time databases,”Computer, June 1999.
[34] J. Stankovic, M. Spuri, K. Ramamritham, and G. Buttazzo, Deadline Schedulingfor Real-time Systems. Kluwer Academic Publishers, 1998.
[35] J. W. Liu, Real-Time Systems. Prentice Hall, 2000.
[36] G. Buttazzo, Hard Real-Time Computing Systems: Predictable Scheduling Algo-rithms and Applications, 2nd ed. New York: Springer Science, 2005.
[37] C. Murthy and G. Manimaran, Resource Management in Real-time Systems andNetworks. MIT Press, 2001.
[38] M. Klein, J. Lehoczky, and R. Rakumar, “Rate-monotonic analysis for real-timeindustrial computing,” IEEE Computer, pp. 24–33, 1994.
[39] C. L. Liu and J. Layland, “Scheduling algorithms for multiprogramming in a hard-real-time environment,” Journal of the ACM, vol. 20, no. 10, 1973.
[40] J. Anderson, J. Calandrino, and U. Devi, “Real-time scheduling on multicore plat-forms,” in Proc. of the 12th IEEE Real-Time and Embedded Technology and Appli-cations Symposium, San Jose, CA, April 2006, pp. 179–190.
[41] J. Potter, Associative Computing: A Programming Paradigm for Massively ParallelComputers. New York: Plenum Press, 1992.
[42] J. Trahan, M. Jin, W. Chantamas, and J. Baker, “Relating the power of the multipleassociative computing model (masc) to that of reconfigurable bus-based models,”Journal of Parallel and Distributed Computing, Elsevier Publishers, vol. 70, pp.458–466, May 2010.
[43] R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald, ParallelProgramming in OpenMP, 1st ed. Morgan Kaufmann, 2000.
94
[44] B. Chapman, G. Jost, and R. Pas, Using OpenMP Portable Shared Memory ParallelProgramming. MIT Press, 2007.
[45] H. Casanova, A. Legrand, and Y. Roberts, Parallel Algorithms. CRC Press, 2009.
[46] B. Wilkinson and M. Allen, Parallel Programming: Techniques and ApplicationsUsing Networked Workstations and Parallel Computers. Prentice Hall, 1999.
[47] K. Lakshmanan, S. Kato, and R. Rajkumer, “Scheduling parallel real-time tasks onmulti-core processors,” in Proc. of the 31st IEEE Real-Time Systems Symposium(RTSS’10), San Diego, CA, December 2010, pp. 259–268.
[48] I. Hwang, “Air traffic surveillance and control using hybrid estimation and protocol-based conflict resolution,” Ph.D. dissertation, Stanford University, 2003.
[49] L. Yang and J. Kuchar, “Prototype conflict alerting logic for free flight,” AIAAJournal of Guidance, Control, and Dynamics, vol. 20, no. 4, pp. 768–773, July-August 1997.
[50] J. Kuchar and L. Yang, “A review of conflict detection and resolution modelingmethods,” IEEE Transactions on Intelligent Transportation Systems, vol. 1, no. 4,pp. 179–189, 2000.
[51] Y. Bar-Shalom and T. Fortmann, Tracking and Data Association. Academic Press,1988.
[52] H. Blom, R. Hogendoorn, and B. vanDoorn, “Design of a multisensor trackingsystem for advanced air traffic control,” in Multitarget-Multisensor Tracking: Ap-plication and Advances, Y.Bar-Shalom, Ed., vol. 2. Artech House, 1990, pp. 31–63.
[53] I. Hwang, H. Balakrishnan, K. Roy, and C. Tomlin, “Multiple-target tracking andidentity management in clutter for air traffic control,” in Proceedings of the AACCAmerican Control Conference, Boston, MA, June 2004.
[54] I. Hwang, J. Hwang, and C. Tomlin, “Flight-mode-based aircraft conflict detectionusing a residual-mean interacting multiple model algorithm,” in Proceedings of theAIAA Guidance, Navigation, and Control Conference, Austin Texas, August 2003.
[55] I. Hwang and C. Tomlin, “Protocol-based conflict resolution for finite informationhorizon,” in Proceedings of the AACC American Control Conference, Anchorage,Alaska, May 2002.
[56] K. Liu, “Composition of kalman and heuristic tracking algorithms for air trafficcontrol,” Master’s thesis, Department of Computer Science, Kent State University,August 1999.
[57] E. Mazor, A. Averbuch, Y. Bar-Shalom, and J. Dayan, “Interacting multiple modelmethods in tracking: A survey,” in IEEE Transactions on Aerospace and ElectronicSystems, vol. 34(1), 1998, pp. 103–123.
95
[58] D. Lainiotis, “Partitioning: A unifying framework for adaptive systems i: Estima-tion,” in Proceedings of the IEEE, vol. 64, August 1976, pp. 1126–1142.
[59] Y. Bar-Shalom and X. Li, Estimation and Tracking: Principles, Techniques andSoftware. Boston, Massachusetts: Artech House, 1993.
[60] D. Sworder and J. Boyd, “Estimation problems in hybrid systems,” in CambridgeUniversity Press, 1999.
[61] H. Blom and Y. Bar-Sharlom, “The interacting multiple model algorithm for sys-tems with markovian switching coefficients,” IEEE Transactions on AutomaticControl, vol. 33, no. 8, pp. 780–783, August 1988.
[62] X. Li and Y. Bar-Shalom, “Design of an interacting multiple model algorithm forair traffic control tracking,” IEEE Transactions on Control Systems Technology,vol. 1, no. 3, pp. 186–194, September 1993.
[63] P. Menon, G. Sweriduk, and B. Sridhar, “Optimal strategies for free flight air trafficconflict resolution,” Journal of Guidance, Control, and Dynamics, vol. 22, no. 2,pp. 202–211, 1999.
[64] J. Krozel, M. Peters, K. Bilimoria, C. Lee, and J. Mitchell, “System performancecharacteristics of centralized and decentralized air traffic separation strategies,” inFourth USA/Europe Air Traffic Management Research and Development Seminar,2001.
[65] Y.-J. Chiang, J. Klosowski, C. Lee, and J. Mitchell, “Geometric algorithms forconflict detection/resolution in air traffic management,” in 36th IEEE Conferenceon Decision and Control, San Diego, CA, December 1997, pp. 1835–1840.
[66] R. Paielli and H. Erzberger, “Conflict probability estimation for free flight,” Journalof Guidance, Control, and Dynamics, vol. 20, no. 3, pp. 588–596, 1997.
[67] M. Prandini, J. Hu, J. Lygeros, and S. Sastry, “A probabilistic approach to air-craft conflict detection,” IEEE Transactions on Intelligent Transportation Systems,vol. 1, no. 4, pp. 199–219, 2000.
[68] “Faa grants for aviation research program solicitation,” 2011. [Online]. Available:http://www.tc.faa.gov/logistics/grants/
[69] K. Park, N. Singhal, M. Lee, S. Cho, and C. Kim, “Design and performance evalu-ation of image processing algorithms on gpus,” IEEE Transactions on Parallel andDistributed Systems, vol. 22, no. 1, pp. 91–104, January 2011.
[70] S. Reddaway, W. Meilander, J. Baker, and J. Kidman, “Overview of air trafficcontrol using an simd cots system,” in Proc. of the International Parallel and Dis-tributed Processing Symposium (IPDPS’05), Denver, CO, April 2005.
[71] K. Batcher, “Staran/radcap hardware architecture,” in Sagamore Computer Conf.on Parallel Processing, 1973, pp. 147–152.
96
[72] ——, “The multi-dimensional access memory in staran,” in Sagamore ComputerConf. on Parallel Processing, 1975, pp. 167–168.
[73] ——, “The flip network in staran,” in International Conf. on Parallel Processing,1976, pp. 65–71.
[74] ——, “Staran series e,” in International Conf. on Parallel Processing, 1977, pp.140–143.
[75] ——, “The multi-dimensional access memory in staran,” IEEE Transactions onComputers, vol. C-26, no. 2, pp. 174–177, Feb 1977.
[76] H. Wang and R. Walker, “Implementing a scalable asc processor,” in Proc. of the17th International Parallel and Distributed Processing Symposium (Workshop inMassively Parallel Processing), Nice, France, April 2003, pp. abstract on page 267,full text on CDROM.
[77] ——, “Implementing a multiple-instruction stream associative masc processor,” inProc. of the 18th International Conference on Parallel and Distributed Computingand Systems(PDCS), Dallas, Texas, November 2006, pp. 460–465.
[78] R. Walker, J. Potter, Y. Wang, and M. Wu, “Implementing associative processing:Rethinking earlier architectural decisions,” in Proceedings of the 15th InternationalParallel and Distributed Processing Symposium (Workshop on Massively ParallelProcessing), San Francisco, CA, April 2001, pp. abstract on p. 195, full text onaccompanying CDROM.
[79] K. Schaffer and R. Walker, “A prototype multithreaded associative simd proces-sor,” in Proceedings of the 21st International Parallel and Distributed ProcessingSymposium (IPDPS)-Workshop on Advances in Parallel and Distributed Comput-ing Models (APDCM), Long Beach, CA, March 2007, p. 228.
[80] ——, “Using hardware multithreading to overcome broadcast/reduction latencyin an associative simd processor (extended version),” Parallel Processing Letters,vol. 18, no. 4, pp. 491–509, Dec 2008.
[81] ——, “Using hardware multithreading to overcome broadcast/reduction latency inan associative simd processor,” in Proc. 22nd Int’l Parallel and Distributed Process-ing Symp.(IPDPS)-Workshop on Large-Scale Parallel Processing (LSPP), Miami,FL, April 2008, p. 289.
[82] “Clearspeed technology plc. clearspeed whitepaper: Csx processor architecture,”2007. [Online]. Available: http://www.clearspeed.com/docs/resources/
[83] “Clearspeed technology plc. clearspeed whitepaper: Clearspeed software descrip-tion,” 2007. [Online]. Available: https://support.clearspeed.com/documents/
[84] “Clearspeed technology plc. csx600 runtime software user guide,” 2007. [Online].Available: https://support.clearspeed.com/documents/
97
[85] J. Gustafson and B. Greer, “Clearspeed whitepaper: Accelerating theintel math kernel library,” Tech. Rep., 2007. [Online]. Available: http://www.clearspeed.com/docs/resources/ClearSpeedIntelWhitepaperFeb07.pdf
[86] K. Schaffer, “Asc library for clearspeed,” 2012. [Online]. Available: http://www.cs.kent.edu/∼kschaffe/asc/
[87] E. Eddey and W. Meilander, “Application of an associative processor to aircrafttracking,” in Proceedings of the Sagamore Computer Conference on Parallel Pro-cessing. Springer-Verlag, Aug 1974, pp. 417–428.
[88] “The video of the performance of the staran at dulles is posted at,” 2012, this is areproduction of part of the original 16mm film made in the 1980’s. If a completeprofessional restoration is feasible, it will also be posted here. [Online]. Available:http://www.cs.kent.edu/∼jbaker/ATC/andXXXXatElseviersite.
[89] A. Marowka, “Back to thin-core massively parallel processors,” IEEE ComputerJournal, vol. 44, no. 12, pp. 49–54, December 2011.
[90] W. Chantamas, J. Baker, and M. Scherger, “An extension of the asc language com-piler to support multiple instruction streams in the masc model using the manager-worker paradigm,” in Proc. of the 2006 International Conference on Parallel andDistributed Processing Techniques and Applications (PDPTA 2006), June 2006, pp.521–527.
[91] W. Chantamas and J. Baker, “A multiple associative model to support branches indata parallel applications using the manager-worker paradigm,” in Proc. of the 19thInternational Parallel and Distributed Processing Symposium (WMPP Workshop),April 2005, pp. 266–273.
[92] J. Potter, ASC Software, 1992, includes a Primer, Windows Compiler,and Windows Emulator, can be downloaded at:. [Online]. Available: http://www.cs.kent.edu/∼parallel/
[93] M. Jin and J. Baker, “Two graph algorithms on an associative computing model,”in International Conference on Parallel and Distributed Processing Techniques andApplications (PDPTA), Las Vegas, June 2007, p. 7 pages.
[94] M. Atwah and J. Baker, “An associative dynamic convex hull algorithm,” in Proc.of the Tenth IASTED International Conference on Parallel and Distributed Com-puting and Systems, Las Vegas, NV, October 1998, pp. 250–254.
[95] ——, “An associative implementation of a parallel convex hull algorithm,” in Proc.of the 15th International Parallel and Distributed Processing Symposium (IEEEWorkshop on Massively Parallel Processing), San Francisco, CA, April 2001, pp.abstract on page 64, full text on CDROM.
[96] ——, “An associative static and dynamic convex hull algorithm,” in Proc. of the16th International Parallel and Distributed Processing Symposium (IEEE Work-shop on Massively Parallel Processing), Ft. Lauderdale, FL, April 2002, pp. ab-stract on page 249, full text on CDROM.
98
[97] M. Atwah, J. Baker, and S. Akl, “An associative implementation of graham’s con-vex hull algorithm,” in Proc. of the Seventh IASTED International Conference onParallel and Distributed Computing and Systems, Washington D.C., October 1995,pp. 273–276.
[98] ——, “An associative implementation of classical convex hull algorithm,” in Proc.of the Eighth IASTED International Conference on Parallel and Distributed Com-puting and Systems, Chicago, IL, October 1996, pp. 435–438.
[99] M. Esenwein, “String matching algorithms for an associative computer,” Master’sthesis, Department of Computer Science, Kent State University, 1995.
[100] M. Esenwein and J. Baker, “Vlcd string matching for associative computing andmultiple broadcast mesh,” in Proc. of the IASTED International Conference onParallel and Distributed Computing and Systems, October 1997, pp. 69–74.
[101] S. Steinfadt and J. Baker, “Swamp: Smith-waterman using associative massive par-allelism,” in IEEE Workshop on Parallel and Distributed Scientific and Engineer-ing Computing, 2008 International Parallel and Distributed Processing Symposium(IPDPS), Miami, FL, April 2008.
[102] S. Steinfadt, M. Scherger, and J. Baker, “A local sequence alignment algorithmusing an associative model of parallel computation,” in Proc. of IASTED Compu-tational and Systems Biology (CASB 2006), Dallas, TX, Nov 2006, pp. 38–43.
[103] D. Ulm and J. Baker, “Solving a 2d knapsack problem on an associative computeraugmented with a linear network,” in Proc. of the International Conference onParallel and Distributed Processing Techniques and Applications, Sunnyvale, CA,Aug 1996, pp. 29–32.
[104] J. Lee, “Developing parallel simd algorithms for the traveling salesman problem,”Master’s thesis, Department of Computer Science, Kent State University, November1989.
[105] P. Berra, “Some problems in associative processor applications to database man-agement,” in Proceedings of the National Computer Conference and Exposition,May 1974, pp. 1–5.
[106] P. Berra and E. Oliver, “The role of associative array processors in database ma-chine architecture,” IEEE Transactions on Computers, no. 4, pp. 53–61, 1979.
[107] C. Asthagiri and J. Potter, “Associative parallel lexing,” in Proceedings of the 6thInternational Parallel Processing Symposium(IPPS), March 1992, pp. 466–469.
[108] ——, “Parallel compilation on associative processors,” in Proceedings of the IFIPWG10.3 Working Conference on Parallel Architectures and Compilation Tech-niques(PACT), North-Holland Publishing Co, The Netherlands, 1994, pp. 315–318.
[109] ——, “Parallel context-sensitive compilation,” Software-Practice and Experience,vol. 24, no. 9, pp. 801–822, Sept 1994.
99
[110] A. Bansal and J. Potter, “Exploiting data parallelism for efficient execution oflogic programs with large knowledge bases,” in Proceedings of the 2nd InternationalIEEE Conference on Tools for Artificial Intelligence, 1990, pp. 674–681.
[111] B. Reed, “An implementation of lisp on a simd parallel processor,” in First annualaerospace applications of AI, Dayton, OH, 1985, pp. 81–90.
[112] G. Steele and W. Hillis, “Connection machine lisp: Fine-grained parallel symbolicprocessing,” in Proceedings of the 1986 ACM conference on LISP and FunctionalProgramming(LFP). New York, NY: ACM, 1986, pp. 279 – 297.
[113] T. Hasten, “An ops5 implementation on a simd computer,” Master’s thesis, De-partment of Computer Science, Kent State University, 1987.
[114] J. Potter, M. Rivett, and T. Hasten, “Rule-based systems on simd computers,” inProceedings of ROBEXS, 1987, pp. 198–204.
[115] B. Reed, “The aspro parallel inference engine (p.i.e.): A real time production rulesystem, Tech. Rep. 85-6048, 1985.
100
Figure 25: Overall ATC System Design.