or high perf ormance switching in communica tion...

PARALLEL ALGORITHMS FOR HIGH PERFORMANCE SWITCHING

IN COMMUNICATION NETWORKS

APPROVED BY SUPERVISORY COMMITTEE:

Dr. S. Q. Zheng, Chair

Dr. I. Hal Sudborough

Dr. Jason P. Jue

Dr. R. N. Uma

Dr. Yuke Wang

Dr. Ashwin Gumaste

Copyright 2004

Enyue Lu

All Rights Reserved

To my grandparents,

My parents,

and

My husband.



by

ENYUE LU, B.S., M.S., M.S.

DISSERTATION

Presented to the Faculty of

The University of Texas at Dallas

in Partial Ful�llment

of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE

THE UNIVERSITY OF TEXAS AT DALLAS

August 2004

ACKNOWLEDGEMENTS

I would like to express my greatest gratitude towards my research advisor Profes-

sor S.Q. Zheng. Without his support, guidance, and constant encouragement, this

dissertation would not have been possible.

I am grateful to Professor I. Hal Sudborough, Professor Jason P. Jue, Professor

R. N. Uma, Professor Yuke Wang, and Dr. Ashwin Gumaste for serving on my

supervisory committee. I'm also grateful to Professor Kemin Zhang, Professor Weifan

Wang, Professor Yuehua Bu, and Professor Tianxing Yao for their guidance and

encouragement on the beginning of my research work in combinatorics and graph

theory. I would like to thank Professor Edwin Sha, Professor Kang Zhang, Professor

I-Ling Yen for their suggestions and kind helps. I thank Charles Jackson, Guanyun

Zou and other colleges for their support when I worked in Nortel Networks as a co-

op in spring 2001. I also thank my labmates, Dr. Mei Yang, Yi Zhang, Bing Yang,

Chuanjun Li, Priya Shetty, Rohit Raut for their helpful comments, suggestions, and

friendship.

I am especially grateful to my grandparents, parents, and husband for their

love, support, and encouragement over the years. I dedicate this work to them.

v



Publication No.

Enyue Lu, Ph.D.

The University of Texas at Dallas, 2004

Supervising Professor: Dr. S. Q. Zheng

The explosive growth of Internet is driving increased demand for faster transmission

rate and faster switching technologies. On one hand, switching algorithms, including

routing for establishing connections between inputs and outputs and scheduling for

solving packet contentions, play a fundamental role on the performance of switching

networks. On the other hand, low cost, high speed, and large capacity switching

architectures are very attractive for high-performance switches and routers.

The main contributions of this dissertation can be categorized into three aspects:

Routing: Designing fast parallel routing algorithms for electronic or optical multi-

stage interconnection networks using time, space, and wavelength approaches.

� Time Dilation: By modeling the permutation decomposition problem as the

problem of edge colorings of bipartite graphs, we simplify the existing proof

for the decomposability of a permutation and reduce the decomposition time

to logarithmic. Using equitable coloring techniques, we further improve the

routing time complexity for optical Benes networks.

vi

� Space Dilation: We study the connection capacity of a class of multistage non-

blocking switching networks constructed from Banyan networks by horizontal

concatenation of extra stages and/or vertical stacking of multiple copies, and

develop sublinear-time routing algorithms by modeling the routing problems

for these networks as weak and strong edge colorings of bipartite graphs.

� Wavelength Dilation: We model the wavelength routing problem as the ver-

tex coloring problem, show the maximum number of wavelengths needed, and

develop polylogarithmic-time routing algorithms for WRSS Banyan networks

and WRSR Benes networks.

Scheduling: Developing eÆcient parallel stable matching and acyclic stable match-

ing algorithms for switch scheduling.

� Stable Matching: We propose a new approach, parallel iterative improvement,

to solving the stable matching problem using randomization and greedy selec-

tion. Simulation shows that our algorithm has good average performance and

converges in small number of iterations with high probability.

� Acyclic Stable Matching: We model the acyclic stable matching problem as

the dominating set problem for a rooted dependency graph. The scheduling

algorithms based on our acyclic stable matching have low time complexity and

are feasible for high-speed implementation.

Architecture: Rearrangeable nonblocking Benes group connectors and Clos group

connectors for serving as the major switching matrix in the design of ingress edge

routers of a burst-switched DWDM have been proposed. Based on our routing

algorithms, the hardware of Benes group connectors can be reduced further.

vii

TABLE OF CONTENTS

Acknowledgements v

Abstract vi

List of Tables xi

List of Figures xii

CHAPTER 1 INTRODUCTION 1

1.1 Overview of Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Switch Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.2 Output Contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.3 Internal Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.4 Crosstalk Problem in Photonic Switching . . . . . . . . . . . . . . . . . . . . . . . 19

1.3 Previous Related Work on Switching Algorithms . . . . . . . . . . . . . . . . . . . . . . 21

1.3.1 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.3.2 Switch Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.3.3 Switch Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1.3.4 Crosstalk-Free Routing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

1.4 Motivations and Contributions of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 38

1.5 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

CHAPTER 2 A PARALLEL ITERATIVE IMPROVEMENT STABLEMATCHING ALGORITHM 43

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.2 De�nitions and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.3 Parallel Iterative Improvement Matching Algorithm . . . . . . . . . . . . . . . . . . . 47

2.3.1 Constructing an Initial Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.3.2 Construct a New Matching from an Existing Matching. . . . . . . . . . 48

2.3.3 PII Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.4 Implementations of PII Algorithm on Parallel Computing MachineModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

viii

CHAPTER 3 DESIGN AND IMPLEMENTATION OF AN ACYCLICSTABLE MATCHING SCHEDULER 58

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.2 A Parallel Stable Matching Algorithm for Rooted Dependency Graph. . 59

3.2.1 Dominating Set for Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.2.2 The Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.2.3 Comparison with GS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3 Implementing the Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

CHAPTER 4 PARALLEL ROUTING ALGORITHMS FOR NONBLOCKINGELECTRONIC AND PHOTONIC SWITCHING NETWORKS 69

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Nonblocking Networks Based on Banyan-type Networks . . . . . . . . . . . . . . . 71

4.3 Graph Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3.1 I/O Mapping Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3.2 Graph Coloring and Nonblockingness . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4 Routing in Rearrangeable Nonblocking Networks . . . . . . . . . . . . . . . . . . . . . . 77

4.4.1 Rearrangeable Nonblockingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.4.2 Algorithm for Balanced 2-Coloring of G(N;K; g) . . . . . . . . . . . . . . . 78

4.4.3 Algorithm for g-Edge Coloring of G(N;K; g) . . . . . . . . . . . . . . . . . . . 80

4.4.4 Parallel Routing in a Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.4.5 Overall Routing Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.5 Routing in Strictly Nonblocking Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.5.1 Strict Nonblockingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.5.2 Algorithm for Strong (2g � 1)-Edge Coloring of G(N;K; g) . . . . . 86

4.5.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.6 Self-Routing Nonblocking Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.6.1 Connection Capacity of BL(N) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.6.2 Constructing T (N;�) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.6.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

CHAPTER 5 PARALLEL CROSSTALK-FREE ROUTING FOR OPTICALBENES NETWORKS 98

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.2 Parallel Permutation Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2.1 Decomposability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.2.2 Decomposing a Permutation into Two Semi-Permutations . . . . . . . 102

ix

5.2.3 Parallel Decomposition Algorithm for Partial Permutations . . . . . 104

5.3 Routing a Semi-Permutation in an Optical Benes Network . . . . . . . . . . . . 107

5.3.1 A Routing Algorithm Based on Parallel Decomposition . . . . . . . . . 107

5.3.2 The Improved Routing of Partial Semi-Permutation by Equi-table Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.4 Comparisons of Three Dilation Approaches for Optical Benes Networks 110

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

CHAPTER 6 PARALLEL ROUTING ANDWAVELENGTHASSIGNMENTSFOR OPTICAL INTERCONNECTION NETWORKS 114

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.2 Parallel Routing and Wavelength Assignment in WRSS Banyan Networks117

6.3 Parallel Routing and Wavelength Assignment in WRSR Benes Networks118

6.3.1 Upper Bound for the Number of Wavelengths . . . . . . . . . . . . . . . . . . 119

6.3.2 Routing and Wavelength Assignment Algorithm . . . . . . . . . . . . . . . . 120

6.4 Implementation on Realistic Multiprocessor Systems . . . . . . . . . . . . . . . . . . 128

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

CHAPTER 7 PARALLEL ROUTING ALGORITHMS FOR GROUP CON-NECTORS 130

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.3 Parallel Routing for Benes Group Connectors . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.3.1 Structure of GB(N;n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.3.2 Graph Model of GB(N;n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.3.3 Algorithm for GB(N;n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.3.4 Analysis and Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7.4 Parallel Routing for Clos Group Connectors. . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.5 Generalizations and Hardware Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

CHAPTER 8 CONCLUDING REMARKS 147

8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Bibliography 152

Vita

x

LIST OF TABLES

1.1 Nonblockingess of 3-stage Clos networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.2 Hardware costs of networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.3 Routing algorithms for 3-stage Clos networks and Benes networks . . . . . 36

2.1 Time complexity for implementations of PII algorithm on three par-allel computing machine models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.1 Comparison of algorithms for �nding a stable matching . . . . . . . . . . . . . . . 65

3.2 Timing and area results of the scheduler design. . . . . . . . . . . . . . . . . . . . . . . 67

4.1 Comparison of self-routing strictly nonblocking photonic switchingnetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

xi

LIST OF FIGURES

1.1 Switching in the telecommunication networks . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Developments in switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Design of a telephone exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Third-generation packet switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Processor system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Output contention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.7 An OQ switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.8 HOL blocking in an IQ switch with FIFO bu�ers . . . . . . . . . . . . . . . . . . . . . 9

1.9 A VOQ switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.10 Internal blocking in a Baseline network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.11 Crossbar switch: (a) architecture; (b) states of crosspoint . . . . . . . . . . . . . 12

1.12 An SE and its two states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.13 Self-routing of Baseline network BL(16) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.14 3-stage Clos network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.15 Benes network B(8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.16 Electro-optic SE: (a) two states; (b) crosstalk . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.17 The relationship of baseline network, butter y network and hyper-cube: (a) BL(8); (b) BF (8); (c) H(4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.18 Model of switch scheduling: (a) a VOQ switch; (b) bipartite graph;(c) maximum size matching; (d) maximal size matching . . . . . . . . . . . . . . 26

1.19 Scheduling based on stable matching in a VOQ switch . . . . . . . . . . . . . . . . 28

1.20 Matrix decomposition: M =P2

i=1Mi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1.21 Graph representation: (a) edge Coloring; (b) matching . . . . . . . . . . . . . . . 33

1.22 Main work of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.1 Parallel random matching generation: (a) initial lists; (b) lists ob-tained after randomization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.2 Finding a new matching from an existing matching. . . . . . . . . . . . . . . . . . . 50

2.3 Parallel computing models: (a) a 16-processor hypercube; (b) a 4� 4mesh of trees; (c) a 4� 4 array with multiple broadcasting buses . . . . . . 54

2.4 Performance Comparisons: (a) average number of iterations for algo-rithms to �nd a stable matching; (b) frequencies for algorithms to �nda stable matching within n iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

xii

3.1 Finding stable matching in a rooted dependency graph: (a) a rooted

dependency graph ~G and its reduced subgraph ~G0 and ~G00; (b) stablematching is found by GS algorithm in 5 iterations. . . . . . . . . . . . . . . . . . . . 61

3.2 Finding stable matching in an acyclic dependency graph: (a) an acyclic

dependency graph ~G and its reduced subgraph ~G0 and ~G00; (b) stablematching is found by GS algorithm in 6 iterations . . . . . . . . . . . . . . . . . . . . 64

3.3 A 4 � 4 scheduler design: (a) scheduler block diagram; (b) circuitstructure; (c) node block diagram.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.1 A network B(16; 2; 3; �) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2 Finding a balanced 2-coloring: (a) an I/O mapping; (b) a balanced 2-coloring of an I/O mapping graph G(32; 25; 8); (c) a set of components;(d) pointer initialization for pointer jumping . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3 Number of connection paths: (a) 1 path inB(16; 0; 1; �); (b) 2 paths inB(16; 1; 1; �); (c) 4 paths in B(16; 2; 1; �); (d) 8 paths in B(16; 3; 1; �) 76

4.4 Edge coloring: (a) a (weak) edge coloring; (b) a strong edge coloring . . 77

4.5 Construction of networks: (a) T (8; 0) based on BL(32); (b) T (8; 1)based on BL(64) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.1 A 2-edge coloring of bipartite graph G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2 A decomposition example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3 Decomposition of a partial permutation based on 2-edge coloring: (a)5 di�erent types of paths; (b) directed paths formed by pointer ini-tialization and a 2-edge coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.4 An equitable 2-edge coloring of graph: (a) 3 odd paths and primaryedges; (b) directed paths formed by pointer initialization and an eq-uitable 2-edge coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.5 A space dilated Benes network DB(8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.1 A 2 � 2 multi-wavelength SE: (a) two states; (b) signal transmission . . 116

6.2 A crosstalk-free routing and wavelength assignment for B(4): (a) aWRSR B(4) contains only basic SEs; (b) a WRSR B(4) containsnon-basic SEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.3 Routing a permutation in WRSR B(8): (a) �nding a wavelength as-signment; (b) crosstalk-free routing in B(8). . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.1 Block diagrams of a group connector: (a) G(8; 4); (b) G0(8; 4) . . . . . . . . . 130

7.2 Block diagram of an ingress edge router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7.3 A Benes group connector GB(16; 4) with k = 2 . . . . . . . . . . . . . . . . . . . . . . . 135

7.4 Hardware redundancy of P1 and control bit selection of P2 in G(16; 4) 139

7.5 The settings of SEs in the �rst stage of GB(16; 4) according to theequitable 2-edge coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.6 Construction of a Clos group connector: (a) a 3-stage Clos groupconnector GC(m;n; r); (b) a 2-stage Clos group connector GC(m;m; r)143

xiii

CHAPTER 1

INTRODUCTION

1.1 Overview of Switching

The ITU-T, telecommunication standardization sector of ITU (International Telecom-

munications Union), de�nes switching as:

\The establishing, on demand, of an individual connection from a desired

inlet to a desired outlet within a set of inlets and outlets for as long as is required

for the transfer of information".

Today, the information not only denotes the speech we hear in our telephone

receiver, but also incorporates all types of information from several telecommunica-

tion services, as shown in Figure 1.1.

Voice

Data

Video

Voice

Data

Video

Figure 1.1. Switching in the telecommunication networks

One hundred and twenty years ago, switching meant an operator intercon-

necting two subscribers with each other. Today we view the concept of switching

di�erently. Present-day switching equipment must be capable of handling more ser-

1

2

vices than before, including high-quality audio, video of di�erent quality standards,

LAN-to-LAN communication, the transfer of large data �les, and new interactive

services based on the cable-TV network. But there is more to it than the switch-

ing of information related to the service user. Information used by the network -

signaling information, for example - must also be switched.

Partly as a consequence of this, the number of switching techniques in the

public network has increased in recent years. From the beginning we had only circuit

switching, which is very suitable for telephone services. Since then, subscribers have

demanded better utilization of transmission capacity and larger bandwidth, and

other techniques have emerged. As a result of the requirements imposed by data

communication, circuit switching was supplemented in the 1970s with the packet

switching technique.

Today we also have frame relay and two types of cell switching: asynchronous

transfer mode (ATM) and distributed queue dual bus (DQDB). The origin of frame

relay and the techniques for cell switching can be traced to packet switching.

Business networks use still other techniques, such as distributed packet switch-

ing by means of buses and rings (for example, Ethernet and token ring) and the �ber

distributed data interface (FDDI) standard.

The service explosion and the tendency to transmit very large amounts of

information through the network have brought the requirement for performance into

focus in recent years. Good performance means that delays through the switching

equipment are minimized, that the ow of information is not distorted in any way,

and that the switched bandwidth can match service requirements.

It is the switching equipment that primarily limits the bandwidth of a con-

nection. Today, we can make use of very high bit rates, up to tens of billions of bits

per second (tens of Gbit/s) in optical transmission systems. However, in switching

equipment, we must change over to electrical signals and considerably lower bit rates.

3

The next step is to use optical switching with electronic switch control. And

in time, we will most assuredly have fully optical switching systems. Indeed, in view

of the intensive research and development that are being carried out in this area, it

should not be long before the �rst optical space switches are commercially available.

Figure 1.2 describes technical developments in the �eld of switching (public

switching only). A detail survey is given in [65].

Figure 1.2. Developments in switching

1.2 Background

Switching is the process by which a network element, named switch, forwards data

arriving at one of its inputs to one of its outputs. There are three kinds of such

switches: (1) telephone switches, which support the telephone network; (2) datagram

switches (also called routers), which tie the Internet together, and (3) ATM switches,

optimized to deal with small, �xed-size packets called cells.

4

1.2.1 Switch Architecture

Generally speaking, a switch contains the following parts: inputs, outputs, a switch-

ing fabric, and a switch controller. The connections will be established between

inputs and outputs through the switching fabric. The switch controller is responsi-

ble for con�guring the switching fabric to establish connections.

We can categorize switches into two categories, circuit switches and packet

switches. Circuit switches switch voice sample while packet switches switch packets

that contain both data and descriptive meta-data [39].

Circuit Switches

In a telephone switch, as shown in Figure 1.3, the switching fabric carries voice and

the switch controller handles to set up and tear down circuits. A switch transfers

information from an input to an output. This can be complicated because a large

central oÆce switch may have more than 150,000 inputs and outputs.

Figure 1.3. Design of a telephone exchange

The two basic ways to connect inputs to outputs are time division switching

and space division switching. In the time division switching, a switch having only

one input and one output, each incoming voice sample is stored in N time slots

in sequence, and the switch controller determines in which order the time slots are

to be read from the sequence. Ordinarily, in the space division switching, a switch

5

consisting N inputs and N outputs, each voice sample takes a connection path from

di�erent input through the switch to di�erent output, depending on its source and

destination. In this dissertation, we focus on the space division switching.

Packet Switches

There are two types of packet switches: virtual circuit ATM switches and datagram

routers. For the ATM switch, it handles �xed-size packets, called cell, while a data-

gram router handles variable-size packets. In this dissertation, we call them both

\switches" and use \cells" to refer both �xed-size and variable-size packets.

The evolution of packet switches has undergone three generations. Details

about each generation and comparisons of the characteristics of the three genera-

tions of switches can be found in [39]. Figure 1.4 shows the block diagram of the

third generation switch architecture, where packets arriving at the inputs are simul-

taneously entering into the switch fabric, through which they are routed in parallel

to outputs.

Switch fabric OutputsInputs

Control processor

Figure 1.4. Third-generation packet switch

6

Processor System

Although processor control in both circuit and packet switches can be implemented

in several ways, two main divisions have been made:

(1) Centralized control, where all work to set up connections is controlled from

a central processor system; and

(2) Distributed control, where the control functions are shared by a number

of processors that are more or less independent of one another.

In centralized control, if there is only one processor used to perform both

routine work and advanced operations, it is called a single-processor system. In this

system, the processor must be dimensioned according to the most diÆcult tasks. At

the same time, however, because the routine tasks are the most time-consuming, the

processor may have diÆculty getting all things done. One solution to this kind of

problem is to let several processors share the work load, which is calledmultiprocessor

system.

In distributed control systems there is no central processor for the overall

functions. Instead, the switching equipment is divided into a number of switching

parts, each of which has its own processor. In this case, the processors may have com-

plete control over all the work in the respective switching parts, or have centralized

control of certain functions to connect di�erent switching parts in a less degree.

Figure 1.5 gives an overall view of di�erent processor systems.

1.2.2 Output Contention

Output contention happens when connections from di�erent inputs are requested to

be established to the same output simultaneously, as shown in Figure 1.6. In each

switching slot, only one connection can be established. Thus, for circuit switching,

only one connection request can be accepted and others will be blocked. For packet

7

Processor system

Centralized Distributed

Single-processor Multi-processorsSeveral switchingparts with some

centralized functions

Several independentswitching parts with

own processors

Figure 1.5. Processor system

switching, only one cell can be transmitted across the switching fabric and each

output can only send one cell, and thus, the other cells must either discarded or

bu�ered [11].


1

2

N

1

2

N

Outputcontention

Figure 1.6. Output contention

For packet switching, depending on where the cells are bu�ered, the switches

can be categorized. In this dissertation, we only consider the switches with bu�ers

in inputs or/and outputs.

Output Queueing Switch

In an output queueing (OQ) switch, all cells destined for the same output are allowed

to arrive at the output at the same time. Since only one cell can be transmitted via

8

the output link at a time, the remaining cells are bu�ered at the output, as shown

in Figure 1.7.

The price to pay for such scheme to solve output contention is the need for

operating the switch fabric and the memory at the output port at rate N times

the line speed if there are N inputs. As the line speed or the switch port number

increase, this scheme will have a bottleneck.


Figure 1.7. An OQ switch

Input Queueing Switch

Another way to resolve output contention is to place a bu�er in each input port.

Only one cell is allowed to go to the same output port at one time, and the cells that

lose contention will need to wait at the input bu�er. A switch with such architecture

is called an input queueing (IQ) switch. An arbiter is needed to decide which cells

should be chosen and which cells should be rejected. This decision can be based on

cell priority or timestamp, or be random.

For an IQ switch with �rst-in-�rst-out (FIFO) input bu�ers, a blocked cell

at the head of an input queue can prevent other cells behind it destined for idle

outputs from being forwarded. This is called head of line (HOL) blocking problem.

As shown in Figure 1.8, the cell in input N destined for output 1 is blocked while

output 1 is idle. Due to the HOL blocking, the throughput of the input bu�ered

switch is at most 58.6% for random uniform traÆc [38].

9

Switch fabricInputs

1

2

N

21

N

21

Outputs

1

2

N

Cell blocked due toHOL blocking

Figure 1.8. HOL blocking in an IQ switch with FIFO bu�ers

Input Queueing with Virtual Output Queues

The HOL blocking problem of IQ switch can be overcome by providing a single and

separate FIFO queue at each input to hold cells destined for each output. Each input

bu�er of the switch is logically divided into N logical queues. Such an FIFO queue is

called virtual output queue (VOQ) introduced in [89], and such a switch architecture

is called virtual-output-queueing (VOQ) switch, as shown in Figure 1.9. All the N

VOQs in each input bu�er share the same physical memory, and each contains the

cells destined to a unique output port. Hence, the HOL blocking is reduced, and

the throughput is increased. However, VOQ switch requires a fast and intelligent

arbitration mechanism. Since there are N2 VOQs, up to N2 (instead of N) HOL cells

compete for switching in each cell slot. A complex arbitration is needed to decide

which N cells should be switched in each cell slot. This becomes the bottleneck of

the switch.

Combined Input and Output Queueing Switches

Combined Input and Output Queueing (CIOQ) Switches, have bu�ers in both input

and output ports. This kind of switch architecture is intended to combine the ad-

10


...

...

...

Figure 1.9. A VOQ switch

vantages of both input bu�ering and output bu�ering. In an IQ switch, the input

bu�er speed is comparable to the input line rate. In an CIOQ switch, there are up

to L (1 < L < N) cells that each output port can accept at each time slot. If there

are more than L cells destined for the same output port, excess cells are stored in

the input bu�ers instead of discarding them.

To achieve a desired throughput, the speedup factor L can be engineered

based on the input traÆc distribution. Since the output bu�er memory only needs

to operated at L times the line rate, a large-scale switch can be built by using input

and output bu�ering. However, this type of switch requires a complicated arbitration

mechanism to determine which of L cells among the N HOL cells may go to the same

output port.

1.2.3 Internal Blocking

A switch fabric is a set of links and switching elements for establishing connections

between inputs and outputs. There are many implementations for switch fabrics. In

this dissertation, we focus on the switch fabric that is implemented by an intercon-

nection network.

While a connection is being established in an interconnection network, it

can face another contention problem, called internal link blocking. It occurs when

11

multiple connections contend for a link at the same time inside the switch fabric.

As shown in Figure 1.10, an internal physical link is shared by two connections,

connection from 0 to 4 and connection 2 to 5.

01

23

45

67

01

23

45

67

Internalblocking

Figure 1.10. Internal blocking in a Baseline network

According to internal blocking properties, switches are classi�ed as blocking

and nonblocking. That is, in a nonblocking switch, a connection path is always

available to connect any idle input to any idle output while in a blocking switch a

connection path may not be found between an idle input and an idle output

Nonblocking networks have been favored in switching systems since they can

set up any one-to-one I/O mapping. There are three types of nonblocking networks:

strictly nonblocking (SNB), wide-sense nonblocking (WSNB) and rearrangeable non-

blocking (RNB) [7], [31]. In both SNB and WSNB networks, a connection can be

established from any idle input to any idle output without disturbing existing con-

nections. In SNB networks any of available paths for a connection can be chosen and

in WSNB networks, however, a rule must be followed to choose one. In an RNB net-

work, a path for the connection from any idle input to any idle output is available if

the rearrangement of existing connections is allowed. In the following, we introduce

several interconnection networks that will be discussed through the dissertation.

12

Crossbar

Basically, an N � N crossbar, as shown in Fig 1.11, consists of an array of N � N

individually operated crosspoints, which control connections. Each crosspoint has

two logical states: cross and bar states, where cross state is the default state.

Cross

Bar

0

1

N-1

0 1 N-1

( a ) ( b )

Figure 1.11. Crossbar switch: (a) architecture; (b) states of crosspoint

The crossbar has three attractive properties: it is strictly nonblocking, simple

in architecture and modular. However, the hardware cost in terms of the number of

the crosspoints grows as O(N2), which is prohibitively high with large N .

A connection between input i and output j is established by setting the (i; j)-

th crosspoint to the bar state while letting other crosspoints along the connection

remain the cross state. The bar state of a crosspoint can be triggered individually by

the destination of each incoming connection. That is, the connection from inputs to

outputs in a crossbar is done by the addresses of its source and destination regardless

of other connections. This property is called self-routing property and a network with

this property is called a self-routing network. Thus, crossbar switch is a self-routing

network. A self-routing network can be either nonblocking such as a crossbar or

blocking such as a Banyan network introduced in the following.

13

Banyan-type network

The Banyan-type network is a multistage interconnection network (MIN), which

usually comprises a number of switching elements (SEs) grouped into several stages

interconnected by a set of links. Each SE can be implemented by a 2 � 2 crossbar.

Depending on whether the upper/lower input is connected with the upper/lower

output of an SE, it has two logical states, namely, straight and cross (see Figure

1.12).

Cross

StraightUpper input

Lower input

Upper output

Lower output

SE

Figure 1.12. An SE and its two states

A class of Banyan-type networks has received considerable attention. A net-

work belonging to this class satis�es the following basic properties:

i. It has N inputs, N outputs, logN -stages and N=2 SEs in each stage.

ii. There is a unique path between each input and each output.

iii. Let u and v be two SEs in stage i, and let Sj(u) and Sj(v) be two sets of SEs to

which u and v can reach in stage j, 0 < i+1 = j � n. Then Sj(u)\Sj(v) = ;

or Sj(u) = Sj(v) for any u and v.

Because of the above three properties (short connection diameter, unique

connection path, uniformmodularity, etc.), Banyan-type networks are very attractive

for constructing switching networks. Several well-known networks, such as Banyan,

Omega, Shu�e, and Baseline, belong to this class. It has been shown that these

networks are topologically equivalent [2, 96]. In this dissertation, we use Baseline

network as the representative of Banyan-type networks.

14

An N � N Baseline network, denoted by BL(N), is constructed recursively.

A BL(2) is a 2 � 2 SE. A BL(N), N = 2n and n > 1, consists of a switching stage

of N=2 SEs, and a shu�e connection, followed by a stack of two BL(N=2)'s. Thus,

a BL(N) has logN stages labeled by 0; � � � ; logN � 1 from left to right, and each

stage has N=2 SEs labeled by 0; � � � ; N=2 � 1 from top to bottom. The upper and

lower outputs of each SE in stage i are connected with two BL(N=2i+1)'s, named

upper subnetwork and lower subnetwork, respectively. The N links interconnecting

two adjacent stages i and i + 1 are called output links of stage i and input links of

stage i+1. The input (resp. output) links in the �rst (resp. last) stage of BL(N) are

connected with N inputs (resp. outputs) of BL(N). To facilitate our discussions, the

label of each stage, link and SE is represented by a binary number. Let alal�1 � � � a1a0

be the binary representation of a. We use �a to denote the integer that has the binary

representation alal�1 � � � a1(1� a0). An example is shown in Figure 1.13.

00000001

00100011

01000101

01100111

10001001

10101011

11001101

11101111

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111

00000001

00100011

01000101

01100111

10001001

10101011

11001101

11101111

00000001

00100011

01000101

01100111

10001001

10101011

11001101

11101111

00000001

00100011

01000101

01100111

10001001

10101011

11001101

11101111

000

001

010

011

100

101

110

111

00000001

00100011

01000101

01100111

10001001

10101011

11001101

11101111

00000001

00100011

01000101

01100111

10001001

10101011

11001101

11101111

00000001

00100011

01000101

01100111

10001001

10101011

11001101

11101111

000

001

010

011

100

101

110

111

00000001

00100011

01000101

01100111

10001001

10101011

11001101

11101111

INPUTS

OUTPUTS

STAGES

upper subnetwork BL(8)

lower subnetwork BL(8)

0 1 2 3

P0

P1

Figure 1.13. Self-routing of Baseline network BL(16)

Self-routing in BL(N) is decided by the destination, dn�1dn�2 � � � d0, of each

connection. If the (n � i)-th bit, dn�i�1, of the destination equals to 0, the input of

the SE on the connection path in stage i is connected to the SE's upper output, and

15

to the lower output otherwise (i.e., dn�i�1 = 1). For an example, in Figure 1.13, the

connection path P0 from 0010 to 1011 in BL(16) is set up by self-routing. Since the

destination is 1011, the connection path passes the lower, upper, lower, and lower

outputs of SEs 1, 4, 4 and 5 in stages 0, 1, 2, and 3, respectively. More speci�cally,

for a connection with destination dn�1 � � � d0 at stage i, if it arrives at input link

bn�1 � � � b1b0, then it leaves from output link bn�1 � � � b1dn�i�1. Since two adjacent

stages are connected by shu�e connection, (i.e., output link bn�1bn�2 � � � b1b0 in stage

i is connected to input link bn�1 � � � bn�ib0bn�i�1 � � � b1 in stage i+1), the unique path

for each connection can be derived as follows. We show the labels of the input link,

SE and output link at stage i on a connection from source sn�1 � � � s0 to destination

dn�1 � � � d0.Input link: dn�1 � � � dn�isn�1 � � � si+1siSE: dn�1 � � � dn�isn�1 � � � si+1

Output link: dn�1 � � � dn�isn�1 � � � si+1dn�i�1

The Banyan-type switch provides advantages: Firstly, its cost in terms of the

number of SEs is O(N logN), which makes it much more suitable than the crossbar

for the construction of large switches. Secondly, self-routing is an attractive feature

in that no control mechanism is needed for establishing connection. Thirdly, due

to their modular and recursive structure, large-scale switches can be built by using

smaller switches without modifying their structures.

The main drawback of banyan-type switch is that it is a blocking network.

Its performance degrades rapidly as the size of the switch increases.

Three-Stage Clos Network

The structure of three-stage Clos network [13], as shown in Figure 1.14, consists of

three stages of switch modules (SM), which can be implemented by crossbars. In the

�rst stage, a set of N inputs is broken up into r1 subsets of n1 inputs. Each subset of

inputs goes into a unique �rst-stage SM. Each of the �rst-stage SMs has m outputs

connecting to all m middle-stage SMs. Similarly, each of middle-stage SMs has r2

16

outputs connecting to all r2 third-stage SMs. In the third stage, N output lines are

provided by r2 subsets of n2 lines. A 3-stage Clos is denoted by C(n1; r1;m; n2; n2; r2),

and, in the symmetric case where n1 = n2 = n and r1 = r2 = r, by C(n;m; r).

1

2

r1

1

2

m

1

2

r2

n1xm r1xr2 mxn2

Figure 1.14. 3-stage Clos network

Depending on the number of SMs in the middle stage, the 3-stage Clos network

can be blocking, rearrangeable nonblocking, or strictly nonblocking. That is, by

increasing the value of m, the probability of blocking is reduced. The following table

shows the relation between the value of m and the blockingness of C(n;m; r).

m m < n n � m < 2n� 1 m � 2n � 1

C(n;m; r) blocking RNB SNB

Table 1.1. Nonblockingess of 3-stage Clos networks

The 3-stage Clos network provides an advantage in that it reduces the hard-

ware complexity from O(N2) in the case of the crossbar switch to O(N32 ), and the

switch can be designed to be nonblocking. Furthermore, it also provides more relia-

bility since there is more than one possible path through the switch to connect any

input port and any output port.

17

Benes network

The Benes network [6] is a well-known rearrangeable nonblocking MIN. The N �N

Benes network, denoted by B(N), can be constructed as the following three ways:

(1) Base on crossbar

B(N) can be constructed recursively base on an SE (i.e. a 2� 2 crossbar). A

B(2) is an SE. A B(N) consists of a switching stage of N=2 SEs, an N=2� 2 shu�e

connection (i.e. Oi is connected to Ij with j � N=2�i+bi=2c mod N in two adjacent

stages [70]), followed by a stack of two B(N=2)s, a 2 �N=2 shu�e connection, and

another switching stage of N=2 SEs. Thus, a B(N) consists of 2 logN � 1 stages

labeled by 0; 1; � � � ; 2 logN � 2 from left to right (in this dissertation, all logarithms

are in base 2), and each stage consists of N=2 SEs labeled by 0; 1; � � � ; N=2�1 from top

to bottom. A pair of SEs i and �i in the same stage is called a pair of dual SEs. Two

inputs (outputs) for a 2�2 SE are called dual inputs (outputs). Each B(N) contains

2 B(N=2)s, alternately named upper subnetwork and lower subnetwork from top to

bottom, and 4 B(N=4)s, alternately named upper subnetwork, lower subnetwork,

upper subnetwork, and lower subnetwork from top to bottom, and so on. As shown

in Figure 1.15, a B(8) contains 2 B(4)s within dashed boxes, each containing 2 B(2)s

within dotted boxes.

7

0 1 2 3 4

OUTPUTS

STAGES

6

01

23

45

1

2

3

0

1

2

3

INPUTS

01

23

45

0

67

0

1

2

3

0

1

2

3

0

1

2

3

Figure 1.15. Benes network B(8)

(2) Base on Banyan network

18

A Benes network can also be constructed by concatenating a Baseline network

and a reverse Baseline network with the center stages overlapped.

(3) Base on 3-stage Clos network

A Benes network B(N) can also be recursively constructed from replacing

each SM in the middle stage of 3-stage Clos networks by logN � 2 times. More

speci�cally, in order to construct B(N), in the �rst iteration, we replace each of two

SMs in the middle stage of C(2; 2; N=2) with a C(2; 2; N=22); in the second iteration,

we replace each of 22 SMs in the middle stage of 2 C(2; 2; N=22)s with a C(2; 2; N=23),

� � �, in the logN � 2 iteration, we replace each of 2logN�2 SMs in the middle stage of

2logN�3 C(2; 2; 4)s with a C(2; 2; 2).

The Benes network is a rearrangeable nonblocking permutation network and

one of the most eÆcient switching architectures in terms of the number of crosspoints

used. In general, a Benes networkB(N) requires 2N �(2logN�1) crosspoints, N being

a power of two. The main disadvantage of this type of networks is that some fast

and intelligent mechanism is needed to rearrange the connections to avoid internal

blocking for establishing new connections.

The high degree of connection capability in SNB and WSNB networks is at a

high hardware cost while RNB networks is usually constructed with lower hardware

cost. Given the same size of network with N inputs and N outputs, Table 1.2

shows the hardware cost in terms number of crosspoints for di�erent interconnection

networks.

Network Blockingness Cost

Crossbar SNB N2

3-stage Clos Network WSNB, RNB �(N1:5)

Benes Network RNB 2N(2 logN � 1)

Bayan Network Blocking 2N logN

Table 1.2. Hardware costs of networks

19

1.2.4 Crosstalk Problem in Photonic Switching

To build a large IP router with capacity of 1 Tb/s and beyond, either electronic or

optical switching can be used. The deployment of optical �bers as a transmission

medium has prompted searching for the solution to the problem of speed mismatch-

ing between transmission and switching. Optical routers have better scalability than

electronic routers in terms of switching capacity. However, the required optical tech-

nologies are immature for all-optical switching to happen anytime soon. A hybrid

approach in which optical signals are switched, but both switch control and routing

decisions are carried out electronically, becomes more practical. Advances in electro-

optic technologies provide a promising choice to meet the increasing demands for

high channel bandwidth and low communication latency in optical communication.

A hybrid optical MIN (OMIN) can be built from 2�2 electro-optic switching

elements (SEs) such as common lithium-niobate (LiNbO3) SE (e.g. [27, 30, 84]). Such

an SE is a directional coupler with two inputs and two outputs. Depending on the

amount of voltage at the junction of two waveguides, optical signals carried on either

of two inputs can be coupled to either of two outputs. An electronically controlled

optical SE can have switching speed ranging from hundreds of picoseconds to tens

of nanoseconds [78], which is much faster than Micro-Electro-Mechanical System

(MEMS)[78]. However, large OMINs built from integrating these electro-optic SEs

have the problem of crosstalk, which is caused by undesired coupling between signals

carried in two waveguides so that the two signal channels interfere with each other.

Crosstalk can also be generated by intersected interconnection links (crossovers) in

OMINs. It has been shown that crossover crosstalk can be negligible by a careful

physical design of the interconnection patterns [40].

The crosstalk generated by signals with the same wavelength can be quite

easily eliminated by a wavelength �lter, and thus, this type of crosstalk is called

�lterable crosstalk. If the crosstalk generated by signals with di�erent wavelengths,

20

this type of crosstalk is called non-�lterable crosstalk [56]. The crosstalk originated

from an SE is called the �rst-order crosstalk, which may result into the higher-

order crosstalk when it interferes with signals in other SEs. In this dissertation,

we only consider non-�lterable �rst-order crosstalk. An optical switching network

is considered crosstalk-free if the connections passing through the same SE have

di�erent wavelengths at the same time.

Each SE has two logic states, namely, straight and cross (see Figure 1.16 (a)).

Figure 1.16 shows an example of crosstalk in an SE. For the straight state, a small

fraction of the input signal injected at the upper input may be detected at the lower

output (see Figure 1.16 (b)). Crosstalk can also occur when an SE is in the cross

state. Consequently, the input signal will be distorted at output due to loss and

crosstalk accumulated along a connection path.

Voltage

Electrode

Electrode

Input signal Output signal

Crosstalk

( b )

Waveguide

( a )

Straight

Cross

Figure 1.16. Electro-optic SE: (a) two states; (b) crosstalk

Let us look at blocking properties more closely. In an SE, two inputs (resp.

outputs) intending to be connected with the same output (resp. input) causes output

link con ict (resp. input link con ict). The existence of crosstalk in photonic switch-

ing networks adds a new dimension of blocking, called node con ict, which happens

when more than one connection with the same wavelength passing through the same

SE at the same time. Figure 1.13 shows two connection paths P0 from 0010 to 1011

21

and P1 from 0100 to 1010. P0 and P1 have an output link con ict in stage 2 and

an input link con ict in stage 3 because both inputs of SE 4 in stage 2 intend to be

connected with its lower output and both outputs of SE 5 in stage 3 intend to be

connected with its upper input. The two paths have node con icts at SEs 4 and 5

in stages 2 and 3, respectively.

1.3 Previous Related Work on Switching Algorithms

1.3.1 Parallel Computing

Before discussing the existing algorithms for switch scheduling and routing, we �rst

introduce the computing models for these algorithms.

A commonly accepted model for designing and analyzing sequential algo-

rithms consists of a central processing unit with a random-access memory attached

to it. The typical instruction set for this model includes reading from and writing

into the memory, and basic logic and arithmetic operations.

The main purpose of parallel processing is to perform computations faster

than can be done with a single processor by using a number of processors concur-

rently. A parallel computer system is simply a collection of processors, typically

of the same type, interconnected in a certain fashion to allow the coordination of

their activities and the exchange of data. Parallel computer systems can be clas-

si�ed according to a variety of architectural features and modes of operations. In

particular, these criteria include the type and the number of processors, the inter-

connections among processors and the corresponding communication schemes, the

overall control and synchronization, and the input/output schemes. People evaluate

parallel algorithms by several important criteria such as the number of processors,

time performance, space utilization, and communication scheme.

The bounds on the resources (for example, processors, time and space) re-

quired by an algorithm are measured as a function of the input size, which re ects

22

the amount of data to be processed. We are primarily interested in the worst-case

analysis of algorithms. Given an input size N , each resource bound represents the

maximum amount of that resource required by any instance of size N . These bounds

are expressed using the following standard notation:

� f(N) = O(g(N)) if there exist positive constants c and N0 such that f(N) �

cg(N0), for all N � N0.

� f(N) = (g(N)) if there exist positive constants c and N0 such that f(N) �

cg(N0), for all N � N0.

� f(N) = �(g(N)) if f(N) = O(g(N)) and f(N) = (g(N)).

For two functions f(N) and g(N), f(N) = O(g(N)), or �(g(N)), or (g(N))

means that f(N) is asymptotically no greater than, or equal to, or no less than g(N)

[15].

The running time of an algorithm is estimated by the number of basic opera-

tions required by the algorithm as a function of the input size. In this dissertation,

we assume the basic operations include reading from and writing into the memory,

sending to and receiving from processors, and the basic arithmetic and logic op-

erations such as adding, subtracting, comparing, or multiplying two numbers, and

computing the bitwise logic OR or AND of two words. The cost of an operation does

not depend on the word size.

The parallel computing models can be used as general frameworks for de-

scribing and analyzing parallel algorithms. Here we only introduce two models,

shared-memory model and network model, which will be talked in this dissertation.

The shared-memory model consists of a number of processors, each of which

has its own local memory and can execute its own local program, and all of which

communicate by exchanging data through a shared memory unit. Each processor

23

is uniquely identi�ed by an index, called a processor number or processor id, which

is available locally. If all the processors operate synchronously under the control of

a common clock, this synchronous shared-memory model is called parallel random-

access machine (PRAM) model. The key assumptions about the PRAM model are

shared-memory and synchronous mode of operation. There are several variations

of the PRAM model based on the assumptions regarding the handling of the si-

multaneous access of several processors to the same location of the global memory.

The exclusive read exclusive write (EREW) PRAM does not allow any simultaneous

access to a single memory location. The concurrent read exclusive write (CREW)

PRAM allows simultaneous access for a read instruction only. Access to a location

for a read or a write instruction is allowed in the concurrent read concurrent write

(CRCW) PRAM. These three models do not di�er substantially in their computa-

tional power, although the CREW is more powerful than the EREW, and the CRCW

is more powerful than EREW, and the CRCW is most powerful.

In the network model, a network can be viewed as a graph G(V;E), where

each node i 2 V represents a processor, and each edge (i; j) 2 E represents a two-

way communication link between processors i and j. Each processor is assumed to

have its own local memory, and no shared memory is available. The operation of a

network may be either synchronous or asynchronous. Processors can communicate

with each others by sending and receiving data using the two-way communication

links among them.

The network model incorporates the topology of the interconnection between

the processors into model itself. In the following, we introduce several topologies

that will be often used in the dissertation.

A completely connected multiprocessor system of size N consists of a set of

processing elements (PEs) PEi, 0 � i � N�1, connected in such a way that there is a

connection between every pair of PEs. In this dissertation, without speci�cation, we

24

assume that each PE can communicate with at most one PE during a communication

step. With this restriction, any algorithm for such a system is equivalent to an

algorithm under the EREW PRAM abstract model for parallel computing.

A linear array processor system of size N consists N processors P1, P2, � � �,

PN connected in a linear array; that is, processor Pi is connected to Pi�1 and to Pi+1,

whenever they exist. A two-dimensional array is a two-dimensional version of linear

array. A two-dimensional array with size of N2 processors arranged into an N � N

grid such that processor Pi;j is connected to processors Pi�1;j and Pi;j�1, whenever

they exist.

A hypercube multiprocessor system of size N = 2d, denoted by H(2d), con-

sists of N processors, indexed from 0 to p � 1, interconnected into a d-dimensional

Boolean cube that can be de�ned as follows. Let the binary representation of i be

id�1id�2 � � � i0, where 0 � i � N � 1. Then processor Pi is connected to processors

Pi(j) , where i(j) = id�1 � � ��ij � � � i0, and �ij = 1� ij, for 0 � j � d� 1. In other words,

two processors are connected if and only if their indices di�er in only one bit position.

The hypercube has a recursive structure. We can extend a d-dimensional cube to a

(d+1)-dimensional cube by connecting corresponding processors of two d-dimensional

cubes. One cube has the most signi�cant address bit equal to 0; the other cube has

the most signi�cant address bit equal to 1. Thus a H(2d) is constructed from 2

H(2d�1)'s by adding 2d�1 edges, named d-dimension edges, that connects the corre-

sponding 2d�1 nodes in 2 H(2d�1)'s. H(2) is an edge with two nodes. The hypercube

is popular because of its regularity, small diameter, many interesting graph-theoretic

properties, and ability to handle many computations quickly and simply.

A butter y multiprocessor system of size N=2 logN , denoted by BF (N), con-

sists ofN=2 logN processors. The structure of butter y is isomorphic to Banyan-type

network talked in 1.2.3. Butter y networks are also in the family of the hypercube

[47] because H(N=2) can be obtained from BF (N) by merging all SEs in row i of

25

BF (N) as a node i of H(N=2) and merging all links connecting SEs contained in two

di�erent nodes as an edge of H(N=2). Figure 1.17 shows the relationship of baseline

network, butter y network and hypercube. In Figure 1.17 (c), the d-dimension edges

are labeled by d�.

(a) (b) (c)

0

2

1

3

1*

1* *

3

0

1

2

3

0

1

2

3

0

1

2

3

2*

2

2

0

1

2

3

0

1

2

3

0

1

Figure 1.17. The relationship of baseline network, butter y network and hypercube:

(a) BL(8); (b) BF (8); (c) H(4)

1.3.2 Switch Scheduling

For packet switching, only one cell can be transmitted across the switching fabric to

each output at one time. Due to output contention, an arbitration process decid-

ing which cells to be transferred is needed. The arbitration scheme, named switch

scheduling, is essentially a service discipline that arranges the service order among

cells. An algorithm to implement the arbitration scheme is called a scheduling algo-

rithm.

Mathematical Models for Switch Scheduling

The scheduling problem for packet switches can be modeled as a matching problem

on a bipartite graph G = (V;E), where V = V1 [V2, V1 = finputsg, V2 = foutputsg,

and E = fconnections for packets/cells in the head of queues in inputs/outputsg.

Figure 1.18 shows an example of the graph model for a VOQ switch.

In a graph G, a set of independent edges (no two edges in the set are adjacent

26

1

2

3

1

2

3

scheduler

1

3

1

1

2

3

2 2

3

switch fabricinput output

1

2

3

1

2

3

Graph

( a )

( b )

1

2

3

1

2

3

Maximum matching

1

2

3

1

2

3

Maximal matching

( c ) ( d )

Figure 1.18. Model of switch scheduling: (a) a VOQ switch; (b) bipartite graph; (c)

maximum size matching; (d) maximal size matching

to each other) is called a matching of G. A maximum size matching of G is one

with the largest number of edges in it among all matchings of G. A maximal size

matching is one that is not contained in any other matchings. That means, for a

maximal matching, if we add one edge that is not in the matching, then this edge

must be adjacent to some edge in the matching. Figure 1.18 shows an example of

a maximum size matching and a maximal size matching of the bipartite graph for a

VOQ switch.

Due to performance requirements, we can associate a weight to each connec-

tion/edge. For example, the connections with more waiting time have larger weights.

Similar to maximumand maximalmatchings, we de�nemaximum weight matching as

one with the largest weight among all weighted matchings of G and maximal weight

matching as one that is not contained in any other weighted matchings. Clearly,

maximum/maximal size matchings is a special case of maximum/maximal weight

matching with weights of all edges equal to 1.

The existing maximumsize and maximumweight matching algorithms are too

complex both in time complexity and hardware implementation, and therefore, they

are not practical for high speed switches. Researchers have turned their attention

to heuristic algorithms that �nd a maximal size or weight matching quickly. These

27

heuristic algorithms can be classi�ed into three categories: sequential, parallel, and

neural algorithms. A comprehensive survey of these algorithms can be found in [60].

In this dissertation, we will focus on a special matching problem, named stable

matching problem.

Stable Matching

The stable matching problem (or stable marriage problem) was �rst introduced by

Gale and Shapley (GS) in 1962 [20]. Given n men, n women, and 2n ranking lists in

which each person ranks all members of the opposite sex in the order of preference,

a matching is a set of n pairs of man and woman with each man/woman in exactly

one pair. A matching is stable if there does not exist one man and one woman

who are not matched to each other, but each of whom strictly prefers the other to

his/her current partner in the matching; otherwise, the matching is unstable. Gale

and Shapley showed that every instance of the stable matching problem admits at

least one stable matching, which can be computed in O(n2) iterations. The paper of

Gale and Shapley sparked much interest in many aspects and variants of the classical

stable matching problem. For a good survey on this subject, refer to [24].

Recently, the solutions to the stable matching problem have been applied to

switch scheduling for packet switches. Many scheduling algorithms based on stable

matchings have been proposed for both input queued (IQ) switches and combined

input and output queued (CIOQ) switches (e.g. [12, 35, 36, 58, 63, 64, 72, 85]). It has

been shown that scheduling algorithms based on stable matchings can provide QoS

guarantees. In these algorithms, the man set and the woman set consist of all input

ports and all output ports respectively, and the ranking list for each input/output

is de�ned di�erently according to di�erent performance requirements. For example,

McKeown proposed two scheduling algorithms, GS longest queue �rst (GS-LQF) and

GS oldest cell �rst (GS-OCF), with ranking lists based on the occupancy of the input

28

queues and the waiting time of the cells at the head of input queues respectively in

[58]. GS-LQF and GS-OCF algorithms were shown to achieve asymptotically 100%

throughput under both uniform and non-uniform traÆc for IQ switches.

Figure 1.19 shows an example how scheduling based on stable matching works,

where the ranking lists are de�ned as the occupancy of the input queues. There are

three inputs and three outputs. In each input, cells destined to di�erent outputs are

queued in di�erent queues. The man set and woman set consist of three inputs and

three outputs respectively. The ranking list for each input is de�ned by the lengths

of its three queues destined to di�erent outputs, and the ranking list for each output

is de�ned by the lengths of three queues destined to it in di�erent inputs, where the

longest queue has the highest ranking, which is 1, and the shortest queue has the

lowest ranking, which is 3. The scheduling are based on the found stable matching,

which is shown as dotted lines.

1

2

3

1

2

3

scheduler

1

3

1

1

2

3

2 2

3

switchfabric

input output

Ranking lists:

input 1: {1,3,2}

input 2: {1,2,3}

input 3: {3,1,2}

output 3: {2,3,1}

output 1: {2,1,3}

output 2: {3,2,1}

Stable matching:

(1,3), (2,1), (3,2)

Figure 1.19. Scheduling based on stable matching in a VOQ switch

29

Acyclic Stable Matching

Applications of stable matching in switch scheduling have been proposed. However,

the classical GS stable matching algorithm is infeasible for high-speed implementa-

tion due to its high complexity. Instead, a special stable matching, called acyclic

stable matching have been shown useful in implementing scheduling for high-speed

switches/routers.

Let M = fm1;m2; � � � ;mng and W = fw1; w2; � � � ; wng be the sets of n

men and n women respectively. Let mLi = fwri;1; wri;2; � � � ; wri;ng and wLi =

fmri;1;mri;2; � � � ;mri;ng be the ranking lists for man mi and woman wi respectively,

where wri;j (resp. mri;j) is the rank of woman wj (resp. man mj) by man mi (resp.

woman wi). Let A be a ranking matrix of size of n � n, where each entry of A is a

pair ai;j = (wri;j;mrj;i).

Given a ranking matrixA, we de�ne the dependency graph as a directed graph

~G constructed as follows: each entry ai;j of A is represented by a vertex vi;j of ~G;

for any two vertices vi;j and vi;k, if ahi;j < ahi;k, then there is an edge from vi;j to

vi;k; for any two vertices vi;j and vl;j, if avi;j < avl;j, then there is an edge from vi;j

to vl;j. Thus, for any instance of stable matching problem, there is a corresponding

dependency graph. If the dependency graph is acyclic, the solution is called acyclic

stable matching.

The scheduling algorithms based on acyclic stable matching have been pro-

posed for combined input and output queued (CIOQ) switches [12, 63, 72, 85], and

it has been shown that with some speedup, an acyclic stable matching scheduling

algorithm can provide QoS guarantees for both unicast and multicast traÆc with

�xed-length and variable-length packets.

30

1.3.3 Switch Routing

In a switching network, when more than one input requests to be connected with

the same output, output contention occurs. Output contentions can be resolved

by switch scheduling. For a set of connection requests without output contention,

the process of establishing con ict-free connection paths to satisfy these requests is

called switch routing. An algorithm, named routing algorithm, is needed to �nd these

paths. Once a set of con ict-free paths is found, the connections can be properly set

up.

Let I and O be the sets of N inputs, denoted by I0; � � � ; IN�1, and N outputs,

denoted byO0; � � � ; ON�1, of an interconnection network respectively. Let � : I 7�! O

be an I=O mapping that indicates connections from I to O. If there is a connection

from Ii to Oj, then set �(i) = j and ��1(j) = i, and we call Ii (Oj) an active input

(output). If j 6= �(i) for any active Ii, we call j an idle output. We say that an input

(resp. output, link, SE) is active if it is on a connection path, and idle otherwise.

An I/O mapping from I to O is one-to-one if each Ii is mapped to at most one Oj

and �(i) 6= �(j) for any i 6= j. In this dissertation, all I/O mappings are one-to-one

and all connections belong to a one-to-one I/O mapping.

A one-to-one I/O mapping involving K(� N) active inputs is called a partial

permutation, or called a non-maximum I/O mapping. A partial permutation with

K = N active inputs is also called a permutation, or called a maximum I/O mapping.

Clearly, a permutation is the maximum number of connections that can be realized

in a single pass in an interconnection network.

Since crossbar and Banyan-type networks are self-routing networks, the rout-

ing in these networks simply follow their self-routing rules talked in subsection 1.2.3.

In the following, We will discuss the previous routing work for Benes networks and

3-stage Clos networks.

31

Mathematical Models for Switch Routing

There are three general mathematical models for designing a routing algorithm in

Benes network and 3-stage Clos network.

(i)Matrix Decomposition:

We represent a 3-stage Clos network C(n;m; r) as a matrix M where each

row is corresponding to one input SM (i.e. an n � m crossbar in the �rst stage),

each column is corresponding to one output SM (i.e. an m� n crossbar in the third

stage), and the entry (i; j) in M indicates the number of connection requests from

input SM i to output SM j. The problem is to partition M into m permutation

matrices, where each row and each column of the matrix has at most one entry of 1.

All requests in a permutation matrix can be routed through one middle SM (i.e. an

r � r crossbar in the second stage).

For a special case of Benes network B(N), we represent its permutation as

a matrix M with size of N=2 � N=2, where each row is corresponding to one input

SE (i.e. a 2 � 2 crossbar in the �rst stage) , each column is corresponding to one

output SE (i.e. a 2� 2 crossbar in the last stage), and the entry (i; j) in M indicates

the number of connection requests from input SE i to output SE j. The problem is

to partition M into 2 permutation matrices. Thus, all connections in a permutation

matrix can be routed through the same subnetwork B(N=2). Figure 1.20 shows

an example for the decomposition of a matrix into two permutation matrices for a

permutation � of B(8), where

� =

0 1 2 3 4 5 6 7

3 2 5 0 4 6 7 1

!

After the matrix decomposition in �gure 1.20, we can set the SEs in the �rst

and last stages, and route the sub-permutation

�1 =

1 3 4 6

2 0 4 7

!

32

and sub-permutation

�2 =

0 2 5 7

3 5 6 1

!

in the upper subnetwork B(4) and the lower subnetwork B(4), respectively.

2 0 00

0 1 01

0 1 10

0 0 11

0 0 01

0 1 00

0 0 10

0 1 00

0 0 10

0 0 01

1 0 00 1 0 00

M =M =M = 1 2

Figure 1.20. Matrix decomposition: M =P2

i=1Mi

The other two models are related to graph theory. We �rst introduce some

de�nitions and notations. Let G be a graph, V (G) be the set of vertices and E(G)

be the set of edges of G. We use jV (G)j and jE(G)j to denote the total number of

vertices and edges in V (G) and E(G) respectively. A graph G is called bipartite graph

if V (G) can be partitioned into two parts so that no two vertices in the same part are

adjacent to each other. The degree of a vertex is the total number of edges adjacent

to the vertex, and the degree of a graph G, denoted by �(G), is the maximumvertex

degree of G. If each vertex in G has the same degree d, then G is called a d-regular

graph. As de�ned earlier, a set of independent edges in a graph is a matching. If the

edges in a matching cover all vertices of G, this matching is called a perfect matching

of G. If all edges of a graph G can be colored by c di�erent colors so that the incident

edges have di�erent colors, G is called c-edge colorable and this coloring is called a

c-edge coloring of G.

We can represent Clos network C(n;m; r) with a permutation � by a graph

G, where V (G) = finput SMs and output SMs g and E(G) = fconnections between

input SMs and output SMsg. It is clear that G is a bipartite graph with all input

SMs as one part and all output SMs as another part, and �(G) � r since each input

SM has r inputs and each output SM has r outputs. G may have more than one edge

33

between two vertices, and however, there is a one-to-one correspondence between a

connection in � and an edge in graph G. Thus, we can label each edge in the graph

by the input of its corresponding connection.

(ii) Edge Coloring:

The Konig's theorem [8] states that every bipartite graph is �(G)-edge col-

orable. Thus, we can color the edges of G using �(G) colors so that the edges

incident to the same vertex have distinct colors. Edges of the same color can be

routed through di�erent middle SMs of C(n;m; r).

(iii) Matching: We can also set a Clos network by recursively �nding a match-

ing and remove it from G until all edges in G have been considered and letting the

edges in the same matching route through the same SM in the middle stage.

For Benes network, the corresponding graph G is a bipartite graph with degree

at most 2. Hence, G is 2-colorable and has at most 2 perfect matchings [8]. Figure

1.22 shows the graph model of G to represent the permutation � of B(8). In Figure

1.21 (a), we color G with two di�erent colors, one denoted by solid lines and the other

is denoted by dashed lines. In Figure 1.22 (b), two perfect matchings,M1 andM2,

of G are found. Clearly, the edges with the same color (or the edges belong to the

same matching) are corresponding to the connections in the same sub-permutation

�1 or �2.

01

23

45

67

01

23

45

67

1

3

4

6

0

2

4

7

0

2

5

7

1

3

5

6

( a ) ( b)

G M1 M2

Figure 1.21. Graph representation: (a) edge Coloring; (b) matching

34

Since the edges of the same color are independent, these edges with the same

color form a matching. Also all entries in each permutation matrix form a matching

of G by the de�nition of permutation matrix. Therefore, matrix decomposition,

edge coloring and matching are essentially equivalent, but each has own techniques

in implementation [31].

Routing Algorithms

The routing algorithms for Benes networkB(N) can be obtained by running a routing

algorithm of the 3-stage Clos network in logN � 1 times since a Benes network is

recursively constructed from replacing the middle stage of SMs with 3-stage Clos

networks. Therefore, its time complexity is simply the time complexity of the 3-

stage Clos network multiplied by a factor of logN . We �rst introduce the routing

algorithms based on the above three models.

Routing algorithms based on matrix decomposition techniques have been de-

veloped for 3-stage Clos networks C(n;m; r) [9, 34, 78]. It is known that matrix

decomposition do not always work [9]. However, it works for Benes networks. Waks-

man [93] proposed a routing algorithm for n = 2, which was elaborated by Opferman

and Tsao-Wu [66] and named looping algorithm. The time complexity of looping al-

gorithm is O(N logN) for a sequential control and O(N) for parallel control.

Sequential matching algorithms for bipartite graphs are available in graph

theory literature, One, due to Hopcroft and Karp [28], runs in O(jV j2:5). Another,

by Gabow [19], runs in O(jV j0:5(jEj+ jV )) time. A third algorithm, due to Cole and

Hopcroft, �nds matchings in O(jEj log jV j) steps.

Based on these matching algorithms, the sequential routing algorithm for Clos

networks C(n;m; r) can be derived. The application of Hopcroft and Karp matching

algorithm to C(n;m; r) leads to an O(mr2:5)-time routing procedure, the application

of Gabow's algorithm results in an O(mr0:5(N + r))-time routing procedure, and the

35

application of Cole and Hopcroft's algorithm leads to an O(mN log r)-time routing

procedure.

There are two primary methods used in edge-coloring: Konig's method of

alternating paths and Euler partitions. Gabow and Kariv [19] formalized Konig's

proof into a procedure that performs a coloring in O(jV j � jEj) time. This algorithm

can lead to a O(Nr)-time routing procedure. The edge-coloring algorithm based

on Euler partition takes O(jV j0:5jEj log�) time. For C(n;m; r), this leads to a

algorithm O(r0:5N logm), assuming that N is a power of 2.

A detailed survey for the routing algorithms in Clos networks based on match-

ing and edge coloring can be found in [10]. Table 1.3 lists routing algorithms for

3-stage Clos network and Benes networks based on these three approaches with the

time complexities in sequential and parallel implementation. These time complexities

are based on PRAM models.

The primary routing algorithms for setting up the Benes Network, besides the

algorithms derived from the Clos network, includes the parallel algorithm of [62], the

self-routing algorithm [61, 77] for some permutations and the non-recursive algorithm

of [45].

Since Benes network B(N) has O(N logN) SEs, one cannot set up the net-

work in less than O(N logN) time using a single processor. The parallel routing

algorithms is an alternative. Nassimi and Sahni in [62] gave a parallel routing algo-

rithm for Benes network, and analyzed its complexity on a fully completed connected

multi-processor system and various non-fully completed connection multi-processor

system such as mesh-connected computer, perfect shu�e computer, and cube con-

nected computer [62]. The idea of this algorithm is to implement the loop algorithm

in parallel. As shown in Table 1.3, the time complexity for the parallel routing algo-

rithms in [49] and [62] for B(N) is O(log2N), which is the best time complexity of

the known routing algorithms for Benes networks.

36

Algorithm Network Sequential Time Parallel Time

Matrix Decomposition 3-stage Clos network x x

[66] Benes O(N logN) O(N)

Matching 3-stage Clos network O(mr2:5) x

[28] Benes O(N2:5) x

Matching 3-stage Clos network O(mr0:5(N + r)) x


Matching 3-stage Clos network O(mN log r) O(mN)

[14] Benes O(N logN) O(N)

Edge Coloring 3-stage Clos network O(N logm) O(N)

[18] Benes O(N) x

Edge Coloring 3-stage Clos network O(Nr) x

[19] Benes O(N2) x

Edge Coloring 3-stage Clos network (r0:5N logm) O(r0:5N)


Edge Coloring 3-stage Clos network O(N logm) O(log2N)

[49] Benes O(N) O(log2N)

Parallel Looping 3-stage Clos network x x

[62] Benes x O(log2N)

Table 1.3. Routing algorithms for 3-stage Clos networks and Benes networks

37

In order to reduce the time complexity of routing algorithm for Benes network,

another alternative is self-routing algorithm, which means that the setting of every

switch in the interconnection network is controlled by the routing tag bits attached

to the input packets. Although Benes network is not self-routing network, Lenfant

[48] �rst showed that the Benes network can self-route all �ve families of frequently

used permutations, each with a di�erent routing algorithm. Nassimi and Sahni [61]

presented a uni�ed self-routing algorithm for one class of permutations which contains

the �ve families of permutation Lenfant considered. Raghavendra and Boppana gave

a self-routing algorithm which di�ers the self-routing algorithm of Nassimi and Sahni

to route linear permutation in Benes network in [77]. The time complexity of self-

routing algorithm is O(logN) for N �N Benes network since it contains 2 logN � 1

stages.

Although the self-routing algorithms is faster than known parallel algorithms,

they cannot route all permutations. K.Y. Lee [45] presented a new Benes network

routing algorithm which sets half of Benes network by self routing and realizes all

permutations. This algorithm does not view the Benes network as recursive network,

but rather as a concatenation of two Banyan networks, SN1 and SN2, where SN1

corresponds to the �rst (logN � 1) stages, and SN2 corresponds to the remaining

logN stages of Benes network. The basic idea of this non-recursive algorithm is as

follows: setting up SEs in SN1 by a full binary tree using set partitioning functions

and setting up SEs in SN2 by bit control. Thus this routing algorithm sets SEs

one stage at a time, starting from the leftmost stage heading for the rightmost

stage. Although the time complexity of this non-recursive algorithm is the same as

the looping algorithm, it has two advantages compared to the looping algorithm.

First, this algorithm eliminates the information exchange among di�erent stages to

make the pipelining of switch setting feasible for Multiple Instruction Multiple Data

(MIMD) environments, because it sets switches stage by stage, one stage at a time.

38

Second, SN2 is bit controlled in real time, and the bottleneck remains only within

SN1.

1.3.4 Crosstalk-Free Routing

In order to reduce internal blocking e�ect in optical switching networks, three ap-

proaches, space dilation, time dilation and wavelength dilation, have been proposed.

The idea is to assure that the connections in the same SE have di�erent wavelengths.

In this dissertation, di�erent wavelengths refer to the wavelengths with enough wave-

length spacing so that no crosstalk will be generated when such wavelengths passing

through the same SE/link, and the crosstalk is referred to the �rst-order SE crosstalk

[99]. In space and time dilations, node con icts can be eliminated by ensuring at

most one connection passing through an SE in an OMIN. More speci�cally, in space

dilation node con icts can be avoided by increasing the number of SEs in an OMIN

(e.g. [41, 42, 68, 86, 91, 92, 94]), while in time dilation a set of con icting connections

is partitioned into subsets so that the connections in each subset can be established

simultaneously without con icts (e.g. [69, 73, 74, 83, 99]). Clearly, space dilation

trades the hardware cost while time dilation trades time. In wavelength dilation, the

crosstalk between two signals passing through the same SE is suppressed by routing

to ensure two wavelengths to be di�erent (e.g. [81, 82]), or by using wavelength

converters (e.g. [22, 75]).

1.4 Motivations and Contributions of Dissertation

Nonblocking networks are always favored to be used in switching whenever possible.

Crosstalk-free requirement in photonic networks adds a new dimension of constraints

for nonblockingness. Switching algorithms, including routing for establishing connec-

tions between inputs and outputs and scheduling for solving packet contentions, play

a fundamental role on the performance of switching networks. Any algorithm that re-

39

quires more than linear time would be considered too slow for real-time applications.

One remedy is to use multiple processors to establish connections in parallel and

the other is to construct low cost, high speed, large capacity nonblocking switching

architecture.

The contributions of this dissertation mainly include: developing parallel algo-

rithms for routing and scheduling and proposing cost-e�ective high-speed switching

architectures, as shown in Figure 1.22. We tackle the challenging switching problems

by using combinatorics and graph theory approaches, applying parallel processing

and computing techniques, and adopting implementation and experimental evalua-

tions.

Switching

Architecture

Parallel RoutingAlgorithms for

Group Connectors(Chapter 7)

Routing

Parallel RoutingAlgorithms forNonblocking

Electronic andPhotonicSwitchingNetworks

(Chapter 4)

Parallel Crosstalk-Free Routing forOptical Benes

Networks(Chapter 5)

Parallel Routingand WavelengthAssignment for

OpticalInterconnection

Networks(Chapter 6)

Scheduling

A Parallel IterativeImprovement

Stable MatchingAlgorithm

(Chapter 2)

Design andImplementation ofan Acyclic Stable

MatchingScheduler(Chapter 3)

Figure 1.22. Main work of the dissertation

1.5 Outline of Dissertation

This dissertation is organized as follows.

In Chapter 2, we propose a new approach, parallel iterative improvement

(PII), to solving the stable matching problem. This approach treats the stable

matching problem as an optimization problem with all possible matchings form-

ing its solution space. Since a stable matching always exists for any stable matching

40

problem instance, �nding a stable matching is equivalent to �nding a matching with

the minimum number (which is always zero) of unstable pairs. A particular PII

algorithm is presented to show the e�ectiveness of this approach by constructing a

new matching from an existing matching and using techniques such as randomization

and greedy selection to speedup the convergence process. Simulation results show

that the PII algorithm has better average performance compared with the classical

stable matching algorithms and converges in linear iterations with high probability.

We also discuss the implementations on hypercube, mesh of trees, and array with

multiple broadcasting buses.

In Chapter 3, we model the acyclic stable matching problem as the dominating

set problem for a rooted dependency graph, and then propose a parallel algorithm

for �nding the dominating set. One advantage of scheduling algorithms based on our

acyclic stable matching is its low time complexity. For any instance of acyclic stable

matching problem, our acyclic stable matching algorithm can �nd a stable matching

in O(N logN) time while the classical stable matching needs O(N2) time. Another

advantage is its feasibility for high-speed implementation. We design and implement

a scheduler based on our acyclic stable matching algorithm in hardware. Simulation

results show that the number of 2-input NAND gates and the timing of our design

are proportional to N2 and N respectively, making it feasible to be implemented at

high speed with current CMOS technologies.

In Chapter 4, we study a class of multistage nonblocking switching networks

B(N;x; p; a), which is constructed by horizontally concatenating x(= logN � 1)

extra stages to an N � N Banyan-type network and vertically stacking p copies of

the extended Banyan, with crosstalk-free constraint (a = 1) or without crosstalk-free

constraint (a = 0). This class of networks contains Banyan network, Benes network,

and Cantor network as special cases. By modeling the routing problems for this

class of networks as weak and strong edge colorings of bipartite graphs, we develop

41

fast parallel routing algorithms that can route an arbitrary partial permutation with

K(= N) connections in a rearrangeable nonblocking network B(N;x; p; a) in O((x+

log p) logK + logN) time and in a strictly nonblocking network B(N; 0; p�; a) in

O(log p� logK + p� log p�) time.

In Chapter 5, we model the permutation decomposition problem as the prob-

lem of an edge coloring of a bipartite graph, and simplify the existing proof for the

decomposability of a permutation into two crosstalk-free (CF) partial permutations.

By applying parallel processing techniques, we develop a fast parallel decomposi-

tion algorithm to decompose a permutation into two CF partial permutations using

linear number of processors, which improves the time complexity of the existing

permutation decomposition algorithms from linear time to logarithmic time. Using

equitable coloring techniques, we further improve the time complexity for establish-

ing a set of K connections in a time dilated optical Benes network from O(N logN)

to O(log2K + logN).

In Chapter 6, we extend the concept of nonblocking in the space division

switching to the wavelength division switching. We model the wavelength routing

problem as the vertex coloring problem and develop fast parallel routing algorithms

for realizing an arbitrary permutation in wavelength-rearrangeable space-strict-sense

Banyan networks and wavelength-rearrangeable space-rearrangeable Benes networks

in O(log2N) time and O(log3N) time respectively, and discuss implementations of

both algorithms on a hypercube.

In Chapter 7, we consider Benes group connectors and Clos group connectors,

which are based on Benes networks and 3-stage Clos networks. We develop fast

parallel routing algorithms for both group connectors to realize connections from

N inputs to n output groups. Benes group connectors and Clos group connectors

have less hardware cost than the corresponding Benes networks and Clos networks,

respectively. We show that, by the proposed routing algorithms, the hardware of

42

Benes group connectors can be reduced further.

In Chapter 8, we conclude our research work and brie y discuss the future

work.

CHAPTER 2

A PARALLEL ITERATIVE IMPROVEMENT STABLE

MATCHING ALGORITHM

2.1 Introduction

Recently, the application of stable matching in switch scheduling has been proposed

in many literatures. For designing eÆcient switch scheduling, we must improve the

time complexity of stable matching algorithm. One possible solution is to use parallel

processing.

For real-time applications, the algorithm proposed by Gale and Shapley, sim-

ply GS algorithm, with time complexityO(n2 log n) using O(n) processors is not fast

enough. Attempts of �nding parallel stable matching algorithms with low complexity

were made by many researchers (e.g. [1, 23, 26, 29, 59, 87, 90]). Up to date, the best

known parallel algorithm for stable matching problem takes O(pn � log3 n) time [17].

This algorithm runs on a CRCW PRAM (concurrent-read concurrent-write parallel

random access machine) of n4 processors, which makes it infeasible for applications

in packet switching networks.

The parallelizability of the stable matching problem is far from being fully

understood. It is widely believed that this problem is not inNC. The parallel version

of the stable matching algorithm by Gale and Shapley needs O(n log n) iterations in

average [24]. It was suggested that parallel stable matching algorithms cannot be

expected to provide high speedup on the average [37, 76]. Thus, designing eÆcient

parallel algorithms that perform well for most cases is a challenging endeavor.

In this chapter, we propose a new approach, parallel iterative improvement

(PII), to solving the stable matching problem. Since a stable matching always ex-

43

44

ists for any instance of the stable matching problem, �nding a stable matching is

equivalent to �nding a feasible matching with minimum number (which is always

zero) of unstable pairs. The PII algorithm consists of two alternating phases, Initi-

ation Phase and Iteration Phase. An Initiation Phase is a procedure that

randomly generates a matching. An Iteration Phase consists of multiple improve-

ment iterations. We try to speedup the convergence process by exploring parallelism

in identifying a subset of unmatched pairs to replace matched pairs in an existing

matching so that the number of unstable pairs in the newly obtained matching can

be reduced. Due to greedy selection of new matching pairs, PII algorithm may not

converge in one Iteration Phase. However, we observed that the PII algorithm

tends to �nd a stable matching in o(n) iterations in average, and in n iterations

with high probability. We show that an Initiation Phase and an iteration of an

Iteration Phase take O(log n) time on both completely connected multiprocessor

system and array with multiple broadcasting buses, and O(log2 n) time on both hy-

percube and mesh of trees (MOT), all assumed having n2 processor elements (PEs).

Simulations show that the PII algorithm has better average performance compared

with the classical stable matching algorithms and converges in n iterations with high

probability. For real-time applications with hard time constraint, the proposed al-

gorithm can terminate at any time during its execution, and the matching with the

minimum number of unstable matching pairs can be used as an approximation of a

stable matching.

The rest of the chapter is organized as follows. In Section 2.2, we study the

properties of stable matching. In Section 2.3, we present our PII algorithm. We

show the implementations of PII algorithm on parallel computing machine models

in Section 2.4. Section 2.5 compares simulation results. Section 2.6 summarizes the

chapter.

45

2.2 De�nitions and Properties

For a ranking matrix A = fai;jg, we call wri;j (resp. mrj;i) the left value (resp. right

value) of ai;j, and denote it by aLi;j (resp. aRi;j). For convenience, we use (axi;j; ayi;j)

to denote the indices (i; j) of pair ai;j. Clearly, the ordered list of left values of all

pairs in row i of A is man ranking list mLi and the ordered list of right values of all

pairs in column j is woman ranking list wLj . Example 1 shows the ranking matrix

obtained from the given ranking lists.

Example 1 An instance of stable matching problem:

Man ranking lists: Woman ranking lists: Ranking matrix:

mL1 : f4; 2; 3; 1g; wL1 : f1; 4; 2; 3g; 4; 1 2; 1 3; 4 1; 3

mL2 : f3; 1; 2; 4g; wL2 : f1; 2; 3; 4g; 3; 4 1; 2 2; 2 4; 1

mL3 : f2; 4; 1; 3g; wL3 : f4; 2; 3; 1g; 2; 2 4; 3 1; 3 3; 4

mL4 : f1; 4; 3; 2g. wL4 : f3; 1; 4; 2g. 1; 3 4; 4 3; 1 2; 2

2

A pair ai;j in A corresponds to a man-woman pair (mi; wj). A matching,

denoted byM, corresponds to n pairs of Awith no two pairs in the same row/column.

If a pair of A is inM, it is called a matching pair of M and a non-matching pair

otherwise. For any matchingM of ranking matrix A, we de�ne the marked ranking

matrix, AM, as the ranking matrix with all matching pairs marked. Thus for any

matchingM, each row i (resp. column j) ofAM has exactly one matching pair, which

is denoted asM(Ri) (resp. M(Cj)). A pair ai;j is an unstable pair if aLi;j <M(Ri)L

and aRi;j <M(Cj)R. By the de�nition of stable matching, we have:

Property 1 A matchingM is stable if and only if there is no unstable pair in AM.

With respect toAM, we de�ne a setNM1 of type-1 new matching pairs (simply

nm1-pairs) as follows. If there is no unstable pair in AM, NM1 = ;. Otherwise, for

46

each row with at least one unstable pair, select the one with the minimum left value

among all unstable pairs in this row as an nm1-generating pair; for each column with

at least one nm1-generating pair, select the one with the minimum right value as an

nm1-pair.

Based on NM1, we de�ne a set NM2 of type-2 new matching pairs (simply

nm2-pairs) by a procedure that �rst identi�es nm2-generating pairs and then iden-

ti�es nm2-pairs using an nm2-generating graph.

For any nm1-pair ai;j in AM, pair al;k with l =M(Cj)x and k =M(Ri)

y is

called the nm2-generating pair corresponding to ai;j. We say that nm1-pair ai;j and

its corresponding nm2-generating pair al;k are associated with matching pairs ai;k and

al;j. We de�ne an nm2-generating graph GM as follows: V (GM) = f nm2-generating

pairsg, and E(GM) = fe = (u; v)j two nm2-generating pairs u and v are associated

with a common matching pairg. Since each nm2-generating pair is associated with

two matching pairs, we have:

Property 2 Given any AM, the degree of nm2-generating graph GM is at most 2.

By Property 2, each connected component in GM is a cycle or chain, named

nm2-generating cycle or nm2-generating chain (an isolated node is a chain of length

0). If a node in GM has degree 2, it is called an internal node; otherwise, it is called

an end node. Clearly, if an nm2-generating pair ai;j is an internal node in GM, there

are two nm1-pairs, one in row i and the other in column j; if an nm2-generating pair

ai;j is an end node in GM, there is at most one nm1-pair in row i or column j. We

call an end node ai;j a row end (resp. column end) of an nm2-generating chain if

there is no nm1-pair in row i (resp. column j) of AM. An isolated node is both row

end and column end.

By the nm2-generating graph, we can generate the set NM2 of nm2-pairs as

follows. For each nm2-generating chain with row end ai1;j1 and column end ai2;j2 , we

47

generate an nm2-pair ai1;j2. No nm2-pair is generated from any nm2-generating cycle.

Hence, there is a one-to-one correspondence between an nm2-generating chain and

an nm2-pair. Let NM = NM1 [ NM2. We call NM the set of new matching pairs

(simply nm-pairs). Based on the way that NM is generated, we know that NM1

and NM2 are disjoint, and each row/column of AM contains at most one nm-pair.

A matching pair ai;j in AM is called a replaced matching pair (simply rm-

pair), if it is in the same row/column of an nm-pair. We denote the set of rm-pairs

by RM . Based on the way that RM is constructed, we have:

Lemma 1 If there is at least one unstable pair in AM, thenM0 = (M�RM)[NM

is a matching di�erent fromM.

2.3 Parallel Iterative Improvement Matching Algorithm

In this section, we present our main result, a parallel iterative improvement algorithm

(PII algorithm) for a completely connected multiprocessor system, which consists of

a set of PEs connected in such a way that there is a direct connection between every

pair of PEs. We assume that each PE can communicatewith at most one adjacent PE

during every communication step. The PII algorithm uses n2 PEs. To facilitate our

discussion, these n2 PEs are placed as an n�n array. As input, PEi;j, (1 � i; j � n),

contains ai;j of ranking matrix A. When the algorithm terminates, a stable matching

is found by PEi;j indicating whether pair (mi; wj) is in the matching. The key idea

of the PII algorithm is to construct a new matchingM0 from an existing matching

M in hope thatM0 is \closer" to a stable matching thanM.

2.3.1 Constructing an Initial Matching

Randomly generating an initial matching can be reduced to generating a random

permutation, which can be done by a sequential algorithm proposed in [16]. We

present a parallel implementation of this algorithm.

48

Let each PE maintain a pointer. Initially, every PEi;j sets its pointer to point

to PEi+1;j, (1 � i � n � 1), and as a result, there are n disjoint lists. Then, each

PEi;i will randomly choose a j (i � j � n) to swap their pointers, i.e. PEi;i points to

PEi+1;j and PEi;j points to PEi+1;i. Consequently, n new disjointed lists originated

from PE1;j are formed. After performing log n times of pointer jumping [34], each

PE1;j �nds the other end PEn;p(1;j) of its list, where p(1; j) is the column position

of the PE pointed by PE1;j. Hence, a matching f(j; p(1; j))j1 � j � ng is formed.

Figure 2.1 shows an example of generating a matching of size 4, where the matching

obtained from (b) consists pairs of (1; 4), (2; 3), (3; 1), (4; 2). Clearly, this parallel

implementation takes O(log n) time since each list has length of n.

PE PE1,2 PE1,3 PE1,4

PE2,4PE2,3PE2,2PE2,1

PE3,1 PE3,2 PE PE3,4

PE4,1 PE4,2 PE4,3 PE4,4

1,1

( b )

PE1,2 PE1,3 PE1,4

PE2,4PE2,3PE2,2PE2,1

PE3,1 PE3,2 PE3,3 PE3,4

PE4,1 PE4,2 PE4,3 PE4,4

1,1

( a )

PE

3,3

Figure 2.1. Parallel random matching generation: (a) initial lists; (b) lists obtained

after randomization.

2.3.2 Construct a New Matching from an Existing Matching

A basic operation of PII algorithm is to construct a new matching M0 = (M�

RM) [ NM from an existing matchingM if M is unstable. In the following, we

describe six steps to carry out this operation.

Step 1: Recognize unstable pairs. Every PE with a matching pair in M

broadcasts its column position associated with its left value to the other PEs in the

same row; every PE with a matching pair inM broadcasts its row position associated

with its right value to the other PEs in the same column. Let ui;j be a Boolean

49

variable indicating whether pair ai;j is stable. If PEi;j's both values are smaller, set

ui;j := true; otherwise set ui;j := false. The broadcasting in rows/columns takes

O(log n) time.

Step 2: Stability checking. Find if there exists a PEi;j with ui;j := true by

parallel searching in binary tree fashion in rows/columns. Since each row/column

has n PEs, the searching takes O(log n) time. If fi;j := false for any PEi;j, then the

current matchingM is stable, and the algorithm terminates. Otherwise, go to the

next step.

Step 3: Find NM1. For each row with at least one unstable pair, �nd the

unstable pair with the minimum left value, and mark this pair as an nm1-generating

pair. For each column with at least one nm1-generating pair, �nd the nm1-generating

pair with the minimum right value, and mark this pair as an nm1-pair. The �nd-

minimum operation in rows/columns takes O(log n) time.

Step 4: Find nm2-generating pairs. For each PEi;j containing an nm1-pair,

mark the pair in PEl;k as a nm2-generating pair, where l =M(Cj)x and k =M(Ri)

y.

Clearly, this step only takes O(1) time.

Step 5: Find NM2. This step has two major objectives: (1) each nm2-

generating node that is both row end and column end in GM recognizes itself as an

isolated node, and (2) the row end of each nm2-generating chain in GM �nds its col-

umn end. Let each PEi;j containing an nm2-generating pair maintain two pointers,

r-pointer and c-pointer. The r-pointer (resp. c-pointer) of PEi;j points to the PE

containing an nm2-generating pair in columnM(Ri)y (resp. row M(Cj)

x) if there

is an nm1-pair in row i (resp. column j), and otherwise to itself. If both r-pointer

and c-pointer of PEi;j point to itself, then it corresponds to an isolated node in GM;

if the r-pointer (resp. c-pointer) of PEi;j points to itself but another pointer points

to some other PE, then PEi;j contains an nm2-generating pair that is the row (resp.

column) end of an nm2-generating chain; if both r-pointer and c-pointer of PEi;j

50

point to other PEs, its nm2-generating pair corresponds to an internal node of GM.

Figure 2.2 shows an example for �nding a new matchingM0 from an existing match-

ing M, where M = fa0;0; a1;9; a2;10; a3;7; a4;8; a5;1; a6;6; a7;5; a8;4; a9;3; a10;2g, NM1 =

fa1;7; a3;1; a4;6; a5;9; a6;5; a7;4; a8;3; a10;10g, NM2 = fa2;2; a9;8g, RM = fa1;9; a2;10; a3;7;

a4;8; a5;1; a6;6; a7;5; a8;4; a9;3; a10;2g andM0 = (M�RM)[NM = fa0;0g[NM1[NM2.

By a completely connected multiprocessor with N2 PEs, objective (1) can be easily

achieved in O(1) time and objective (2) can be achieved by performing dlog ne times

of pointer jumping [34] since the length of each nm2-generating chain is at most n.

Once objectives (1) and (2) are accomplished, the nm2-pairs can be easily computed

in O(1) time.

2

2

1

c-pointer

r-pointer

nm -pair

nm -generating pair

nm -pair

matching pair

1 2 3 4 5 6 7 8 9 10

Rows

1

2

3

4

5

6

7

8

9

10

Columns

nm -generating path: a a a a 2

nm -generating isolated node: a 2

9,4

nm -generating cycle: a a a 2 1,1

6,8 7,6 8,5

2,2

3,9 5,7

Figure 2.2. Finding a new matching from an existing matching.

Step 6: Construct a New Matching. Each PEi;j containing an nm-pair marks

the matching pair in row i as a replaced pair, and marks itself as a matching pair.

This step takes O(1) time.

Given an initial matching, the above procedure can be repeatedly applied,

and a stable matching may be found. Example 2 shows how the PII algorithm �nds

a new matching from a given unstable matching for the instance of Example 1.

51

Example 2 A stable matching is found after an iteration of the Iteration Phase.

initial matching pairs: unstable pairs: nm1-generating pairs: nm1-pairs

4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3

3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1

2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4

1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2

nm2-generating pairs: nm2-pairs: replaced pairs: new matching pairs

4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3

3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1

2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4

1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2

It is not diÆcult to verify that the new matching is a stable matching. 2

2.3.3 PII Algorithm

Now we are ready to present our PII algorithm. Conceptually, the PII algorithm has

two alternating phases: Initiation Phase and Iteration Phase. The Initia-

tion Phase �nds an initial matching arbitrarily. The Iteration Phase contains

at most c �n iterations, where c is a constant for controlling the number of iterations

in the Iteration Phase. Each iteration of Iteration Phase checks whether an

existing matchingM is stable. IfM is stable, the algorithm terminates; otherwise,

a new matchingM0 is constructed. Then,M0 is used asM for the next iteration.

After c � n iterations in each Iteration Phase, the PII algorithm goes back to

Initiation Phase to generate a new initial matching randomly and a new Itera-

tion Phase is e�ected based on this new generated matching. As we analyzed, an

Initiation Phase and an iteration of an Iteration Phase takes O(log n) time

on a completely connected multiprocessor system with n2 PEs.

In an iteration of an Iteration Phase, a new matchingM0 = (M�RM)[

NM1 [ NM2 is constructed from an existing matchingM. It is easy to verify that

the pairs in NM1 were unstable forM, but become stable forM0; the pairs in NM2

are stable forM0, regardless whether they were stable forM; and the pairs inM,

which were stable for M, remain to be stable for M0. Intuitively, the number of

52

unstable pairs forM0 is smaller than the number of unstable pairs forM. For most

cases, it is true. This is the heuristic behind the PII algorithm.

However, new unstable pairs may be generated forM0. Let the initial match-

ing be M0 and the matching generated in the i-th iteration be Mi. Since the set

of nm1-pairs, nm2-pairs and rm-pairs with respect toMi�1 is unique, the matching

Mi is constructed uniquely fromMi�1. Hence, ifMi 2 fMjjj 2 f0; 1; � � � ; i� 1gg,

i.e. the newly generated matching is the same as a previously generated matching,

no stable matching can be found. Example 3 shows a case that will not converge

to a stable matching. It is possible to include a procedure for detecting this cyclic

situation. Such a procedure, however, is too time-consuming. This is why we de-

cided to start a new round after c �n iterations of an Iteration Phase, where c is a

carefully selected constant. The random permutation generating algorithm we used

generates random matchings with uniform distribution according to [3]. Therefore,

by the existence of a stable matching, the PII algorithm can always �nd one for any

instance of stable matching problem.

Example 3 The matching Mi (i � 4) is the same as matchingMj with j = i mod

4.

Initial matchingM0: MatchingM1: MatchingM2: Matching M3

4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3 4; 1 2; 1 3; 4 1; 3

3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1 3; 4 1; 2 2; 2 4; 1

2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4 2; 2 4; 3 1; 3 3; 4

1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2 1; 3 4; 4 3; 1 2; 2

2

Our simulation results (see Section 2.5) indicate that the PII algorithm has

better performance compared with GS algorithm. However, we are unable to the-

oretically exclude the possibility that the total number of iterations is very large.

In order to enforce a bound for the number of iterations, we propose to run the

PII algorithm and parallel GS algorithm simultaneously in a time-sharing fashion.

53

We denote this modi�ed PII algorithm as PII-GS, which terminates once one of the

algorithms generates a stable matching. Clearly, the PII-GS algorithm converges to

a stable matching with O(n2) iterations in the worst case.

2.4 Implementations of PII Algorithm on Parallel Computing Machine

Models

In this section, we consider implementing the PII algorithm on three well-known par-

allel computing systems � hypercube, mesh of trees (MOT) and array with multiple

broadcasting buses. Without loss of generality, assume n = 2k. If 2k < n < 2k+1,

the PII algorithm can be implemented on a 22k-processor system with a constant

slow-down factor. The n2 PEs in each system are placed as n� n array, and n PEs

in each row/column form a row/column connection (see Figure 2.3). We assume that

our parallel computing systems operate in a synchronous fashion. Basic O(1)-time

parallel operations of a hypercube and a MOT can be found in [47]. For an array

with multiple broadcasting buses, we assume that each bus has a controller. A pro-

cessor can request to communicate with the controller or any other processor on the

bus. At any time, a constant number of processors on a bus may send requests to the

bus controller, and the controller selects one request (if any) to grant the bus access

arbitrarily. The controller of a bus can broadcast a message to all the processors

on the bus. We assume that each processor-to-processor, processor-to-controller and

broadcasting operation takes O(1) time.

It is simple to notice that multiple-broadcasting, �nding minimum, and pointer

jumping are the most time consuming operations in the PII algorithm. The pointer

jumping can be carried out by sorting. Let C be a parallel computing machine with

n2 processors, and let TB(n), TM(n) and TS(n) be the time required for multiple-

broadcasting, �nding minimum and sorting on C, respectively. Then, an Initia-

tion Phase of PII algorithm can be implemented on C in O(TS(n) � log n) time,

and each iteration of an Iteration Phase of PII algorithm can be implemented on

54

(b)(a) (c)

Figure 2.3. Parallel computing models: (a) a 16-processor hypercube; (b) a 4 � 4

mesh of trees; (c) a 4 � 4 array with multiple broadcasting buses

C in O(maxfTB(n); TM(n); TS(n) � log ng) time.

For a hypercube and a MOT, the operations of broadcasting and �nding-

minimum in PII are performed in parallel row-wise or column-wise, resulting TB(n)=

TM(n) = O(log n). For an array with multiple broadcasting buses, TB(n) = O(1).

Finding-minimum operation can be carried out on a bus in O(log n) time using a

binary searching method. For an n2-processor hypercube TS(n) = O(log2 n) while

TS(n) = (n) for a MOT and an array with multiple broadcasting buses since

either of their bisection widths is n. If we use sorting to implement pointer jumping

operations, both an Initiation Phase and an iteration in an Iteration Phase

of PII algorithm require O(log3 n) time on a hypercube and (n log n) time on a

MOT and an array with multiple broadcasting buses. In the following, however, we

show that sorting can be avoided on these parallel computing models using special

features of PII algorithm.

First, we show how to implement an Initiation Phase without pointer

jumping. This can be done by adopting a parallel implementation in [25] of the algo-

rithm of [16]. Let �i, (1 � i � n�1), be the permutation interchanging i and ri that is

chosen randomly from the set fi; � � � ; ng while leaving other elements of f1; 2; � � � ; ng

�xed. Let �n be an identity permutation. Initially, we use row i to represent �i. The

computation � = �1 Æ �2 Æ � � � Æ �n�1 Æ �n is organized in a complete binary tree of

55

height log n. For example, for n = 8, � = ((�1 Æ�2) Æ (�3 Æ�4)) Æ ((�5 Æ�6) Æ (�7 Æ�8)).

Hence, all that remains is to consider the composition of two permutations. Given

a permutation �0, let D(�0) = fij1 � i � n; and �0(i) 6= ig. The algorithm of

[25] associates jD(�0)j processors to �0. In our implementation, we mimic the op-

erations of one processor in [25] using a set of processors and their connections.

More speci�cally, we associate each row/column i to �i at the beginning. Let

�(i;j) = �i Æ �i+1 Æ � � � Æ �j = �(i;(j�i+1)=2) Æ �((j�i+1)=2+1;j), where j � i + 1 is an

integer of power of 2. Note that jD(�(i;j))j � j � i + 1. Thus, we can use row i

through row j and column i through column j to perform the operations assigned

to the processors for computing �(i;j) in the algorithm of [25]. The communication

paths for computing the composition of permutations at the same level of the bi-

nary computation tree are disjoint, because they use disjoint sets of row and column

connections. Since the height of the binary computation tree is O(log n), an Initi-

ation Phase of PII algorithm takes O(log2 n) time on a hypercube and a MOT.

If an array with multiple broadcasting buses is used, an Initiation Phase of PII

algorithm takes O(log n) time.

We now show how to implement an iteration of Iteration Phase without

using pointer jumping. Since each row/column contains at most one nm2-generating

pair, each pointer jumping of Step 5 can be decomposed into disjoint parallel 1-to-

1 row communications followed by disjoint parallel 1-to-1 column communications

without con icts. Thus, every pointer jumping step can be implemented in O(log2 n)

time on a hypercube and a MOT, and in O(log n) time on an array with multiple

broadcasting buses. We also note that simulating an n � n MOT by an n=2 � n=2

MOT (which has 3n2=4 � n < n2 processors) results in a constant slowdown factor.

To summarize, we show the improvement of the time complexity of PII algorithm on

three parallel computing systems in Table 2.1.

56

Machine modelsInitiation Phase An iteration in Iteration Phase

with sorting without sorting with sorting without sorting

Hypercube O(log3 n) O(log2 n) O(log3 n) O(log2 n)

MOT O(n logn) O(log2 n) O(n logn) O(log2 n)

Array with Buses O(n logn) O(logn) O(n logn) O(logn)

Table 2.1. Time complexity for implementations of PII algorithm on three parallel

computing machine models

0

50

100

150

200

250

300

350

0 10 20 30 40 50 60 70 80 90 100

size of stable matching

aver

age

iter

atio

ns

PII

PII-GS

GS

( a )

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

size of stable matching

cum

ula

tive

fre

qu

ency

(%)

PII

PII-GS

GS

( b )

Figure 2.4. Performance Comparisons: (a) average number of iterations for algo-

rithms to �nd a stable matching; (b) frequencies for algorithms to �nd a stable

matching within n iterations

2.5 Simulation Results

We have simulated PII, PII-GS, and parallel GS algorithms for di�erent sizes n 2

f10; 20; � � � ; 100g of stable matching, with 10000 runs each. The ranking lists and

initial matchings are generated by random permutation algorithm [16]. Each It-

eration Phase contains n iterations. The performance comparisons are based on

the average number of parallel iterations for each algorithm to generate a stable

matching and the frequency for each algorithm to converge in n iterations. From the

simulation, we notice that PII and PII-GS algorithms signi�cantly outperform GS

algorithm. Figure 2.4 shows that PII and PII-GS algorithms converge in n iterations

with very high probabilities, while the probability for GS algorithm to converge with

the same number of iterations decreases quickly as the sizes of problem increase.

57

2.6 Summary

In this chapter, we proposed a new approach, parallel iterative improvement, to solv-

ing the stable matching problem. The PII algorithm requires n2 PEs, among which

n PEs are required to perform arithmetic operations (for random number genera-

tion) and the other PEs can be simple comparators. The classical GS algorithm and

most existing stable matching algorithms can only �nd the man-optimal or woman-

optimal stable matching. By [24], the man (resp. woman)-optimal stable matching

is women (resp. men)-pessimal, i.e. every man/woman gets the best partner while

every woman/man gets the worst partner over all stable matchings. However, due to

randomness, the PII algorithm constructs a stable matching that is contained in the

set of stable matchings. Therefore, this algorithm will generate the stable matching

with more fairness.

In some applications, such as real-time packet/cell scheduling for a switch,

stable matching is desirable, but may not be found quickly within tight time con-

straint. Thus, �nding a \near-stable" matching by relaxing solution quality to satisfy

time constraint is more important for such applications. Most of existing parallel

stable matching algorithms cannot guarantee a matching with a small number of

unstable pairs within a given time interval. Interrupting the computation of such

an algorithm does not result in any matching. However, the PII algorithm can be

stopped at any time. By maintaining the matching with the minimum number of

unstable pairs found so far, a matching that is close to a stable matching can be

computed quickly.

CHAPTER 3

DESIGN AND IMPLEMENTATION OF AN ACYCLIC

STABLE MATCHING SCHEDULER

3.1 Introduction

The scheduling algorithms based on general stable matchings are too complex for

high-speed implementation. It turns out that for stable matching instances with

acyclic dependency graphs, �nding stable matchings takes less time. Researchers

have proposed several scheduling algorithms for CIOQ switches based on acyclic

stable matchings. In [72], Prabhakar and McKeown proposed the most urgent cell

�rst algorithm (MUCFA) for a CIOQ switch with a speedup of 4 to emulate an

output queued (OQ) switch performance. Chuang and Stoica improved the result to

a speedup of 2 by the critical cell �rst (CCF) algorithm [12] and the joined preferred

matching (JPM) algorithm [85] independently. In [63], Nong et al. proved that

with some speedup, an acyclic stable matching scheduling algorithm can provide

QoS guarantees for both unicast and multicast traÆc with �xed-length and variable-

length packets.

The advantage of acyclic stable matching scheduling algorithms is its fea-

sibility for high-speed implementation. However, there is no hardware design and

implementation of acyclic stable matching scheduling algorithms in the literature. In

this chapter, we propose a parallel algorithm for the acyclic stable matching problem,

and present its hardware implementation. We �rst model the acyclic stable matching

problem as the dominating set problem for rooted dependency graphs. We show that

the root set and the dominating set of a rooted dependency graph are identical. We

then propose a parallel algorithm, FIND ROOTS, to �nd the root set of a rooted

58

59

dependency graph in O(n log n) time with n2 simple processing elements (PEs). We

further present hardware design and implementation of the proposed algorithm. Sim-

ulation results show that the number of 2-input NAND gates and the timing of our

design are proportional to n2 and n log n respectively. The proposed design can be

used to implement schedulers based on acyclic stable matching algorithms, such as

those in [72, 12, 85, 63].

The rest of the chapter is organized as follows. In Section 3.2, we propose

our parallel algorithm FIND ROOTS. In Section 3.3, we focus on the design and

implementation of FIND ROOTS in hardware. Section 3.4 summarizes the chapter.

3.2 A Parallel Stable Matching Algorithm for Rooted Dependency Graph

In this chapter, for a ranking matrix A = fai;jg, we call wri;j (resp. mrj;i) the

horizontal value (resp. vertical value) of ai;j, and denote it by ahi;j (resp. avi;j). By

de�nitions, we know, given an n� n ranking matrix A, a set of man-woman pairs is

a matchingM if any two pairs (mi1; wj1) and (mi2; wj2) inM are corresponding to

two entries ai1;j1 and ai2;j2 in di�erent rows/columns of A;M is a stable matching if

there does not exist a pair (mi; wj) =2 M such that ahi;j < ahi;k and avi;j < avl;j, where

(mi; wk); (ml; wj) 2 M.

3.2.1 Dominating Set for Dependency Graph

A dominating set of dependency graph ~G is a set of vertices, denoted by Vd, such

that the following two conditions are satis�ed: (1) for any two vertices in Vd, they

are corresponding to two entries in di�erent rows and columns of the ranking matrix;

(2) for any vertex v 2 V (~G)� Vd, there is a directed edge from a vertex in Vd to v.

Since each vertex vi;j in ~G is corresponding to a pair of man and woman

(mi; wj), by the de�nitions of stable matching and dominating set, we have the

following fact.

60

Fact 1 Let ~G be a dependency graph. Vd is the vertex subset corresponding to a

stable matching if and only if Vd is a dominating set of ~G.

By Fact 1, the problem of �nding a stable matching is reduced to the problem

of �nding a dominating set. In general, the dominating set for a dependency graph

may not be unique, and �nding one is time consuming. However, we �nd that the

problem of �nding dominating sets for a special class of dependency graphs, named

rooted dependency graphs, is much easier. A rooted dependency graph is de�ned

recursively as follows: an empty graph is a rooted dependency graph; a non-empty

dependency graph ~G is a rooted dependency graph if (1) it contains one or more

roots, each being a vertex without any incoming edge; (2) the reduced subgraph,

which is obtained from ~G by removing all vertices in the same rows and columns in

which the roots are located and all outgoing edges from these removed vertices, is

also a rooted dependency graph. The root set of a rooted dependency graph ~G is a

set that consists of all roots of ~G and its reduced subgraphs recursively generated

from ~G. The following fact is obvious.

Fact 2 Let ~G be the dependency graph of a ranking matrix A where each entry

ai;j = (wri;j ;mrj;i). For any vertex vi;j, the number of incoming edges coming from

the vertices in row i is equal to wri;j � 1 and the number of incoming edges coming

from the vertices in column j is equal to mrj;i � 1.

By Fact 2, we know that a vertex with corresponding entry (1; 1) is a root

since it has no incoming edge. By Facts 1 and 2, we have the following theorem.

Theorem 1 For a rooted dependency graph ~G, the root set is the same as the dom-

inating set, which is unique for ~G.

61

colu

mns

1

2

3

4

1 2 3 4

colu

mns

1

GG "G’

rows rows4321

4

3

2

1

2

32rows

colu

mns

1

4

3

4

m

4,22,4

3,3 4,1 1,1 2,3

1,2 2,4 3,2 4,2

1,1 2,3 4,3 3,1

2,4 3,2 1,4 4,4 3,2

( a )

iteration 44

3

2

1

m

m

m

iteration 3iteration 2iteration 1 iteration 5

( b )

4,2

4,4

w

w

w

w

4

3

2

1

m

m

m

m 1

4

3

2

1

m

m

m

m

4

3

2

m

m

m

4

3

2

1

w

w

w

w

m 4

3

2

1

w

w

w

w

4

3

2

1 w

m

m

m

m

4

3

2

1

w

w

w

1

4

3

2

1

w

w

w

w

4

3

2

Figure 3.1. Finding stable matching in a rooted dependency graph: (a) a rooted

dependency graph ~G and its reduced subgraph ~G0 and ~G00; (b) stable matching is

found by GS algorithm in 5 iterations.

Example 4 Figure 3.1 (a) shows an example for a rooted dependency graph ~G and

its reduced subgraph ~G0 and ~G00, where each root is marked as a dark circle. The

dependency graph~G is corresponding to the following ranking matrix A1:

3,3 4,1 1,1 2,3

1,2 2,4 3,2 4,2

1,1 2,3 4,3 3,1

2,4 3,2 1,4 4,4

The horizontal value and vertical value of each entry in the ranking matrix are shown

in each corresponding vertex. From the �gure, clearly, neither of two vertices v1;3 and

v3;1, which are marked as dark circles in ~G, has incoming edge since each of them

corresponds to an entry (1; 1) in the ranking matrix. Hence, ~G has two roots, v1;3

and v3;1. After removing all vertices in rows 1, 3 and columns 1, 3 and their outgoing

edges in ~G, we get the reduced subgraph ~G0, which has root v4;2 marked as dark circle

in ~G0. After removing all vertices in row 4 and column 2 and their outgoing edges

in ~G0, we get the reduced subgraph ~G00, which contains only one vertex v2;4 that is

also a root of ~G00. By the de�nition, we know ~G is a rooted dependency graph. It is

easy to verify that the root set, fv1;3; v3;1; v4;2; v2;4g, is the dominating set of ~G. By

62

Theorem 1, the dominating set corresponds to the stable matching of ranking matrix

A1, which is f(1; 3); (3; 1); (4; 2); (2; 4)g. 2

A rooted dependency graph may not be acyclic (i.e. the graph may have a di-

rected cycle). In Figure 3.1 (a), ~G contains a cycle (v1;1,v4;1,v4;2,v3;2,v2;2,v2;3,v2;4,v1;4)

(see Figure 3.1 (a), in which edges in the cycle are marked as dark edges). How-

ever, an acyclic graph always has at least one root, and its reduced subgraph is also

acyclic. Thus, we have the following fact.

Fact 3 An acyclic dependency graph is a rooted dependency graph, but a rooted

dependency graph may not be an acyclic dependency graph.

In the following, we propose a parallel algorithm for �nding the root set (i.e.

the stable matching) in a rooted dependency graph.

3.2.2 The Algorithm

Given a rooted dependency graph ~G constructed from an n � n ranking matrix A,

we �rst �nd the roots of ~G. If the reduced subgraph ~G0 of ~G is not empty, we

continue to �nd remaining vertices in the root set of ~G recursively until the total

number of found roots equals to n. The algorithm for �nding the root set of a rooted

dependency graph, FIND ROOTS, is described in the following.

Algorithm FIND ROOTS

begin

G := ~G; /* ~G is the dependency graph */

Vr := ;; /* Vr is the root set */

while there exists a root in G do

Step 1: �nd the set of roots V 0

rof G and let Vr := Vr [ V

0

r;

Step 2: �nd the reduced subgraph G0 of G and let G := G0.

end

Based on Theorem 1 and Fact 1, the set of roots obtained from FIND ROOT

is corresponding to the set of man-woman pairs in the stable matching. We analyze

63

the time complexity of FIND ROOTS using n2 PEs as follows. The n2 PEs are

placed as an n� n array, and the n PEs in each row or column are fully connected.

Each PEi;j is corresponding to a vertex vi;j of ~G and has a pair of horizontal

(h for short) and vertical (v for short) values set as (wri;j ;mrj;i) initially. Since the

total number of roots in root set of ~G is equal to n, FIND ROOTS runs in at most n

iterations. Each iteration of FIND ROOTS consists of two steps. Based on Fact 2,

we know step 1 can be done in O(1) time by each PEi;j checking if its (h; v) = (1; 1).

Conceptually, step 2 contains 2 substeps. In substep 1, each root vertex vi;j found

in step 1 sets its (h; v) = (0; 0) and marks all vertices in row i and column j as the

vertices to be deleted. Since all PEs in the same row or column are fully connected,

this substep takes O(1) time. In substep 2, each undeleted vertex vi;j decreases its

h (resp. v) value by k if its h (resp. v) value is greater than that of k deleted

vertices in row i (resp. column j). Since there are at most n deleted vertices in each

row/column, this substep can be done in O(log n) time. Therefore, based on the

above discussion, we have the following theorem.

Theorem 2 Given any instance of stable matching problem, if its corresponding

dependency graph is a rooted dependency graph (including acyclic dependency graph),

we can �nd the stable matching in O(n log n) time on n2 PEs.

3.2.3 Comparison with GS Algorithm

Gale and Shapley proposed an algorithm for solving the stable matching problem in

[20]. The GS algorithm works in the following way. Each man �rst proposes to his

most favorite woman; each woman will keep the proposal proposed by the man who

has the highest rank in her ranking list among those who have proposed to her, and

reject all the rest proposals. Each rejected man then proposes to his next favorite

woman on his ranking list. The GS algorithm will continue this process until all

women get proposals. When GS algorithm stops, each woman and the man whose

64

G

3

"G

4rows

colu

mns

colu

mns

1

2

3

4

1 2 3 4rows

2

colu

mns

1

2

3

4

1 2 3 4rows

1

2

3

4

1

G’

iteration 2

( b )

iteration 1 iteration 5 iteration 6

2

1,1 3,1 2,3 4,3

1,2 4,2

m

4

3

2

1

m

m

m

4

3

1,1

2

1

m

m

m

m

4

3

2

4,2

4,4 1,2

1

m

m

m

m

( a )

iteration 4iteration 3

3,1

4,4

3

2,2 3,1

2,3 4,3 3,4

2,4 4,4 3,4 1,2

1

m

m

m

m

4

3

2

1

m

m

m

m

4

1

w

w

w

w

4

3

2

1

m

m

m

m

4

3

2

4

3

2

1

w

w

w

w

4

3

2

w

w

w

w

4

3

2

1

w

w

w

w 1 1

w

w

w

w

4

3

2

1

w

w

w

w

4

3

2

Figure 3.2. Finding stable matching in an acyclic dependency graph: (a) an acyclic

dependency graph ~G and its reduced subgraph ~G0 and ~G00; (b) stable matching is

found by GS algorithm in 6 iterations

proposal the woman keeps become a pair of partners, and all pairs form a stable

matching. GS showed that a stable matching always exists and can be found in O(n2)

iterations. Due to the dependency in GS algorithm, the number of iterations cannot

be easily reduced by parallelism regardless of the number of PEs used. The running

time of parallel GS algorithm is O(n2 log n) time on n PEs since each iteration takes

O(log n) time to �nd the minimum from at most n distinct numbers.

For stable matching problems with rooted dependency graphs, GS algorithm

does not work as fast as FIND ROOTS. Figure 3.1 (b) shows an example to �nd

the stable matching using GS algorithm, where new proposals are marked by light

lines and the kept proposals are marked by dark lines in each iteration. As shown

in Figure 3.1 (b), to �nd the stable matching for ranking matrix A1, GS algorithm

needs 5 iterations while FIND ROOTS only needs 3 iterations. This means that O(n)

iterations are not suÆcient for GS algorithm to �nd the stable matching for rooted

dependency graphs. Furthermore, O(n) iterations are not suÆcient for GS algorithm

to �nd the stable matching for acyclic dependency graphs. Figure 3.2 shows an

example of an acyclic dependency graph. Consider �nding a stable matching of the

65

ranking matrix A2:1,1 3,1 2,3 4,3

1,2 4,2 2,2 3,1

2,3 4,3 1,1 3,4

2,4 4,4 3,4 1,2

According to Algorithm FIND ROOTS, we �rst �nd the roots of ~G, which are v1;1,

v3;3; then �nd the root of ~G0, which is v2;4; �nally �nd the root of ~G00, which is v4;2.

Thus, GS algorithm needs 6 iterations and FIND ROOTS needs 3 iterations.

In the worst case, the parallel GS algorithm �nds the stable matching for

a rooted dependency graph and an acyclic dependency graph in O(n2 log n) time.

However, FIND ROOTS �nds the stable matching for a rooted dependency graph

and an acyclic dependency graph in n iterations, each taking O(log n) time. Thus, the

speedup for worst time complexity of FIND ROOTS to GS algorithm is O(n). Both

FIND ROOTS and GS algorithms take man and woman ranking lists as inputs and

every list contains n numbers. Thus, the needed spaces for both algorithms are the

same. Table 3.1 compares the parallel GS algorithm and the parallel FIND ROOTS

algorithm for �nding the stable matching in any rooted dependency graph or acyclic

dependency graph with respect to time, the number of PEs and memory space.

Algorithm Time PEs Space

GS O(n2 log n) n O(n2)

FIND ROOTS O(n log n) n2 O(n2)

Table 3.1. Comparison of algorithms for �nding a stable matching

3.3 Implementing the Scheduler

One of the objectives of our work is to design a scheduler that is feasible to implement.

In this section, we present the hardware design and implementation of a scheduler

based on the FIND ROOTS algorithm. An n � n scheduler has n2 pairs of inputs

(wr1;1;mr1;1); � � � ; (mrn;n; wrn;n), and n pairs of outputs which are the indices of n

roots, s1; s2; � � � ; sn. The circuit consists of n2 nodes arranged as an n � n array.

66

Each node corresponds to an entry in the ranking matrix A and a vertex of A's

dependency graph. We use 2n buses to interconnect n2 nodes such that node ni;j,

where 1 � i; j � n, is connected to the ith row bus, ri, and the jth column bus, cj.

Each bus is log n-bit wide. The �rst bit line of all n row buses are connected to a

controller, which is used to select one out of possibly multiple bus requests (in the

case of multiple root nodes exist in a graph). Each node ni;j has 2 inputs for reading

its (h; v) pair, and one output to send out its index. Figure 3.3 shows the scheduler

block diagram, circuit structure, and node block diagram of a 4 � 4 scheduler.

Scheduler

( a )

1

2

3

4s3

s1

s2

s3

s4

s2

wri,j

mrj,i

4

s s

. . . to/from ri

to/from cj

( c )( b )

Con

trol

ler

cc c c

r

r

r

1 2 3

r4

wr1,1 mr1,1

1

wr4,4 mr4,4

ni,j

Figure 3.3. A 4� 4 scheduler design: (a) scheduler block diagram; (b) circuit struc-

ture; (c) node block diagram.

The operation of an n� n scheduler has n iterations. Initially, each node ni;j

sets its (h; v) = (wri;j;mrj;i). An iteration operates as follows. For each node ni;j, if

it �nds its (h; v) = (1; 1) (i.e. it is a root node), it will send a `request signal' on its

row bus. If the controller detects that there are more than one buses requesting, it

will con�rm the bus with the minimum row index and send back a `grant signal' to

the bus. Once a root node ni;j gets the `grant signal' from its row bus, it will send

a `mask signal' on row bus ri and column bus cj to \eliminate" all nodes on row i

and column j; meanwhile, it will update its (h; v) = (0; 0) and send out its index.

Once a node on row i (resp. column j) receives a `mask signal', it will broadcast its

v (resp. h) value on its column (resp. row) bus. If a node with its h (resp. v) value

is greater than the h (resp. v) value received from its row bus (resp. column bus), it

will subtract its h (resp. v) value by 1.

67

Size N=2 N=4 N=6 N=8 N=10 N=12

Timing 62.24 137.28 239.52 315.84 399.6 479.52

Area 1166 5410 12263 29283 41002 59342

Table 3.2. Timing and area results of the scheduler design.

The major advantage of this design is its simplicity. We only use 2n log n-bit

buses to broadcast signals to nodes in the same row or the same column, and one

log n-bit priority encoder functioning as a controller for bus arbitration. Although

n2 nodes are used, the logic of each node is simple, which mainly includes 2 log n-bit

registers used to store its h and v values, one log n-bit comparator, and one log n-bit

adder.

We conducted simulations of the scheduler design on Synopsys's design tools.

We wrote the VHDL [32] code, compiled and synthesized it on Synopsys's de-

sign analyzer [88] using its library lsi 10k. The design analyzer was directed to min-

imize the area cost of the design. Table 3.2 depicts the timing results (in terms of

ns) and the area results (in terms of the number of 2-input NAND gates) of the

scheduler design for n = 2; 4; 6; 8; 10; 12. The timing and the number of 2-input

NAND gates are proportional to n and n2 respectively, making the design feasible

to be implemented with current CMOS technologies.

Another advantage of the design is its exibility. Our scheduler design works

well for real applications, including the case that ranks in some ranking lists are not

distinct (e.g. cells with the same priority), the case that the lengths of some ranking

lists are not equal to n (e.g. in some input queue, there is no cell destined for some

output port), and the case that the sizes of man set and woman set are not equal

(e.g. the number of input queues is not equal to the number of output queues).

68

3.4 Summary

In this chapter, we addressed the acyclic stable matching problem and proposed

a parallel algorithm to solve the stable matching problem for rooted dependency

graphs, which contains all acyclic dependency graphs as special cases. We designed

a hardware scheduler based on the proposed algorithm. Simulation results show that

the proposed scheduler design is feasible with current CMOS technologies. To the

best of our knowledge, the scheduler design is the �rst hardware design for acyclic

stable matching. It is very useful in constructing high-speed switches/routers.

CHAPTER 4

PARALLEL ROUTING ALGORITHMS FOR NONBLOCKING ELECTRONIC

AND PHOTONIC SWITCHING NETWORKS

4.1 Introduction

Recently, a class of multistage nonblocking switching networks has been proposed.

In this class each network, denoted by B(N;x; p; �), has relatively low hardware cost

and short connection diameter, O(N1:5 logN) and O(logN) respectively, in terms

of the number of SEs. A B(N;x; p; �), � 2 f0; 1g, is constructed by horizontally

concatenating x(� logN � 1) extra stages to an N � N Banyan-type network and

vertically stacking p copies of the extended Banyan. Networks B(N;x; p; 0) and

B(N;x; p; 1) are similar in structure, but the latter does not allow any two connec-

tion paths to pass through the same SE while the former does. B(N;x; p; 0) and

B(N;x; p; 1) are suitable for electronic and optical implementation, respectively. It

has been shown that B(N;x; p; �) can be SNB, WNB and RNB with certain values

of x and p for given N and � [41, 42, 57, 91, 92].

A trivial lower bound on the time for routing K (0 � K � N) connections

sequentially in B(N;x; p; �) is (K logN). This lower bound is obtained by assum-

ing that for a connection it takes O(1) time to correctly guess which plane to use

without causing con ict and O(logN) time to compute the connection path in that

plane. Clearly, when x 6= 0 and p > 1, correctly assigning connections to planes and

routing connections in each plane are not easy. When the number of connection re-

quests is large, the routing time complexity is greater than O(N). Parallel processing

techniques should be used to meet the stringent real-time timing requirement [31].

To the best of our knowledge, except for some special cases such as Banyan network

69

70

(i.e., B(N; 0; 1; �)) and Benes network (i.e., B(N; logN � 1; 1; �)), no e�ort of inves-

tigating faster routing for the whole class of these networks has been reported in the

literature. For B(N;x; p; �) to be useful, a fast switch control mechanism must be

devised.

The focus of this chapter is studying the control aspect of the class ofB(N;x; p; �)

networks in the context of being used as electrical and optical switching networks. In

particular, our objective is to speed up routing process by using parallel processing

techniques. A completely connected multiprocessor system of N processing elements

(PEs) is used as the parallel computation model. Such a model is by no means to

be practical; it is used as a general abstract model to derive parallel algorithms.

EÆcient algorithms on more realistic models, such as a hypercube whose architec-

tural complexity is the same as that of a single plane of B(N;x; p; �), can be easily

obtained from our algorithms.

There are three basic approaches for designing routing algorithms, namely

matrix decomposition, matching and graph edge-coloring, and they are essentially

equivalent [31]. In this chapter, we assume that N = 2n. By examining the con-

nection capacity of B(N;x; p; �), we �rst model the routing problems for this class

of networks as weak and strong edge-colorings of bipartite graphs, which uni�es and

extends previous models for RNB and SNB networks. Basing on our model, we pro-

pose fast routing algorithms for B(N;x; p; �) using parallel processing techniques.

We show that the presented parallel routing algorithms can route K connections in

O(logK logN) time for an RNB B(N;x; p; �) and in O(d� logN) time for an SNB

B(N; 0; p�; �), where d� is the degree of the I/O mapping graph of the new connec-

tions. Since K = N and d� = O(pN) in the worst case, the proposed algorithms

can always route O(N) connections in an RNB B(N;x; p; �) in O(log2N) time and

in an SNB B(N;x; p�; �) in O(pN logN) time. As a by-product of our analysis,

we also propose a class of simple self-routing nonblocking networks, T (N;�). Com-

71

pared with crossbar, the presented new networks have lower hardware cost, shorter

connection diameter, and less number of required wavelengths.

The remainder of the chapter is organized as follows. In Section 4.2, we discuss

the topology of B(N;x; p; �). In Section 4.3, we model routing in B(N;x; p; �) as two

coloring problems of an I/O mapping graph G(N;K; g). In Section 4.4, we propose a

fast parallel routing algorithm for RNB B(N;x; p; �) based on a weak g-edge coloring

of G(N;K; g). In Section 4.5, we extend this parallel routing algorithm to SNB

B(N;x; p; �) based on a strong (2g � 1)-edge coloring of G(N;K; g). In Section 4.6,

we present a new structure T (N;�) for self-routing nonblocking networks. Finally,

we summarize this chapter in Section 4.7.

4.2 Nonblocking Networks Based on Banyan-type Networks

If Baseline network is used for photonic switching, it is a blocking network since

two connections may pass through the same SE, which causes node con ict. Even if

Baseline network is used for electronic switching, it is still a blocking network since

two connections may try to pass through the same input (resp. output) link, which

causes input (resp. output) link con ict.

Although the Baseline network is a blocking network, a nonblocking network

can be built by extending it in three ways: horizontal concatenation of extra stages

to the back of a Baseline network, vertical stacking of multiple copies of a Baseline

network, and the combination of both horizontal concatenation and vertical stacking

[41, 42, 91, 92]. Such an extended network is constructed by concatenating the mirror

image of the �rst x(< n) stages of BL(N) to the back of a BL(N), then vertically

making p copies of the extended BL(N) (each copy is called a plane), and �nally

connecting the inputs (resp. outputs) in the �rst (resp. last) stage to N 1�p splitters

(resp. p� 1 combiners). Speci�cally, the i-th input (resp. output) of the j-th plane

is connected with the j-th output (resp. input) of the i-th 1� p splitter (resp. p� 1

72

combiner), which is connected with the i-th input (resp. output) of this network.

We denote a network constructed in this way by B(N;x; p; �), where � is crosstalk

factor. That is, � = 0 if the network has no crosstalk-free constraint and � = 1 if

the network has crosstalk-free constraint. Clearly, B(N; 0; 1; �) is a Baseline network

and B(N;n� 1; 1; �) is a Benes network [6]. In B(N;x; 1; �), a subnetwork, denoted

by B(N;x; 1=2l; �) (0 � l � n� 1), is de�ned as a B(N=2l;maxfx� l; 0g; 1; �) from

stage l to stage n+maxfx� l; 0g� 1. Figure 4.1 shows an example of B(16; 2; 3; �),

which contains three planes of B(16; 2; 1; �), and each B(16; 2; 1; �) is constructed

from B(16; 0; 1; �) by adding two extra stages.

INPUTS

OUTPUTS

STAGES

2 extra stages3 planes

01

23

45

67

89

1011

1213

1415

01

23

45

67

89

1011

1213

1415

0 1 2 3 4 5

Figure 4.1. A network B(16; 2; 3; �)

4.3 Graph Model

In this section, we model routing problems in B(N;x; p; �) networks as two edge

coloring problems of bipartite graphs.

4.3.1 I/O Mapping Graphs

ForB(N;x; p; �), a set of N inputs (resp. outputs) is divided intoN=g modulo-g input

group (resp. modulo-g output group). Let g = 2i, 0 � i � n. Then, the k-th modulo-

73

g input group comprises inputs I(k�1)g; I(k�1)g+1; � � � ; Ikg�1, and the k-th modulo-g

output group comprises outputs O(k�1)g; O(k�1)g+1; � � � ; Okg�1, where 1 � k � N=g.

If a connection path does not have any link (resp. node) con ict with other

connection paths, it is called a link con ict-free (resp. node con ict-free) path.

Clearly node con ict-free path is also link con ict-free, but the converse is not true.

If a set of connections can be set up by con ict-free paths in B(N;x; 1; �), these

connections are called feasible connections of B(N;x; 1; �). Our goal is to quickly set

up K link (resp. node) con ict-free paths for K connections of any I/O mapping in

B(N;x; p; 0) (resp. B(N;x; p; 1)). To achieve this goal, we usually decompose a set of

connections into disjoint subsets, and route each subset in one plane of B(N;x; p; �)

so that each subset is feasible for its assigned plane.

Given any I/O mapping with K connections for B(N;x; p; �), we construct a

graph G(N;K; g), named I/O mapping graph, as follows. The vertex set consists of

two parts, V1 and V2. Each part has N=g vertices, i.e., each modulo-g input (resp.

output) group is represented by a vertex in V1 (resp. V2). There is an edge between

vertex bi=gc in V1 and vertex bj=gc in V2 if j = �(i). Thus, G(N;K; g) is a bipartite

graph with N=g vertices in each of V1 and V2 and K edges, where at most g edges

are incident at any vertex. Thus, the degree of G(N;K; g) is at most g. Since there

may be more than one connection from a modulo-g input group to the same modulo-

g output group, G(N;K; g) may have parallel edges between two vertices and it

may be a multigraph. However, there is a one-to-one correspondence between active

inputs/outputs in an I/O mapping and the edges in the I/O mapping graph, and

thus, we can label each edge by its corresponding input. An edge e is called the left

edge (resp. right edge) of edge f if e = �f (resp. �(e) = �(f)). Any edge has at most

one left edge and at most one right edge in G(N;K; g). Two edges e and f are called

neighboring edges if e is the left or right edge of f . We de�ne a linear component (or

simply, a component) of G(N;K; g) as follows: two edges e and f belong to the same

74

component if and only if there is a sequence of edges e = e1; � � � ; ej = f such that ei

and ei+1, 1 � i � j� 1, are neighboring edges. If every edge in a component has two

neighboring edges, the component is called a closed component; otherwise it is called

an open component. By generalizing \neighboring edge" to an equivalent relation,

each edge is in exactly one component, and thus, components are edge disjoint in

G(N;K; g).

Example 5 In Figure 4.2, (a) shows an I/O mapping with 32 inputs, 25 of which

are active; (b) shows the I/O mapping graph G(32; 25; 8) of (a), where V1 (resp. V2)

of G(32; 25; 8) has 4 vertices and each vertex in V1 (resp. V2) includes 8 inputs (resp.

outputs) belonging to the same modulo-8 input (resp. output) group; (c) shows all

components of G(32; 25; 8) in (b). 2

131211109

7

54

210

25-11517-12927

24

14

8

6

3130292827262524232221201918171615

21

6

(i)π21 VV

( a )

i

10

3023281-1-1209

165

22122611

-1-140-1

3 3210

76543210

31302928

25

2322

423

15

31

23

22

15

14

22

31

14

2627

G(32, 25, 8)

( b ) ( d )

24

765

21

24

2322212019181716

15141312111098

3

7

25

2019181716

15141312111098

313029282726

11

19

18

3

26

1918

3

12

21

20

10

( c )

1011

2120

8

(i) 1 closed component

1

9

26

24

1

24

0

8

9

25

0

25

7

7

29

28

13

4

2928

1312

4

(ii) 5 open components

Figure 4.2. Finding a balanced 2-coloring: (a) an I/O mapping; (b) a balanced 2-

coloring of an I/O mapping graph G(32; 25; 8); (c) a set of components; (d) pointer

initialization for pointer jumping

4.3.2 Graph Coloring and Nonblockingness

If we route connections in B(N;x; p; �) one by one using sequential algorithms, the

time complexity for establishing K connections is (K � (logN + x)) since it takes

75

(logN + x) time to set up one connection. For a large number of connections, the

time required is more than O(N), which is not acceptable for real-time applications.

Parallel processing techniques can be used to speed up routing in B(N;x; p; �). We

say that two connections share a modulo-g input (resp. output) group if their sources

(resp. destinations) are in the same modulo-g input (resp. output) group. Let us

study the connection capability of B(N;x; p; �) �rst.

Lemma 2 For any connection set C of B(N; 0; 1; �), if no two connections in C

share any modulo-g input (resp. output) group, then the connection paths for C

satisfy the following conditions:

(i) they are node con ict-free in the �rst (resp. last) log g stages;

(ii) they are input link con ict-free in the �rst log g + 1 (resp. last log g) stages and

output link con ict-free in the �rst log g (resp. last log g + 1) stages.

Lemma 3 For any pair of input and output in B(N;x; 1; �), there are 2x paths

connecting them.

It is easy to verify that Lemmas 2 and 3 are true according to the topology of

BL(N) (refer to [57] for formal proofs and see Figure 4.3 for examples). Using the

above two lemmas, the following claim can be easily derived from the results of [57].

Lemma 4 Given a connection set C of B(N;x; 1; �), if any two connections in C do

not share any modulo-2bn�x+�

2c input group and also do not share any modulo-2b

n�x+�2

c

output group, then C is feasible for B(N;x; 1; �).

By Lemma 4, if we assign the connections in B(N;x; p; �) with sources (resp.

destinations) passing through the same modulo-g input (resp. output) group to

di�erent planes, then we can route connections in B(N;x; p; �) without con ict.

Thus, in order to route con ict-free connections in B(N;x; p; �), we �rst need to

76

( a )

( b )

( c )

( d )

Figure 4.3. Number of connection paths: (a) 1 path in B(16; 0; 1; �); (b) 2 paths in

B(16; 1; 1; �); (c) 4 paths in B(16; 2; 1; �); (d) 8 paths in B(16; 3; 1; �)

determine which plane to be used for each connection. By constructing an I/O

mapping graph G(N;K; g) with g = 2bn�x+�

2c, we can reduce the problem of routing

K connections in B(N;x; p; �) to the following two graph coloring problems:

Weak Edge Coloring Problem (WEC problem): Given an I/O mapping graph

G(N;K; g) with K0(< K) colored edges, color K edges with a set of colors such that

no two edges with the same color are incident at the same vertex of G(N;K; g) with

the changing of the colors of the K0 colored edges allowed. If we can �nd a weak

edge coloring of G(N;K; g) using at most c1 di�erent colors, we call this coloring a

(weak) c1-edge coloring of G(N;K; g). The de�nition of weak edge coloring is the

same as the de�nition of edge coloring in graph theory. Thus we omit \weak" in the

following of chapter.

Strong Edge Coloring Problem (SEC problem): Given an I/O mapping graph

G(N;K; g) with K0(< K) colored edges, color K �K0 uncolored edges with a set of

colors such that no two edges with the same color are incident at the same vertex

77

of G(N;K; g) without changing the colors of the K0 colored edges. If we can �nd

a strong edge coloring of G(N;K; g) using at most c2 di�erent colors, we call this

coloring a strong c2-edge coloring of G(N;K; g).

If we consider the colored (resp. uncolored) edges in G(N;K; g) as the existing

(resp. new) connections in B(N;x; p; �), a solution to the WEC problem is a plane

assignment for routing in an RNB network since we can reroute existing connections,

and a solution to the SEC problem is a plane assignment for routing in an SNB

network since rerouting existing connections is prohibited. Clearly, for the same

G(N;K; g), c1 � c2.

Example 6 In Figure 4.4, there are three edges labeled a, b, c, respectively. Edges a

and b have already been colored using colors 1 and 2, respectively. A WEC solution

is given in (a), and an SEC solution is given in (b). Note that, in (b), an additional

color is needed for edge b because the colors of existing colored edges a and c cannot

be changed. 2

b

c

a

b

c

a

b

c

[ 1 ]

[ 2 ]

[ 2 ]

[ 1 ]

[ 2 ]

[ 1 ]

[ 2 ]

[ 1 ]

[ 2 ]

[ 3 ]

( a ) ( b )

aa

b

c

Figure 4.4. Edge coloring: (a) a (weak) edge coloring; (b) a strong edge coloring

In the following two sections, we show how to speed up the routing for RNB

networks and SNB networks using WEC and SEC of I/O mapping graphs, respec-

tively.

4.4 Routing in Rearrangeable Nonblocking Networks

In this section, we present a fast parallel routing algorithm for RNB B(N;x; p; �)

based on a weak g-edge coloring of G(N;K; g).

78

4.4.1 Rearrangeable Nonblockingness

The following claim is implied by the results of [57].

Lemma 5 If p � 2bn�x+�

2c, then B(N;x; p; �) is rearrangeable nonblocking.

It is important to note that the minimum value of p in Lemma 5 equals to

the value of g in Lemma 4, where p is the number of B(N;x; 1; �) planes required

for B(N;x; p; �) to be rearrangeable nonblocking.

By Lemmas 4 and 5, if we assign the connections (including existing and new

connections) sharing the same modulo-g input or output group to di�erent planes,

the connections are feasible for every assigned plane. Then, the routing can be

completed by setting up con ict-free connection paths within each plane.

Lemma 6 Every bipartite multigraph G has a �(G)-edge coloring, where �(G) is

the degree of G.

By Lemma 6 (see a proof in [8]), if we set g = 2bn�x+�

2c in G(N;K; g), the

plane assignments for a set of connections in RNB B(N;x; p; �) can be solved by

�nding a g-edge coloring of G(N;K; g).

4.4.2 Algorithm for Balanced 2-Coloring of G(N;K; g)

In order to solve WEC problem eÆciently, we present an algorithm for a problem,

named balanced 2-coloring problem: given an I/O mapping graph G(N;K; g), color

its edges with 2 colors so that every vertex is adjacent to at most g=2 edges with one

color and g=2 with the other.

We choose to present our parallel algorithms for a completely connected mul-

tiprocessor system with N PEs. Initially, each PEi, 0 � i � N � 1, reads �(i) from

input i, sets value of ��1 in PE�(i) as i, and then performs the following two steps.

79

Step 1. Divide the I/O mapping graph G(N;K; g) into a set of components.

This step can be done by each edge �nding its left edge �i and right edge ��1(�(i)).

Step 2. Color components with two colors, red and blue, so that neighboring

edges in each component have di�erent colors.

Each component has two speci�c representatives, simply referred as Reps.

(There is an exception: for the component with length of 1, there is only one Rep,

which is itself.) For closed and open components, the Reps are de�ned di�erently.

For a closed component, we de�ne two edges with the minimum labels as two Reps;

for an open component, if an edge e has no left edge or e's left edge has no right

edge, e is de�ned as one Rep. Figure 4.2(c) shows the Reps of all possible types of

components, where the Reps of each component are marked as dark lines and edges

are labeled by their corresponding inputs. Step 2 can be done by coloring edges with

the Reps as references using the pointer jumping technique in [33]. At the beginning,

each edge sets its pointer to point to the right edge of its left edge if it exists and

to itself otherwise. By doing so, two disjoint directed cycles are formed for a closed

component, and two disjoint directed paths are formed for an open component with

more than one edge, each containing a Rep. For an open component, furthermore,

the end pointer of every directed path is pointing to one of the Reps. For example,

Figure 4.2(d) shows that the directed cycles and paths formed from the components

of Figure 4.2(c). Then, by performing dlogK=2e times of parallel pointer jumping,

each edge �nds the Rep belonging to the same directed cycle or path. Finally, each

edge can be colored by comparing the value of the Rep found by itself with that by

its neighbor. That is, if the value of the Rep founded by an edge is no larger than its

neighbor's, color the edge with red; and otherwise color it with blue. Figure 4.2(b)

shows a balanced 2-coloring of an I/O mapping graph of Figure 4.2(c), where solid

lines are colored as red and dashed lines are colored as blue.

The detailed implementation of a balanced 2-coloring algorithm is referred to

80

Algorithm 1, where we use operator \:=" to denote an assignment local to a PE or

to the control unit, and use operator \ " to denote an assignment requiring some

interprocessor communication.

The correctness and time complexity of Algorithm 1 are given in the following

theorem.

Theorem 3 A balanced 2-coloring of any G(N;K; g) can be found in O(logK) time

using a completely connected multiprocessor system of N PEs.

Proof: Given an I/O mapping graph G(N;K; g), Step 1 can be done in O(1) time

using a completely connected multiprocessor system of N PEs. In Step 2, since the

length of each directed cycle or path is at most dK=2e, each edge can �nd a Rep

by dlogK=2e times of pointer jumping. Clearly, all edges in the same directed cycle

or path are colored with the same color since they �nd the same Rep. The pointer

initialization implies that each edge and its neighboring edges are in di�erent directed

cycle or path, and thus, they have di�erent colors. By the de�nition of left/right

edge, there are no more than g=2 pairs of neighboring edges incident at any vertex

of G(N; k; g). Thus, the coloring of all components compose a balanced 2-coloring

of G(N; k; g). Therefore, a balanced 2-coloring of any G(N;K; g) can be found in

O(logK) time. 2

4.4.3 Algorithm for g-Edge Coloring of G(N;K; g)

Based on the balanced 2-coloring algorithm, a WEC solution to any I/O mapping

graph G(N;K; g) with no more than g colors can be found as follows. Let d be the

degree of G(N;K; g). Clearly, d � g. First, remove colors of the K0 colored edges.

Then, perform at most dlog de iterations as follows. In initial iteration (i.e., iteration

0), we �nd a balanced 2-coloring of G(N;K; g) using colors 0 and 1 if d > 1, and let

G0 and G1 be the graphs induced by the edges with colors 0 and 1, respectively. If

81

Algorithm 1 A Balanced 2-Coloring of an I/O Mapping Graph

Input: G(N;K; g)

Output: a balanced 2-coloring of G(N;K; g)

for all PEi, 0 � i � N � 1, do

l(i) := r(i) := �1; /* l(i), r(i) are the left edge and right edge of edge i respectively.

*/

if �(i) 6= �1 then

if �(�i) 6= �1 then

l(i) := �i;

end if

if ��1(�(i)) 6= �1 then

r(i) ��1(�(i));

end if

if l(i) 6= �1 and r(l(i)) 6= �1 then

q(i) r(l(i)); /* q(i) is a pointer */

p(i) := 0; /* p(i) is used to �nd edge i in a path or cycle */

else

q(i) := i;

p(i) := 1; /* the edge i in a path */

end if

m(i) := m0(i) := i; /* m(i) (resp. m0(i)) is used to �nd the Rep found by edge i

(resp. i's neighbor) */

for t := 1 to dlogK=2e do

m(i) min fm(i); m(q(i))g;

p(i) p(q(i));

q(i) q(q(i));

end for

if p(i) = 1 then

m(i) := q(i); /* the Rep found by the edge i is the label of PE to which i is

pointing */

end if

if l(i) 6= �1 /* i has left edge */ then

m0(i) m(l(i));

else if r(i) 6= �1 /* i has right edge */ then

m0(i) m(r(i));

end if

if m(i) � m0(i) then

c(i) := 0; /* color i as red */

else

c(i) := 1; /* color i as blue */

end if

end if

end for

82

�(G0) > 1 (resp. �(G1) > 1), we execute iteration 1 to �nd a balanced 2-coloring

for G0 (resp. G1) using colors 00 and 01 (resp. 10 and 11). This process recursively

continues in a binary tree fashion until a solution to WEC is reached. More formally,

in each recursive iteration i, 1 � i � dlog de � 1, we �nd a balanced 2-coloring for

each graph Gz using colors z0 and z1 (i.e., concatenate 0 or 1 with z) if �(Gz) > 1,

where z is a binary representation of an integer in f0; 1; � � � ; 2i � 1g denoting the

color of edges in Gz in iteration i� 1.

Theorem 4 For any I/O mapping graph G(N;K; g), a g-edge coloring can be found

in O(log d�logK) time using a completely connected multiprocessor system of N PEs,

where d is the degree of G(N;K; g).

Proof: Let d0 = 2k such that k is the smallest integer satisfying d � 2k. We prove

the theorem by induction on k. If k = 1, it is true since a balanced 2-coloring is

a 2-edge coloring by Theorem 3. Assume that for any k < m � n, the theorem

holds. Now, we prove that the theorem holds for k = m. First, we �nd a balanced

2-coloring of G(N;K; g), which can be done in O(logK) time by Theorem 3. Let G0

and G1 be the graphs induced by the edges of two di�erent colors from this balanced

2-coloring. By the de�nition of balanced 2-coloring, we know that �(G0) � d0=2 and

�(G1) � d0=2. By the hypothesis, we can �nd a (d0=2)-edge coloring for each of G0

and G1 in O((k�1)�logK) time on a completely connected multiprocessor subsystem

of jE(G0)j and jE(G1)j PEs, respectively. These two colorings can be carried out

simultaneously since E(G0) \ E(G1) = ;. The (d0=2)-edge colorings of G0 and G1

compose a d0-edge coloring of G(N;K; g), which takes total O(k � logK) time using

a completely connected multiprocessor system of N PEs. Since d0=2 < d � d0 � g,

this theorem holds. 2

83

4.4.4 Parallel Routing in a Plane

We have shown how to assign each connection to a plane in an RNB B(N;x; p; �).

In this section, we show how connections are routed within each plane.

Lemma 7 Let C be a set of feasible connections for B(N;x; 1; �). If each connection

in C is set up in the �rst and last x stages such that the output link in stage i and

the input link in stage logN � i on each connection are connected with the same

subnetwork B(N;x; 1=2i+1; �), 0 � i � x� 1, then C can be routed by self-routing in

the middle logN � x stages.

Proof: By the topology of B(N;x; 1; �), we know that each connection must pass

through the same subnetwork B(N;x; 1=2i; �), 0 � i � logN � 1. Since the middle

logN �x stages of B(N;x; 1; �) consists of 2x Baseline network BL(N2x), this lemma

is true. 2

Theorem 5 Let C be a set of K feasible connections of B(N;x; 1; �). Then C can

be correctly routed in O(x logK + logN) time using a completely connected multi-

processor system of N PEs.

Proof: By Lemma 7, what we only need to do is to route C correctly in the �rst

and last x stages for x � 1. By the topology of B(N;x; 1; �), we know that the

output link in stage i and the input link in stage logN � i on each connection are

connected with the same subnetwork B(N;x; 1=2i+1; �), 0 � i � x � 1. Thus, we

need to decide which subnetwork to be used for each connection since there are 2i

B(N;x; 1=2i; �)s. This can be reduced to a 2-edge coloring of a bipartite graph with

degree of 2. For each subnetwork B(N;x; 1=2i; �), 0 � i � x � 1, we construct an

I/O mapping graph G(N=2i;Ki; 2), where Ki is the number of connections passing

through it. We color the edges of G(N=2i;Ki; 2) with two di�erent colors and assign

the connections (edges) with the same color to pass through the same subnetwork

84

B(N;x; 1=2i+1; �). Speci�cally, in each iteration i, 0 � i � x � 1, we run g-edge

coloring algorithm for 2i G(N=2i;Ki; 2)s with g = 2. By Theorem 4, each iteration

can be done in O(logK) time. Thus, the time to set up K feasible connections in the

�rst and last x stages is O(x logK). By Lemma 7, we can set up the connections in

the middle logN � x stages by self-routing, which takes logN � x time. Therefore,

the total time to route K feasible connections of B(N;x; 1; �) is O(x logK + logN)

using a completely connected multiprocessor system of N PEs. 2

4.4.5 Overall Routing Performance

Theorem 6 For any RNB B(N;x; p; �) such that p � 2bn�x+�

2c, K connections

(including existing and new connections) can be correctly routed in O(logK logN)

time using a completely connected multiprocessor system of N PEs.

Proof: Let g = 2bn�x+�

2c. By Theorem 4, we can �nd a g-edge coloring of

the I/O mapping graph G(N;K; g) in O(log d logK) time, where d is the degree of

G(N;K; g). By Lemma 4, we assign the connections with the same color to the

same plane. In each plane B(N;x; 1; �), by Theorem 5, we can route the connections

in O(x logK + logN) time. Since x < logN , d � g = 2bn�x+�

2c, the total time is

O((x+ log d) logK + logN) = O(logK logN). 2

By Lemma 5, for special cases of an RNB B(N; 0; p; �) and an RNB B(N;n�

1; p; �), the minimum number p of planes of Baseline network and Benes network,

equals to 2bn+�2c and 2b

1+�2c, respectively. Consequently, we can route N connections

in O(log2N) time for both B(N; 0; p; �) and B(N;n�1; p; �). For the RNB B(N;n�

1; p; 0), which is the electronic Benes network, this performance is the same as the

best known results reported in [49, 62].

85

4.5 Routing in Strictly Nonblocking Networks

In this section, we present a fast parallel routing algorithm for SNB B(N;x; p; �)

based on a strong (2g � 1)-edge coloring of G(N;K; g).

4.5.1 Strict Nonblockingness

The following lemma can be easily derived from the results of [92].

Lemma 8 If

p �((1 + �)x + 2

n�x2 (3

2+ 1

2�)� 1; for even n� x

(1 + �)x + 2n�x+1

2 (1 + 12�)� 1; for odd n� x

then B(N;x; p; �) is strictly nonblocking.

For an SNB network, we can route new connections (as long as these con-

nections form an I/O mapping from idle inputs to idle outputs) without disturbing

the existing ones; however, this routing problem is harder than that in an RNB net-

work when we need to route the new connections simultaneously. In this section, we

present a parallel algorithm based on graph coloring to speed up routing time.

Based on the discussions in Section 4.3, we know that the routing problem

for an SNB B(N;x; p; �) can be solved by �nding a strong edge coloring of the I/O

mapping graph G(N;K; g) with g = 2bn�x+�

2c.

Lemma 9 Any multigraph G has a strong (2�(G) � 1)-edge coloring, where �(G)

is the degree of G.

Proof: Consider coloring edges in an arbitrary order. Since each edge in G is adjacent

to at most 2�(G)� 2 edges, any uncolored edge in G can always be assigned a color

so that the total number of colors used is no more than 2�(G) � 1. 2

We consider a subclass of SNB networks, B(N; 0; p�; �) with p� = 2bn+�2c+1�1.

By Lemma 8, we know that B(N; 0; p�; �) is an SNB network. Since each plane of

86

B(N; 0; p�; �) is a Baseline network, the routing of connections in any plane can be

done by self-routing. Thus, the problem of setting up connections in B(N; 0; p�; �) is

reduced to �nding a plane for each new connection so that all connections, including

existing ones, are con ict-free. By Lemmas 4 and 9, this can be done by �nding a

strong (2g � 1)-edge coloring for G(N;K; g) of B(N; 0; p�; �) with K0 existing con-

nections and K�K0 new connections, where g = 2bn+�2c p�+1

2. In the next subsection,

we present an algorithm to �nd a strong (2g � 1)-edge coloring of G(N;K; g).

4.5.2 Algorithm for Strong (2g � 1)-Edge Coloring of G(N;K; g)

Let G(N;K0; g) and G(N;K �K0; g) denote the graph obtained from G(N;K; g) by

only keeping theK0 colored edges and by removing theK0 colored edges, respectively.

Let d� be the degree of G(N;K �K0; g), and let d0 = 2k such that k is the smallest

integer satisfying d� � 2k. Conceptually, a strong (2g�1)-edge coloring of G(N;K; g)

with K0(< K) colored edges can be done in the following two steps.

Step 1: �nd a set of matchings fM1;M2; � � � ;Md0g of G(N;K �K0; g);

Step 2: for i from 1 to d0 do the following: color the edges in Mi without

changing the colors of the edges in G(N;K0; g)S([j<iMj).

Finding a set of d0 matchings in a graph is equivalent to coloring the edges

in the graph with d0 di�erent colors, because edges with the same color are not

adjacent to each other. Thus, Step 1 can be done by �nding a d0-edge coloring of

G(N;K �K0; g) using the algorithm described in Section 4.4. This d0-edge coloring

dividesK�K0 uncolored edges (corresponding to new connections) into d0 matchings.

By Theorem 4, Step 1 takes O(log d0 � log(K �K0)) = O(log d� � log(K �K0)) time

using a completely connected multiprocessor system of N PEs.

In G(N;K; g), each edge is adjacent to at most 2g�2 edges, and hence, there

are at most 2g�2 colored edges adjacent to each edge in a matchingMi. Since edges

with the same color cannot be adjacent, we can color every edge in a matching by one

87

of the unused colors. This can be done by parallel searching for a free color among

2g � 1 colors as follows. Associate a Boolean array C[1::2g � 1] of 2g � 1 elements

with each vertex in G(N;K; g), with C[r] = 0 if and only if an edge adjacent to

the vertex has been colored with color r. Consider an edge e in Mi that connects

vertices u and v of G(N;K; g), and let Cu and Cv be the C array associated with

vertices u and v, respectively. Performing bit-wise AND operation on Cu and Cv and

obtain a Boolean array Du;v such that Du;v[s] = Cu[s]^Cv[s], 1 � s � 2g� 1. Then,

Du;v[t] = 1 if and only if color t is available for edge e. We can assign g=2 PEs to

each vertex w of G(N;K; g), and these PEs collectively maintain Cw. Then, using

g PEs, Du;v can be computed O(1) time, and �nding some t such that Du;v[t] = 1

by performing a parallel binary pre�x sums operation on Du;v, which takes O(log g)

time. Since no two edges are adjacent in a matching, uncolored edges in the matching

can be colored simultaneously by their assigned PEs in O(log g) time, and Step 2

takes O(d0 log g) time. Since d0=2 < d� � d0, O(d0 log g) = O(d� log g). Therefore, we

have the following claim.

Theorem 7 For any I/O mapping graph G(N;K; g) with K0(< K) colored edges, a

strong (2g � 1)-edge coloring can be found in O(log d� log(K �K0) + d� log g) time

using a completely connected multiprocessor system of N PEs, where d� is the degree

of G(N;K �K0; g).

4.5.3 Performance Analysis

We summarize the overall performance of our routing algorithm for SNB network

B(N; 0; p�; �) by the following theorem.

Theorem 8 For an SNB network B(N; 0; p; �) with p � p� = 2bn+�2c+1 � 1, con-

nections from any K �K0 idle inputs to any K �K0 idle outputs, with K0 existing

connections, can be correctly routed in O(d� logN) time using a completely connected

multiprocessor system of N PEs, where d� is the degree of G(N;K �K0; g).

88

Proof: By Theorem 7, we can �nd a strong (2g � 1)-edge coloring of G(N;K; g) in

O(log d� log(K � K0) + d� log g) time using a completely connected multiprocessor

system of N PEs. We assign each new connection with color i to the i-th plane of

B(N; 0; p; �). By Lemma 4, these new connections can be routed by self-routing in

O(logN) time. Thus, the total time is O(log d� log(K�K0)+d� log g+logN). Since

g = 2bn+�2c and K �K0 � N , this time complexity is O(d� logN). 2

Since d� � g = 2bn+�2c, d� = O(

pN ) in the worst case. Assuming that

the edges in G(N;K � K0; g) are uniformly distributed, then d� = d (K�K0)g

Ne =

O(K�K0pN

) in average. Therefore, the performance of our algorithm is summarized by

the following claim.

Corollary 1 Under the same conditions of Theorem 8, the worse-case and average-

case time complexities of our routing algorithm areO(pN logN) and O(K�K0p

NlogN),

respectively.

By Lemma 8, we derive the minimum number of planes, pmin, in B(N; 0; p; �)

as follows.

i. If there is no crosstalk-free constraint (i.e., � = 0),

pmin =

(322n2 � 1; for even n

2n+12 � 1; for odd n

and

ii. If there is a crosstalk-free constraint (i.e., � = 1),

pmin =

(2n2+1 � 1; for even n

322n+12 � 1; for odd n

Compared with B(N; 0; pmin; �), the hardware redundancy of B(N; 0; p�; �) is

shown as follows.

89

p� � pmin =

8>>>><>>>>:

0; if � = 0 and n is oddpN=2; if � = 0 and n is even

0; if � = 1 and n is evenp2N=2; if � = 1 and n is odd

The hardware cost of B(N; 0; p�; �), in terms of the number of SEs, is higher

than that of B(N; 0; pmin; �) in half of the cases, but both have the same hardware

complexity of �(N1:5 logN). The time for routing O(N) connections, however, is

improved from (N logN) to sublinear O(pN logN) in the worst case.

4.6 Self-Routing Nonblocking Networks

The attenuation of light passing through the switch has several components such

as �ber-to-switch and switch-to-�ber coupling loss, propagation loss in the medium,

loss at waveguide bends, loss at the couplers, etc. In a large switch, a substantial

part of this attenuation is directly proportional to the number of couplers that the

optical path passes through. Thus, the connection diameter is used to characterize

the signal loss [68]. Although B(N;x; p; �) built from Banyan-type network by hori-

zontal concatenation and/or vertical stacking has connection diameter O(logN) and

can be strictly nonblocking, �nding a plane for each new connection relies on global

information (i.e., the knowledge of other connections), which increases the time for

setting up connections as discussed in previous sections. In this section, we pro-

pose a self-routing strictly nonblocking switching network with O(logN) connection

diameter.

4.6.1 Connection Capacity of BL(N)

Lemma 10 For any connection set C of BL(N), if no two connections in C share

any modulo-g input group, then the connection paths for C are node con ict-free in

the �rst log g stages; if no two connections in C share any modulo-g output group,

90

then the connection paths for C are node con ict-free in the last log g stages, 2 �

g � 2n.

It is easy to verify that Lemma 10 is true according to the topology of BL(N).

For example, in Figure 1.13 in Chapter 1, two connections along paths P0 and P1 do

not share any modulo-4 input group, and thus, there is no node con ict in the �rst

two stages. But they share the �rst modulo-8 input group and the sixth modulo-2

output group, and thus, there are node con icts in stages 2 and 3. By Lemma 10,

the following claim can be derived.

Lemma 11 Given a connection set C of BL(N), if any two connections in C do

not share any modulo-2bn+�2c input group and also do not share any modulo-2b

n+�2c

output group, then

(i) for � = 0, there is no link con ict in BL(N);

(ii) for � = 1, there is no node con ict in BL(N).

Proof: We prove the lemma by considering the following two cases.

(1) n is even.

We have 2bn+�2c = 2

n2 . Since there are no two connections sharing any modulo-2

n2

input and output groups, by Lemma 10, there is no node con ict in the �rst n2and

last n2stages. Since n

2+ n

2= n, there is no node con ict in all n stages of BL(N).

Since no node con ict in stage i implies no link con ict in stage i. Thus, there is

neither link con ict nor node con ict in BL(N).

(2) n is odd. There are two subcases.

(2.1) For � = 0, we have 2bn+�2c = 2

n�12 . Since there are no two connections sharing

any modulo-2n�12 input and output groups, by Lemma 10, there is no node con ict

in the �rst n�12

stages, stage 0 to stage n�32, and last n�1

2stages, stage n+1

2to stage

n � 1. Thus, there is no node con ict in all stages except the central stage, stage

91

n�12, of BL(N). Since the output links of stage n�3

2is the input links of stage n�1

2

and the input links of stage n+12

is the output links of stage n�12, there is no link

con ict in all stages of BL(N).

(2.2) For � = 1, we have 2bn+�2c = 2

n+12 . By Lemma 10, there is no node con ict in

the �rst n+12

and last n+12

stages Since n+12

+ n+12

> n, there is no node con ict in

BL(N). 2

By Lemma 11, if we only allow one connection to pass through each modulo-

2bn2c input and output groups at any time, then we can route connections in BL(N)

without link con ict; if we only allow one connection to pass through each modulo-

2bn+12c input and output groups at any time, then we can route connections in BL(N)

without node con ict. The new class of self-routing strictly nonblocking networks

will be built based on this idea.

4.6.2 Constructing T (N;�)

In this subsection, we assume that M = 2m = N2

21��and g = N

21��= 2n�1+�.

Lemma 12 Given a connection set C of BL(M), if neither do two connections share

any modulo-g input group nor do they share any modulo-g output group in a given

connection set C, then C can be set up without con ict in BL(M).

Proof: By M = 2m = N2

21��= (2n)22�1+� = 22n�1+�, we have m = 2n� 1 + �.

According to Lemma 11, if any two connections in C do not share any modulo-

2bm+�2

c = 2b2n�1+2�

2c = 2n�1+� input and output groups at any time, then we can

route the connections of C in BL(M) with link con ict-free constraint (i.e. � = 0)

or with node con ict-free constraint (i.e. � = 1). 2

We select the �rst input in each modulo-g input group of BL(M) as a useful

input of BL(M), and the �rst output in each modulo-g output group of BL(M) as a

useful output of BL(M). Clearly, M=g = N . Thus, restricted to these useful inputs

92

and outputs, BL(M) can be used as an N �N self-routing switching network with

link or node con ict-free constraint, depending on the value of � by Lemma 12. In

the following we show how to construct an N � N self-routing strictly nonblocking

network, denoted by T (N;�), from BL(M).

We �rst give some de�nitions. A link (resp. SE) is called a redundant link

(resp. SE) if its removal will not a�ect the switching functionality of BL(M) for

establishing connections from N useful inputs to N useful outputs; otherwise it is

called an essential link (resp. SE). T (N;�) is constructed fromBL(M) by performing

the following two steps to remove all redundant links and SEs.

Step 1. BecauseBL(M) hasm = 2n�1+� = n+log g stages, the subnetworks

of BL(M) induced by the SEs from stage n to the last stage form a set of 2n BL(g)s.

Since each of these BL(g)s is connected with exactly one useful output of BL(M),

at most one of any given set of connections from useful inputs to useful outputs is

routed though each BL(g). We replace each of these BL(g)s by a g � 1 combiner,

and set the output of this combiner as an output of T (N;�).

Step 2. To complete the construction of T (N;�), we need to remove additional

redundant SEs and links in the �rst n stages of BL(M). It can be done by starting

from stage 0 to stage n � 1 as follows. Initially, N useful inputs are considered to

be connected with N essential links in stage 0. In stage i, 0 � i � n � 1, do the

following operations. Firstly, we identify all essential SEs and links: if an SE has one

of input connecting with an essential link, it is marked as an essential SE and its two

output links are marked as essential links. Secondly, we remove all redundant SEs

and links: if a link is not an essential link, it is removed; if both input links of an SE

have been removed, this SE and its two output links are considered redundant and

removed.

Example 7 Figure 4.5 (a)(i) and (b)(i) show BL(32) and BL(64), respectively,

where essential links and SEs are highlighted with dark color and redundant links

93

and SEs are colored gray. Figure 4.5 (a)(ii) and (b)(ii) show T (8; 0) and T (8; 1)

constructed from BL(32) and BL(64), respectively. 2

In BL(M), we know that two outputs of each SE in one stage are connected

with two SEs of next stage, one in the upper subnetwork and the other in the lower

subnetwork. Thus, the number of essential SEs in stage i (0 � i � n � 1) equals

to minf2iN;M=2g = minf2n+i; 22n�2+�g. Let s(N;�) denote the number of SEs

in T (N;�). It is easy to verify that there are 2n+i essential 1 � 2 SEs in stage i,

(0 � i � n� 2), 22n�2+� essential (2 � �)� 2 SEs in stage n� 1, and zero essential

SE in the remaining stages of BL(M). Therefore, by a simple calculation, the total

number of SEs in T (N;�) is

s(N;�) =n�2Xi=0

2n+i + 22n�2+� =3 + �

4N2 �N

=

(3N2

4�N; if � = 0

N2 �N; if � = 1

In T (N;�), input (resp. output) i is corresponding to input (resp. output)

i0 of BL(M), where the binary representation of i0 is the binary representation of i

concatenating with log g 0s at the end. It means that the �rst logM � log g = n bits

for i and i0 are the same. Therefore, the routing process in T (N;�) is the same as

that in BL(N), which is self-routing.

We summarize the above discussions by the following claim.

Theorem 9 T (N;�) is an N �N self-routing strictly nonblocking network of logN

stages. For � = 0, it consists of 3N2

4�N SEs, among which N2

2�N SEs are of size

1� 2 and N2

4SEs are of size 2� 2; for � = 1, it consists of N2 �N SEs, all of size

1� 2.

In an optical switching network, for practical reasons, the number of wave-

lengths used must be small. Clearly, if two connection paths are allowed to pass

94

( a )

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

4

5

6

7

0

1

2

3

( i ) ( ii )

( b )( i ) ( ii )

Figure 4.5. Construction of networks: (a) T (8; 0) based on BL(32); (b) T (8; 1) based

on BL(64)

95

through an SE, then at least two wavelengths are required. In general, two wave-

lengths are not suÆcient for an optical switching network. For example, for an

N �N crossbar, in order to establish an identity permutation, which means input i

is mapped to output i, then N wavelengths are necessary for crosstalk-free routing.

In this aspect, T (N;�) is superior, as indicated in the following claim.

Corollary 2 T (N; 1) is crosstalk-free with one wavelength and T (N; 0) is crosstalk-

free with two wavelengths.

Proof: Since all SEs in T (N; 1) are of size 1� 2, there is only one connection can be

passed through an SE at one time. Thus, one wavelength is suÆcient for crosstalk-

free routing in T (N; 1). All SEs in T (N; 0) are of size 1�2 except the ones in the last

stage. Thus, a total of two wavelengths are suÆcient to ensure that the connections

passing trough the same SEs use di�erent wavelengths. 2

4.6.3 Comparison

Compared with self-routing Banyan-type networks, T (N;�) is strictly nonblocking,

which is promising for high performance switching.

Compared with an N � N crossbar for photonic switching, T (N; 1) requires

slightly fewer number of SEs and only one wavelength; T (N; 0) requires much fewer

number of SEs with two wavelengths available. The di�erence between N � N

crossbar and T (N;�) for photonic switching is much more noticeable as shown in

Table 4.1.

4.7 Summary

One major contribution of this chapter is the design and analysis of parallel routing

algorithms for a class of nonblocking switching networks, B(N;x; p; �)s. Although

the assumed parallel machine model is a completely connected multiprocessor system

96

Networks Number of SEs Diameter Number of wavelengths

Crossbar N2 2N � 1 N

T (N; 0) 3N2

4�N logN 2

T (N; 1) N2 �N logN 1

Table 4.1. Comparison of self-routing strictly nonblocking photonic switching net-

works

of N PEs, the proposed algorithms can be transformed to algorithms for more real-

istic parallel computing models. The pointer jumping and binary searching, which

dominate the complexity of the proposed algorithms, can be reduced to sorting on

realistic parallel computing structures. It is interesting to note that the sorting

can be implemented in Banyan-type network in O(log2N) time [47]. Thus the pro-

posed algorithms can set up connections in B(N;x; p; �) with a slow-down factor

O(log2N) on a Banyan-type network, whose complexity is no larger than one plane

of B(N;x; p; �).

The approach of applying edge-coloring techniques to investigate the capacity

and routability of RNB switching networks has been widely used (refer to [10, 31]).

We extended this approach to SNB by de�ning strong edge-coloring. For a class of

RNB and SNB banyan-based switching networks obtained by horizontal expansion

and vertical replication, we proposed a uni�ed mathematical formulation for design-

ing parallel routing algorithms using this approach. For the SNB case, if there is no

existing connection, our routing algorithms for SNB and RNB are the same since we

only need to run the �rst step of SNB routing algorithm. But when there are existing

connections, we have to run the second step of SBN routing algorithm, whose perfor-

mance essentially proportional to d�, which can be as large as O(pN), making our

algorithm not practical. An open problem is how to design eÆcient polylogarithmic

time routing algorithm for the SNB case.

The results of this chapter have valuable architectural implications for design

97

and implementation of future large-scale electronic and optical switching networks.

Scalable nonblocking switching networks tend to have no self-routing capability. For

example, for a nonblocking switching network B(N;x; p; �), though self-routing ca-

pabilities exist in a portion of it, its routing is still computation intensive. Therefore,

for the design of a switching network, in addition to its hardware cost in terms of

the cost of SEs and interconnection links (and wavelengths), we must take the rout-

ing complexity into consideration. It remains a great challenge for �nding low-cost

high-speed nonblocking switching networks.

CHAPTER 5

PARALLEL CROSSTALK-FREE ROUTING FOR OPTICAL BENES

NETWORKS

5.1 Introduction

Benes networks are rearrangeable nonblocking permutation networks and are among

the most eÆcient switching architectures in terms of the number of 2� 2 switching

elements (SEs) used. In optical Benes networks, if two I/O connecting paths with

the same (close) wavelength(s) share a common SE, crosstalk will occur. In order to

reduce the crosstalk e�ect, three approaches, time, space, and wavelength dilations

have been proposed. In the time dilation approach (eg.[69, 74, 83, 99]), crosstalk can

be avoided by using the principle of recon�guration with time division multiplexing

(RTDM) paradigm proposed by C. Qiao et al. in [73]. More speci�cally, a set of

permutation connections is partitioned into subsets so that the connections in each

subset can be established simultaneously without crosstalk and the subsets can be

used to form a sequence of con�gurations for the set of connections. Such a subset is

called a crosstalk-free (CF) partial permutation. Since the paths realizing a CF partial

permutation for a given OMIN do not share any SE, the time dilation approach is

also useful for establishing a set of connections that would normally cause con icts

in blocking OMINs such as Banyan networks [69, 74, 83].

In this chapter, we focus on how to quickly con�gure an optical Benes net-

work for realizing an arbitrary (partial) permutation using time dilation approach.

It has been shown in [99, 100] that for an optical Benes network, a special type

of partial permutation, named semi-permutation, can be realized in one pass and

any permutation can be decomposed into two semi-permutations. However, the ex-

98

99

isting permutation decomposition algorithms have O(N) time complexity and the

existing crosstalk-free routing algorithms for an N �N optical Benes network take

O(N logN) time, which is slow even for circuit switching in optical domain. Us-

ing parallel processing techniques, we give a permutation decomposition algorithm

to decompose a partial permutation with K connections in O(logK) time. By ap-

plying our permutation decomposition algorithm and equitable coloring techniques,

we present a routing algorithm for realizing an arbitrary partial permutation with

K(� N) connections in an N �N optical Benes network in O(log2K + logN) time.

In addition, we show that the time dilation approach is the most cost-e�ective for

optical Benes networks provided that the costs in time, in space and in wavelength

were interchangeable.

The rest of this chapter is organized as follows. In Section 5.2, we present a

logarithmic parallel decomposition algorithm to decompose a (partial) permutation

into two (partial) semi-permutations. Section 5.3 shows a parallel routing algorithm

for realizing semi-permutations in optical Benes networks without crosstalk. In Sec-

tion 5.4, we compare three dilation approaches to avoiding crosstalk in optical Benes

networks. Section 5.5 summarizes this chapter.

5.2 Parallel Permutation Decomposition

In this section, we �rst give a simple proof to show that any permutation can be

decomposed into two semi-permutations, then present a parallel permutation de-

composition algorithm, and �nally extend this algorithm for the decomposition of

an arbitrary partial permutation. The presented decomposition algorithm is based

on �nding a 2-edge coloring of a bipartite graph G with �(G) � 2.

100

5.2.1 Decomposability

Clearly, a permutation can not be realized in a single pass in an N � N OMIN

without crosstalk. Hence, we are interested in a type of partial permutations that

can be passed through OMIN simultaneously without crosstalk. Y. Yang et al. [99]

introduced a concept called semi-permutation, which is a partial permutation that

ensures only one active input in each SE of the �rst and last stages of an OMIN at

the same time. Formally, we have the following de�nition.

De�nition 1 For any permutation � of f0; 1; � � � ; N�1g, a partial permutation with

N=2 active inputs, x0; x1; � � � ; xN=2�1, is called a semi-permutation of �, if it satis�es:

fbx0=2c; bx1=2c; � � � ; bxN=2�1=2cg =

fb�(x0)=2c; b�(x1)=2c; � � � ; b�(xN=2�1)=2cg =

f0; 1; � � � ; N=2 � 1g:

2

A partial semi-permutation is a partial permutation such that its input (resp.

output) set is the subset of the input (resp. output) set of a semi-permutation.

Clearly, a semi-permutation is a maximum potential partial permutation that can

be realized in one pass of an N �N OMIN built with 2 � 2 SEs.

For any (partial) permutation �, we can construct a bipartite graph, named

I/O mapping graph, as follows. The vertex set consists of two parts, V1 and V2 and

each part has N=2 vertices corresponding to I and O, respectively. That is, a pair

of two inputs (resp. outputs) 2i and 2i+ 1 with i 2 f0; 1; � � � ; N=2 � 1g, called dual

inputs (resp. dual outputs), is represented by a vertex in V1 (resp. V2). There is an

edge between vertex bi=2c in V1 and vertex bj=2c in V2 if and only if j = �(i). An I/O

mapping graph may consist of parallel edges between two vertices. Because there is a

101

one-to-one correspondence between active inputs and outputs in a permutation and

edges in an I/O mapping graph, we can label each edge by its corresponding input

or output. The following theorem was proved in [99]. We provide a much simpler

proof, which serves as the foundation of our parallel algorithms.

Theorem 10 Any (partial) permutation can be decomposed into two (partial) semi-

permutations.

Proof: Let �(G) be the maximum degree of vertices in the I/O mapping graph G.

Clearly, �(G) � 2 since each vertex presents two inputs or two outputs of an OMIN.

It is known that every bipartite graph G is �(G)-edge colorable. That is, we can

color edge set E(G) with �(G) colors so that the adjacent edges have di�erent colors

(see a proof in [8]). Thus the I/O mapping graph G constructed from a (partial)

permutation is 2-edge colorable. If we color G with a 2-edge coloring, then two edges

with ends corresponding two dual inputs and outputs incident at a vertex in G must

have di�erent colors. Thus, each of the subgraphs induced by the edges of the same

color corresponds to a (partial) semi-permutation. 2

Example 8 For the permutation

� =

0 1 2 3 4 5 6 7

1 6 0 5 3 2 7 4

!

a 2-edge coloring of its corresponding I/O mapping graph is shown in Figure 5.1,

where each edge is labeled by its corresponding input, and solid and dashed edges are

colored with di�erent colors. The solid and dashed edges correspond to two semi-

permutations

�1 =

0 3 4 6

1 5 3 7

!and �2 =

1 2 5 7

6 0 2 4

!

102

respectively. Clearly, � = �1 Æ �2. 2

3

0

1

2

3

V VG

6

7

5

4

3

2

1

0

1 2

2

0

1

Figure 5.1. A 2-edge coloring of bipartite graph G

Each connected component in the I/O mapping graph G is a cycle for a

permutation while it is a cycle, a path or an isolated vertex for a partial permutation.

If we interchange the colors of edges in any component of G, it is still a 2-edge coloring

of G. Thus, the decomposition of a (partial) permutation may not be unique, we can

assign one of dual inputs/outputs to either of two semi-permutations. In fact, given

a (partial) permutation, if the number of connected components (excluding isolated

vertices) in its corresponding I/O mapping graph is c, 1 � c � N=2, there are 2c�1

ways to decompose the permutation into a pair of (partial) semi-permutations.

5.2.2 Decomposing a Permutation into Two Semi-Permutations

We choose to present our parallel algorithms for a completely connected multipro-

cessor system since any algorithm on this parallel computing model can be easily

transformed to algorithms on more realistic multiprocessor systems. A completely

connected multiprocessor system of size N consists of a set of N processor elements

(PEs) connected in such a way that there is a direct connection between every pair

of PEs. The PEs are labeled beginning with 0 and placed as an array according to

their labels in nondecreasing order. We assume that each PE can communicate with

at most one processor during a communication step.

103

To facilitate the description of our algorithms, we introduce some notations.

Let avav�1 � � � a1a0 be the binary representation of a. We use �a to denote the integer

that has the binary representation avav�1 � � � a1(1 � a0). We use operator \:=" to

denote an assignment local to a PE or to the control unit, and use operator \ " to

denote an assignment requiring some interprocessor communication.

Initially, each PEi reads �(i) from inputs, assigns value i to m(i), and sets

value of ��1 in PE�(i) as i. The pointer p(i) of PEi will be set to point to a PE

with index of ��1(�(i)), which is actually done by two steps. In the �rst step, PEi

computes �i and reads value of �(�i) from the PE with index �i. In the second step,

PEi computes �(i) and reads value of ��1(�(i)) from the PE with index of �(i).

Then, by dlog(N=2)e times of pointer jumping [33], each PEi sets value of m(i) to

be the minimum index of PEs it ever points to. Finally, the parity of m(i) decides

in which semi-permutation Ii is; i.e. all inputs with the same parity are in the same

semi-permutation. The detailed implementation is given in Algorithm 2.

Algorithm 2 A Parallel Decomposition

Input: A permutation

Output: Two semi-permutations

for all PEi, 0 � i � N � 1, do

m(i) := i;

��1(�(i)) i;

p(i) ��1(�(i)); /* pointer initialization */

for t := 1 to dlog(N=2)e do

m(i) min fm(i); m(p(i))g; /* comparison */

p(i) p(p(i)); /* pointer jumping */

end for

if m(i) is even then

Ii is in the �rst semi-permutation;

else

Ii is in the second semi-permutation;

end if

end for

104

Theorem 11 Algorithm 2 correctly computes two semi-permutations for any per-

mutation in O(logN) time on a completely connected multiprocessor system of N

PEs.

Proof: After initialization of N pointers, a set of directed cycles (including loops)

are formed. It is easy to see that two dual inputs and two inputs mapped to a pair

of dual outputs are in di�erent directed cycles. Since the length of each directed

cycle is at most N=2, after dlog(N=2)e times of pointer jumping, each PEi maintains

the minimum index of the input in the directed cycle/loop to which Ii belongs.

Hence, each PEi has m(i) � 0 or 1 mod 2. Therefore, two dual inputs/outputs

are in di�erent semi-permutations. Clearly, the algorithm takes O(logN) time since

pointer jumping dominates the time complexity. 2

Example 9 Consider the permutation in Example 8. After initializing N pointers,

two directed cycles, (0 ! 6 ! 3) and (1 ! 2 ! 7), and two loops, 4 and 5, are

formed, where each edge is represented by a circle, two dual inputs are connected by

a dotted line, and two inputs mapped to two dual outputs are connected by a dashed

line. 2

0 1 2 3 4 5 6 7

Figure 5.2. A decomposition example

5.2.3 Parallel Decomposition Algorithm for Partial Permutations

The decomposition algorithm presented in the last subsection can be generalized

to decompose any partial permutation with K(< N) active inputs into two partial

semi-permutations.

105

Initially, each PEi is associated with edge i. Let p(i) be a pointer of PEi,

which is initially set to point to the PE with index of ��1(�(i)) if i is active and

��1(�(i)) exists (i.e. there is an active input j so that �(j) = �(i)), and it is set

to point to itself otherwise. For a partial permutation with K active inputs, its

corresponding I/G mapping graph G is the union of a set of paths and cycles. For

cycles, the case is the same as Algorithm 2. For paths, there are two directed paths

formed from each path by pointer initialization. Each end of the directed path is

pointing to an edge i labeled by its corresponding input such that input �i is idle or

output �(�i) is idle. By pointer jumping, the two directed paths formed from a path

can be colored with two di�erent colors by comparing the indices of the end edges.

That is, for an edge corresponding to input i, if the label of the end edge found by i

is less than the one founded by an input j with j = �i or ��1(�(i)), then color i with

one color; otherwise color i with the other color. Clearly, the edges corresponding to

the vertices in the same directed path are colored with the same color, and two dual

inputs and outputs are colored with di�erent colors.

Example 10 Figure 5.3 shows how to decompose a partial permutation into two

partial semi-permutations based on a 2-edge coloring. In Figure 5.3 (a), each edge of

5 paths is labeled by its corresponding input. In Figure 5.3 (b), the directed paths are

formed by pointer initialization, where each edge is represented by a circle. The edges

represented by solid circles are colored with one color, and the edges represented by

dashed circles are colored with the other color. The connections corresponding to the

edges with the same color are formed an partial semi-permutation. 2

Since the pointer jumping dominates the time complexity of the algorithm

and each connected component in G has at most K edges, the extended parallel de-

composition can be done in O(logK) time on a completely connected multiprocessor

system of N PEs. In summary, we have the following theorem.

106

( b )

1

( a )

22

2

9 2118

11

23

19310

517

14 20

4

6

12

8

1

21

17

19

18

14

5

4

3

2

620

12

9

10

11

8

23

22

Figure 5.3. Decomposition of a partial permutation based on 2-edge coloring: (a)

5 di�erent types of paths; (b) directed paths formed by pointer initialization and a

2-edge coloring

Theorem 12 For any partial permutation with K active inputs, two partial semi-

permutations can be computed in O(logK) time on a completely connected multipro-

cessor system of N PEs.

Every (partial) semi-permutation can be passed through the SEs in the �rst

and last stages of an N �N OMIN without crosstalk at one time. In order to route

a semi-permutation in a single pass without crosstalk, we need to assure there is

only one active input of each SE in every stage of OMINs. In the next section, we

will present a fast routing algorithm for realizing a (partial) semi-permutation in an

optical Benes network so that no two connections will pass through the same SE at

the same time.

107

5.3 Routing a Semi-Permutation in an Optical Benes Network

In this section, we �rst present a routing algorithm for realizing an arbitrary semi-

permutation in an optical Benes network based on our parallel decomposition al-

gorithm, and then, we improve the time complexity of the routing algorithm using

equitable coloring technique.

5.3.1 A Routing Algorithm Based on Parallel Decomposition

The algorithm for routing a semi-permutation in an optical Benes network is given

as Algorithm 3.

Algorithm 3 A Semi-Permutation Routing Algorithm in Optical Benes Networks

Input: A semi-permutation

Output: A setting of SEs of B(N) without crosstalk

Step 1. If the size of the semi-permutation is 1, then set up B(2) according to the

connection request, and exit.

Step 2. Decompose the semi-permutation into 2 parts, named upper/lower semi-

permutation, satisfying that two active inputs/outputs in a pair of dual SEs in the

�rst/last stage are in di�erent parts.

Step 3. Set SEs in the �rst and last stages so that the active inputs and out-

puts in the upper/lower semi-permutation are connected with the upper/lower

subnetwork.

Step 4. Recursively call this algorithm in the upper/lower subnetwork with the

upper/lower semi-permutation as input.

Theorem 13 For any semi-permutation of an optical B(N), Algorithm 3 correctly

sets the SEs of B(N) without crosstalk in O(log2N) time on a completely connected

multiprocessor system of N PEs.

Proof: By the topology of B(N), we know that every pair of dual SEs in stage i (resp.

2 logN � 2 � i), 0 � i � logN � 2, is connected with two SEs in stage i + 1 (resp.

108

i� 1) and these two SEs are in di�erent subnetwork B(N=2i+1)s. In order to satisfy

that no crosstalk occurs in each stage of B(N), two active inputs (resp. outputs)

belonging to a pair of dual SEs of stage i (resp. 2 logN � 2 � i) must be connected

with the SEs in di�erent subnetwork B(N=2i+1)s. This is equivalent to assigning a

2-edge coloring to a bipartite graph G, where 2 active inputs (outputs) belonging to

a pair of dual SEs of stage i (2 logN�2� i) compose a vertex and each connection is

corresponding to an edge. Thus, using parallel decomposition algorithm recursively,

the SEs are set without crosstalk for any given semi-permutation. By Theorem 11,

the time complexity of Step 2 in Algorithm 3 is O(logN). Since there are 2 logN �1

stages and every parallel decomposition step can decide the setting of SEs of two

stages (i.e. the �rst and last stages of a subnetwork) in B(N), the time complexity

of Algorithm 3 is O(log2N). 2

5.3.2 The Improved Routing of Partial Semi-Permutation by Equitable

Coloring

Since a partial semi-permutation is the subset of a semi-permutation, it can be routed

in an optical Benes network in one pass without crosstalk. By applying the extended

parallel decomposition in step 2 of Algorithm 3, the total time for routing any partial

permutation with K active inputs in an optical B(N) is O(logN logK).

In order to further improve the complexity of routing time to O(log2K +

logN), we introduce a concept, equitable edge coloring. A graph G is equitable c-

edge colorable if E(G) can be colored with c colors so that the adjacent edges are

colored with di�erent colors and the di�erence between the sizes of two color classes

is at most one, where a color class is the subset of E(G) with the same color for the

coloring. Clearly, both cycle and path are 2-edge colorable. For any 2-edge colorings

of paths or cycles, the sizes of two color classes are equal for a cycle and an even path

(a path with an even number of edges) while the di�erence between the sizes of two

color classes is one for an odd path (a path with an odd number of edges). The color

109

with which more than half edges in an odd path are colored is called primary color.

Thus, given a partial permutation, if the I/O mapping graph G has x odd paths, we

color paths and cycles in G with two di�erent colors c1 and c2 so that dx2e odd paths

have c1 as primary color and the remaining bx2c odd paths have c2 as primary color.

These 2-edge colorings of cycles and paths compose an equitable 2-edge coloring of

G.

To route a partial semi-permutation in an optical B(N) without crosstalk

in O(log2K + logN) time, we need to do a preprocessing and apply the equitable

2-edge coloring technique in step 2 of Algorithm 3. The preprocessing is to link

K PEs corresponding to K active inputs. This preprocessing step can be done by

a parallel pre�x sums operation [33], which takes O(logN) time on a completely

connected multiprocessors of N PEs. In the following, we show how to color x odd

paths of G with 2 colors so that the di�erence of 2 color classes is at most 1. It is

easy to see that for any odd path, the edge whose dual input is not active will be

colored with primary color. We call such an edge a primary edge. We concatenate

all primary edges by a parallel pre�x sums on the K linked PEs and alternately color

the primary edges with two di�erent colors. Thus there are dx2e primary edges with

one color and bx2c primary edges with another color. The edges in an odd path will

be colored using the primary edge as reference. That is, if an edge e and a primary

edge f are in the same directed cycle, then e and f have the same color; otherwise

they have di�erent colors. Therefore, an equitable 2-edge coloring of G is found.

Since the operations of pointer jumping and parallel pre�x sums dominate the time

complexity, an equitable 2-edge coloring of G can be found in O(logK) time using a

completely connected multiprocessors with N PEs.

Example 11 Figure 5.4 shows how to �nd an equitable 2-edge coloring. The primary

edges are marked as dark lines in Figure 5.4 (a). 2

110

( a )

23

54

6

1

( b )

5

4

3

2

6

9

11

10

12

8

1

8

12

9

10

11

Figure 5.4. An equitable 2-edge coloring of graph: (a) 3 odd paths and primary

edges; (b) directed paths formed by pointer initialization and an equitable 2-edge

coloring

Using equitable 2-edge coloring technique, we can decompose a partial per-

mutation into two partial semi-permutations with the di�erence between their sizes

being at most one. When we route a partial semi-permutation in an optical Benes

network, by applying the equitable 2-edge coloring technique in step 2 of Algorithm

3, the size of the partial permutation entering into each subnetwork is reduced by

half. Thus after logK iterations, there is at most one active input entering into one

subnetwork. Consequently, the time for setting up a partial semi-permutation with

K active inputs in an optical B(N) is O(logK) in each of the �rst logK iterations

and O(1) in each of the remaining iterations. Therefore, we have the following claim.

Theorem 14 For any partial permutation with K(< N) active inputs of an optical

B(N), it can be routed without crosstalk in O(log2K+logN) time using a completely

connected multiprocessor system of N PEs.

5.4 Comparisons of Three Dilation Approaches for Optical Benes Net-

works

There are three approaches, time dilation, space dilation and wavelength dilation,

can be used to avoid the crosstalk in OMINs.

111

In time dilation approach, given any (partial) permutation, we �rst use par-

allel decomposition algorithm to decompose the (partial) permutation into 2 (par-

tial) semi-permutations, then use Algorithm 3 twice to route two (partial) semi-

permutations without crosstalk.

In space dilation, a dilated Benes network, denoted as DB(N), consists of 2

copies of B(N) with the corresponding two inputs and outputs are connected to a

1� 2 SE and a 2� 1 combiner, respectively [42, 57] (see Figure 5.5 for an example).

INPUTS

5

3

2

4

6

0

1

OUTPUTS

77

3

2

0

1

5

6

4

2 3 4

STAGES

10

Figure 5.5. A space dilated Benes network DB(8)

For routing a permutation in a DB(N), we �rst decompose the permuta-

tion into 2 semi-permutations, then route each semi-permutation in one of copies of

DB(N) simultaneously. By Theorems 11-14, we have the following corollary.

Corollary 3 For any (partial) permutation with K(� N) active inputs of an optical

Benes network B(N), it can be routed without crosstalk in O(log2K + logN) time

on a completely connected multiprocessor system of N PEs by either time or space

dilation.

By Corollary 3, the time complexity to route a (partial) permutation in an

optical B(N) is the same as the time complexity of the best known parallel routing

algorithms for realizing a (partial) permutation in an electronic B(N) [43, 49, 62].

112

Compared with the time dilation approach, the space dilation approach uses

more than double of hardware, i.e. twice of SEs and links plus splitters and combin-

ers, and more than half of time to route a permutation, i.e. the time for decomposi-

tion and routing of one semi-permutation.

In wavelength dilation, if there is a wavelength converter available in each SE,

we can convert two input signals with the same wavelength entering into the same

SE to di�erent wavelengths. Thus, two wavelengths are necessary plus the costs of

the wavelength converters. If there is no wavelength converter available, i.e. each

connection will be assigned the same wavelength, then we �nd two wavelengths are

not suÆcient. An example is given as follows.

Example 12 Routing the permutation

� =

0 1 2 3

0 2 1 3

!

in an optical B(4).

In order to route the permutation � in B(4), by the topology of B(4), we know

that inputs 0 and 1 (outputs 2 and 3) are connected with di�erent subnetwork B(2)'s,

which are two SEs in the second stage of B(4). Since �(1) = 2 and �(3) = 3, we

know that inputs 1 and 3 must be connected with di�erent SEs in the second stage.

Consequently, inputs 0 and 3 must be connected with the same SE in the second stage

containing only 2 SEs. In order to avoid crosstalk, we must use di�erent wavelengths

for connections 0 ! 0 and 3 ! 3. We also know that the connections 0 ! 0 and

1 ! 2 must be carried on the signal with di�erent wavelengths since they pass the

same SE in the �rst stage. Thus, connections 3! 3 and 1! 2 must have the same

wavelength if there are only two available wavelengths. However, the connections

3 ! 3 and 1 ! 2 pass through the same SE in the last stage of B(4), which will

cause crosstalk. In fact, we need 4 wavelengths to route the above permutation � in

B(4). 2

113

From the above discussion, we know that the time dilation approach is the

most cost-e�ective provided that the cost both in space and in wavelength are at

least as high as the cost in time.

5.5 Summary

In this chapter, we proposed a fast parallel decomposition algorithm with time com-

plexity O(logN), which can decompose any permutation with size of N into two

semi-permutations assuring no crosstalk in SEs of the �rst and last stages in OMINs.

Based on this parallel decomposition, we further presented a fast crosstalk-free par-

allel routing algorithm, which can set up any permutation in O(log2N) time in an

optical B(N). The proposed decomposition algorithm can be generalized to any

partial permutation. Using the equitable 2-edge coloring technique, any partial per-

mutation with K(< N) active inputs can be routed in O(log2K + logN) time in an

optical B(N).

In addition, the proposed algorithms run on a completely connected multipro-

cessor system can be easily translated to the algorithms on more realistic multipro-

cessor systems. For an example, we know that the time complexities of our routing

algorithms for time and space dilated B(N) depends on the parallel permutation de-

composition algorithm. Pointer jumping is the most time-consuming operation in the

decomposition algorithm. Each pointer jumping step on a completely connected mul-

tiprocessor system can be implemented on a hypercube by a sorting operation, which

takes O(log2N) time. Consequently, the decomposition algorithm and routing algo-

rithms can be implemented in O(log3N) time and in O(log4N) time, respectively,

on a hypercube.

CHAPTER 6

PARALLEL ROUTING AND WAVELENGTH ASSIGNMENTS FOR OPTICAL

INTERCONNECTION NETWORKS

6.1 Introduction

The networks using optical transmission and maintaining optical data paths can

be used to remove the expensive optic-electro and electro-optic conversions. The

electronic parallel processing for controlling such networks are capable, in principle,

of meeting future high data rate requirements. For a nonblocking space-division-

multiplexing network, it can be strictly nonblocking (SNB), or rearrangeable non-

blocking (RNB). In SNB networks, a connection can be established from any idle

input to any idle output without disturbing existing connections while in RNB net-

works the connection can be established if the rearrangement of existing connections

is allowed. With wavelength-division-multiplexing (WDM) technology, the concept of

SNB and RNB in space division switching can be extended to the wavelength division

switching. Depending on whether wavelengths can be reassigned, this extension re-

sults in four combinations: wavelength-rearrangeable space-rearrangeable (WRSR),

wavelength-rearrangeable space-strict-sense (WRSS), wavelength-strict-sense space-

rearrangeable (WSSR), and wavelength-strict-sense space-strict-sense (WSSS). It has

been shown that using both the wavelength and space multiplexing techniques in a

fully dynamic manner, networks can achieve higher bandwidth and higher connec-

tivity [80].

The crosstalk in photonic switching networks adds a new type of blocking,

node block, also called wavelength con ict, compared with only link blocking in

electronic switching networks. Clearly, for a photonic switching network, if it is free

114

115

of wavelength con ict, it must be free of link con ict since the connections with

di�erent wavelengths can share the same link in such networks.

In order to minimize wavelength con icts in photonic switching networks,

three approaches, space dilation, time dilation and wavelength dilation, have been

proposed. Since the connections with neighboring wavelengths do not share any

SE, the wavelength dilation approach is also useful for establishing a set of connec-

tions that would normally cause link con icts in blocking space-division-multiplexing

OMINs such as Banyan networks.

In this chapter, we focus on the wavelength dilation approach to quickly con-

�guring an OMIN and assigning each connection a wavelength for realizing a permu-

tation without crosstalk. In wavelength dilation, if there are wavelength converters

available, we can convert the input signals with di�erent wavelengths entering into

the same SE to di�erent ones. Thus, two wavelengths are necessary plus the costs of

the wavelength converters. The use of wavelength converters will increase hardware

cost and con�guration time. If there is no wavelength converter available, i.e. each

connection will use a single wavelength, then we need to �nd a wavelength assign-

ment for connections plus a setting of SEs so that there is no crosstalk in OMINs.

In this chapter, we assume that no wavelength converter is available in OMINs and

assure the wavelengths in the same SE to be di�erent by routing.

The switch model used in this chapter follows [71, 82]. The OMINs under

such switch model can be built up using 2� 2 multi-wavelength SEs, in which each

input (resp. output) is capable of receiving (resp. transmitting) optical signals of

a set of wavelengths and each wavelength is switched independently in SEs [82].

Such a multi-wavelength SE has an independently controllable state, straight or

cross as shown in Figure 6.1 (a), for each wavelength. Figure 6.1 (b) shows a signal

transmission in a multi-wavelength SE, where the connections for the wavelength �2

in the upper input and the wavelength �02 in the lower input are in cross state and

116

all other connections are in straight state. If an SE can only receive/transmit one

wavelength for each input/output, it is called a basic SE. The OMINs considered

in this chapter are WRSS Banyan networks and WRSR Benes networks, where the

WRSR Benes networks only contain basic SEs.

Multi-Wavelength

SE

( b )

λλλ 1 2 k. . .

,,,

( a )

λλλ 1 2 k. . . λλλ 1 2 k

. . .,

λλλ 1 2 k. . .

, ,

Figure 6.1. A 2� 2 multi-wavelength SE: (a) two states; (b) signal transmission

For a permutation of an OMIN, if there is a setting of SEs to realize the

permutation and a wavelength assignment of connections so that no two connections

with the same wavelength share any SE, we called this setting and wavelength assign-

ment a crosstalk-free con�guration of the OMIN for the permutation. An algorithm

that can �nd a crosstalk-free con�guration for any permutation of an OMIN is called

a crosstalk-free routing and wavelength assignment algorithm for the OMIN.

In order to design crosstalk-free routing and wavelength assignment algo-

rithms, we �rst study the permutation capacity of these OMINs, and then show how

to partition a set of connections into subsets so that the connections in each subset

can be established simultaneously, and assign a wavelength to each connection so

that the connections in di�erent subsets have di�erent wavelengths. By applying

graph edge and vertex coloring techniques, we show that our algorithms can route

any permutation without crosstalk in O(log2N) time for a WRSS Banyan network

using at most 2blogN+1

2c wavelengths, and in O(log3N) time for a WRSR Benes net-

work using at most 2 logN wavelengths, on a completely connected multiprocessor

system of N PEs. Finally, we show that both routing and wavelength assignment

algorithms can be implemented on a hypercube with N=2 PEs in O(log4N) time.

117

The rest of chapter is organized as follows. In Section 6.2, a parallel crosstalk-

free routing and wavelength assignment algorithm for WRSS Banyan networks is

given. In Section 6.3, we develop a parallel crosstalk-free routing and wavelength as-

signment algorithm for WRSR Benes networks. Section 6.4 shows how to implement

our algorithms on a hypercube. Section 6.5 summarizes the chapter.

6.2 Parallel Routing and Wavelength Assignment in WRSS Banyan Net-

works

The idea behind our crosstalk-free routing and wavelength assignment algorithm for

WRSS Banyan networks is as follows. We partition a set of connections into subsets

so that the connections in the same subset don't share any SE, and then assign the

connections in di�erent subsets with di�erent wavelengths and the connections in the

same subset with the same wavelength. Each of these subsets is called a crosstalk-

free (CF) subset. Clearly, this wavelength assignment will not cause any crosstalk in

SEs. Since BL(N) is a self-routing network, the routing for each connection can be

easily done following the self-routing rule. We only need to consider how to partition

a set of connections into CF subsets and assign the connections in di�erent subsets

with di�erent wavelengths.

By Lemma 4 in Chapter 4, we have the following lemma.

Lemma 13 Given a partial permutation � of BL(N), if any two connections in �

do not share any modulo-2bn+12c input group and also do not share any modulo-2b

n+12c

output group, then � can be routed in BL(N) simultaneously without crosstalk.

We assume g = 2bn+12c in the rest of this section. According to Lemma 13, if we

assign di�erent wavelengths to the connections in � with sources (resp. destinations)

sharing the same modulo-g input (resp. output) group, then we can route � in

BL(N) without crosstalk. This wavelength assignment problem can be reduced to

118

a �(G(�; g))-edge coloring of a bipartite graph G(�; g). In G(�; g), the vertex set

consists of two parts, V1 and V2. Each part has N=g vertices, i.e., each modulo-g

input (resp. output) group is represented by a vertex in V1 (resp. V2). There is an

edge between vertex bi=gc in V1 and vertex bj=gc in V2 if j = �(i). Thus, G(�; g)

is a bipartite graph with N=g vertices in each of V1 and V2 and K edges, where at

most g edges are incident at any vertex, and the degree of G(�; g) equals to g. It

has been proved that any bipartite graph G has a �(G)-edge coloring [8]. Hence,

G(�; g) has a g-edge coloring since G(�; g) is bipartite and �(G(�; g)) = g. Thus,

if we can �nd a g-edge coloring of G(�; g), then we can assign wavelength i to the

connections corresponding to the edges with the color i, 0 � i � g � 1.

In Section 4.4 of Chapter 4, we know that there is an eÆcient algorithm for

�nding a �(G)-edge coloring of a bipartite graph G. By Theorem 4, we have the

following corollary.

Corollary 4 For any partial permutation � with K(� N) active inputs, a crosstalk-

free routing and wavelength assignment of � for a WRSS BL(N) can be found in

O(logN � logK) time using at most 2bn+12c wavelengths on a completely connected

multiprocessor system of N PEs.

It is easy to verify that 2bn+12c wavelengths are also necessary for a WRSS

BL(N) since there exist permutations with 2bn+12c connections sharing a common

SE.

6.3 Parallel Routing and Wavelength Assignment in WRSR Benes Net-

works

For space-division multiplexing, Benes networks are rearrangeable nonblocking. By

[53, 99], we know that each permutation can be decomposed into two crosstalk-

free partial permutations so that each CF partial permutation can be routed in an

119

optical Benes network simultaneously. Hence, if we assign the same wavelength to the

connections in the same CF partial permutation and assign di�erent wavelengths to

the connections in di�erent CF partial permutations, two wavelengths are suÆcient

for a WRSR B(N) in which SEs may contain non-basic states. (Figure 6.2 shows an

example, where di�erent line styles denote di�erent wavelengths.) In the following

two subsections, we will show the case that WRSR Benes networks only contain basic

SEs, leading to reduced hardware complexity [71].

6.3.1 Upper Bound for the Number of Wavelengths

In order to �nd an upper bound for the number of wavelengths needed for crosstalk-

free routing, we need to consider routing a permutation in an OMIN. We model

the wavelength assignment for a permutation in an OMIN as the vertex coloring

of a graph G!, where the vertex set V (G!) = fconnectionsg and the edge set

E(G!) = ffu; vgjtwo connections u and v con ict with each otherg. We call G!

a wavelength con ict graph. Although �nding the minimum number of wavelengths

and assigning the wavelengths to the connections are equivalent to �nding the mini-

mum number of colors and assigning the colors to the vertices respectively, which are

both NP-complete for general graphs, we can �nd an upper bound for the number

of wavelengths needed for realizing any permutation in WRSR Benes networks.

Theorem 15 For any permutation of a WRSR B(N),

! �(2 logN; if N � 4

2 logN � 1; otherwise

where ! is the number of wavelengths needed for the crosstalk-free routing of a per-

mutation in B(N).

Proof: Each connection con icts with at most 2 logN � 1 connections since it passes

through total 2 logN � 1 basic SEs. Thus �(G!) � 2 logN � 1. By Brooks theorem

120

(see a proof in [8]), ifG! is neither a complete graph nor an odd cycle, then we need at

most �(G!) colors to color V (G!) such that any two adjacent vertices have di�erent

colors; otherwise �(G!) + 1 colors are suÆcient. Clearly, for any permutation of an

OMIN with N > �(G!) + 1, G! is neither a complete graph nor an odd cycle since

�(G!) < N � 1 and N is even. Therefore, the theorem is true. 2

The following example shows that 4 wavelengths are necessary for any crosstalk-

free routing of a permutation in B(4) that only contains basic SEs.

Example 13 Routing the permutation

� =

0 1 2 3

0 2 1 3

!

in an optical B(4).

By the topology of B(4), it is easy to verify each connection con icts with

all other three connections, and thus, 4 wavelengths are necessary for routing this

permutation in B(4) without crosstalk. A wavelength assignment for � is shown in

Figure 6.2 (a). 2

The simple proof of an upper bound on the number of required wavelengths

as in Theorem 15 does not directly lead to a wavelength assignment algorithm. In

the next subsection, we utilize the properties of our permutation decomposition and

the structure of Benes network to obtain a fast parallel crosstalk-free routing and

wavelength assignment algorithm for a WRSR B(N) using no more than 2 logN

wavelengths.

6.3.2 Routing and Wavelength Assignment Algorithm

Our routing and wavelength assignment algorithm uses the permutation decompo-

sition algorithm of [53] as a subalgorithm and the vertex coloring technique similar

to that of [21]. Conceptually, this algorithm has logN iterations. In each iteration

121

OUTPUTS

STAGES

1

( b )

01

23

( a )

0 2

OUTPUTS

STAGES

1

2

01

23

01

23

INPUTS

01

23

INPUTS

0

Figure 6.2. A crosstalk-free routing and wavelength assignment for B(4): (a) a

WRSR B(4) contains only basic SEs; (b) a WRSR B(4) contains non-basic SEs

i, if 0 � i < logN � 1, the algorithm decides the setting of SEs in stage i and stage

2 logN � 2 � i and uses at most 2(i+ 1) + 1 wavelengths to ensure that there is no

wavelength con ict in stage j for any j 2 f0; � � � ; ig[f2 logN�2�i; � � � ; 2 logN�2g;

if i = logN � 1, the algorithm decides the setting of SEs in stage logN � 1 and uses

at most 2 logN wavelengths to ensure that there is no wavelength con ict in B(N).

We de�ne a wavelength class as the set of connections assigned the same

wavelength. A wavelength � is called a free wavelength for a connection c if � is not

assigned to any connection con icting with c.

Each PEi is associated with connection i, and maintains one variable �(i),

and two arrays Ci and Wi, 0 � i < N � 1. For any 0 � i � N � 1, Ci consists of

2 logN�1 entries Ci[j], 0 � j � 2 logN�2, and Wi consists of 2 logN entriesWi[k],

0 � k � 2 logN�1. �(i), Ci[j], andWi[k] are used to record the assigned wavelength,

the new con icting connections generated in iteration bj=2c, and the number of

122

con icting connections with wavelength k, respectively, for connection i. We call

Ci and Wi connection con ict array and wavelength con ict array of connection i,

respectively. The other variables are all working variables. Initially, let �(i) := 0,

Ci[j] := 1, and Wi[k] := 0, for i 2 f0; � � � ; N � 1g, j 2 f0; � � � ;� 2 logN � 2g, and

k 2 f0; � � � ; 2 logN�1g, respectively. We use operator \:=" to denote an assignment

local to a PE or to the control unit, and use operator \ " to denote an assignment

requiring some interprocessor communication. In our parallel routing and wavelength

assignment algorithm, each iteration i consists of the following steps:

Step 1-Permutation Decomposition: decompose a (partial) permutation of

each subnetwork B(N=2i) into two parts, each named upper or lower partial permu-

tation, satisfying that two active inputs (resp. outputs) in an SE in the �rst (resp.

last) stage of B(N=2i) are in di�erent parts.

Step 2-Setting SEs: set the SEs in the �rst and last stages of each B(N=2i) in

such a way that (i) if i 6= logN �1, the active inputs and outputs in the upper (resp.

lower) partial permutation are connected with an upper (resp. lower) subnetwork

B(N=2i+1); (ii) if i = logN � 1, each active input is connected with its mapped

output.

The above two steps decide the routing for the given permutation. The fol-

lowing steps are used to �nd a wavelength assignment for the routing solution. For

all PEc, 0 � c � N � 1, do in parallel:

Step 3-Recording Con icting Connections: (i) if there is a connection c0 so

that c and c0 pass through the same SE in stage i and c0 6= Cc[j] for all 0 � j < 2i,

then Cc[2i] := c0; (ii) if i 6= logN � 1 and there is a connection c" so that c and c00

pass through the same SE in stage 2 logN�2�i and c00 6= Cc[j] for all 0 � j < 2i+1,

then Cc[2i+ 1] := c00.

Step 4-Reassigning Wavelengths: if connection c is in a lower partial permu-

tation, �0(c) := �(c) and �(c) := �(c) + (2i+ 1).

123

Step 5-Updating Con icting Wavelengths: update wavelength con icts by (i)

adding new con icts and (ii) updating existing con icts, where (ii) consists of two

substeps: (ii-1) clearing old wavelengths and (ii-2) adding updated wavelengths. The

detailed implementation of this step is given in Algorithm 4.

Algorithm 4 Updating Con icting Wavelengths

if i 6= logN � 1, j 0 := 2i+ 1; otherwise, j0 := 2i;

for all PEc, 0 � c � N � 1, do

t(c) :=1;

for j = 2i to j0 doif Cc[j] 6=1 and �(c) � 2 logN � 1 then

t(Cc[j]) �(c);

end if

if t(c) 6=1 then

Wc[t(c)] := Wc[t(c)] + 1; /* (i): adding new con icts */

t(c) :=1;

end if

end for

if connection c is in a lower partial permutation and i 6= 0 then

for j = 0 to 2i� 1 do

if Cc[j] 6=1 then

t(Cc[j]) �0(c);

end if

if t(c) 6=1 then

Wc[t(c)] :=Wc[t(c)]� 1; /* (ii-1): clearing old wavelengths */

t(c) :=1;

end if

if Cc[j] 6=1 and �(c) � 2 logN � 1 then

t(Cc[j]) �(c);

end if

if t(c) 6=1 then

Wc[t(c)] :=Wc[t(c)] + 1; /* (ii-2): adding updated wavelengths */

t(c) :=1;

end if

end for

end if

end for

By the above �ve steps, it is easy to know the wavelength assignment in each

iteration will not result in any con ict in the SEs that have been set up so far.

However, we can reduce the number of wavelengths by reassigning new wavelengths

in f0; � � � ; 2(i+ 1)g to the connections with wavelengths in f2(i+ 1) + 1; � � � ; 2(2i+

124

1)�1 = 4i+1g without resulting in any wavelength con ict. (The correctness for the

reassignment of wavelengths will be proved in Lemma 14.) This is done as follows:

for �� = 2(i+ 1) + 1 to 4i+ 1, if �(c) = ��, then perform the following two steps:

Step 6-Adjusting Wavelengths: �nd a free wavelength j 2 f0; 1; � � � ; j0+1g such

that Wc[j] = 0 by checking the values in fWc[0]; � � � ;Wc[j0 + 1]g, and �0(c) := �(c)

and �(c) := j. (The value of j 0 in this step and next step is the same as that in

Algorithm 4.)

Step 7-Updating Con icting Wavelengths: for k = 0 to j0, do (i) if Cc[k] 6=1

and �0(c) � 2 logN � 1, then decrease WCc[k][�0(c)] by 1; and (ii) if Cc[k] 6=1, then

increase WCc[k][�(c)] by 1. (The detailed implementation is similar to Algorithm 4.)

Lemma 14 After iteration i, 0 � i � logN � 1, of our parallel routing and wave-

length assignment algorithm, there is no wavelength con ict in stage j, for any

j 2 f0; � � � ; ig [ f2 logN � 2 � i; � � � ; 2 logN � 2g, and at most !i wavelengths are

used, where

!i �(2(i+ 1); if i = 0; logN � 1

2(i+ 1) + 1; otherwise

Proof: The proof is done by induction on iteration i. If i = 0, it is true since

two connections passing though the same SE in �rst and last stages are assigned

di�erent wavelengths and !0 = 2. Now we assume that it is true for any i < k �

logN�1. In iteration k, by assumption, we know that there is no wavelength con ict

in stage j, for any j 2 f0; � � � ; k� 1g [ f2 logN � 1� k; � � � ; 2 logN � 2g, using !k�1

wavelengths. By Step 4, two connections passing though the same SE in stage k and

stage 2 logN � 2 � k are assigned di�erent wavelengths using 2 � !k�1 wavelengths.

Hence, there is no wavelength con ict in stage j for any j 2 f0; � � � ; kg[f2 logN�2�

k; � � � ; 2 logN �2g, using 2 �!k�1 wavelengths. In the following, we show that 2 �!k�1

wavelengths are too much for the case that 2 � !k�1 > 2(k + 1) + 1 if k 6= logN � 1

or the case that 2 �!k�1 > 2 logN if k = logN � 1 . For iteration k, each connection

125

con icts with at most 2(k+1) connections if k 6= logN �1 and at most 2 logN �1 if

k = logN�1. This is because for iteration j, if j � k < logN�1, we need to consider

wavelength con icts in two stages, stages j and 2 logN � 2� j; if j = k = logN � 1,

we only need to consider wavelength con ict in stage logN � 1 since stage j and

stage 2 logN � 2 � j are the same. Thus, in Step 6, a free wavelength of index

no greater than 2(k + 1) for k < logN � 1 and 2 logN � 1 for k = logN � 1 can

always be found. Furthermore, the connections in the same wavelength class have no

wavelength con ict so that we can do wavelength adjustment for these connections

at the same time without resulting in any new con ict. 2

Theorem 16 For any (partial) permutation, a routing and wavelength assignment

for a WRSR B(N) can be found in O(log3N) time using at most 2 logN wavelengths

on a completely connected multiprocessor system of N PEs.

Proof: By the recursive structure of B(N) and by applying our permutation decom-

position algorithm recursively, we can �nd a setting of SEs in B(N) so that any

permutation can be realized. By Lemma 14, we know that the wavelength assign-

ment assures no wavelength con ict for the routing solution. Now, we analyze the

time complexity. It is easy to see that in each iteration, Steps 2 and 4 take O(1) time

and each of other steps takes O(logN) time. Iteration i has at most !i�1(� 2i+ 1)

wavelength classes to be adjusted, and thus, Steps 6 and 7 in iteration i are executed

at most !i�1(� 2i+ 1) = O(logN) times. Since there are logN iterations, the total

time complexity of our routing and wavelength assignment algorithm is O(log3N).

2

Example 14 Figure 6.3 shows the process for routing the permutation

� =

0 1 2 3 4 5 6 7

0 2 1 4 3 7 5 6

!

in a WRSR B(8).

126

Ste

p 2

Ste

p 2

Ste

p 2

Ste

p 7

1 0111100

0 1 2 3 4 5 6 7

0 1 4 1 3 0 2 5

2

0

0

1

1

1

2

1

1

0

1

1

0

0

1

1

0

1

1

0

2

2

0

0

0

0

1

1

1

1

2

1

1

0

0

0

2

0

1

1

Ste

p 6

0

1

2

3

4

5

6

7

0

1 51

4

0

3 2

0

1

2

3

4

5

6

7

0

0

0

00+1=1

0

1

2

3

4

5

6

7

0

0+3=3 01

1+3=4

1

0+3=3 1+3=4

0

1

2

3

4

5

6

7

0

3+5=8 0+5=51

4

1+5=6

3 4+5=9

Ste

p 3

Ste

p 4

Ste

p 5

i

1

2

0

4

3

0

2

6

5

1

4

7

7

3

6

5

0 1 1 0 0 1 1 0

2

0

0

0

0

0

2

0

0

0

0

0

0

2

0

0

0

0

0

2

0

0

0

0

2

0

0

0

0

0

2

0

0

0

0

0

0

2

0

0

0

0

0

2

0

0

0

0

0 1 2 3 4 5 6 7

Ste

p 3

Ste

p 4

Ste

p 5

1

2

0

4

3

0

2

6

5

1

4

7

7

3

6

5

0 0000000

0 1 2 3 4 5 6 7

3 2 1 0 7 6 5 4

4 307

0 1 4 3 3 1 4 0

1

0

0

2

1

1

0

0

2

1

1

1

0

1

0

0

1

0

1

1

2

2

0

0

0

0

2

0

0

2

1

1

0

1

0

0

1

0

1

1

Ste

p 3

Ste

p 4

Ste

p 5

1

2

0

4

3

0

2

6

5

1

4

7

7

3

6

5

1 0111100

0 1 2 3 4 5 6 7

3 2 1 0 7 6 5 4

4 307

0 1 4 8 3 6 9 5

1

0

0

1

1

0

1

0

1

0

0

0

0

0

1

1

0

0

1

0

1

1

0

0

0

0

1

0

1

1

1

1

0

0

0

0

1

0

1

1

7 5 6 4 3 1 2 0

Ste

p 6

Ste

p 7

1 0111100

0 1 2 3 4 5 6 7

0 1 4 1 3 6 9 5

2

0

0

1

1

1

1

0

1

0

1

0

0

0

1

1

0

0

1

0

2

1

0

0

0

0

1

0

1

1

2

1

0

0

0

0

1

0

1

1

0

1

2

3

4

5

6

7

0

1 51

4

6

3 9

Ste

p 6

0

1

2

3

4

5

6

7

0

1 51

4

0

3 9

Ste

p 7

1 0111100

0 1 2 3 4 5 6 7

0 1 4 1 3 0 9 5

2

0

0

1

1

1

2

0

1

0

1

1

0

0

1

1

0

0

1

0

2

2

0

0

0

0

1

0

1

1

2

1

0

0

0

0

2

0

1

1

Iteration 0 Iteration 1 Iteration 2

0 1 2 3 4 5 6 7

0 2 1 4 3 7 5 6

1 2 5 6

2 1 7 5

0 3 4 7

0 4 3 6

1 5

2 7

2 6

1 5

3 4

4 3

0 7

0 6

1

2

2 6

1 5

0

0

7

6

4

3

3

4

5

7

1 2 5 6

2 1 7 5

0 3 4 7

0 4 3 6

1 5

2 7

2 6

1 5

3 4

4 3

0 7

0 6

Ste

p 1

Ste

p 1

Ste

p 1 Points to the upper partial permutaions

Points to the lower partial permutaions

:

:

( b )

0+1=1

0+1=1 0+1=1

( a )

i i

i

i

i

Ci[0]

Ci[4]

Ci[3]

Ci[2]

Ci[1]

Ci[0]

Ci[4]

Ci[3]

Ci[2]

Ci[1]

Wi[0]

Wi[5]

Wi[4]

Wi[2]

Wi[3]

Wi[1]

Wi[0]

Wi[5]

Wi[4]

Wi[2]

Wi[3]

Wi[1]

Ci[0]

Ci[4]

Ci[3]

Ci[2]

Ci[1]

Wi[0]

Wi[5]

Wi[4]

Wi[2]

Wi[3]

Wi[1]

Wi[0]

Wi[5]

Wi[4]

Wi[2]

Wi[3]

Wi[1]

Wi[0]

Wi[5]

Wi[4]

Wi[2]

Wi[3]

Wi[1]

Wi[0]

Wi[5]

Wi[4]

Wi[2]

Wi[3]

Wi[1]

0

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

Inp

uts

Ou

tpu

ts

Wavelengths

0:

1:

2:

3:

4:

5:

(i) (i) (i)

(i)

(i)

(i)

Figure 6.3. Routing a permutation in WRSR B(8): (a) �nding a wavelength assign-

ment; (b) crosstalk-free routing in B(8)

127

In iteration i, by Step 1, each (partial) permutation is divided into one upper

partial permutation and one lower partial permutation by applying parallel decompo-

sition algorithm. Step 2 sets up SEs in stage i and stage 2 logN � 2� i according to

the decomposition. That is, if the connection is in the upper partial permutation, the

connection is connected with the upper subnetwork; otherwise, it is connected with

the lower subnetwork. In Step 3, if two connections con ict with each other, possibly

in two SEs, they record each other in their respective connection con ict array only

once. Step 4 reassigns a new wavelength with !i�1 larger than the original one to each

connection in the lower partial permutations so that there is no wavelength con ict

in any SE that has been set up so far. After the wavelength reassignment, in Step

5, each connection c updates Wd for each connection d in Cc by doing the following:

(i) if d is the new con icting connection of c generated in this iteration, the value

of the entry with index of c's updated wavelength in Wd is increased by 1 ; and (ii)

if d is an existing con icting connection for c, the value of the entry with index of

c's original wavelength in Wd is decreased by 1 and the value of the entry with index

of c's updated wavelength in Wd is increased by 1. By Lemma 14, we know that the

number of wavelengths is no more than !i. Hence, the wavelength greater than !i

can be adjusted to a wavelength with label less than or equal to !i, which is done by

Step 6. Step 6 �nds a free wavelength by looking at the wavelength con ict array

and assign the index of an entry with value of 0 as new wavelength to each adjusted

connection. Step 6 is immediately followed by Step 7, which updates the wavelength

con ict array for each adjusted connection so that every connection always main-

tains the up-to-date wavelength con ict information. Figure 6.3 (a) shows how our

routing and wavelength assignment algorithm works step by step and also shows the

corresponding wavelength con ict graph G! generated in each iteration, where every

connection is represented by a circle and labeled by its corresponding input. In G!,

con icting connections are connected by edges and the assigned wavelengths in each

iteration are shown beside the circles. Fig 6.3 (b) shows the �nal routing and wave-

128

length assignment for � in B(8), where 6 wavelengths are used.

2

6.4 Implementation on Realistic Multiprocessor Systems

The presented algorithms run on a completely connected multiprocessor system can

be easily transformed to algorithms on more realistic multiprocessor systems. As an

example, in this section, we show how to implement our algorithms on a hypercube

of N=2 PEs such that any (partial) permutation can be routed without crosstalk in

a WRSS BL(N) and a WRSR B(N) in O(log4N) time.

In our presentation, the Benes network B(N) is the back-to-back concatena-

tion of two BL(N)'s. Butter y networks are in the family of the hypercube as talked

in Subsection 1.3.1 of Chapter 1. Since each PE can communicate with at most one

other PE in every communication step of our algorithms, in the following, we show to

how to implement one communication step of a completely connected multiprocessor

system of N PEs by a set of one-to-one communications on a hypercube H(N=2), in

which each PE is responsible for a pair of connections i and �i.

The time complexity of our routing and wavelength assignment algorithm for

a WRSS BL(N) depends on edge coloring algorithm, which can be implemented

in O(log3N) time on H(N=2) [54], Thus, the routing and wavelength assignment

algorithm for a WRSS BL(N) takes O(log4N) on H(N=2).

Considering our routing and wavelength assignment algorithm for a WRSR

B(N), we can see that the total time for routing on H(N=2) only depends on the de-

composition algorithm [53], which can be implemented in O(log3N) time on H(N=2)

since each pointer jumping step on a completely connected multiprocessor system can

be implemented onH(N=2) by a sorting operation, which takesO(log2N) time. Con-

sequently, the routing on H(N=2) takes O(log4N) time. For wavelength assignment,

communications among PEs only occur in Step 5 and Step 7, in which PEc needs to

129

talk to PEd if d is recorded in Cc (see \ " operations in Algorithm 4). Fortunately,

all con icting connections of c are recorded in connection con ict array Cc in the or-

der of SEs through which c passes from both sides, i.e. from a pair of outside stages

i and 2 logN�2�i towards the center stage, stage logN�1. Thus, these con icting

connections can be located using this ordering via interstage connections in B(N).

Since the interstage interconnection pattern between stage i (resp. 2 logN � 2 � i)

and stage i+ 1 (resp. 2 logN � 3� i) of B(N) corresponds to (log N2� i)-dimension

edges of H(N=2), the communication ordering de�ned by connection con ict arrays

directly corresponds to a classic hypercube communication technique called dimen-

sion ordering. Thus, the total time for wavelength assignment on H(N=2) remains

unchanged. Therefore, when our routing and wavelength assignment algorithm for a

WRSR B(N) is implemented on H(N=2), it has a slowdown factor of O(logN) and

its time complexity is O(log4N).

6.5 Summary

In this chapter, we studied the crosstalk problem in OMINs using wavelength dila-

tion approach. We proposed parallel routing and wavelength assignment algorithms

to route a partial permutation in optical WRSS Banyan networks and WRSR Benes

networks so that there is no crosstalk in these networks. For an arbitrary partial

permutation, it can be routed without crosstalk in a WRSS BL(N) in O(log2N)

time using at most 2blogN+1

2c wavelengths and in a WRSR B(N) with only basic SEs

in O(log3N) time using at most 2 logN wavelengths, on a completely connected

multiprocessor system with N PEs. The proposed algorithms run on a completely

connected multiprocessor system can be easily transformed to algorithms on more

realistic multiprocessor systems. For example, our routing and wavelength assign-

ment algorithms for a WRSS BL(N) and a WRSR B(N) take O(log4N) time on a

hypercube with N=2 PEs.

CHAPTER 7

PARALLEL ROUTING ALGORITHMS FOR GROUP CONNECTORS

7.1 Introduction

Recently, a new class of interconnection networks called group connectors were pro-

posed in [101]. A group connector G(N; g) is de�ned as an interconnection network

that consists of N inputs and N outputs such that (1) its N outputs are divided

into g output groups with N=g functionally equivalent outputs in each group; and

(2) it can provide any simultaneous (N=g)-to-one connections from N inputs to g

output groups, possibly without the ability of distinguishing the order of outputs

within each group. Another type of N �N group connector G0(N; g) can be de�ned

by dividing its N inputs and N outputs into g equal-size groups, respectively. For

G0(N; g), if the inputs in the same input group are allowed to be connected to the

outputs in di�erent output groups, G0(N; g) and G(N; g) are the same in function-

ality; otherwise g separate planes of N=g �N=g networks can be used to implement

G0(N; g). Figure 7.1(a) and (b) illustrates the block diagram of G(8; 4) and G0(8; 4),

respectively.

0

1

2

3

0

1

2

3

0

1

2

3

(a) (b)

Figure 7.1. Block diagrams of a group connector: (a) G(8; 4); (b) G0(8; 4)

130

131

Group connectors have many applications. In general, a group connector

G(N; g) captures the simultaneous connections between N clients and N servers,

which are divided into g equal-size server groups such that the N=g servers in each

group are functionally equivalent. A group connector G(N; g) can also be viewed

as a g � g permutation network with internal speedup of factor N=g achieved by

space-division multiplexing [98].

Group connectors are particularly useful in dense wavelength-division multi-

plexing (DWDM) networks. With DWDM, it is now possible to transmit di�erent

wavelengths of light over the same �ber, which has provided another dimension to

increase bandwidth capacity. A group connector can be used as a switching network

in a DWDM router. For example, if some inputs and one or more groups of outputs

are connected to a local node, a group connector can be used as an add-drop cross-

connect switching matrix. Group connectors can also be used in the construction

of ingress edge routers of DWDM networks. An ingress edge router in a DWDM

optical network has a set of N electrical or optical input links and a set of g opti-

cal output links. Each optical output link i consists of a set of N=g data channels

Chi;1; � � � ; Chi;N=g, each using a di�erent wavelength. Associated with each input

link, there is an input line card (ILC) and associated with each output link there

is an output line card (OLC). A switching matrix M is between ILCs and OLCs,

and N=g connections are from the output of M to each OLC. The main function of

each ILC is to route input packets to appropriate OLCs by routing table lookup.

Each OLC transmits the received packets by g optical channels of the link it controls.

The block diagram of a DWDM ingress edge router is shown in Figure 7.2. A group

connector G(N; g) is served as the major switching matrixM in the design of ingress

edge routers of a burst-switched DWDM network [102].

As talked in subsection 1.2.3 of Chaper 1, an interconnection network is re-

arrangeable nonblocking if it can realize all possible permutations between inputs

132

...

...

...

OLC

OLC

OLC

1

2

g

...

.

.

.

.

.

.

ILC

ILC

ILC1

2

3

ILCN

ChChCh

2,12,22,N/g

ChChCh

n,1n,2n,N/g

Switching Matrix M

N/g

N/g

N/g ChChCh

1,11,21,N/g

Figure 7.2. Block diagram of an ingress edge router

and outputs when the rearrangement to existing connections is permitted. Sim-

ilarly, a group connector is rearrangeable nonblocking if it can realize all possible

connections between the inputs and group outputs when the rearrangement to ex-

isting connections is permitted. It has been shown [101] that the group connectors

based on Benes network, called Benes group connectors denoted by GB(N;n), and

the group connectors based on 3-stage Clos network, called Clos group connectors

denoted by GC(m;n; r), both are rearrangeable nonblocking. Rearrangeable non-

blocking networks, including Benes networks, Benes group connectors, 3-stage Clos

network and Clos group connector are very attractive for �xed-size cell switching

architecture. In such a switch, variable length packets are segmented into cells upon

arrival, transferred across the switch matrix, and then reassembled again before they

depart. Using �xed-size cells allows for slotted switching, which makes it easier for

the scheduler to con�gure the switch matrix for high throughput.

When a group connector is used as a switching matrix in a high-speed packet

router/switch, packet-forwarding speed is crucial. There are several factors that af-

fect packet-forwarding speed: routing (label) table lookup, switch scheduling, switch

routing and switch internal transmission. For group connectors, the implementations

of switch scheduling and switch routing are of particular importance.

In this chapter, we address the issue of how to quickly set up SEs so that

133

K(� N) con ict-free paths between inputs and output groups can be established

in group connector G(N; g). In particular, we present a parallel algorithm, named

ROUTE, and its variations, for the setup of a group connector with K connection

requests. In this context, Benes network B(N) is a special case of Benes group

connectors GB(N;n) with n = N . Thus, our algorithms can be applied to Benes

networks directly. Given any permutation, all known best sequential algorithms for

setting up the B(N) take O(N logN) time [45, 66, 93], and the best time complexity

of parallel algorithms isO(log2N) [49, 62]. Given any partial permutation withO(K)

connection requests, the parallel algorithms to set up Benes network in O(log2N)

time and in O(log2K + logN) time were proposed in [46] and [43] respectively. Our

main algorithm ROUTE extends the algorithm [43] to set up Benes group connector

for non-maximum mapping between inputs and output groups. As the algorithm

of [45], our algorithm sets up SEs in the �rst logN � 1 stages of GB(N;n) so that

the SEs in the remaining stages can be set up by self-routing [44]. On the other

hand, given any non-maximum mapping with O(K) connection requests, our algo-

rithm runs in O(log2K + logN) time on a completely connected computer or the

EREW PRAM model with N processing elements(PEs) as the algorithm of [43].

When it is implemented on a perfect shu�e computer and a hypercube of N PEs,

O(log4K + log2K � logN) time is suÆcient. Our algorithm ROUTE takes the ad-

vantages of algorithms of [43] and [45, 62]. For a Clos group connector GC(m;n; r)

with O(K) busy inputs, by using the decomposition technique in [49], our algorithm

ROUTE CLOS can determine the switch setting in O(logK logm) time ifm is an in-

tegral power of two and in O(log2K logm) time otherwise on a completely connected

computer or the EREW PRAM model with N PEs.

The rest of chapter is organized as follows. Section 7.2 introduces de�nitions

and notations. In Section 6.3, we develop a parallel routing algorithm ROUTE

for Benes group networks. Section 7.4 extends algorithm ROUTE to ROUTE Clos

134

for setting up the connections in Clos group connectors. Section 7.5 shows the

implementation of our algorithms on more realistic parallel machine models and

hardware redundancy of group connectors. Section 7.6 summarizes the chapter.

7.2 Preliminaries

Let I, O and G be the sets of N inputs, N outputs and g output groups of G(N; g)

respectively. Let � : I 7�! G be an I=G mapping that indicates connection requests

from inputs to output groups. If there is a connection request from Ii to Oj , set

�(i) = j and call Ii a busy input; otherwise set �(i) = �1 and call Ii an idle input.

An I/G mapping from I to G is legal if each input is mapped to at most one output

group and at most N=g di�erent inputs are mapped to the same output group.

When group connector is used as a switching matrix, legal mappings can be enforced

by using the arbitration hardware [97, 103]. A legal I/G mapping is maximum if

all inputs are busy, and non-maximum otherwise. We denote a legal I/G mapping

involving K busy inputs as �jK. Clearly, if K = N , �jN is a legal maximum I/G

mapping. A group connector G(N; g) has a feasible con�guration for a given �jK

if all SEs can be set up so that there are K con ict-free paths connecting the busy

inputs to output groups. Thus, G(N; g) is rearrangeable nonblocking if it has a

feasible con�guration for any �jK.

7.3 Parallel Routing for Benes Group Connectors

In this section, we develop a fast parallel routing algorithm ROUTE for Benes group

networks.

7.3.1 Structure of GB(N;n)

A Benes group connector GB(N;n) with N inputs and n = N=2k output groups

is constructed from a Benes network B(N) by permanently setting all inputs in its

135

last k stages straight, which leads to eliminating these SEs (see Figure 7.3 for an

example).

OUTPUT GROUPSINPUTS

Stage 0 Stage 1 Stage 2 Stage 3 Stage 4

1-level subnetworks

2-level subnetworks

3-level subnetworks

9

1011

1213

1415

8

01

23

45

67

1

2

3

0

Figure 7.3. A Benes group connector GB(16; 4) with k = 2

An L-level subnetwork (0 � L � logN�1) of Benes group connectorGB(N;n)

is de�ned as a Benes group connector GB(H;h) so that H = N=2L and h =

minfH;ng. Thus a GB(N;n) contains 2L L-level subnetworks. Clearly, 0-level

subnetwork of GB(N;n) is itself. Figure 7.3 shows a GB(16; 4), which contains two

1-level subnetworks GB(8; 4), four 2-level subnetworks GB(4; 4), and eight 3-level

subnetworks GB(2; 2). All SEs in the �rst(resp. last) stage of subnetwork GB(H;h)

are called input(resp. output) SEs of GB(H;h).

For a GB(N;n), we label the N inputs/outputs as 0; 1; � � � ; N � 1, n group

outputs as 0; � � � ; n � 1, and N=2 SEs in each stage as 0; � � � ; N=2 � 1 from top

to bottom, and 2 logN � 1 stages are indexed 0 through 2 logN � 2 from left to

right. Denote every input, output and output group of GB(N;n) by Ii, Oi, and Gj

respectively, where 0 � i � N � 1 and 0 � j � n � 1. Thus Oi connects to Gj

where j = i mod n in the last stage of GB(N;n) and every Gj consists of OjN=n+l,

0 � l � N=n � 1. If an input SE of GB(N;n) has 2(resp. 1, 0) busy inputs, it is

called busy(resp. semi-busy, idle) input SE.

136

7.3.2 Graph Model of GB(N;n)

Each GB(N;n) with a legal mapping �jK can be represented as a graph G as fol-

lowing:

Case 1: N = n.

The vertex set V (G) = fvjv is an input SE or an output SEg and edge set E(G) =

f(v;w; i)j there is a busy input i of input SE v with �(i) being an output of output

SE wg.

Case 2: N > n.

The vertex set V (G) = fvjv is an input SE or a group output g and edge set

E(G) = f(v;w; i)j there is a busy input i of input SE v with �(i) = wg.

It is clear that G is a bipartite graph in both cases with all input SEs as one

part A and all output SEs for Case 1 or all output groups for Case 2 as the other part

B. Each edge in G is corresponding to a pair of input and its mapped output group.

There is a one-to-one corresponding relation between every busy input of GB(N;n)

and every edge of G. We label each edge by its corresponding busy input. Hence,

we can exchange notation of edge and its corresponding input. If an edge is the end

edge of a path, we say its labeled input is the end input of the path; if two edges are

adjacent, we say their labeled inputs are adjacent; if an edge is colored with some

color, we say its labeled input is colored with that color.

In the following theorem, Theorem 17, we show that GB(N;n) with a legal

I/G mapping �jK has a feasible con�guration, which is done by showing G has an

equitable 2-edge coloring. The proof of theorem 17 not only shows Benes group con-

nector is rearrangeable nonblocking, but also implies a sequential algorithm, which

we will implement in parallel in next section, to set up SEs for any legal I/G mapping

of Benes group connector.

Theorem 17 Given any legal I/G mapping �jK of a Benes group connector GB(N;n),

137

GB(N;n) has a feasible con�guration.

Proof. Let GB(H;h) be the L-level subnetwork of GB(N;n). The proof is done by

induction on L. If L = logN � 1, GB(H;h) is GB(2; 2) or GB(2; 1) which consists

of a single node (i.e., a single 2�2 SE) and the claim is obviously true. Assume that

the claim is true for any L-level (0 < L � logN � 1) subnetwork GB(H;h). For any

legal I/G mapping of GB(N;n), we know that GB(N;n) (i.e. 0-level subnetwork)

can be represented as a bipartite graph G. We �rst prove G has an equitable 2-edge

coloring. Since G is bipartite, G does not contain any odd cycle, i.e. any cycle in G

has even number of edges [8]. So E(G) is the union of a set of even cycles and paths.

Thus we can alternately color each edge with one of two di�erent colors beginning

with any busy input along each even cycle or path so that the adjacent edges on the

same cycle or path have di�erent colors. We know that every vertex in part A has

degree � 2 since each input SE has at most 2 busy inputs, and every vertex in part

B has degree � 2 for Case 1 or � 2k for Case 2 since each output SE has at most

2 outputs or each output group are mapped by at most 2k busy inputs respectively.

Thus, if a vertex with degree d, then its adjacent dd=2e edges are colored with one

color and bd=2c edges are colored with another color. Therefore G has an equitable

2-edge coloring.

Then, we show that there is a feasible con�guration of Benes Group Connector

GB(N;n) if its graph model G has an equitable 2-edge coloring. We let two ends

(i.e. input and its mapped output group) of the edges with the same color connect

with the same 1-level subnetwork. By the de�nition of the equitable 2-edge coloring,

this setting of SEs satis�es the mapping constraints for 0-level subnetwork GB(H;h).

Since every pair of input and its mapped output group is connected with the same

1-level subnetwork, it generates two legal I=G mappings for two 1-level subnetworks

of GB(N;n). By induction, each of 1-level subnetworks has a feasible con�guration.

Therefore, GB(N;n) has a feasible con�guration. 2

138

We de�nemapping constraints as follows: for any L-level subnetworkGB(H;h)

of GB(N;n) (0 � L � logN � 2),

(1) Every busy input of input SEs and its mapped output group are connected with

the same (L+ 1)-level subnetwork;

(2) Two dual inputs(outputs) are connected with two di�erent (L+ 1)-level subnet-

works; and

(3) If H > h, the busy inputs of GB(H;h) mapped to the same output group are

enforced to be partitioned into two parts with each size � H2h, which are connected

with the di�erent (L+ 1)-level subnetworks G(H=2; h); otherwise (i.e. H = h), two

inputs mapped to two dual outputs must be connected through its two di�erent

GB(H=2;H=2) subnetworks.

By theorem 17 and topology of GB(N;n), we have the following corollary.

Corollary 5 Given any legal I/G mapping �jK, G is a feasible con�guration of

GB(N;n) if and only if G satis�es the mapping constraints.

7.3.3 Algorithm for GB(N;n)

For a legal I/G mapping �jK, any busy input Ii is speci�ed to be connected to a

unique output group Gj . We consider the operation of setting up K link-disjoint

paths from busy inputs to their mapped output groups as a routing process, and

an algorithm for establishing I/G connections as a routing algorithm. Our routing

algorithm is based on the sequential algorithm of [45] and the parallel algorithm of

[43, 62] for routing a permutation in Benes network.

A GB(N;n) consists of 2 logN � 1 � k stages. It can be regarded as a

concatenation of two parts, P1 being the �rst logN � 1 stages, and P2 being the

remaining stages. For each busy input Ii of P2 in stage s of GB(N;n), we de�ne

its control bit to be �(i)2 logN�2�k�s. The control bit is used to do self-routing, i.e.

if the control bit for a busy upper input is 0(resp. 1), then this busy input is set

139

straight(resp. cross), and if the control bit for a busy lower input is 0(resp. 1), then

this busy input is set cross(resp. straight).

Lemma 15 If a Benes group connector GB(N;n) has a feasible con�guration for a

legal I/G mapping �jK, then all busy inputs of P2 can be set up by self-routing.

Proof. Since GB(N;n) has a feasible con�guration, two dual outputs of an output

SE in last stage are only di�erent in the �rst bit (see �gure 7.4 for example, where

C; x 2 f0; 1g). Thus, if the control bit is 0 for a busy upper input or 1 for a

busy lower input, this input is set as straight, and if the control bit is 1 for a busy

upper input or 0 for a busy lower input, this input is set as cross. In general, if

GB(N;n) has a feasible con�guration, then, for two dual outputs of an SE in stage

s (logN � 1 � s � 2 logN � k � 2), the (2 logN � 1 � k � s)-th bit must equal

to 0 for upper one and 1 for lower one. Therefore, according to the control bit

�(i)2 logN�2�k�s of busy input i, we can set up SEs in P2 by self-routing. 2

0001

0010

0011

0000

INPUTS OUTPUT GROUPS

Stage 0 Stage 1 Stage 2 Stage 3 Stage 4

01

23

45

67

89

1011

1213

1415

00000001

00110010

0000

0010

0011

0001

0000

0000

0001

00100011

0001

00100011

000C

000C

000C

000C

000C

000C

000C

000C

000x001x

001x000x

000x001x

000x

001x

000x001x

000x001x

000x

001x

000x001x

00Cx

00Cx

00Cx

00Cx

00Cx

00Cx

00Cx

00Cx

00Cx

00Cx

00Cx

00Cx

00Cx

00Cx

00Cx

00Cx

000C

000C

000C

000C

000C

000C

000C

000C

Figure 7.4. Hardware redundancy of P1 and control bit selection of P2 in G(16; 4)

If we de�ne P2-passable condition of GB(N;n) as (2) and (3) of the mapping

constraints, by Lemma 5 and Lemma 15, we have the following claim:

140

Theorem 18 For any legal I/G mapping �jK of GB(N;n), if P1 is set up to satisfy

P2-passable condition and P2 is set up by self routing, then the setting of GB(N;n)

is a feasible con�guration.

Our algorithm, named ROUTE, is presented for a parallel computer with

N PEs that are completely connected The N PEs are labeled as 0; 1; � � � ; N � 1.

Algorithm ROUTE consists of two phases: PHASE I and PHASE II. In PHASE I,

the setting of all SEs in the �rst logN �1 stages are determined. In PHASE II, self-

routing is performed for the SEs in the remaining stages. Conceptually, PHASE I

consists of logN � 1 iterations and each iteration contains 4 steps. In the i-th

iteration, 1 � i � k, the setting of busy inputs of stage i � 1 are determined in the

following way: we only consider 2i�1 independent (i � 1)-level subnetworks so that

P2-passable condition is satis�ed in each subnetwork. In the (k + 1)-th iteration,

we encounter 2k independent B(n) routing problems. Then, for each such problem,

our algorithm degenerates to a parallel routing algorithm based on the sequential

algorithm of [45]. PHASE II is self-routing process. Since P2-passable condition is

satis�ed after PHASE I, this guarantees that self-routing for P2 is always possible

by theorem 18. The basic structure of algorithm ROUTE is given as following:

Algorithm 5 ROUTE

Input: A legal mapping �jK for GB(N;n)

Output: A feasible con�guration of GB(N;n)

PHASE I: set up busy inputs in P1 in logN � 1 iterations

PHASE II: set up busy inputs in P2 by self routing.

In PHASE 1, in order to satisfying P2-passable condition of GB(N;n), by

the proof of Theorem 17, we need to give graph model G an equitable 2-edge coloring

and let the inputs with the same color connect to the same subnetwork of next level.

A parallel algorithm for �nding an equitable 2-edge coloring for a bipartite graph

can be found in Section 5.3 of Chapter 5.

141

Our algorithm in PHASE II is presented in such a way that all PEs participate

routing process for P2 of GB(N;n). Actually, once the setting of SEs in P1 is

determined, cells can be injected into GB(N;n). When the cells reach P2, the SEs

can determine its setting by inspecting the control bits of SEs. We use xl and xl

to denote the (l + 1)-th signi�cant bit bl of the binary representation of x and the

integer that has the binary representation bvbv�1 � � � (1� bl) � � � b1b0 respectively.

Algorithm 6 SelfRout

Input: N; k; s; �(i)

Output: setting of busy inputs in P2

for s = logN � 1 to 2 logN � k � 2 do

s := 2logN � 2 � s;

for all PEi, 0 � i � N � 1 do

if (i is even and (�(i))s�k = 0) or (i is odd and (�(i))s�k = 1) then

set input i as straight;

else

set input i as cross.

end if

end for

end for

7.3.4 Analysis and Example

In PHASE I, since the length of a cycle or path is at most K, we need O(logK)

time to �nd an equitable 2-edge coloring in the �rst logK iterations. Because the

number of busy inputs connecting to the same subnetwork of next level is reduced

by half after each iteration, each iteration in PHASE I only takes O(1) time after

O(logK) iterations. Thus, the total time for PHASE I is O(log2K + logN). There

are 2 logN � k iterations in PHASE II, each takes O(1) time. Therefore, the total

time complexity of algorithm ROUTE is O(log2K + logN). Therefore, we have the

following claim:

Theorem 19 For any legal I/G mapping �jK, algorithm ROUTE correctly com-

putes a feasible con�guration of GB(N;n) in O(log2K + logN) time on a parallel

142

completely connected computer or the EREW PRAM model using N PEs.

Example 15 The parallel algorithm ROUTE sets up busy inputs in the �rst stage

of GB(16; 4) for a legal I/G mapping

i : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

�(i) : 1 �1 1 0 2 3 �1 1 3 2 0 3 2 0 �1 2

!

using equitable edge 2-coloring technique.

After PHASE I of our algorithm, the SEs in the �rst stage of G(16; 4) is set

up as shown in Figure 7.5, and two new mappings are given to inputs of input SEs

in two subnetworks of next level as follows:

i : 0 1 2 3 4 5 6 7

�(i) : 1 0 2 �1 3 3 0 2

!

and

i : 8 9 10 11 12 13 14 15

�(i) : �1 1 3 1 2 0 2 �1

!

2

0

3

2

1Upper 1-level Subnetwork

Lower 1-level Subnetwork

OUTPUT GROUPSINPUTS

6

54

32

10

7

1514

1312

1110

98

Figure 7.5. The settings of SEs in the �rst stage of GB(16; 4) according to the

equitable 2-edge coloring

143

7.4 Parallel Routing for Clos Group Connectors

A Clos group connector GC(m;n; r), is constructed from three-stage Clos network

by replacing the third stage with fat-and-slim concentrator that was proposed by

in [67]. As Figure 7.6 (a) shown, a Clos group connector GC(m;n; r) consists of r

m � n SMs in the �rst stage, and n r � r SMs in the second stage, and r n � m

SMs in the third stage with N = mr inputs and r output groups. The SMs in the

�rst 2 stages are implemented by crossbar networks and the SMs in the last stage

are implemented by concentrators.

Crossbar1

r×r

Crossbar2

r×r...

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.Crossbar

1

r×r

Crossbar2

r×r...

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

(a)

.

.

.

.

.

.

.

.

.

...

...

...

OU

TP

UT

GR

OU

PS

INP

UT

S

Group g

Group g

Group g1

2

r

Crossbar1

m×n

Crossbarr

m×n

Crossbar2

m×nCrossbar

2

.

.

.

.

.

.

INP

UT

S

Crossbar1

Crossbarr

m×m

Crossbar2

m×mCrossbar

2

Crossbarm

r×r

OU

TP

UT

GR

OU

PS

Group g

Group g

Group g1

2

r

(b)

Crossbarn

r×r

Concentrator1

n×mConcentrator

2

n×mConcentrator

r

n×m m×m

Figure 7.6. Construction of a Clos group connector: (a) a 3-stage Clos group con-

nector GC(m;n; r); (b) a 2-stage Clos group connector GC(m;m; r)

Similar to Benes group connector, we model a Clos group connectorGC(m;n; r)

with a mapping �jK, which mapsK(� N) busy inputs to r output groups, as a graph

~G where vertex set V ( ~G) = fvjv is an SM in the �rst stage or an output groupg and

edge set E( ~G) = fe = vwj there is an input i of an input SM v with �(i) = wg.

It is clear that ~G is a bipartite graph with all SMs in the �rst stage as one part

and all output groups as another part. Let �( ~G) denote the maximum degree of

~G. Clearly, �( ~G) � m since each SM in the �rst stage has at most m busy inputs

and each output group has at most m busy inputs mapped to it. By the topology

of Clos group connectors, we know that every output of each SM in the �rst stage is

connected to di�erent SMs in the middle stage, and every output of each SM in the

144

second stage is connected to the di�erent output groups. Because each SM, which is

a crossbar, is nonblocking, what we need to do for setting up GC(m;n; r) is to route

all outputs of every SM in the �rst stage to di�erent SMs in the second stage so that

the inputs that are connected with the same SM in the second stage are mapped to

the di�erent output groups. Hence, if we can color E( ~G) with �( ~G) colors, then we

can set up the SMs of GC(m;n; r) by connecting the busy inputs corresponding to

the edges colored with di�erent colors to di�erent SMs in the second stage. Since the

number of SMs in the second stage is n � m, there is a set of edge-disjoint paths from

the inputs to the outputs connecting input i to output group �(i) for 1 � i � m� r.

Therefore, we can apply the routing algorithm of Benes group connector to Clos

group connector GC(m;n; r). The basic strategy of the algorithm for Clos group

connector, denoted by ROUTE CLOS, is similar to [49]: �rstly reduce the case of

arbitrary m, which is the number of inputs for each SM in the �rst stage, into the

case in which m is an integral power of two, and then recursively decompose the

original d-edge coloring into two d=2-edge coloring (4 � d � m) so that we can apply

the algorithms similar to the in section 7.3.3 to an N=2-vertex bipartite graph with

maximum degree 2. For brevity, the detail implementation is omitted. In summary,

we have the following claim:

Theorem 20 For any legal I/G mapping �jK, the algorithm ROUTE CLOS cor-

rectly computes a feasible con�guration of GC(m;n; r) in O(logK logm) time if m

is an integral power of two and O(log2K logm) otherwise on a parallel completely

connected computer or the EREW PRAM model using N PEs.

7.5 Generalizations and Hardware Redundancy

The parallel machine model we used is not realistic. However, our algorithm can

be converted to �t any realistic machine model. A parallel operation involving in-

terprocessor communication can be achieved by sorting. Let S(N) be the time for

145

sorting N elements on a parallel machine M with N processors. Then, as the al-

gorithm for routing in Benes network B(N) of [62], the algorithms ROUTE and

ROUTE CLOS can be implemented on a machine with N PEs in no more than

O(log2K � S(N) + logN � S(N)) and O(log2K logm � S(N)) running time respec-

tively, where there are O(logK) busy inputs. For example, when implemented on

parallel computers whose PEs are connected by perfect shu�e and hypercube net-

works, our algorithm ROUTE takes O(log4K + log2K � logN) time.

Also, it is not diÆcult to see that the proposed parallel algorithm for GB(N;n)

can set dual inputs in the SEs indexed by f0; � � � ; (N=2s+1) � jg, where 0 � s �

logN � 2 and j 2 f0; � � � ; 2s � 1g, in the stage s to straight. Thus, these SEs can

be eliminated. So the hardware redundancy in �rst logN � 1 stages isPlogN�2

i=0 2i =

2m�1 � 1 = N=2 � 1 SEs. Translated into crossing points, the number of saved

crosspoints in Benes group connector GB(N;N=2k) is increased to 2N � k + 2N � 4,

compared with the Benes network B(N). For example, we can reduce 1 + 2 + 4 = 7

SEs in the �rst 3 stages of GB(16; 4). Comparing with the Figure 7.3, the Benes

group connector in Figure 7.4 has much lower hardware cost. Thus, when k = 0 in

which case GB(N;n) is the Benes network B(N), the hardware redundancy achieved

by our algorithm is the same as the number given in [93].

For three-stage Clos group connector, it has been shown [101] that the suf-

�cient condition for the rearrangeable nonblocking GC(m;n; r) is n � m, which is

also the necessary condition. If we choose the minimum value of n, i.e. n = m,

each concentrator in the last stage becomes size of m �m. Thus, we can obtain a

rearrangeably nonblocking two-stage group connector by removing all concentrators

of the last stage as Figure 7.6 (b).

146

7.6 Summary

We have introduced a class of interconnection networks: Benes group connector

and Clos group connector based on Benes network and 3-stage Clos network with

the reduction of hardware redundancy, and designed fast parallel algorithms for

con�guration of Benes group connectors and Clos group connectors based on graph

coloring. Also our algorithms can be implemented in various realistic parallel machine

models by multiplication a factor of time for sorting N elements on it. To our

knowledge, all known algorithms for setting up Benes networks B(N) cannot be

directly applied to set up Benes group connectors, however, by letting n = N , our

algorithm for GB(N;n) can be directly applied to set up Benes network B(N) with

the same time complexity. All known algorithms for setting up Clos networks only

considered full permutation [4, 10, 31, 49], which cannot be directly applied to set up

Clos group connectors for non-maximummapping. By lettingK = N , our algorithm

for GC(m;n; r) can be directly applied to set up 3-stage Clos networks.

CHAPTER 8

CONCLUDING REMARKS

A switching network plays a key role in communication networks. Nonblocking

switching networks are always favored to be used as switching networks whenever

possible. Crosstalk-free requirement in photonic networks adds a new dimension of

constraints for nonblockingness. Switching algorithms play a fundamental role in

nonblocking networks, and any algorithm that requires more than linear time would

be considered too slow for real-time applications. One remedy is to use multiple

processors to route connections in parallel.

Design and analysis eÆcient switching algorithms is one of the most active

research areas in communication networks. Using parallel computing and processing

techniques to improve the time complexity of switching algorithms brings great chal-

lenge. In this chapter, we summarize the major contributions of this dissertation and

discuss the further research work as the extension of the dissertation in switching

area.

8.1 Contributions

One major contribution of this dissertation is the design and analysis of fast par-

allel routing and wavelengths assignment algorithms for establishing connections in

switching networks.

We studied a class of multistage nonblocking switching networks B(N;x; p; a),

which contain Banyan network, Benes network, and Cantor network as special cases.

By modeling the routing problems for this class of networks as weak and strong edge

147

148

colorings of bipartite graphs, we developed fast parallel routing algorithms that can

route an arbitrary partial permutation with K(� N) connections in a rearrangeable

nonblocking networkB(N;x; p; a) inO((x+log p) logK+logN) time and in a strictly

nonblocking network B(N; 0; p�; a) in O(log p� logK + p� log p�) time.

Crosstalk problem for photonic switching adds a new dimension of blocking for

switching networks. We presented fast parallel routing and wavelength assignment

algorithms to establish connection for photonic switching network using time, space

and wavelength dilations to avoid crosstalk.

We modeled the routing and wavelength assignment problems as the graph

coloring problems by combinatorial and graph theory approaches. Using various

parallel graph coloring techniques such as edge coloring, vertex coloring, equitable

coloring, and balance coloring and so on, we presented fast parallel routing algorithms

for photonic switching.

Using time dilation approach, we proposed a fast parallel decomposition al-

gorithm with time complexity O(logN) to decompose a permutation into two semi-

permutations which can be routed separately through the optical Benes networks

without crosstalk. The presented parallel crosstalk-free routing algorithm can set up

any permutation in O(log2N) time in an optical B(N). This decomposition algo-

rithm can be extended to set up any partial permutation with K(< N) connections

in O(log2K + logN) time in an optical B(N).

Using space dilation approach, the crosstalk can be avoided by increasing

the number of SEs in photonic switching networks. We developed sublinear-time

parallel routing algorithms for the class of networks constructed from Banyan-type

networks by horizontal concatenation of extra stages and/or vertical stacking of

multiple copies.

Using wavelength dilation approach, we presented fast parallel routing and

wavelength assignment algorithms to route connections in optical WRSS Banyan

149

networks and WRSR Benes networks so that the connections passing through the

same SE have di�erent wavelengths. For an arbitrary partial permutation, it can

be routed without crosstalk in a WRSS BL(N) in O(log2N) time using at most

2blogN+1

2c wavelengths and in a WRSR B(N) with only basic SEs in O(log3N) time

using at most 2 logN wavelengths.

The presented algorithms run on a completely connected multiprocessor sys-

tem can be easily transformed to algorithms on more realistic multiprocessor systems.

For example, the proposed algorithms used to set up connections inB(N;x; p; �) have

a slow-down factor O(log2N) on a Banyan-type multiprocessor system, whose com-

plexity is no larger than one plane of B(N;x; p; �); the decomposition algorithm and

routing algorithms for optical Benes network can be implemented in O(log3N) time

and in O(log4N) time, respectively, on a hypercube; the routing and wavelength

assignment algorithms for a WRSS BL(N) and a WRSR B(N) take O(log4N) time

on a hypercube with N=2 PEs.

Another major contribution of this dissertation is to the design and analysis

of fast parallel stable matching and acyclic stable matching algorithms for switch

scheduling.

For stable matching problem, we proposed a new approach, parallel iterative

improvement (PII), which treats this problem as an optimization problem. A par-

ticular PII algorithm based on this approach is presented. Using techniques such

as randomization and greedy selection, the experimental evaluations show that PII

algorithm has better average performance compared with the classical stable match-

ing algorithms and converges in linear iterations with high probability. Due to the

non-uniqueness of random selection, the stable matching generated by PII algorithm

provides more fairness. The PII algorithm can also be stopped at any time with an

output of a \near-stable" matching to satisfy time constraint in real time applica-

tions. In addition, the PII algorithm can be easily implemented in realistic parallel

150

computing models such as hypercube, mesh of trees, and array with multiple broad-

casting buses without or with a logarithmic-time slow down factor.

For acyclic stable matching problem, we modeled it as the dominating set

problem on a rooted dependency graph, and then propose a parallel algorithm for

�nding the dominating set. For any instance of acyclic stable matching problem,

our acyclic stable matching algorithm can �nd a stable matching in O(N logN)

time while the classical stable matching needs O(N2) time. Simulation results show

that the scheduler based on our acyclic stable matching algorithm is feasible to be

implemented at high speed using current CMOS technologies.

Design of low cost, high speed, and large capacity nonblocking switching

architectures is also a contribution of this dissertation work.

Scalable nonblocking switching networks tend to have no self-routing capa-

bility. For example, for a nonblocking switching network B(N;x; p; �), though self-

routing capabilities exist in a portion of it, its routing is still computation intensive.

By studying the connection capacity of Banyan-type networks, we proposed a new

class of self-routing strictly nonblocking networks T (N;�). Compared with existing

strictly nonblocking self-routing networks, T (N;�) has lower hardware cost, shorter

connection diameter, and much smaller number of required wavelengths. Conse-

quently, they are more feasible for implementation with reduced optical signal atten-

uation and crosstalk.

We have introduced a class of interconnection networks: Benes group connec-

tor and Clos group connector based on Benes network and 3-stage Clos network with

the reduction of hardware redundancy, and designed fast parallel routing algorithms

for these group connectors. We also showed that, by our routing algorithms, the

hardware of Benes group connectors can be reduced further.

Most of results discussed in this dissertation have been reported to research

community through publications [50]-[55].

151

8.2 Future Work

EÆcient switching algorithms and switching architectures directly a�ect the perfor-

mance of communication networks. More work needs to done in this area. In the

following, we suggest some directions as the extension of this dissertation for possible

further research.

Multicasting is an important feature for any switching network being intended

to support broadband integrated services digital networks (B-ISDN). We expect to

extend our routing algorithms with some modi�cations to support multicasting in

switching networking.

With wavelength division-multiplexing (WDM) technology, the concept of

SNB and RNB in space division switching can be extended to the wavelength division

switching. Depending on whether wavelengths can be reassigned, this extension re-

sults in four combinations: wavelength-rearrangeable space-rearrangeable (WRSR),

wavelength-rearrangeable space-strict-sense (WRSS), wavelength-strict-sense space-

rearrangeable (WSSR), and wavelength-strict-sense space-strict-sense (WSSS). It has

been shown that using both the wavelength and space multiplexing techniques in a

fully dynamic manner, networks can achieve higher bandwidth and higher connec-

tivity. It is worthwhile to investigate the connection capacities of these networks and

design eÆcient routing and wavelength assignment algorithms for them.

The scheduling algorithms based on stable matchings have been shown to

provide QoS guarantees. It is desirable to seek new approaches to further reduce the

time complexities for the solutions of stable matching and acyclic stable matching

problems.

For the design of a switching network, in addition to its hardware cost in

terms of the cost of SEs and interconnection links (and wavelengths), we must take

the routing complexity into consideration. It remains a great challenge for �nding

low-cost high-speed nonblocking switching networks.

BIBLIOGRAPHY

[1] H. Abeledo and U. G. Rothblum, \Paths to marriage stability",Discrete Applied

Mathematics, vol. 63, pp. 1-12, 1995.

[2] D. P. Agrawal, \Graph theoretical analysis and design of multistage interconnec-

tion networks", IEEE Transactions on Computers, vol. C-32, no. 7, pp. 637-648,

July 1983.

[3] R. Anderson, \Parallel algorithms for generating random permutations on a

shared memory machine", Proceedings of the 2nd ACM Symposium on Parallel

Algorithms and Architectures, pp. 95-102, 1990.

[4] S. Andersen, \The looping algorithm extended to base 2t rearrangeable switch-

ing network", IEEE Transactions on communications, vol. 25, pp. 1057-1063,

1977.

[5] V. E. Benes, \On rearrangeable three-stage connecting networks", Bell System

Technical Journal, vol. 41, no. 5, pp. 1481-1492, Sep. 1962.

[6] V. E. Benes, \Permutation groups, complexes, and rearrangeable connecting

networks", The Bell System Technical Journal, vol. 43, pp. 1619-1640, July

1964.

[7] V. E. Benes,Mathematical Theory of Connecting Networks and Telephone Traf-

�c, Academic Press, New York, 1965.

[8] J. A. Bondy and U.S.R. Murty, Graph Theory with Applications, Elsevier North-

Holland, 1976.

152

153

[9] J. Carpinelli, \Interconnection networks: Improved routing methods for Clos

and Benes networks", Ph.D. Thesis, Rensselaer Polytechnic Institute, Troy, NY,

Aug. 1987.

[10] J. Carpinelli and A. Y. Oru, \Applications of matching and edge-coloring al-

gorithms to routing in Clos networks", Networks, vol. 24, pp. 319-326, Sep.

1994.

[11] H. J. Chao, C. H. Lam, and E. Oki, Broadband Packet Switching Technologies,

John Wiley & Sons, Inc., 2001.

[12] S. T. Chuang, A. Goel, N. McKeown, and B. Prabhakar, \Matching output

queuing with a combined input/output-queued switch", IEEE Journal on Se-

lected Areas in Communications, vol. 17, no. 6, pp. 1030-1039, 1999.

[13] C. Clos, \A study of non-blocking switching networks", Bell System Technical

Journal, vol. 32, pp. 406-424, Mar. 1953.

[14] R. Cole and J. Hopcroft, \On edge coloring bipartite graphs", SIAM Journal

on Computing, vol. 11, pp. 540-546, 1982.

[15] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algo-

rithms, The MIT Press and McGraw-Hill Book Company, second edition, 2001.

[16] R. Durstenfeld, \Random permutation (Algorithm 235)", Communication of

ACM, vol. 7, no. 7, pp. 420, 1964.

[17] T. Feder, N. Megiddo, and S. Plotkin, \A sublinear parallel algorithm for stable

matching", Theoretical Computer Science, vol. 233, pp. 297-308, 2000.

[18] H. Gabow, \Using Euler partitions to edge color bipartite multigraphs", Inter-

national Journal of Computer and Information Sciences, vol. 5, pp. 345-355,

1976.

154

[19] H. Gabow and O. Kariv, \Algorithms for edge coloring bipartite graphs and

multigraphs", SIAM Journal on Computing, vol. 11, pp. 117-129, 1982.

[20] D. Gale and L. S. Shapley, \College admissions and the stability of marriage",

American Mathematical Monthly, vol. 69, pp. 9-15, 1962.

[21] A. V. Goldberg, S. A. Plotkin, and G. E. Shannon, \Parallel symmetry-breaking

in sparse graphs," Proceedings of the Nineteenth Annual ACM Symposium on

Theory of Computing, pp. 315-323, 1987.

[22] Q. P. Gu and S. Peng, \Wavelengths requirement for permutation routing in

all-optical multistage interconnection networks", Proceedings of 14th Interna-

tional Parallel and Distributed Processing Symposium (IPDPS), pp. 761-768,

May 2000.

[23] D. Gus�eld, \Three fast algorithms for four problems in stable marriage", SIAM

Journal on Computing, vol. 16, no. 1, pp. 111-128, 1987.

[24] D. Gus�eld and R. W. Irving, The Stable Marriage Problem Structure and Al-

gorithms, MIT Press, 1989.

[25] T. Hagerup, \Fast parallel generation of random permutations", Proceedings of

the 18th Annual International Colloquium on Automata, Languages and Pro-

gramming, pp. 405-416, 1991.

[26] T. Hattori, T. Yamasaki, and M. Kumano, \New fast iteration algorithm for the

solution of generalized stable marriage problem", Proceedings of IEEE Interna-

tional Conference on Systems, Man, and Cybernetics, vol. 6. pp. 1051 -1056,

1999.

[27] H. Hinton, \A non-blocking optical interconnection network using directional

couplers", Proceedings of IEEE Global Telecommunications Conference, pp. 885-

889, Nov. 1984.

155

[28] J. E. Hopcroft and R. M. Karp, \An n2:5 algorithm for maximum matching in

bipartite graphs", SIAM Journal on Computing, vol. 2, pp. 225-231, 1973.

[29] M. E. C. Hull, \A parallel view of stable marriages", Information Processing

Letters, vol. 18, no. 1, pp. 63-66, 1984.

[30] D. K. Hunter, P. J. Legg, and I. Andonovic, \Architecture for large dilated

optical TDM switching networks", IEE Proceedings on Optoelectronics, vol. 140,

no. 5, pp. 337-343, Oct. 1993.

[31] F. K. Hwang, The Mathematical Theory of Nonblocking Switching Networks,

World Scienti�c, 1998.

[32] IEEE Standards Board, IEEE Standard VHDL Language Reference Manual,

2002.

[33] J. Jaja, An Introduction to Parallel Algorithms, Addison-Wesley, 1992.

[34] A. Jajszczyk, \A simple algorithm for the control of rearrangeable switching

networks", IEEE Transactions on Computers, vol. 33, pp. 169-171, 1985.

[35] A. C. Kam, K. Y. Siu, R. A. Barry, and E. C. Swanson, \A cell switch WDM

broadcast LAN with bandwidth guarantee and fair access", IEEE Journal of

lightwave technology, vol. 16, no. 12, pp. 2265-2280, Dec. 1998.

[36] A. Kam and K.-Y. Siu, \Linear complexity algorithms for QoS support in input-

queued switches with no speedup", IEEE Journal on Selected Areas in Commu-

nications, vol. 17, no. 6, pp. 1040-1056, June 1999.

[37] D. Kapur and M. S. Krishnamoorthy, \Worst-case choice for the stable marriage

problem", Information Processing Letters, vol. 21, pp. 27-30, 1985.

156

[38] M. J. Karol, M. G. Hluchyj, and S. P. Morgan \Input vs. output queueing on a

space-division packet switch", IEEE Transactions on communications, vol. 35,

no. 12, pp. 110-115, May 1987.

[39] S. Keshav, An Engineering Approach to Computer Networking, Addison-Wesley

Inc., 1997.

[40] C. T. Lea, \Crossover minimization in directional-coupler-based photonic

switching systems", IEEE Transactions on Communications, vol. 36, no. 3, pp.

355-363, Mar. 1988.

[41] C. T. Lea, \Multi-log2N networks and their applications in high-speed electronic

and photonic switching systems", IEEE Transactions on Communications, vol.

38, no. 10, pp. 1740-1749, Oct. 1990.

[42] C. T. Lea and D. J. Shyy, \Tradeo� of horizontal decomposition versus ver-

tical stacking in rearrangeable nonblocking networks", IEEE Transactions on

Communications, pp. 899-904, vol. 39, no. 6, June 1991.

[43] C. Y. Lee and A.Y. Oruc, \A fast parallel algorithm for routing unicast as-

signments in Benes networks", IEEE Transactions on Parallel and Distributed

Systems, vol. 6, no. 3, pp. 329-334, Mar. 1995.

[44] K. Y. Lee, \On the rearrangeability of a ( 2 logN � 1 ) stage permutation

network", IEEE Transactions on Computers, vol. 34, no. 5, pp. 412-425, May

1985.

[45] K. Y. Lee, \A new Benes network control algorithm", IEEE Transactions on

Computers, vol. 36, no. 6, pp. 768-772, June 1987.

[46] T. T. Lee and S. Y. Liew, \Parallel routing algorithms in Benes-Clos networks",

IEEE Transactions on Communications, vol. 50 no. 11, pp. 1841 - 1847, Nov.

2002.

157

[47] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays �

Trees � Hypercubes, Morgan Kaufmann Publishers, 1992.

[48] J. Lenfant, \Parallel permutations of data: a Benes network control algorithm

for frequently used permutations", IEEE Transactions on computers, vol. 27,

pp. 637-647, July 1978.

[49] G. F. Lev, N. Pippenger, and L. G. Valiant, \A fast parallel algorithm for

routing in permutation networks", IEEE Transactions on Computers, vol. 30,

pp. 93-100, Feb. 1981.

[50] E. Lu and S. Q. Zheng, \Parallel routing algorithms for nonblocking electronic

and photonic multistage switching networks", Proceedings of IEEE International

Parallel and Distributed Processing Symposium (IPDPS 2004), Workshop on

Advances in Parallel and Distributed Computing Models, April, 2004.

[51] E. Lu and S. Q. Zheng, \A parallel iterative improvement stable matching al-

gorithm", Proceedings of International Conference on High Performance Com-

puting (HiPC), Lecture Notes in Computer Science, Springer-Verlag, pp. 55-65,

Dec. 2003.

[52] E. Lu, M. Yang, Y. Zhang and S. Q. Zheng, \Design and implementation of an

acyclic stable matching scheduler", Proceedings of IEEE Global Communications

Conference (GlobeCom), pp. 3938-3942, Dec. 2003.

[53] E. Lu and S. Q. Zheng, \High-speed crosstalk-free routing for optical multistage

interconnection networks", Proceedings of the 12th IEEE International Confer-

ence on Computer Communications and Networks (ICCCN), pp. 249-254, Oct.

2003.

[54] E. Lu and S. Q. Zheng, \A fast parallel routing algorithm for Benes group

switches", Proceedings of the 14th IASTED International Conference on Parallel

158

and Distributed Computing and Systems, pp. 67-72, Nov. 2002.

[55] E. Lu and S. Q. Zheng, \Parallel algorithms for controlling group switches", Pro-

ceedings of the 15th ISCA International Conference on Parallel and Distributed

Computing Systems, pp. 84-89, Sep. 2002.

[56] G. Maier, A. Pattavina, and S. G. Colombo, \ Control of non-�lterable crosstalk

in optical-cross-connect banyan architectures", in Proceedings of IEEE Global

Telecommunications Conference GLOBECOM, vol. 2, pp. 1228-1232, Nov.-Dec.

2000.

[57] G. Maier and A. Pattavina, \Design of photonic rearrangeable networks with

zero �rst-order switching-element-crosstalk", IEEE Transactions on Communi-

cations, vol. 49, no. 7, pp. 1268-1279, Jul. 2001.

[58] N. McKeown, \Scheduling algorithms for input-bu�ered cell switches", Ph.D.

Thesis, University of California at Berkeley, 1995.

[59] D. G. McVitie and L. B. Wilson, \The stable marriage problem", Communica-

tion of the ACM, vol. 14, no. 7, pp. 486-490, 1971.

[60] C. Minkenberg, \On packet switch design", Ph.D. dissertation, Eindhoven Uni-

versity of Technology, 2001.

[61] N. Nassimi and S. Sahni, \A self-routing Benes network and parallel permutation

algorithms", IEEE Transactions on Computers, vol. 30, no. 5, pp. 148-154, May

1981.

[62] N. Nassimi and S. Sahni, \Parallel algorithms to set up the Benes permutation

network", IEEE Transactions on Computers, vol. 31, no. 2, pp. 148-154, Feb.

1982.

159

[63] G. Nong and M. Hamdi, \On the provision of integrated QoS guarantees of

unicast and multicast traÆc in input-queued switches", Proceedings of IEEE

Globecom 1999, vol. 3, pp. 1742-1746, 1999.

[64] G. Nong and M. Hamdi, \On the provision of quality-of-service guarantees for

input queued switches", IEEE Communications Magazine, vol. 38, no. 12, pp.

62-69, 2000.

[65] A. Olsson, Understanding Telecommunications, Ericsson, 2002.

[66] D. C. Opferman, and N. T. Tsao-Wu, \On a class of rearrangeable switching

networks", Part I: Control Algorithm, Bell System Technical Journal, vol. 50,

pp. 1,579-1,600, 1971.

[67] A. Y. Oruc and H. M. Huang, \Crosspoint complexity of sparse crossbar con-

centrators", IEEE Transactions on Information Theroy, vol. 42, no. 9, pp. 1466-

1471, Sep. 1996.

[68] K. Padmanabhan and A. Netravali, \Dilated network for photonic switching",

IEEE Transactions on Communications, vol. COM-35, no. 12, pp. 1357-1365,

Dec. 1987.

[69] Y. Pan, C. Qiao, and Y. Yang, \Optical multistage interconnection networks:

new challenges and approaches", IEEE Communications Magazine, vol. 37, no.

2, pp. 50-56, Feb. 1999.

[70] J. H. Patel, \Performance of processor-memory interconnections for multipro-

cessors", IEEE Transactions on Computers, vol. 30, no. 10, pp. 771-780, Oct.

1981.

[71] G. Pieris and G. Sasaki, \A linear lightwave Benes network", IEEE/ACM Trans-

actions on Networking, vol. 1, no. 4, pp. 441-445, Aug. 1993.

160

[72] B. Prabhakar and N. McKeown, \On the speedup required for combined input-

and output-queued switching", Automatica, vol. 35, no. 12, pp. 1909-1920, 1999.

[73] C. Qiao, R. Melhem, D. Chiarulli, and S. Levitan, \A time domain approach

for avoiding crosstalk in optical blocking multistage interconnection networks",

IEEE Journal Lightwave Technology, vol. 12, no. 10, pp. 1854-1862, Oct. 1994.

[74] C. Qiao, \Analysis of space-time tradeo�s in photonic switching networks",

Proceedings of IEEE INFOCOM, vol. 2, pp. 822-829, Mar. 1996.

[75] X. Qin and Y. Yang, \Nonblocking WDM switching networks with full and

limited wavelength conversion", IEEE Transactions on Communications, vol.

50, no. 12, pp. 2032-2041, Dec. 2002.

[76] M. J. Quinn, \A note on two parallel algorithms to solve the stable marriage

problem", BIT, vol. 25, pp. 473-476, 1985.

[77] C. S. Raghavendra and R. V. Boppana, \On self-routing in Benes and shu�e-

exchange networks", IEEE Trans. Comput. , vol. 40, no. 9, pp. 1057-1064, Sep.

1991.

[78] R. Ramaswami and K. Sivarajan, Optical Networks: A Practical Perspective,

second edition, Morgan Kaufmann, 2001.

[79] H. Ramanujam, \Decomposition of permutation networks", IEEE Transactions

on Computers, vol. 22, pp. 639-643, 1973.

[80] J. Sharony, S. Jiang, T. E. Stern, and K. W. Cheung, \Wavelength rearrangeable

and strictly nonblocking networks", IEEE Electronics Letters, vol. 28, no. 6, pp.

536-537, Mar. 1992.

161

[81] J. Sharony, K. W. Cheung, and T. E. Stern, \Wavelength Dilated Switches

(WDS)-a new class of high density, suppressed crosstalk, dynamic wavelength-

routing crossconnects", IEEE Photonics Technology Letters, vol. 4, no. 8, pp.

933-935, Aug. 1992.

[82] J. Sharony, K. W. Cheung, and T. E. Stern, \The wavelength dilation concept in

lightwave networks-implementation and system considerations ", IEEE Journal

of Lightwave Technology, vol. 1, no. 5/6, pp. 900-907, May-Jun. 1993.

[83] X. Shen, F. Yang, and Y. Pan, \Equivalent permutation capabilities between

time-division optical Omega networks and non-optical extra-stage Omega net-

works", IEEE/ACM Transactions on Networking, vol. 9, no. 4, Aug. 2001.

[84] G. H. Song and M. Goodman, \Asymmetrically-dilated cross-connect switches

for low-crosstalk WDM optical networks", Proceedings of IEEE 8th Annual

Meeting Conference on Lasers and Electro-Optics Society Annual Meeting, vol.

1, pp. 212-213, Oct. 1995.

[85] I. Stoica and H. Zhang, \Exact emulation of an output queueing switch by a

combined input output queueing switch", in Proceedings of the 6th IEEE/IFIP

IWQoS'98, Napa Valley, CA, pp. 218-224, May 1998.

[86] F. M. Suliman, A. B. Mohammad, and K. Seman, \A space dilated lightwave

network-a new approach", Proceedings of IEEE 10th International Conference

on Telecommunications (ICT 2003), vol. 2, pp. 1675-1679, 2003.

[87] A. Subramanian, \A new approach to stable matching problems", SIAM Journal

on Computing, vol. 23, no. 4, pp. 671-700, 1994.

[88] Synopsys Design Analyzer Datasheet, available at

http://www.synopsys.com/products/logic/deanalyzer ds.html, 1997.

162

[89] Y. Tamir and G. L. Frazier, \High-performance multiqueue bu�ers for VLSI

communication switches", Proceedings IEEE 15th Annual International Sympo-

sium on Computer Architecture, pp. 343-354, 1988.

[90] S. S. Tseng and R. C. T. Lee, \A parallel algorithm to solve the stable marriage

algorithm", BIT, vol. 24, pp. 308-316, 1984.

[91] M. Vaez and C. T. Lea, \Wide-sense nonblocking Banyan-type switching sys-

tems based on directional couplers", IEEE Journal on Selected Areas in Com-

munications, vol. 16, no. 7, pp. 1327-1332, Sep. 1998.

[92] M. Vaez and C. T. Lea, \Strictly nonblocking directional-coupler-based switch-

ing networks under crosstalk constraint", IEEE Transactions on Communica-

tions, vol. 48, no. 2, pp. 316-323, Feb. 2000.

[93] A. Waksman, A permutation Network, Journal of the ACM, vol. 15, no. 1, pp.

159-163, Jan. 1968.

[94] J. E. Watson et al., \A low-voltage 8�8 Ti:LiNbO3 switch with a dilated Benes

architecture", IEEE Journal of Lightwave Technology, vol. 8, pp. 794-800, May

1990.

[95] T. S. Wong and C. T. Lea, \Crosstalk reduction through wavelength assign-

ment in WDM photonic switching networks", IEEE Transactions on Commu-

nications, vol. 49, no. 7, pp. 1280-1287, Feb. 2001.

[96] C. L. Wu and T. Y. Feng, \On a class of multistage interconnection networks",

IEEE Transactions on Computers, vol. C-29, no. 8, pp. 694-702, Aug. 1980.

[97] M. Yang and S. Q. Zheng, \The kDDR scheduling algorithms for multi-server

packet switches", Proceedings of the ISCA 15th International Conference on

Parallel and Distributed Computing Systems, pp. 78-83, 2002.

163

[98] M. Yang and S. Q. Zheng, \EÆcient scheduling for CIOQ switches with space-

division multiplexing speedup", Proceedings of IEEE Infocom, 2003.

[99] Y. Yang, J. Wang, and Y. Pan, \Permutation capability of optical multistage

interconnection networks", Journal of Parallel and Distributed Computing, vol.

60, no. 1, pp. 72-91, Jan. 2000.

[100] Y. Yang and J. Wang, \Optimal all-to-all personalized exchange in a class of

optical multistage networks", IEEE Transactions on Parallel and Distributed

Systems, vol. 12, no. 6, pp. 567-582, June. 2001.

[101] Y. Yang, S. Q. Zheng, and D. Verchere, Group switching for DWDM networks,

submitted for publication.

[102] S. Q. Zheng and Y. Xiong, \Ingress edge router architecture and related chan-

nel scheduling algorithms for OBS networks", Alcatel internal technical report,

2000.

[103] S. Q. Zheng, M. Yang, and F. Masetti, \Hardware switch scheduling in high-

speed, high-capacity IP routers", Proceedings of the 14th IASTED International

Conference on Parallel and Distributed Computing and Systems, pp. 636-641,

2002.

VITA

Enyue Lu received B.S. degree in mathematics from Zhejiang Normal University,

China, in 1996, M.S. degree in mathematics from Nanjing University, China, in

1999, and M.S. degree in computer science fromUniversity of Texas at Dallas in 2001.

Currently, she is a Ph.D. candidate in computer science department at University of

Texas at Dallas.

Enyue Lu's current main research interests include parallel processing and comput-

ing, computer and communication networks, algorithm design and analysis, computer

architectures, software engineering, databases, and combinatorics and graph theory.

Her Ph.D. dissertation focuses on the design and analysis of eÆcient switching algo-

rithms for high-performance switches and routers. She has published several refereed

papers in those areas and earned a Best Paper Award at the 14th IASTED Interna-

tional Conference on Parallel and Distributed Computing and Systems in 2002. From

Jan. 2001 to May 2001, she worked as a co-op in UMTS/GSM Services Development

Group at Nortel Networks, Richardson, Texas.

or high perf ormance switching in communica tion...

Documents