An Efficient Parallel Gauss-Seidel Algorithm on a 3D Torus Network-on-Chip

: Network-on-chip (NoC) multi-core architectures with a large number of processing elements are becoming a reality with the recent developments in technology. In these modern systems the processing elements are interconnected with regular NoC topologies such as meshes and tori. In this paper we propose a parallel Gauss-Seidel (GS) iterative algorithm for solving large systems of linear equations on a 3-dimensional torus NoC architecture. The proposed parallel algorithm is O( Nn 2 / k 3 ) time complexity for solving a system with a matrix of order n on a k × k × k 3D torus NoC architecture with N iterations assuming n and N are large compared to k . We show that under these conditions the proposed parallel GS algorithm has near optimal speedup.


Introduction
n this paper we propose a parallel Gauss-Seidel algorithm based on message passing for solving a system of linear equations: Ax = b where A is an n by n dense matrix, b is a known n-vector and x is an n-vector to be determined.Systems of linear equations are of immense importance in mathematics, and to its applications to areas in the physical sciences, economics, engineering, social sciences and biological sciences, among many others.Even complicated situations are frequently approximated by a linear model as a first step.The solution of a system of nonlinear equations is achieved by an iterative procedure involving the solution of a series of linear equations.Similarly, the solution of ordinary differential equations, partial differential equations and integral equations using the finite difference method leads to a system of linear or nonlinear equations.Linear equations also arise frequently in numerical analysis.There are two classes of methods for solving linear systems of equations: direct and iterative methods.A direct method is a fixed number of operations carried out once, at the end of which the solution is produced.Gauss elimination and related strategies on a linear system is an example of such methods.Direct methods are often too expensive in terms of computation time, memory requirements, or both.As an alternative, linear systems are usually solved with iterative methods.A method is called iterative when it consists of a basic series of operations which are carried out over and over again until the answer is produced, some exception error occurs, or a limit on the number of iterations is exceeded [1].
For early parallel computers such as the CM-2 and the Intel iPSC/860 [2], it was observed that the single iteration steps of most iterative methods offered too little opportunity for parallelism in comparison with, for instance, direct methods for dense matrices.In particular, the inner products required per iteration for many iterative methods were I identified as obstacles because of communication.This has led to attempts to combine iteration steps, or to combine the message passing for different inner products.
The Gauss-Seidel is one of the most efficient iterative methods for solving linear systems that arise in solving partial differential equations.Many parallel implementations have been proposed including those reported in [3][4][5][6][7][8].Some implementations have been developed for regular problems such as the Laplace equation [9,10], circuit simulation problems [11], power load-flow problems [12], and for many applications of inter-dependent constraints, or as a relaxation step in multi-grid methods [3].In [13] several parallelization strategies for the dense Gauss-Seidel method are presented.These strategies are compared and evaluated through performance measurements on a large range of hardware architectures.The authors found that these new architectures do not offer the same trade-off in terms of computation power versus communication and synchronization overheads as do traditional high-performance platforms.In 1999, Wang and Xu [14] presented a specific technique for solving convection-dominated problems.Their algorithm uses crosswind thin blocks in a block Gauss-Seidel method.Their method is based on a special partitioning technique for a block iterative method for solving the linear system derived from a monotone discretization scheme for convection diffusion problems.They conclude that crosswind grouping is essential for the rapid convergence of the method.In 2005, Grabel et al. [15] presented two simple techniques for improving the performance of the parallel Gauss-Seidel method for the 3D Poisson equation by optimizing cache usage as well as reducing the number of communication steps.
In 2006, Nobuhiko et al. [16] presented a novel parallel algorithm for the block Gauss-Seidel method.The algorithm is devised by focusing on Reitzinger's coarsening scheme for those linear systems derived from the finite element discretization with first order tetrahedral elements.
The time consumed on communication between processors limits the parallel computation speed.With advances in technology, chips with a large number of cores (processing elements) are becoming a reality.Communication between processing elements in such multi-core systems was initially based on buses.When the number of cores increased, the bus became a performance bottleneck.In recent years, networks-on-chip (NoCs) have been used instead of buses for interconnecting the on-chip processing elements, which has resulted in faster inter-processor communication.The topology of the network-on-chip has a major impact on the communication performance of the multi-core system [17].
Several topologies have been proposed and studied for NoCs including mesh-based and tree-based topologies [17].The emerging three-dimensional (3D) integration and process technologies allow the design of multi-level Integrated Circuits (ICs).This creates new design opportunities in NoC design.For example, a considerable reduction can be achieved in the number and length of global interconnections using three-dimensional integration.
Motivated by these new developments in technology and by the resulting improved performance of interprocessor communication on modern 3D network-on-chip systems, we propose, and analyze the complexity of, a new parallel implementation of the Gauss-Seidel algorithm on 3D torus NoC architectures.The proposed algorithm uses message passing for inter-processor communication.It is an extension of our previous similar algorithm on 2D torus NoC [18].A shorter version of this paper has been presented in [19].
The rest of the paper is structured as follows: in section 2 an introduction is presented including a description of the sequential Gauss-Seidel algorithm to be parallelized, as well as an introduction to the 3D torus NoC architecture.We describe the proposed parallel algorithm in section 3, and we evaluate its performance in section 4. The paper is concluded in section 5.

Preliminaries a. The Gauss-Seidel sequential agorithm
The Gauss-Seidel (GS) algorithm is an improvement of the Jacobi algorithm.GS corrects the i th component ) (m i x of the vector x (m) in the order i = 0, 1, …., n-1.The approximation solution is updated immediately after the new component is determined.The newly computed component x can be changed within a working vector which is redefined at each relaxation step, and this results in the following iterative formula [10]: In matrix notation, equation (2) becomes: (2) In (2) L, D, and U are the lower, diagonal, and upper triangular parts of matrix A respectively. Figure 1 outlines the sequential GS algorithm.It is well known that the GS algorithm will always converge if the matrix A is strictly or irreducibly diagonally dominant.The sequential GS algorithm has time complexity O(Nn 2 ) for N iterations and therefore requires large execution time for large problem sizes (n), hence the need for faster parallel implementations.

b. The 3D torus network-on-chip architecture
An n-dimensional torus network, also called a wrap-around mesh or toroidal network, is a Cartesian product of n cycle networks.Two-dimensional mesh-based topologies (such as 2D mesh and 2D torus) have been the most popular among the known NoC topologies.Their popularity is due to their modularity (they are easily expandable by adding new nodes and links without modifying the existing structure), their ability to be partitioned into smaller meshes, their simple XY routing strategy, and their facilitated implementation.They also have a regular structure and short interswitch wires.They have been used in several chip multiprocessors such as the RAW processor [20], the TRIPS processor [21], the Intel 80-core Terascale processor [22], the 100-core TILE-Gx100 processor from Tilera [23] and the Single-Chip Cloud Computer (SCC) of Intel [24].
The emerging three-dimensional (3D) integration and process technologies allow the design of multi-level Integrated Circuits (ICs) [25].This creates new opportunities in NoC design [26,27].For example, a considerable reduction can be achieved in the number and length of global interconnections using three-dimensional integration.3D NoCs are more advantageous than 2D NoCs in providing better performance for large multi-core systems [28].Long horizontal wires in 2D NoCs can be replaced by very short vertical links in 3D NoCs.
Despite being available for quite a while, the 3D torus architecture now has the potential to be one of the most attractive interconnection topologies for future large NoC systems.This is because nowadays severe challenges are faced due to the rising number of cores (nodes).Petascale and exascale installations require, and will require, hundreds or thousands of cores to efficiently work together.The 3D torus topology offers the ability to add nodes without affecting performance and reliability.It is also important for future large multi-core systems to consume less energy.Connecting nodes using a 3D torus topology means that each node is connected to the adjacent ones via short cabling (except for the wrap-around links) in 6 different "directions": X+, X-, Y+, Y-, Z+, Z-.The pair-wise connectivity between nearest neighbor nodes of a 3D torus helps to reduce energy consumption and communication latency.A 3dimesional torus interconnection topology is illustrated in Figure 2.

The proposed parallel GS algorithm
In our proposed parallel GS algorithm, we partition the n×n matrix A into blocks of n/k 3 columns each, and scatter them to the k 3 processors of a k×k×k 3D torus.The k 3 processors are identified by (x, y, z) coordinates, 0 ≤ x, y, z ≤ k-1, as illustrated in Figure 2. Processor (x, y, z) is connected to the six neighboring processors (x-1, y, z), (x+1, y, z), (x, y-1, z), (x, y+1, z), (x, y, z-1) and (x, y, z+1).The +1 and -1 operations in these expressions are modulo k in order to include the wrap-around links.
We also assign sequential processor numbers (ids) to the k 3 processors as follows: the processor whose coordinates are (x, y, z) is assigned the sequential id: id(x, y, z) = x + ky + k 2 z.In this way, the k 3 processors are also identified with sequential ids in the range 0 ... k 3 -1 as illustrated in Figure 3. Figure 4 outlines the steps of the proposed parallel GS algorithm.Given the problem inputs A, b, x and tolerance, processor 0 (the master processor) partitions matrix A into blocks of n/k 3 columns each, and scatters them to the processors.Processor r receives the r th block of the matrix containing the a ij elements for i and j in the ranges: 0 ≤ i < n and r(n/k 3 ) ≤ j ≤ (r+1)(n/k 3 )-1, respectively.Then processor 0 scatters the elements of the vector x to the processors.Processor P r receives the r th segment of n/k 3 elements of the vector x, i.e. the x j elements for j in the range r(n/k 3 ) ≤ j ≤ (r+1)(n/k 3 )-1.After scattering A and x, processor 0 broadcasts the value of tolerance to all processors.The rest of the algorithm is similar to the sequential algorithm, in except that the loop for calculating and summing the a ij x j 's is done in parallel by the k 3 processors.Processor number r calculates the partial sum of the a ij x j 's, j = r(n/k 3 ), …,(r+1)(n/k 3 )-1, corresponding to the r th block of matrix A and the r th segment of vector x received by this processor.The partial sums are then collected and summed (reduce-sum) at processor 0, which completes the calculation of the new x i element.Processor 0 then sends the new x i to the processor in charge of x i , that is Notice that this parallel GS algorithm is based on parallelizing the calculation of sum inside the inner loop of the sequential algorithm (Figure 1).Each processor is in charge of a distinct block of columns of matrix A and of a distinct segment of vector x elements, allowing it to contribute to summing the a ij x j 's in a distinct range of j.More precisely, processor P r (i.e. with sequential id r) is in charge of the j range: r (n/k 3 ) ≤ j ≤ (r+1)(n/k 3 )-1.The partial sums calculated in parallel by the different processors are gathered by the master, forcing all processors to synchronize at this point before proceeding to the next i iteration.This yields a correct parallelization of the calculation of sum of the sequential algorithm.

Analysis of the parallel GS algorithm
Table 1 outlines timing expressions for the various steps of the parallel GS algorithm of Figure 4. We assume it takes an amount of time t copy to copy the value of a real number from one memory location to another, t multiply to multiply two real numbers, t add to add two real numbers, and t sqrt to calculate the square root of a real number.Expressions for t broadcast , t scatter , and t reduce-sum which correspond to the time required for group communication operations (broadcast, scatter, reduce-sum) in the k×k×k torus will be derived later in this section.

Parallel_GS Algorithm { //Let r be the sequential id of local processor (0 ≤ r < k 3 ) if (r = 0) //master processor { Input A, b, x, tolerance
Partition A into k 3 blocks of n/k 3 columns each Scatter the blocks of columns of A to the k 3 processors Partition x into k 3 segments of n/k 3 elements each Scatter the segments of elements of x to the processors } else { Receive the r th block of n/k 3

columns of A
Receive the r th segment of n/k 3 { Gather and sum the partial sums: } else { Send partial sum S r to processor 0 (contribute to gather) if (r = i/(n/k 3 ) ) receive x i } } if (r = 0) and (||x-oldx|| < tolerance) terminate computation } } Table 1.Complexity of the parallel GS algorithm.

Algorithm Step Time Complexity 1. Scatter the blocks of columns of
Calculate the S r partial sums T 5 = (n/k 3 )(t multiply + t add ) = O(n/k 3 ) 6. Reduce-sum the S r partial sums It can be seen from Figure 4 that the total time required by the parallel GS algorithm is given by: It remains to derive expressions for t broadcast , t scatter , and t reduce-sum which correspond to the timing on the 3D torus of the group communication operations broadcast, scatter, and reduce-sum respectively.

a. The cost of broadcasting on the 3D torus
The broadcasting of a message of size s from a source node to all other nodes in the k×k×k torus can be done in time: , where t com is the time needed to send a single number from one processor to a neighboring processor in the k×k×k torus.This expression is justified as follows: broadcasting in the k×k×k torus can be done in three phases as illustrated in Figure 5.In the first phase, the message is propagated on the X dimension in both directions (X+ and X-), starting at the source node and making use of the wrap-around links on the X dimension if needed.This X broadcasting phase requires    In the third phase of the broadcasting, all the nodes which received the message during the first and second phases (i.e.all the nodes located on the source plane) initiate parallel propagations of the message across the Z dimension in both directions (Z+ and Z-), making use of the wrap-around links on the Z dimension if needed.This third phase also requires  

  
. Figure 6 shows an implementation of this broadcasting algorithm on the k×k×k torus.operation is needed in the parallel GS algorithm to collect and sum up the S r partial sums.It can be done by reversing the three phases of the broadcasting operation.During the first phase, partial sums across dimension Z are calculated in parallel by a sequence of summing and sending on the Z+ or Z-direction (whichever is closer) to the sink plane.This yields a set of partial sums stored at the processors of the sink plane.During the second phase, the processors on the sink plane calculate partial sums across dimension Y in parallel by a sequence of summing and sending on the Y+ or Y-direction (whichever is closer) to the sink row.This yields a set of partial sums stored at the processors of the sink row.The processors on the sink row then calculate the final sum by a sequence of summing and sending on the X+ or X-direction (whichever is closer) to the sink processor.The final sum will be stored at the sink processor.Each of the three phases requires   2 / k steps of summing and sending.The resulting total time of the Reduce-Sum operation is therefore:

d. Overall cost of the parallel GS algorithm
Notice that T 1 , T 2 , T 3 , T 6 , and T 8 in expression (3) of the total execution time of the parallel GS algorithm correspond to communication steps, while T 4 , T 5 , T 7 and T 9 correspond to computation steps.Using the obtained timing expressions of the broadcasting, scattering and reduce-sum operations, the total communication time T comm is given by the following: Notice that T comm is proportional to t com (the single hop communication cost) which is much faster on NoC networks than on cluster or multiprocessor networks.Therefore the proposed parallel GS algorithm runs faster on a NoC than on a loosely coupled cluster or a tightly coupled multiprocessor.
The total computation time T comp is given by: The complexity of the total execution time T total of the parallel GS algorithm is therefore: When the size of the linear system n and the number of iterations N are large compared to the number of processors k 3 (in the order of k 4 or larger), the total execution time T total of the parallel GS algorithm in expression ( 4) is dominated by the term Nn 2 /k 3 which is k 3 times smaller than the time of the sequential algorithm.We can therefore conclude that, when the problem size n and the number of iterations N are large compared to k, the proposed parallel GS algorithm has a near optimal speedup (nearly equal to the number of processors k 3 ) and hence a near optimal efficiency.Remember that the speedup of a parallel algorithm is defined as the time of the sequential algorithm divided by the time of the parallel algorithm, while the efficiency is defined as the speedup divided by the number of processors.Figures 7 and 8 illustrate how the speedup and efficiency increase as n and N increase.

Conclusion
We have proposed a parallel Gauss-Seidel iterative algorithm for solving large systems of linear equations on a 3D torus network-on-chip architecture.The proposed algorithm makes use of the X, Y, Z interconnects with wraparound links of the 3D torus for efficient group communication operations between the processors including broadcasting, scattering and reduce-sum operations.These efficient group communication operations are at the heart of the proposed algorithm.We have shown that the proposed parallel algorithm has near optimal speedup when solving large linear systems that require a large number of iterations.This work can be extended by an experimental or simulation-based performance evaluation of the proposed parallel algorithm.

Figure 3 .
Figure 3. Sequential ids of the processors in plane z = 0.

Figure 5 .
Figure 5. Broadcasting on the 3D torus in 3 phases.In the second phase of the broadcasting, all the nodes which received the message during the first phase (the nodes located on the source row) initiate parallel propagations of the message across the Y dimension in both directions (Y+ and Y-), making use of the wrap-around links on the Y dimension if needed.This phase also requires   2 / k singlehop communication steps.In the third phase of the broadcasting, all the nodes which received the message during the first and second phases (i.e.all the nodes located on the source plane) initiate parallel propagations of the message across the Z dimension in both directions (Z+ and Z-), making use of the wrap-around links on the Z dimension if needed.This

Figure 7 .
Figure 7. Speedup and efficiency of the parallel GS algorithm as a function of n.

Figure 8 .
Figure 8. Speedup and efficiency of the parallel GS algorithm as a function of N.