A Multi-Step Approach for Scheduling Tasks with Synchronization on Clusters of Computers

In this work, a two-step approach is adopted for scheduling tasks with synchronous inter-task communication. To that end, an efficient algorithm, called GLB-Synch, is introduced for mapping clusters and ordering tasks on processors. The algorithm used the information obtained during the clustering step for selecting a cluster to be mapped on the least loaded processor. A performance study has been conducted on the GLB-Synch algorithm by simulation. A multi-step scheduling setup has been performed based on a previously developed algorithm for clustering DAGs with synchronous communication, called NLC-SynchCom, and using synthesized DAGs. We have shown by analysis and experimentation that the GLB-Synch algorithm retains the same low complexity cost of the first step for clustering. The performance results highlight the drawback of synchronization on speedup scalability.


Introduction
With the era of low cost commodity hardware, clusters of workstations (COW) are emerging as platforms for parallel and distributed computing environments.Computers connected through LAN or WAN form an infrastructure Grid Computing Foster and Kesselman, (1997).They form a high computing power for massive parallel processing that can be accessed by wide spectrum of application programmers.But the communication cost over the available networking facilities is still very high compared to computing cost.Besides, it lacks the necessary reliability found in typical multi-processor interconnection __________________________________________ *Corresponding author E-mail: arafeh@squ.edu.omnetworks.To overcome this deficiency, synchronous message-passing communication may need to be enforced for many parallel applications or middleware software.In general, synchronous communication adds overhead to the already high cost of communication.Furthermore, it may develop deadlock problems among the communicating tasks.Regardless of any constraints, parallel programs must be efficiently partitioned and scheduled on a COW to achieve any perceivable gain in performance.
The main contribution of this work is an algorithm, called Guided Load Balancing with Synchronization (GLB-Synch).It is a basis for mapping and ordering tasks on processors with synchronous communication.The GLB-Synch algorithm is based on the GLB algorithm for cluster mapping that was introduced by Radulescu (2001).However, the GLB-Synch algorithm performs both cluster-mapping and task-ordering, and retains the low complexity of the unbounded number of processors (UNC) schedule applied in the first step.
The rest of the paper is organized as follows.The next section introduces a background and some definitions related to this paper and the related work.Section 3 presents the GLB-Synch algorithm, while section 4 presents the performance study.Finally, section 5 is the conclusion.

Preliminaries
The execution behavior of the program DAG is the macro-dataflow model.However, the execution of each task consists of three phases: receive, compute and send.The receive phase includes receiving all messages required by the task for its execution to start.The compute phase is the phase in which the instructions of the task are executed without interruption.We assume a synchronous communication protocol.The send phase includes sending all messages to all dependent tasks in parallel.However, the sender task is blocked, waiting for acknowledgements, until all receiving tasks actually receive the messages.

Multi-Step Scheduling
The multi-step scheduling process can be achieved by two or three steps.First, clustering of tasks without duplication can be performed, assuming unbounded number of processors.Second, clusters are mapped on the available processors.Third, the tasks of the mapped clusters are ordered for execution on the processors.The clustering problem has been shown to be NP-complete (Papadimitriou and Yannakakis, 1999;Sarkar, 1989).Polynomial-time heuristic algorithms have been proposed for the clustering problem based on the critical path analysis (Sarkar, 1989;Wu and Gajski, 1990;Gerlasoulis and Yang, 1993;Kwok and Ahmad, 1999;Kadamuddi and Tsai, 2000;Shirazi et al. 1990;Lee et al. 2003).An overview of cluster-mapping is given next.

Mapping Clusters to Processors
Cluster-mapping is needed when the number of available processors is less than the number of clusters.There are a number of approaches that are proposed in literature.However, the issue of mapping clusters to processors has not been given enough attention in the literature, and there is much room to explore on this topic.In the next paragraphs, we discuss some of the approaches reported in the literature.Sarkar (1989) used a list-scheduling based method to map the clusters to processors, called List Cluster Assignment (LCA).It is an incremental algorithm that performs both cluster-mapping and task-ordering in a single step.Kim and Browne (1988) proposed a mapping scheme for clusters based on their linear clustering algorithm.The clusters are first merged in order to reduce their number to be at most equal to the number of processors.Then the process is followed by heuristics to optimize the mapping step.Mainly, the heuristics would choose a processor which has the most appropriate number of channels among currently unallocated processors.Wu and Gajski (1990) proposed a mapping scheme for clusters based on a dedicated traffic scheduling algorithm that balances the network traffic.The algorithm used generates an initial assignment by a constructive method; then, the assignment is iteratively improved to obtain a better mapping.The heuristic is based on minimizing the total communication traffic.Yang and Gerasoulis (1994) employed a work profiling method for merging clusters, called Wrap Cluster Merging (WCM) algorithm.First, clusters are sorted in an increasing order of aggregate computational load.Then, a load balancing algorithm is invoked to map the clusters to the processors, so that every processor has about the same load.
The work of Liou and Palis (1997)

investigated the
In this work, a parallel program is modeled as a weighted directed acyclic graph (DAG), G = (V, E, ω, λ), where V is the set of task nodes, E is the set of communication edges, ω is the set of task computation weights, and λ is the set of edge communi cation costs.
An edge e ij = (n i , n j ) ∈ E represents a data dependence constraint between the two tasks n i and n j , where the execution of n j must start after receiving all input from n i .The communication cost of message passing along an edge e ij is denoted by c ij = λ(e ij ), and the computation weight of a task n i is denoted by ω(n i ).We will refer to the source and destination nodes of an edge by the parent node and the child node, respectively.A node that does not have any parent is called an entry node, while a node, which does not have any child, is called an exit node.Pred(n i ) is the set of immediate predecessors of n i , and Succ(n i ) is the set of immediate successors of n i .The length of a path is defined as the sum of all computation weights of nodes and all communication costs of edges along the path.The critical path of a DAG is the path from an entry node to an exit node that has the maximum length.The computation to communication ratio of a parallel program (PCCR) is defined as its average computation weight divided by its average communication cost.problem of mapping clusters to processors.They have shown the effectiveness of using the two-phase scheduling approach, in which the task clustering is followed by the cluster-mapping step, over the one-phase scheduling.They proposed a clustering algorithm, called CASS-II (Clustering and Scheduling System II), and introduced three algorithms for cluster-mapping schemes, namely, the LB (Load-Balancing) algorithm, CTM (Communication Traffic Minimizing) algorithm, and the RAND (Random) algorithm.They applied randomly generated task graphs in an experimental study using their clustering algorithm and cluster-mapping schemes.Their work shows that, when task clustering is performed before cluster-mapping, load balancing is the preferred approach for merging clusters.Compared to CTM, LB is fast, easy to implement and produces significantly better schedules.Radulescu (2001) proposed two algorithms for mapping clusters to processors in a multi-step scheduling approach.Both algorithms aim at achieving a better costperformance ratio.The first algorithm, called Guided Load Balancing (GLB), exploits knowledge about the task start times that were computed in the clustering step.Accordingly, clusters are mapped in the order of their start times to the least loaded processor at that time.The second algorithm, called List Load Balancing (LLB), aims at improving the load balancing throughout the program execution time by performing cluster-mapping and taskordering in one step.There are two benefits reported for this integration.First, it allows dynamic load balancing through the execution of the mapping process, because only the ready tasks are considered in the mapping process.Second, it considers communication costs, when selecting tasks for mapping, as opposed to other clustermapping algorithms, such as WCM and GLB, which do not.
The work by Lee et al. (2003), introduced a multi-step scheduling approach using a block dependency DAG, that represents the execution behavior of block sparse Cholesky factorization operation in a distributed-memory system.The proposed scheduling algorithm consists of two stages.In the first stage, a clustering algorithm, called Early-Start Clustering (ESC), is used to cluster tasks while preserving the earliest start time of a task without limiting the potential degree of parallelism, and without considering the number of available processors.In the second stage, a cluster mapping algorithm, called Affine Cluster Mapping (ACM), is used to allocate clusters to a given number of processors.The ACM algorithm attempts to reduce the communication overhead and balance the workloads among the processors based on two criteria.These are the affinity of a cluster with respect to a processor, in terms of the sum of communication costs required when the cluster is mapped to other processors, and the amount of workload required for a cluster.The work by Lee et al. (2003) shows by experiments the effectiveness of applying the proposed scheduling algorithm, compared to other processor mapping methods that are used for parallelizing the sparse Cholesky factorization operation.The experiments were conducted on a Myrinet cluster system and using benchmark sparse matrices

Scheduling with Synchronous Communication
Most scheduling algorithms for distributed-memory parallel architectures assume the use of an asynchronous communication protocol for a message-passing system.However, parallel computing on a network of workstations or over the Internet is not as reliable as that performed on parallel machines.Therefore, the requirement for synchronization at the application level becomes eminent for many software systems.
In a synchronous communication, the sender is blocked until an acknowledgement is received from the receiver.This waiting time is called the blocking delay.A deadlock occurs when a sender gets blocked indefinitely, waiting for an acknowledgment from a receiver task in another cluster.At the same time, the receiver task cannot start execution, because one of its predecessor tasks has been blocked indefinitely, waiting for an acknowledgment from a task in some other cluster.A direct deadlock situation between two clusters occurs due to a cyclic dependency relation between them.In general, a deadlock situation may arise due to a chain of dependency relations among a subset of tasks.In this work, we consider direct deadlock situations only.Based on the task execution phases, the following definitions of time parameters characterize the scheduling of a task node in a scheduled DAG with synchronous communication.
The issue of scheduling on distributed-memory parallel architectures with synchronous communication has not been given enough attention in literature.However, the works of Kadamuddi and Tsia (2000) and Arafeh (2003) address this issue, assuming a multi-step scheduling approach.Both propose clustering algorithms for tasks with synchronous communication, in which deadlocks are detected and avoided as part of the task clustering step.
The work presented in this paper uses the clustering algorithm, called NLC-SynchCom, by Arafeh (2003) for the task-clustering step.The algorithm proceeds in one pass in the forward direction from entry nodes to exit nodes, one level at a time.The algorithm starts assuming each task node is in a cluster by itself.Therefore, there are |V| clusters at the beginning of the algorithm.Each node in a cluster is designated by its status as a Head, Tail, Regular or Singleton.A node in a cluster by itself is given the status of a Singleton node.The Head task of a cluster is the one that must be scheduled first due to its precedence with respect to all other tasks in the cluster.
Similarly, the tail task of a cluster is the one that must be scheduled last due to its precedence with respect to all other tasks in the cluster.A Regular task in a cluster is one that is not a Singleton, Head or Tail.The selection of a parent node, n i , at level l that will be merged with one of its child nodes is determined using two priority schemes.
The first scheme is used to determine the priority of a parent node at level l for merging.It is defined by the parent's completion time, e_send(parent), in descending order.The second scheme is used to determine the priority of merging a child node n j for a parent n i .This priority depends on the maximum remaining time left to the completion of execution from a parent node, n i , to an exit node, excluding n i .Since a merging step would zero a (parent, child) edge, then the parent node and all its descendant nodes may have their completion times changed accordingly.Thus, the priorities of the parent nodes at each level are found dynamically before the nodes of that level are scanned for merging.On the other hand, the remaining time to the completion of execution for each node is computed at the initialization time only.
A priority is given to merge the parent node with a child node which leads to the highest remaining time to completion.A selected child node n j has to pass two tests before merging can be finally applied.These are the deadlock detection and the merging check tests.First, the deadlock detection test ensures that a merging step of a child with its parent's cluster would not cause a deadlock case for the parent's cluster with any other existing cluster in the DAG.Second, the merging check test ensures that a merging step of a child node with its parent's cluster would not cause an increase in the application's execution time.This execution time is referred to in this paper by the DAG Parallel time, PT.The (parent, child) edge is zeroed and merging is performed, only, if both tests are passed successfully.The complexity cost of the NLC-SynchCom algorithm is O(v(log v + e 2 )), where v is the number of nodes and e is the number of edges in a DAG.For further details, see the paper by Arafeh (2003).

Description of the GLB-Synch Algorithm
The GLB-Synch algorithm is an extension of Radulescu GLB algorithm for mapping clusters to processors (Radulescu, 2001).However, the GLB-Synch algorithm performs both cluster-mapping and task-ordering in the context of synchronous communication.The algorithm uses the information obtained during the clustering step, based on the NLC-SynchCom algorithm, for mapping clusters to a distributed-memory system of unbounded number of homogeneous processors that are fully connected.Eventually, the NLC-SynchCom algorithm schedules the tasks on the virtual processors, assuming each cluster is allocated to one virtual processor and there is Unbounded Number of Processors or Clusters (UNC).Therefore, we will refer to the DAG and schedule generated by the NLC-SynchCom algorithm by the clustered DAG and the UNC schedule, respectively.Each node of the clustered DAG represents a cluster or virtual processor, and each directed edge between two clusters represents a communication link connecting them.The GLB-Synch algorithm uses the cluster start time, T s (C), to represent the priority of a cluster, C, for mapping.The start time of a cluster is the start time for a computation phase of the header task in the cluster, and it is given by (1) where t is a task of cluster C.
Similar to GLB, the cluster, which has the earliest start time, is mapped first.In case of a tie, the cluster with the highest workload is mapped first.If there is still a tie, a cluster is selected randomly.Based on the execution phases of a task, the UNC schedule allocates time slots for all the phases, with no consideration for overlapping computation with communication.At this stage, it is more natural to consider the existence of some overlapping between the computation and the communication phases, as clusters are mapped to physical processors.In this work, we assume that the duration of the receive phase is implicitly handled by a communication processor, and the acknowledgment of a received message is handled by the messagepassing system.Each scheduled task should have all its expected messages to be received declared to the message-passing system ahead of the start time of its receive phase.Only when all the needed messages have arrived, the task can start the computation phase.However, the duration of the sending phase is still considered explicitly as part of the task's schedule on a processor to enforce synchronization.The GLB-Synch algorithm maps clusters to a distributed-memory system of bounded number of homogeneous processors that are fully connected.Since a complete interconnection network is assumed, there is no consideration for bandwidth contention.Furthermore, each processor is assumed to have unlimited number of communication ports and unlimited memory space.
Since the workload of a task, t, includes the duration starting from s_compute(t) till e_send(t), then the workload of a cluster is defined as (2) Because a cluster is not a schedulable unit, the GLB-Synch algorithm maps a cluster to the least loaded proces- sor as in the GLB algorithm.It is expected that clusters mapped to the same processor to be interleaved due to the task-ordering step.The workload of a processor, p, is defined as where, ψ(p) is the subset of clusters mapped to processor p.As a consequence, all inter-cluster communication costs among the mapped subset of clusters must become zero.
The algorithm achieves the objective of scheduling tasks on the processors in three steps.In the first step, it assumes that the process of cluster-mapping is generating super-clusters, constructed as aggregates of the mapped clusters to the target processors.In this step, all the time parameters of all tasks are recomputed, due to the zeroing of the inter-cluster communication costs, without performing the task-ordering (i.e.sequentialization) process.It may look as if each aggregate of clusters (i.e.virtual processors) is mapped to a shared-memory multiprocessor.
In the second step, the GLB-Synch algorithm performs task-ordering, based on start times of tasks for the computation phase.All tasks allocated to the same processor are sorted topologically in an increasing order of their start computation time, s_compute.If two tasks have the same s_compute time, the task with the highest blocking delay is scheduled first.The blocking delay of a task, t, is defined as the time it is waiting for acknowledgments for all messages sent by that task, and it is computed by (e_send(t) -s_send(t)).If there is still a tie, a task is selected randomly.In the third step, the mapped tasks to the same processor are scheduled using their precedence order.Let T i r (p) denote the processor ready time on a partial schedule.It is initialized by zero, and it is defined as the end time of the last task, t i , scheduled on that processor.Accordingly, a task t i+1 , is scheduled for execution at T i r (p), if the current start time for the computation phase, s_compute(t i+1 ) is less than or equal to T i r (p).Otherwise, the task t i+1 is scheduled to start execution at its designated s_compute time.The GLB-Synch algorithm is described text.

Complexity Analysis
The following notations are used to characterize the time complexity of the GLB-Synch algorithm.The GLB-Synch algorithm performs cluster-mapping, task-ordering and scheduling on the processors, based on the clustering step for synchronous communication.The GLB-Synch algorithm assumes that the formulation of clusters, along with the computation of their workloads, have been performed in the clustering step.Besides, it assumes that task clusters are deadlock-free, since all generated clusters by the clustering step had passed the deadlock detection test successfully.

Theorem 1. The time complexity of the GLB-Synch algorithm is O(mc + mv logv + ev).
Proof.The start time of a cluster, T s (C), is the s_compute ; time of the Head task of C.Then, the cost of the first step is O(c).In step 2, clusters are sorted in O(c logc) time.
The determination of the least loaded processor, p j , takes O(m) steps.While, the process of zeroing the inter-cluster communication costs between the already mapped clusters to p j and the current one to be mapped needs O(e).Accordingly, step 3 of the algorithm takes O(c(m+e)) time.
The updating of the UNC schedule at step 4 takes O(ev) steps, due to the need to find the e_receive(t i ) and the e_send(t i ) time for each task t i .Finally, step 5 of the algorithm takes O(mv logv) steps.Because, the algorithm orders all tasks mapped to a processor in O(v logv) steps, then schedules tasks on a processor in O(v) steps.Since m < c and c < v, the total time complexity of the GLB-Synch algorithm is O(pc + mv logv + ev).

Performance Study
A performance study has been conducted on the multistep approach for scheduling tasks with synchronous inter-task communication.The study is based on the simulation of the multi-step scheduling approach using the NLC-SynchCom algorithm for the clustering step, and the GLB-Synch algorithm for the cluster-mapping and taskordering steps.The performance study had adopted randomly generated DAGs for experimentation.Synthesized random DAGs are generated so that the results would not be biased towards regular graph structures or certain graph shapes, allowing various DAG characteristics to be considered.The objectives of the performance study include assessing the cost of the multi-step scheduling approach in the context of synchronous communication, evaluating the outcome of the multi-step scheduler, and discovering the points of deficiencies and limitations.In this section, the definitions of the chosen performance metrics in the study are given next.Then, the simulation set-up for experimentation is described.Finally, the performance results are presented and discussed.

Performance Metrics
The performance metric considered for assessing the cost of each step of the multi-step scheduler is the execution time.The performance metrics considered for evaluating the outcome of the scheduling steps are based on the Schedule Length (SL).We will refer to SL o as the original schedule length of an application DAG.It is the parallel time for executing a DAG on an unbounded number of processors.SL o is equal to the length of the critical path of the DAG.The schedule length obtained from executing a DAG on a uniprocessor is referred to by SL 1 , and it is defined as (7) The schedule length obtained by the clustering step (i.e. the UNC schedule) is referred to by SL c .It is also based on an unbounded number of processors.The schedule length obtained by mapping and scheduling the tasks on bounded number of processors, m, is referred to by SL m .The speedup factor, SP m , is defined as the ratio of executing a parallel program on a uniprocessor to its execution on m processors, and it is given by ( 8) Definition 1 Normalized Schedule Length NSL(m).In this work, we define the Normalized Schedule Length, NSL(m), as the ratio of the schedule length of a DAG on m processors, SL m , to its original schedule length, SL o .That is, (9) Hwang, (1993).Let b(v i , p j ) be the blocking time of task v i due to synchronization, when it is executed on p j .Hence, (10) where, x(v i , p j ) is a 0-1 function defined as follows:

Definition 2 Utilization U(m). The utilization of a system of m processors, U(m), is defined as the percentage of the m processors' time that is kept busy during the execution of a parallel program due to a certain allocation
(11) and (12) Hwang, (1993).It is given by, (13)

Definition 4 Synchronization Overhead Ratio SOR(m).
Let B m be the average blocking (i.e.synchronization time) for scheduling tasks of a parallel program on bounded number of processors, m.Accordingly, the Synchronization Overhead Ratio, SOR(m), is defined as the ratio of the average blocking time per processor to the schedule length, SL m , where, Definition 5 Load Imbalance Factor LIF(p).Let T w (p j ) refer to the workload of processor p j .Therefore, the Load Imbalance Factor, LIF(p), is defined as the percentage of the average processor's time that is kept idle during the execution of a parallel program.Actually, it indicates the percentage of non-utilized processor's time, and it is given by (16) From definitions 2, 3 and 4, the utilization is characterized by the following Lemma.

Lemma 1. The utilization U(m) is characterized in terms of the efficiency, E(m), and the synchronization overhead ratio, SOR(m), for scheduling a parallel program on m processors as (17)
Proof.Let b ave be the average blocking time in a DAG, and B m the average blocking per processor for scheduling the DAG on m processors.Accordingly, the utilization can be rewritten as

Simulation
The use of randomly generated Directed Acyclic Graphs (DAGs) to model parallel applications is a common practice in the evaluation of proposed scheduling heuristics for parallel and distributed computing systems.The use of simulation provides a basis to evaluate the scheduling algorithm independent of the hardware imple-mentation and its organization.Many approaches have been proposed in the literature on how to generate synthesized DAGs randomly (Kasahara Laboratory, Japan, 2004).In this work, a random graph generator is implemented to generate weighted DAGs, as defined in this paper, with various characteristics based on a method that uses the following factors: 1.The number of tasks, v. 2. The shape factor, α., of a DAG: We assume the height ( i.e. number of levels) of a DAG is randomly generated from a uniform distribution with a mean value, L mean , equal to v 1/2 /α.Similarly, the width for each level in the DAG is ra ndomly generated from a uniform distribution with a mean value, W mean , equal to α v 1/2 .3. The maximum out -degree of a node in a DAG. 4. The maximum span of an edge in a DAG.

The Computation to Communication Ratio,
CCR: It is taken as the ratio of the average computation weight to the average communication cost.Values of CCR in the range 0.1-0.7 represent fine granularity, values in the range 0.8 -1.4 represent medium granularity, and values greater that 1.4 represent coarse granularity.6.The mean computation weig ht, ω mean : The computation weight of each node is determined randomly from a uniform distribution with a mean ω mean .7. The mean communication cost, λ mean : The mean communication cost of a DAG is equal to ω mean /CCR.Each communication cost of an edge is determine d randomly from a uniform distribution with a mean λ mean .
For the purpose of generating random DAGs in this study, we have arbitrarily chosen ω mean to be 20.Three values of CCR are considered.0.5, 1.0 and 5.0 , to represent fine, medium and coarse granularity respectively.To contr ol the structure of the DAG, we have limited the outdegree of a node to be within the range 1-W mean .Also, the span of an edge is limited to be within 1-⎡0.25 x height ⎤.Three values for the shape factor, α, are considered.These are 0.5, 1.0, and 2.0.An α < 1.0 represents a DA G with a long height and low degree of parallelism; while an α > 1.0 represents a shorter DAG with high degree of parallelism.The width of a DAG level l indicates the degree of parallelism at that level.Therefore, the mean width of a DAG, Wmean, is adopted as a measure of the potential for the degree of parallelism in the DAG.The relationship between the mean width, W mean of a DAG and the number of nodes, v, of the DAG for constant values of α are shown in Simulation experiments were conducted to measure the cost of each step of the multi-step scheduling techniques used in this work.Ten groups of synthesized DAG sizes were generated.These range from 50 to 500 nodes with an increment of 50.For each DAG size, v, and shape factor, α, we have generated 100 random DAGs.The NLC-SynchCom algorithm is applied on each generated DAG for clustering tasks with synchronous communication.The resultant clustered DAG is taken as an input to the GLB-Synch algorithm for mapping and ordering the tasks on the processors.

Performance Results
The cost of each step in our multi-step scheduling scheme is measured against the number of nodes in a DAG.Figures 2 and 3 show the average execution time for the clustering step, and the mapping and ordering step, respectively, versus the number of nodes.The average execution time for the mapping and ordering step is taken over all the number of processors considered in the simulation runs.Three cases are considered for each scheduling step, based on the shape factor of the DAG.For both steps, the execution time increases as the DAG size and the degree of parallelism (i.e.α) in the DAG increase.
From the plots, it can be deduced that the cost of the mapping and ordering step is much less than the cost of clustering, in general.The cost of mapping and ordering does not exceed 25% of the cost of clustering for DAG sizes greater than 100 nodes.
In order to assess the performance of the GLB-Synch algorithm, we have to focus a little on the main performance features of the NLC-SynchCom algorithm.The main objective of performing the clustering step is to reduce the communication cost in the DAG by merging tasks onto clusters, where each cluster can be assigned to a processor.In this way, the clustered DAG would have better PCCR relative to the initial value of the DAG's PCCR.Both measures tend to degrade at higher DAG sizes and shape factors, as shown in Figs. 4 and 5.However, the NLC-SynchCom algorithm imposes restrictions on merging nodes to clusters that may cause an increase in the DAG's PT.Therefore, the UNC schedule length, SL c , is always less than or equal to SL o .For example, the parallel time is reduced by about 9% for DAGs with a size of 50 nodes, PCCR=1 and α = 0.5.The ability of the algo-rithm for PT reduction degrades, generally, with higher DAG sizes.For example, the PT reduction is about 1%, for DAGs with a size of 500 nodes, PCCR = 1.0 and α = 0.5.This brief characterization of the clustering step should support understanding the outcomes of the mapping and ordering step (for more details see Arafeh, 2003).
The performance results are shown in Figs.6-10.They are taken for DAGs with α =1.0 and sizes of 50, 250 and 500 nodes.This should provide us with bases of uniformity in our comparisons, analyses and assessments.This does not mean that the effect of the shape factor has been ignored, or that it does not have a role on the type of results generated.On the contrary, all performance results in this work scale proportionally with the value of α.Figures 6-10 show the relationships of the normalized scheduling length, NSL, the speedup, SP, the utilization, U, the efficiency, E, and the load imbalance factor, LIF, against the number of processors, respectively.In particular, simulation results were collected for number of processors equal to 1, 2, 4, 8, 16 and 32.However, curve fitting techniques were applied in order to obtain smooth curves in those figures.
The results shown in Fig. 6(a), (b) and (c) depict the relationship between the NSL(m) of DAGs with PCCR valus equal to 0.5, 1.0 and 5.0, respectively, versus the number of processors, m.The NSL results are high with small number of processors, but they approach optimal values at high number of processors.The NSL values are optimal in the cases of DAG size of 50 nodes, and close to optimal for higher sizes, when m > 16.The next set of results is for the average speedup factor against the number of processors.They are shown in Fig. 7(a), (b) and (c) for PCCR values equal to 0.5, 1.0 and 5.0, respectively.Before the crossover points, the speedup factor is higher for lower DAG sizes.While after the crossover points, the speedup factor becomes higher for higher DAG sizes.
The crossover points in the speedup curves are expected.Because, the potential degree of parallelism (represented by the average DAG width) for smaller DAG sizes would have higher opportunity to match the available number of processors, before the crossover points, than large DAG sizes.Accordingly, SL m of smaller DAG sizes on low number of processors would be close to SL o , as shown in Fig. 6 by the relationship between the NSL and the number of processors, m.However, the high potential degree of parallelism found in large DAG sizes would not have an opportunity for exploitation with a limited number of processors.Accordingly, their SL m values would be much larger than SL o due to processing tasks allocated to the same processor sequentially.But the situation changes after the crossover points, as more processors become available, since they can match the potential degree of parallelism found in high DAG sizes.The discrepancies in the crossover points among the three curves is attributed to the average PCCR of the DAGs.
It is very clear that the speedup does not scale linearly with the number of processors.Definitely, synchronous communication has a serious drawback on limiting the

Conclusions
This work has introduced a low cost algorithm, called GLB-Synch.It is intended to perform task mapping and ordering, in the context of synchronous communication as a part of a multi-step scheduling approach.We have shown by analysis that the complexity of the GLB-Synch algorithm is O(mc + mv logv + ev).The simulation results show that a multi-step scheduling using the NLC_SynchCom and the GLB_Synch algorithms for clustering, and mapping and ordering, respectively, retain the same low complexity cost for both steps.The performance study has shown limited speedup gain over different DAG shape factors.The limitations in the achieved speedups are mainly attributed to the synchronization overhead.Further improvements in clustering and mapping techniques are needed to achieve high performance, unless the objective for synchronous parallel and distributed computing is otherwise.

c:
The number of clusters.e: The number of DAG edges, |E|.v: The number of DAG vertices, |V|.m: The number of processors.
Fig. 1.The plots shown in Fig. 1 indicate the potential speedup expected by scheduling a DAG on m processors.Accordingly, values of m >> W mean are not expected to have any significant speedup improvements in the PT of a DAG.

Figure 9 .
Figure 1.Relationship between the mean width and the number of nodes in DAG Figure 10.The Load Imbalance Factor versus the number of processors Table of time parameters (i.e.UNC schedule) 3. Number of processors.Compute the start time, T s (C), and the workload, T w (C) for each cluster, C. 2. Sort the clusters in an increasing order based on T s , breaking ties by choosing the cluster with the highest workload.If there is still a tie, select one randomly.3.For each cluster, C, do * Map C to a processor, p, with the least workload.