Nous étudions dans cet article une parallélisation efficace de l'algorithme de branch and bound dans sa version dite du meilleur d'abord. Cette parallélisation consiste en une alternance de phases de calcul et de phases dites de synchronisation qui servent à équilibrer les charges et à réaliser une répartition optimale des données. Des résultats expérimentaux obtenus sur un ordinateur massivement parallèle Cray T3D illustrent l'efficacité de notre approche.
EPFL Supercomputing Review - n. 6 - nov. 94
In this paper we study an efficient parallel synchronized branch and bound (PSBB) algorithm. The parallelization of the sequential best-first branch and bound algorithm is based on the concept of alternating computation and so called synchronization steps. The computational steps simplify the problem to solve whereas the synchronization phases are devoted to the problem of load balancing and data distribution. Experimental results show the efficiency of the proposed PSBB algorithm when executed on a massively parallel Cray T3D machine.
We have chosen the approach of exactly solving minimization problems belonging to the class NP-hard by developing a parallel algorithm based on the sequential branch and bound method. The idea behind our parallel synchronized branch and bound algorithm, also called PSBB algorithm, is to alternate computation and communication phases so as to minimize useless work and to maximize computation versus communication time.
In order to study the efficiency of the PSBB algorithm, we implemented a general combinatorial optimization problem, the mixed integer linear programming problem (MIP) [2].
Figure 1 - The branch and bound tree for a 0-1 integer programming problem
The sequential branch and bound algorithm
The branch and bound algorithm is a method for solving combinatorial optimization problems exactly. It proceeds by traversing a tree in which each node is a subproblem of the initial problem in order to find a feasible leaf node with minimal value. In Figure 1 we illustrate such a branch and bound tree for solving a 0-1 integer programming problem[13]. The solution on each non-leaf node corresponds to the solution of at least one of its sons. To each node P in the tree, we associate a lower bound l(P) (denoted by l in Figure 1) and an upper bound u(P) (not shown). The nodes in the search tree are explored in increasing order according to their lower bound l(P). The order in which nodes are considered is indicated on Figure 1 by a blue number in the upper right corner. This strategy has as a consequence that the first feasible solution considered for branching is the optimal solution to the initial problem. Such a strategy is called best-first. A more in depth description of the sequential branch and bound algorithm and its axiomatization may be found in [6, 9, 11]. It may be described by the following pseudo code, where OPEN is a list of unexpanded problems sorted in increasing order by their lower bound l(P).
The Parallelization of the Algorithm
The parallelization of the best-first branch and bound algorithm on distributed memory machines is relatively difficult. The difficulties come first from the dynamic data generated by the algorithm not allowing any compile time scheduling; and second from the dependencies between the problems generated, the best first strategy implying that they must be visited in non decreasing order of their lower bounds.
The most immediate parallelization on distributed memory machines consists in attributing one list of unexpanded problems to each processor and letting processors work exclusively on their private list. But as the shape of the search tree is very irregular, this will lead to unbalanced load, and hence a bad speedup. Thus there is a need for a load sharing mechanism in order to avoid starvation.
But, this is not sufficient because there is no guarantee that all processors work on equally interesting problems. Indeed, if processors consider only their private list of problems executing a local best first search, they cannot be aware of the fact that, maybe, all the problems they own are uninteresting when compared to the problems held by another processor. Such a parallelization may lead to a situation where only a small number of processors work on problems belonging to the best first search tree while the others consider problems that are of no use for finding the best solution.
Thus, there is beside load sharing also a need for balancing the different open lists regarding the lower bounds of the problems. A trivial solution to this problem consists in having one single open list managed by a master which hands out problems to the slaves. This approach is currently used on shared memory parallel computers, the concept of a master being realized by an exclusive access to the open list. On distributed memory machines, the master-slave parallelization falls short essentially for two reasons. First of all, the master is likely to become a bottleneck as soon as the number of slaves increases. Moreover, the entire open list is kept in the memory of the master processor, yielding a very bad utilization of the distributed memory and not allowing the resolution of any problem that cannot fit into the memory of one single processor.
The use of many private open lists is thus the way to go provided that some additional mechanisms keep the open lists balanced concerning the lower bounds of the different problems. Such a control of the parallel execution can be realized efficiently using very different schemes. Lüling and Monien in [10] use a local load balancing scheme where processors only exchange problems with their neighbors. They showed that keeping the loads of neighboring processors close one to the other leads to a globally well balanced execution if the diameter of the underlying processor graph is small. The time spent for balancing the load is acceptable and linear speedups can be obtained for large optimization problems using hundreds of processors.
Our approach, on the contrary, is based on global synchronizations and perfect load balancing. It is called the Fringe Enumerating Algorithm or Parallel Synchronized Branch and Bound Algorithm and is derived from an axiomatic formulation of the branch and bound algorithm [6] which defines the notion of fringes within a branch and bound search tree.
Figure 2 - Representation of a fringe in a branch and bound tree
In [6] we showed that a parallelization based on fringes is optimal in the sense that it avoids all useless computations and uses all the parallelism available. At any time of the computation, a fringe consists simply in the subset of unexpanded problems that have the smallest lower bound of all problems (see Figure 2). All problems belonging to a fringe can be processed in parallel and independently one from the other. This processing corresponds one parallel computation step in our parallelization and is followed by a synchronization step which determines the next fringe and equally distributes all the problems belonging to this fringe on the processors. The following pseudo-code accurately describes the overall control structure of our parallel algorithm.
This algorithm is only efficient if the number of synchronization phases is small and if computations of almost all fringes are coarse-grained. Practical experience using different optimization problems show that this is indeed the case. More importantly, one may show using modelizations of the execution of the branch and bound algorithm that the above conditions hold for almost all optimization problems and almost all problem instances [5, 7, 14].
The last important topic in our parallelization concerns the load balancing realized during a synchronization phase. We compute the lower bound of the next fringe and distribute the problems of this fringe on all the processors. The distribution of the problems is known in the literature as the Token Distribution Problem (TDP) which consists in redistributing a set of identical tokens over a graph of processors. As all the problems belonging to a given fringe are completely independent from each other and can be processed in any order, the token distribution problem is a general formulation of the load balancing mechanism we need in our parallelization. We discuss the TDP in detail in the next section and propose a novel algorithm to solve this problem on k-ary d-cube networks (tori of any dimension) using an approach that minimizes the information exchanges and thus provides a scalable solution for the TDP over a broad class of networks.
The Token Distribution Problem
The token distribution problem (TDP), introduced by Peleg and Upfal in 1989 [12], considers a set of p processors interconnected using a given network topology. Two processors can exchange one or more messages or tokens at a time. At any time, a processor can at most send and receive tokens from one of its neighbors. Initially, each processor Pi has a given number m(Pi) tokens. The TDP is the problem of finding a sequence of token exchanges between processors such that at the end of the sequence every processor m(Pi) owns m'(Pi) tokens. Token exchanges between two processors means in this context that either processor may send token to the other processor or receive tokens from it. More formally, the TDP can be stated as follows.
Let G=(V,E) be a graph representing a MIMD-DM machine. Let
and
be two functions associating an initial respectively a final number of identical token to each node in the graph. A communication stage is a function
which indicates how many tokens, are sent at stage i over a given edge. By convention, we have
. Any communication stage has to verify the two restrictions
. This means that a node can at most send/receive tokens to/from exactly one neighbor node in one communication stage. The token distribution problem consists in finding a sequence c1,...,cs of communication stages of minimal length s such that for all v and all j such that 1 ¾ j ¾ s .


Auf der Heide et al. [1] have developed various token distribution algorithms for k-ary d-cube networks. They also proved that a solution to the TDP can be found in polynomial time by solving a sequence of maximum flow problems. But this solution assumes that every processor knows the initial distribution entirely. This requires an information exchange under the form of a total exchange (gossiping) which is very expensive. The option we take consists in using only a partial information exchanges and develop a TDP algorithm that, during the different steps of its execution, sends tokens only along fixed dimensions. We thus reduce the volume of information exchanged but this is at the expense of the loss of the token routing optimality. The compromise is in our favor because one can show that our approximate approach has the same worst case complexity as the optimal routing defined in [1]. Our TDP algorithm, called Dimension Order Token Distribution algorithm (DOTD), proceeds in three steps:
Figure 3 - Representation of the token distribution known to one
The information exchanges we use can easily be understood from Figure 3.. The blue processor (which could be any other processor as a k-ary d-cube is a regular graph) knows the number of tokens owned by its neighbor processors along dimension one, shown in orange and white on Figure 3.
Along dimension two, the blue processor does not know any detailed information about the number of tokens owned by an individual processor. It only knows the sum of tokens owned collectively by all the processors that belong to a same ring of dimension one, shown as red blocks. The decomposition continues this way and all processors of the upper plane see the sums of tokens owned collectively by the processors of the lower planes (green). In order to obtain these information, the processor shown in blue in Figure 3, or any other processor, needs only to communicate with processors along the different rings it belongs to (Figure 4). The number of messages to send is thus O(d.k) on a k-ary d-cube while the total exchange algorithm needs O(kd) messages.
Figure 4 - Representation of the neighbors with which the black processor communicates
Let us now look at the way tokens are exchanged so as to determine the distribution information computed in step 2 of the algorithm. Tokens are successively routed through given dimensions, the dimensions being visited in some fixed order. For simplicity, we visit dimensions in increasing order. Consider again Figure 3. Processors first distribute tokens in dimensions one and two, i.e. on every plane, before distributing tokens along the third dimension. Suppose now that the upper plane has one extra token while the plane in the middle lacks one token and the lower plane is balanced in the number of tokens. Knowing both the sum of tokens available on the lower plane and the final distribution wished by these processors, the middle and upper plane conclude that they will not exchange any tokens with the lower plane.
With the same information, they also know that they must themselves exchange exactly one token. There will thus be one individual processor of the highest plane that has to send one token to its neighbor processor on the middle plane. The difficulty comes from the fact that the processors of the middle plane distribute their local tokens and that there will be one processor that gets the hole, i.e. that lacks one token. The location of this hole must be known to the above processors so they can put their extra token on their corresponding processor. Hence, the distribution strategy used by the processors of the middle plane must be known to the above processors.
Concerning the middle processors, things are much easier. They distribute their local tokens according to the distribution strategy and thereby put the hole on just any processor knowing that the exceeding plane will manage to send the missing token to the right place.
Thus, inspecting the token exchanges needed along the highest dimension gives the distribution of tokens that is needed on every plane before communication is done in this dimension. This information is then used to determine how tokens must be sent in the lower dimensions in order to achieve the correct distribution on the different planes. The computation of the token messages is a recursive procedure that computes the successively needed local distributions and communications starting with the highest dimension. Its result will be the information of how many tokens are to be sent from a given processor to one of its neighbors in a given dimension.
The effective exchange of tokens is done during step 3 of the algorithm and proceeds in increasing order of dimensions. Figure 5 shows an example of the token routing. The distribution on the left gives the initial state while the final state is on the right. The intermediate distributions are those obtained after tokens have been routed through the first respectively second dimension.
Solving mixed integer programming problems on the Cray T3D
(1)
where c is a cost vector of size n, A a matrix of size m x n, b a vector of size m and
, a set of distinct indices which identify integer variables (see also[13]). In order to solve a MIP problem using the branch and bound algorithm, we need to compute lower and upper bounds for subproblems in order to be able to subdivide a problem into zero or more subproblems and to test if a given subproblem has a feasible solution.
Table 1 - Summary of test problems from the MIPLIB
To compute a lower bound of a MIP subproblem, we solve the relaxed linear programming problem associated, i.e. the MIP subproblem without the integrity constraints. Depending on the MIP problem structure, the gap between the lower bound and the optimal solution may be large.
Although knowing good upper bounds does not diminish the overall execution time , they are important in order to reduce memory consumption, especially when using a best-first branch and bound strategy. We implemented a simplified strategy which is based on rounding the value of integer variables having non integer values in the solution of the relaxed problem and iteratively removing inadmissibilities occurring due to these roundings.
Typically a branching strategy subdivides a given feasible MIP problem into one or more feasible and disjoint subproblems. The branching strategy hopefully increases the lower bound for each of the newly generated subproblems, but an increase cannot be guaranteed.
In the case of a MIP subproblem P, we chose a variable xj which has a non integer value
in the solution of the relaxed problem of P and which must be integer, i.e. j ‘ B. Now the branching strategy subdivides the problem P into two disjoint subproblems P1 and P2 such that
.
The selection of the variable used in the branching of a subproblem is based on the estimated degradation of the objective function, the notion of pseudo-costs and the previously branched subproblems. A detailed description of the selection of the branching variable may be found in [4].
Before studying the efficiency of the PSBB algorithm, we would like to stress the fact that on the Cray T3D, different executions give very similar results considering the execution time, the number of nodes branched and the number of synchronizations. The only non determinism in the PSBB algorithm is due to the variable arrival time of signals. On the Cray T3D, we implemented signal handling using the very fast virtual shared memory primitives. Therefore signals usually arrive approximately at the same instant of execution during various runs. From this, we conclude that the data obtained by a single run are representative.
As we can see from Figure 6, we obtain an efficiency of 73% on 128 processors when solving the problem bell3a. About 15% more nodes were explored by the parallel algorithm than by the sequential one. From this we can deduce that about 23.3% of the overall execution time is lost in synchronization and waiting. There are two major reasons for this behavior.
First, at the beginning and at the end of the execution, there are no enough subproblems to form large enough macro fringes. This is due to the structure of the problem solved and is independent of any particular parallel branch and bound algorithm.
Second, if one processor has no more work, it has to wait until the last processor has received its notification of starvation and has finished branching its current subproblem. This problem may be efficiently solved by using an interrupt based message passing system. Such a mechanism is not available on the Cray T3D machine. Generally, 15% of the total execution time is spent in synchronization and waiting and we estimate that about 30% of this time is effective synchronization time.
Interestingly, for the problem egout, we measure a superlinear speedup. This is due to the fact that the size of the tree explored decreases with the number of processors used to solve it. This consists a rather special case as the tree size explored tends to increase with the number of processors used (see Figure 6).
The p0033 problem is completely different from egout. The number of subproblems branched considerably increases with the number of processors. On a 128 processor execution, 1.6 times the number of nodes branched by the sequential algorithm are expanded. This has a direct impact on the efficiency, which is in this case about 30%, as can be seen on figure 6: Speedup and figure 6: Overhead(in %).
Solving easy problems
On figure 7: Speedup and figure 7: Overhead(in %)
, we have represented the results obtained from solving the three easy MIP problems fluglp, pipex and p0201. As it can be seen, the speedup obtained is nearly linear up to a given number of processors (32, 16 and 8). At that point, the problem size becomes too small and there is not enough work to occupy all the processors. This situation is typical when using massively parallel machines to solve small problems. Nevertheless, the efficiency could in our case be increased by parallelizing the branching and bounding functions. But, this is a major change in the parallelization and makes the parallelization dependent on the problem considered.
Conclusion
In this paper we presented a parallelization of the best-first branch and bound algorithm designed for massively parallel super-computers. We showed that the branch and bound algorithm can be expressed as a sequence of independent computations. This is the base for the design of our PSBB algorithm which uses a very simple control structure consisting in alternations of parallel computations and synchronizations. The synchronizations serve two purposes: first, they control the dynamic scheduling of the PSBB algorithm. And second, they equilibrate the load of the machine by using an algorithm to solve the token distribution problem. This algorithm, called DOTD algorithm, is designed for a broad class of interconnection networks. As the DOTD algorithm minimizes the information exchange needed to compute the token routing, it is a scalable approach.
We showed the performances obtained when using a Cray T3D massively parallel super-computer and applying our PSBB algorithm to realistic mixed integer programming problems. This shows that a global approach based on synchronizations is well adapted to control the dynamic behavior of the program.
In further work, we will apply the PSBB algorithm to other problems, like the traveling salesman problem and use other parallel machines, like the Intel Paragon. We will also try to apply these principles to more structured trees, like game trees.
Bibliography
article paru EPFL Supercomputing Review - n. 6 - nov. 94