O(N) Tight-Binding Molecular Dynamics on the T3D

by Andrew Canning, Cray Research Switzerland, Giulia Galli, Francesco Mauri, Alessandro De Vita and Roberto Car, Institut Romand de Recherche Numérique en Physique des Matériaux, EPFL, CH-Lausanne

We present a parallel implementation of an O(N) tight-binding molecular dynamics algorithm on the T3D parallel computer. The localisation of the orbitals in the O(N) algorithm introduces a sparse nature to the orbital data and Hamiltonian which greatly changes the coding on parallel machines compared to a non-localised system.
The data distribution, communication routines and dynamic load balancing scheme used in the program are presented in detail along with the speed and scaling of the code on various homogeneous and inhomogeneous physical systems. Performance results will be presented for systems of 2048 to 32768 carbon atoms on 32 to 512 processors. We discuss the relevance to quantum molecular dynamics simulations with localised orbitals, of techniques used for programming short-range classical molecular dynamics simulations on parallel machines.
The linear nature of the scaling of the communications with system size and the localised nature of the orbitals makes these algorithms extremely scalable in terms of memory and speed on parallel systems with fast communications. .

Summary


Introduction

Figure 1: Fullerenes (C28) deposited on a reconstructed diamond (111) surface. The system shown has a supercell containing a 3072 atom slab of diamond with 52 C28's deposited on the surface.

The evaluation of the Energy in Density Functional Theory (DFT) calculations using the Local Density Approximation (LDA), such as those used in Car-Parrinello type Molecular Dynamics (MD) simulations [1], requires of order N**3 steps where N is the number of atoms in the system. This limits the size of systems that can be treated with these first-principles approaches to a few hundred, even with the most powerful modern computers. Tight-Binding (TB) models, while still presenting a quantum mechanical approach to molecular dynamics, have a much simpler basis set and Hamiltonian, greatly reducing the computational cost of an MD step. However, the scaling of the computational cost is still of O(N**3) even though the prefactor is much smaller than for first-principles calculations. This limits the number of atoms that can be studied in TB MD simulations to about a thousand on the most powerful modern supercomputers. In order to extend the application of quantum MD to larger systems, many new, so-called O(N) (since their computational cost grows linearly with the system size), techniques have been introduced in recent years. The basic ingredients of orbital based O(N) approaches are the spatial localisation of the orbitals and a novel energy functional which does not involve explicit orthogonalisation of the orbitals or calculation of the inverse of the overlap matrix. It is well known that for certain systems e.g. insulators, the wavefunctions can be represented by exponentially localised orbitals (Wannier type functions), therefore, it should be possible to exploit this localisation in MD calculations.

In this paper we will discuss our approach to the parallel programming of the O(N) energy functional proposed independently in references[2, 3] and[4] and extended to a non-orthogonal representation of the orbitals in [5]. However, a lot of the techniques presented in this paper are applicable to other O(N) algorithms in the context of a TB model and to LDA implementations of O(N) methods. We will not present in detail the calculation of all the different terms for our MD simulation but will rather present an overview illustrated with a few typical calculations such as the overlap matrix. The TB O(N) scheme has already been used with a serial code to study fullerene impacts on a semiconducting surface[7] and one of the target problems for our parallel code is an inhomogeneous system of fullerenes deposited on a diamond surface (see figure 1). In the section which follows we will discuss the implementation of the O(N) TB MD in a serial code. In particular we will present how the Energy functional and its derivative are calculated in the context of sparse matrices. In the next section we will discuss the data distribution and communication routines used in the parallel code. In the following section we will look at techniques used in programming classical MD simulations on parallel machines that are relevant to our quantum simulations. In the section on dynamic load balancing, we will discuss its application to inhomogeneous systems such as fullerenes arriving on a diamond surface. In the last section we will give results for the speed and the scaling of the code for various systems.

go to the summary

O(N) Tight-binding MD code

Our code uses the conjugate gradient (CG) method to minimise the energy functional for the electronic degrees of freedom and the Verlet algorithm [8] to update the atomic positions in an MD simulation. The basic structure of the code is two large nested loops with the innermost loop performing the minimisation to the ground state of the wavefunctions and the outer loop the atomic dynamics. The minimisation of the energy functional in O(N) calculations typically takes more steps than in non-localised methods and therefore a significant percentage of the calculation time for an MD run is spent in the CG minimisation step. Therefore, in this paper, we will devote most of the discussion of parallelisation and optimisation to the CG part of the code rather than the atomic update. In order to perform the CG minimisation it is necessary to calculate the energy functional and its derivative.

In the TB model the wavefunctions are expanded on the basis set of atomic orbitals at all the atomic sites. In the context of the O(N) TB algorithm a localisation region for the orbital representation of the wavefunctions can then be identified with a set of atomic sites centered around a given atom. In our code atoms are included in a localised region (LR) if they belong to the th nearest neighbour of the center atom where we have a cut-off distance to define, whether or not, an atom is a nearest neighbour. In this way, by varying the value of , we vary the localisation region of the orbital.

The localised orbitals can then be represented by a set of coefficients such that

(1)

where i is the atomic site on which the orbital is localised; j is summed over all the atomic sites in the localisation region of site i i.e. is summed over the set of basis functions atomic orbitals in our case) where is the k'th basis function at site j. Outside the localisation region the values of the coefficients are zero. In a non-localised TB formulation, would be non-zero at all atomic sites. In the program the orbitals are stored as sparse coded matrices in the sense that we only store the non-zero values of : the values in the localisation region LRi, and a second matrix tells us which atomic sites each of these values corresponds to. In the program the complete set of orbitals for the system is stored as a four dimensional sparse coded matrix of the form C(k,m,j,i) where m is the index for the orbitals having the same localisation region and a two dimensional matrix NEIG(j,i) lists the true site numbers for each C value. A further vector NNEIG(i), contains the number of sites in each localisation region. It would be prohibitive, in terms of memory, and computationally inefficient to work with the full C matrix containing all the zero values of the C's outside the localisation regions. It should also be noted that while for non-localised problems the overlap matrix, Hamiltonian etc. are usually calculated with Basic Linear Algebra Subprograms (BLAS), typically written in assembly language and highly machine-optimised, we now have to hand code the sparse matrix multiplications in Fortran. This leads to codes which typically have a MFLOP speed lower than non-localised codes but the "time to solution" for large systems is, of course much lower than standard O(N**3) non localised codes.

In this paper we will discuss results for the code using the energy functional for non-orthogonal localised orbitals presented in reference [5]. This is essentially the same as the energy functional for localised orbitals presented in [2, 4] except that the number of orbitals is chosen to be larger than the number of electronic states. All the coding techniques we will discuss apply to both functionals. For more discussion on the physics behind these energy functionals the reader should consult these papers and the references therein. The energy functional used in our program is given by

, (2)

where N is the number of electrons in the system. M is the number of orbitals which is chosen to be larger than N/2; the number of occupied states. The number of occupied states is half the number of electrons due to the spin degeneracy of the electrons. In all our calculations we choose corresponding to three orbitals for each of the localisation regions. is a scalar parameter (chemical potential) chosen to give the correct total charge and Q is the first order expansion of the inverse of the overlap matrix,

Q = 2I-S (3)

where S is the overlap matrix and I is the identity matrix. As pointed out in reference [6], in order to control unphysical charge transfer for certain systems studied with TB models e.g. systems with dangling bonds, it is often useful to add a Hubbard like term to the energy functional. This term is present in our program and can be switched on and off depending on what type of system is under study. It is given by where qi is the total charge at atomic site i and U is a scalar parameter typically chosen to be 8eV for carbon systems.

The Hamiltonian in equation 2 is the tight-binding Hamiltonian of Xu et al.[6]. This is an empirical tight-binding model where the Hamiltonian matrix elements are parametrised as a function of distance between the atomic sites and atomic orbital type. Thus the different terms in the Hamiltonian can be calculated, knowing only the inter-atomic distances and direction cosines of the vectors between the atoms connected by hopping terms. The same site orbital products are trivial to calculate. The hopping terms lead to products of orbitals on neighbouring sites so to keep track of these terms we must have an enlarged neighbourhood list NEIGH(i,j) which contains all the sites in a given localisation region plus the ones connected to it via the hopping terms. If the Hamiltonian operates on an orbital localised at site i it will produce a new orbital which is localised in the larger region defined by the sites NEIGH(i,j).

A third site index matrix NEIGS(i,j) is required for the sparse coding of S and tells us which pairs of orbitals have non-zero overlap i.e. the element S(i,j) is the overlap between the orbital centered at site i and the orbital centered at site NEIGS(i,j). In fact, the element S(i,j) is itself a small 3 x 3 matrix since we have 3 orbitals centered at each site so that S is a four dimensional matrix in our program. Thus we have three index matrices NEIG(i,j), NEIGH(i,j) and NEIGS(i,j) which tell us the structure of all the sparse data matrices in the program. The minimisation of the energy functional of equation 2 was performed using the conjugate gradient method. In this method at each step t, the gradient, at the current position, is calculated from

(4)

The minimisation direction D**t , which is conjugate to all the former directions of search, is then determined and the energy is minimised along the line where C**t is the current position. The energy functional is, in fact, a quartic in (8th power if the Hubbard term is included) which can be calculated explicitly by evaluating the energy functional in equation 2 at . Once has been determined the next point in the conjugate gradient process can then be calculated . This whole process is repeated until the required tolerance on the energy is reached. The majority of calculations to evaluate the quartic polynomial and the derivative of the energy functional involve sparse matrix products. We will take as a typical example the calculation of S. As previously noted S is stored as a four dimensional array where the first two indices correspond to the orbitals centered on the same site so that if there was no sparse coding S would be given by,

(5)

where m = 1,3 and n= 1,3 are the indexes for the 3 orbitals on each site; k = 1,4 is the index for the basis set of four orbitals and l,i and j are site indices. The sum over l is only non-zero on the sites where the two orbitals overlap. The type style C(k,m,l,i) is used to denote the way these variable are stored in the program. C(k,m,l,i) is sparse coded only on the l index while S(m,n,j,i) is sparse coded only on the j index. Thus in the calculation of S the indices m,n,k and i are direct indices in the sense that they run over all their possible values while j and l are indirect indices running over only the non-zero values of the matrices. The values of these indirect indices are given through our index matrices NEIG(l,i) and NEIGS(j,i). This makes our calculation different from standard sparse matrix multiplications which are typically of the form where only i is a direct index. In our calculation we are essentially pulling in two 3 x 4 matrices (C(k,m,l,i); m = 1,3; k = 1,4) from indirectly indexed memory locations and multiplying them together to give a 3 x 3 matrix (S(m,n,j,i); m = 1,3; n = 1,3). Therefore in sparse notation S is given by, S(m,n,j,i) =

(6)

where at each MD step we precalculate index lists giving the values of a and b for which NEIG(a,j) = NEIG(b,i) required for the outermost summation. Since we pull in small matrices rather than single values for each indirect memory reference these calculations run much faster on RISC type architectures with cache (such as the T3D) than a standard sparse matrix multiply. Using a data layout which gives the best memory access for the most common loops in our program we achieved speeds of about 30 MFLOPS for the calculation steps e.g. the calculation of S.

The calculation of , once we have constructed , is of the same form as the calculation of S given in equation 6. The calculation of the quartic polynomial for the line minimisation requires the further calculation of where D is the conjugate gradient direction. All these calculations are of the same form as equation 6, the calculation of the overlap matrix. In total, for a typical run, about 65% of the time in the CG step is taken up with sparse matrix multiplications while the remaining 35% is mainly vector matrix products (to calculate the derivative) which are also sparse coded. To perform the MD step the spatial derivative of the total energy E (including the ionic potential and kinetic energy ) must be calculated. E is given by

(7)

where is the band structure energy which takes the form ; is the interatomic potential of the TB model [6] and RI is the atomic position of atom I. The forces on the ions are then given by

(8)

and the ionic positions are updated using the Verlet algorithm [8]. Calculating the contribution to the forces from the electronic band structure term takes the largest percentage of the time in the MD step and, as in the CG step, this calculation mainly involves sparse matrix multiplications. As part of the MD step we must also update the index lists NEIG(i,j), NEIGH(i,j) and NEIGS(i,j) as the sites within the different localisation regions may change. The direction cosines which occur in the matrix elements of the Hamiltonian must also be updated. In a typical MD run where 15 CG steps were performed for each MD step [4,7] the CG steps took 85 to 90% of the time.

go to the summary

Molecular Dynamics on Parallel Computers

It may be thought that the localisation of the orbitals in our Quantum MD calculation would lead to a system which is more similar in nature, from a programming point of view, to short-range classical MD than standard Quantum MD simulations. In this section we will discuss some of the techniques used in programming classical MD with short-range interactions which are relevant to our localised quantum MD calculations and we will also point out some of the major differences between these two types of simulations. In short-ranged classical MD simulations on parallel machines there are considered to be two main ways to distribute the data among the processors often referred to as particle and spatial distribution. In the case of a spatial distribution the cell in which we carry out the simulation is divided spatially into blocks which are then allocated to different processors. Each processor then stores all the data associated with the particles within its block (which may change during the simulation) and calculates the trajectories for these particles. In the case of a particle type distribution all the particles are divided among the processors (typically each processor dealing with the same number of particles) and throughout the simulation each processor will store the data associated with its set of particles (which do not change) and calculate the trajectories for its set of particles. With this type of distribution, if the particles are moving rapidly during the simulation (say in the case of a liquid or gas for example), there is no spatial correlation between the particles on the processors and their physical positions in the simulation. This means that when the data of a physically neighbouring particle is required, there is a high probability it will not be on the same processor and not even on a processor which is physically close. The particle data distribution can therefore lead to a high cost in communication between processors although it has the advantage that it is easy to load balance the calculations by giving roughly the same number of particles to each processor. On the other hand the spatial decomposition method can lead to large load imbalances in the calculations on each processor if the relative number of particles in each spatial region varies greatly. However, due to the spatial correlation between the particles on the processors and the true physical positions of the particles in the simulation it is highly probable that the data of neighbouring particles required to update a given particle will be on the same processor or at worst a physically close neighbouring processor. This means that a spatial distribution of the particles tends to be very efficient from a communications point of view but bad from a load balancing point of view. The best algorithms typically perform a combination of these two types of distribution giving roughly equal numbers of particles to each processor which have spatial locality and redistributing the particles to the processors, every few MD steps, if the particles move too much and lose there spatial locality. In this way both the constraints of load balancing and minimising communication can be partially satisfied. It is an approach similar to this that we have followed for our Quantum MD program.

While we have drawn some analogies between classical short-range MD and quantum MD with localised orbital there are several major differences which will make our parallel implementation rather different from techniques used in classical MD. In particular, due to the minimisation of the electronic degrees of freedom, the amount of calculation required to update an atom at each MD step in quantum simulations is very much larger than the update of an atomic position for classical MD. This has implications for the loadbalancing and also the techniques used to calculate the neighbourhood lists.

go to the summary

Parallel tight-binding code

In our program we used a particle/orbital distribution of the data among the processors so that each processor has complete orbitals. In the section on load balancing we will discuss how we implement spatial locality in the context of a particle like distribution for an inhomogeneous system. The reader may assume, in this section, that the code is written for a solid with periodic boundary conditions (such as bulk diamond) with the system being divided up into identical blocks, each processor dealing with the orbitals centered within this block and the atomic motion of the atoms in that block. We also assume the atoms are moving so little that they do not cross the boundaries of any of these blocks. For this system we have no load balancing problems and a spatial locality of the mapping of the orbitals onto the processors. We chose a particle/orbital distribution (processors having complete orbitals) rather than a spatial type distribution (single orbitals being distributed between processors) for the following reasons. It is the simplest to program as the parallel code loop structures remain essentially the same as the serial code with a lot of the subroutines remaining almost identical. It gives faster calculation speeds on RISC type chips with cache than other methods. The communication cost of passing the orbital data between processors is modest due to the localised nature of the orbitals which means we only need to pass the orbital data for the overlapping portions of the orbitals which are not on the same processor. In our localised TB formulation the center of the localisation region is an atomic site so we can associate an orbital, as well as the atoms, to atomic sites. Therefore, each processor has a subset of atomic sites allocated to it and it will have stored in its memory all the data for the complete orbitals centered on that site plus all the atomic data for its sites. At each conjugate gradient step we require the values of the orbitals on other processors which are overlapping with the orbitals on our processor. We used a technique similar to ghost cells used in solving differential equations on parallel machines in that at each CG step each processor copies into a dummy array all the off-PE orbital elements that are required to update its own orbitals. Since during the CG steps the particles are not moving we can at the start of each MD step make a list of the processor numbers and memory locations of all the required off-PE orbitals. This list can then be used at each CG step to copy in the required data. It should be noted that during the conjugate gradient step, plus the conjugate gradient directions must also be communicated between the processors. Our program is written in a message passing format using Fortran77 and PVM while the communication routines for the orbitals were written using the faster SHared MEMory (SHMEM) library routines which are native to the T3D. In addition to carrying all the local information for its set of particles and orbitals, each processor has global lists running over all the particles in the system which tell it which processor deals with a given particle and what is the local address of that particle. Using these global lists and its local neighbourhood lists each processor can determine the location of all its neighbouring orbitals. In addition each processor has a complete list of all the atomic sites so that the new neighbourhood lists can be calculated in parallel when the particles move.

go to the summary

Dynamic Load Balancing

There are now many articles on different load balancing schemes in short-range classical molecular dynamics applied to a range of different physical problems [10,13,14]. Load balancing becomes an issue for systems where the amount of time to calculate the new position varies from particle to particle. Giving out the same number of particles to each processor would then cause processors to idle while waiting for the processor with the most work to finish. As discussed in previous sections, the main problem in developing a loadbalancing algorithm to divide the calculations among the processors is to satisfy the (often conflicting) constraints of spatial locality of the data on the processors and loadbalancing of the calculations between processors. If particles/orbitals are rapidly moving during the MD simulation any loadbalancing algorithm which does not take into account spatial locality will lead to increased communications during the run. The most sophisticated load balancing algorithms for classical MD simulations typically involve division of the system into a number of different sized regions equal to the number of processors such that the calculations, for all the particles in each region, take roughly the same time [12]. These regions should then have a surface area to volume ratio as low as possible in order to minimise the amount of communications. Algorithms to perform this type of loadbalancing are often difficult to implement in three dimensions and can have numerical instabilities and bad convergence. In most algorithms the region that each processor deals with is topologically connected. In our load balancing algorithm we will weakly relax this constraint while strongly satisfying the load balancing constraint on the calculations. In our typical simulations the communications only take about 10% of the time so that even if the spatial locality was strongly satisfied we would only expect a few percent difference in the program run time between our solution and the optimal solution. On the other hand the calculations take about 90% of the run time therefore good load balancing is extremely important.

The first step in our load balancing algorithm is to spatially divide the system into three dimensional blocks. Typically the number of blocks is chosen to be equal to the number of processors but this is not a necessary condition for our load balancing algorithm. In our program we read in three parameters which correspond to the number of blocks in the x,y and z directions. We then construct a one dimensional list of all the sites by ordering the blocks as a one dimensional list with x,y,z ordering and each block we write as a one dimensional list with z,y,x ordering. It is important that the ordering of the sites for each block is the reverse of the block ordering to give good spatial locality although it is not necessary to order the blocks as x,y,z. There may be systems where better spatial locality is obtained by ordering with the y or z coordinate being the most rapidly varying. In this way we construct a one dimensional list that has three dimensional spatial locality. We now wish to load balance on this list giving out the sites to each processor such that each processor does the same amount of work. In essentially all the subroutines in the program the outermost loop is over the sites allocated to each processor. In a typical subroutine, such as the calculation of the overlap matrix, we explicitly time the calculations associated with each site by adding a subroutine timing call within the loop over the sites. We now have a time ti associated with the calculations for each site where i = 1,2, ..., N, the order of sites being the one dimensional list we have constructed with spatial locality. We now calculate the time each processor would take to do the calculations if the system was perfectly load balanced i.e.

(9)

where is the number of processors. We then hand out contiguous strips of sites to each processor, from our one dimensional list, such that the sum of the times (ti) on each processor are as close as possible to . In this way we achieve an essentially optimal load balancing while still having a high degree of spatial locality of the mapping of the sites onto the processors. This means m sites, j,j + 1,j + 2, ..., j + m are allocated to a given processor such that,

(10)

Providing the the ratio of the number of atoms to the number of processors is not too small, we have found this load balancing algorithm to work extremely well. In a typical simulation where we had ~ 40 sites per processor we achieved load balancing down to a few percent. We perform the loadbalancing dynamically during our MD run every 10 to 20 MD steps depending on how rapidly the atoms are moving in our simulation. All the data for the orbitals and the atoms is reorganised in memory to correspond to the new mapping of sites to processors. Due to the computationally intensive nature of quantum MD the load balancing typically takes less than one percent of the run time if it is carried out every 20 MD steps. This should be contrasted with classical MD where the load balancing can easily become a significant percentage of the run time. It should be noted that our load balancing algorithm is very general in that the same program can run any geometry of system, even systems of different dimensions, on any configuration of processors.

go to the summary

Performance

In this section we will present performance results for an idealised bulk diamond systems and for the inhomogeneous system of fullerenes arriving on a reconstructed diamond surface which was discussed in the previous section on load balancing. The inhomogeneous system used for these tests consists of 52 fullerenes, each of 28 carbon atoms, deposited on a diamond slab (3072 atoms) of 12 double layers, with periodic conditions in the x and y directions (see figure 1). The performance figures were taken for a typical dynamical run, with dynamic loadbalancing every 20 MD steps, where 5 fullerenes are deposited on top of 47 fullerenes already bonded to the diamond slab. The bulk diamond system used in the tests has periodic boundary conditions in the x, y and z directions. As mentioned in the section on the parallel tight-binding code each processor deals with a geometrically equivalent sub-block of the diamond system. During the MD runs on bulk diamond the atomic motion was constrained so that the neighbourhoods defining the localisation regions remained unchanged throughout the calculation. The calculation times for constructing the index lists at each MD step are still included in the performance figures to make them more representative of an MD run on a real system in a research environment. Due to the static and homogeneous nature of this problem no load balancing was required or performed during these runs.

Figure 2: - " Weak scaling " of the code for a bulk diamond system of 2048 to 32768 atoms running on 32 to 512 processors where the localisation region for each orbital contains 45 sites. The curve of circles shows the code performance for the conjugate gradient minimisation and the curve of squares shows the performance for a full molecular dynamics run.

In figure 2 we show the results for the " weak scaling " of our code where the size of the system is increased proportionally with the number of processors The results shown are for a 2048 atom bulk diamond system on 32 processors up to a 32768 atom system on 512 processors. The localisation region is taken up to the 3rd nearest neighbour giving 45 atomic sites in each localisation region. This is a localisation region which is sufficiently large to give good convergence and very small errors (compared to a non-localised algorithm) for carbon based insulators. The results for the MD runs are given with 15 CG steps for each MD step which is representative of the amount of CG steps required for convergence in a typical run [4,7] for this type of system. As can be seen from the graph the speed up is extremely linear and very close to ideal. The full MD simulation MFLOP speed is slightly lower than the CG step due to the index lists that are constructed at each MD step. Also some of the calculations for the atomic update, run at a slightly lower MFLOP speed than the CG step. We present the results in GFLOPs rather than seconds for an MD step as the smaller systems have wraparound effects on the localisation regions which means that proportionally less time is required for the MD step than larger systems. It should be noted that the algorithm used is strictly O(N), in terms of the number of calculations per MD step, so that provided there are no wraparound effects on the localisation regions the speed up in GFLOPS translates directly into an identical increase in the number of atoms updated per second. All of these simulations spent about 14% of the time in communications and 86% doing calculations. An MD step, including 15 CG minimisation steps of the orbitals, took about 50s for the 32768 atom system on 512 processors, running at above 11 GFLOPS. The weakscaling curves for other ratios of particles to processors and different localisation regions are also essentially linear with nearly ideal speedup. Due to the localised nature of the problem the ratio of communications to calculations remains the same as we scale up the problem and the communications remain local. The GFLOP speed for the inhomogeneous system on 128 processors is ~ 8% lower than the bulk diamond system. The lower GFLOP speed is caused by two main factors the first of which is the load imbalance introduced into the problem by the inhomogeneous nature of the system and the fact the particles are moving. The second most important factor is the increase in communications due to the less spatially structured nature of the problem which makes the spatial mapping of the system to the processors less direct than the bulk diamond system. A third less important factor is the extra time required to run the loadbalancing algorithm every 20 MD steps although this is only about 1% of the program time. Without dynamic loadbalancing our test run is about twice as slow, which is due to the fact that when the fullerenes arrive at the surface the number of sites in the localisation regions of their atoms can increase by factors of 2 to 3.

These results should be compared with the vector code that runs at about 270 MFLOPS on one processor of a C90 for the bulk diamond system. The number of floating point operations per MD step is, to within a few percent, the same for the vector and serial code. It should be noted that the structure of the loops in the vector code is very different from that of the parallel code. The innermost loop in the vector code must be over sites since this is the only loop long enough to efficiently exploit the vector processing while in the parallel code we parallelise over the site loop which must therefore be the outermost loop.

Figure 3: - " Strong scaling " of the code for two bulk diamond systems of 2048 and 4096 atoms, and the inhomogeneous system shown in figure 1. All orbital localisation regions are taken up to the third nearest neighbour.

Figure 3 shows the " strong scaling " performance for the code where the size of the system is kept constant and the number of processors is increased. Results are shown for 32 to 256 processors for the 2048 and 4096 atom bulk diamond system and also the inhomogeneous system. The choices of systems are meant to be typical of the size of problem that would be run in a research environment on a system of 32 to 128 processors. The speed up is very linear and in the case of 4096 atoms of diamond there is a 7.3 speed up in going from 32 to 256 processors. The speed up is lower for the smaller 2048 atom system since when we are in the regime where the number of atoms on each processor is very small (8 atoms per processor for 256 processors), the amount of time in communications becomes more important. This is because the fewer atomic sites we have on each processor the more probable it is that the orbitals, required to perform the calculation of the overlap matrix elements etc. for these sites, will not be on that processor and will have to be communicated. Typically the communicated data will also be less local on the processors when we have a few atoms on each processor thus, the communication times will be larger. In order to reduce the communication cost of studying small systems on a large number of processors it may be more efficient to use a different data distribution where the data for each orbital is spatially distributed over the processors. However, it is not clear that the overall speed would be faster as the MFLOP speed for the calculations may decrease. This type of data distribution was implemented in reference [15] on a CM5 but the single PE performance is very much lower that what we find for our implementation. The load balancing also becomes more difficult as we cannot simply load balance over sites since the calculations associated with the MD step for each individual atom are distributed over processors. All our production runs of the code were never in the regime where there was a few sites on each processor which would only be useful for studying small systems over very long timescales.

go to the summary

Conclusions

In this paper we have presented a particle type data distribution scheme of an O(N) TB molecular dynamics scheme. We have shown that the local nature of orbitals translates directly into a parallel algorithm that has only local communications between processors. While the amount of interprocessor communications in our program is significant the locality of these communications leads to an algorithm whose ratio of communications to particle number remains constant as we scale up the processor number and system size. We have also presented a simple and robust load balancing scheme for load imbalance found in inhomogeneous systems or systems with rapidly moving atoms. Parallel machines with fast communications, like the T3D, are very cost-effective for these types of simulation as well as allowing the possibility to scale to larger systems than could be studied on conventional supercomputing platforms. Our program has now been used to study many different problems e.g. fullerene deposition, in a variety of carbon based systems. The results of these simulations and a more detailed description of the code will be published elsewhere.

Acknowledgements

This work was done on the T3D at EPFL as part of the PATP (Parallel Applications Technology Program) joint project between the EPFL and Cray Research. Support is also acknowledged from the Swiss National Science Foundation. In the course of this work we have benefited from useful discussions with other PATP collaborators and staff at Cray Research, Eagan, USA.

References

[1] R. Car and M. Parrinello, Phys. Rev. 55, 2471 (1985).

[2] F. Mauri, G. Galli, and R. Car, Phys. Rev. B 47, 9973 (1993).

[3] F. Mauri, G. Galli, Phys. Rev . B 50, 4316 (1994).

[4] P. Ordejón, D. Drabold, M. Grunbach, and R. Martin, Phys. Rev. B 48. 14646 (1993).

[5] J. Kim, F. Mauri, G. Galli, Phys. Rev. B 52, 1640 (1995).

[6] C. Xu, C. Wang, C. Chan, and K. Ho, J. Phys. Condens. Matter 4, 6047 (1992).

[7] G. Galli and F. Mauri, Phys. Rev. Lett. (in press).

[8] L. Verlet, Phys. Rev. 159, 98 (1967).

[9] L. J. Clarke, I. Stich and M. C. Payne, Comp. Phys. Commun. 72, 14 (1992)

[10] S. Plimpton, pre-print SAND91-1144, Sandia National Laboratories, (1993).

[11] A. Canning, A. De Vita, G. Galli, F. Gygi, F. Mauri and R. Car, Procs. of the 94 Fall, Cray User Group Conf. (edited by B. and K. Winget), 18 (1995).

[12] Y. Deng, R. A. McCoy, R. B. Marr and R. F. Peierls, Procs. of the Seventh SIAM Conf., (edited by D. H. Bailey et al), 605, (1995).

[13] D. C. Rapaport, Comp. Phys. Commun. 62, 198 (1991), D. C. Rapaport, Comp. Phys. Commun. 62, 217 (1991).

[14] W. Smith, Comp. Phys. Commun. 62, 229 (1991).

[15] S. Itoh, P. Ordejón and R. Martin, University of Illinois preprint, (1995).


REFER TO CONTENTS