

We present a parallel implementation of an O(N) tight-binding molecular dynamics algorithm on the T3D parallel computer. The localisation of the orbitals in the O(N) algorithm introduces a sparse nature to the orbital data and Hamiltonian which greatly changes the coding on parallel machines compared to a non-localised system.
The data distribution, communication routines and dynamic load balancing scheme used in the program are presented in detail along with the speed and scaling of the code on various homogeneous and inhomogeneous physical systems. Performance results will be presented for systems of 2048 to 32768 carbon atoms on 32 to 512 processors. We discuss the relevance to quantum molecular dynamics simulations with localised orbitals, of techniques used for programming short-range classical molecular dynamics simulations on parallel machines.
The linear nature of the scaling of the communications with system size and the localised nature of the orbitals makes these algorithms extremely scalable in terms of memory and speed on parallel systems with fast communications.
.

Figure 1: Fullerenes (C28) deposited on a reconstructed diamond (111) surface. The system shown has a supercell containing a 3072 atom slab of diamond with 52 C28's deposited on the surface.
The evaluation of the Energy in Density Functional Theory (DFT) calculations using the Local Density Approximation (LDA), such as those used in Car-Parrinello type Molecular Dynamics (MD) simulations [1], requires of order N**3 steps where N is the number of atoms in the system. This limits the size of systems that can be treated with these first-principles approaches to a few hundred, even with the most powerful modern computers. Tight-Binding (TB) models, while still presenting a quantum mechanical approach to molecular dynamics, have a much simpler basis set and Hamiltonian, greatly reducing the computational cost of an MD step. However, the scaling of the computational cost is still of O(N**3) even though the prefactor is much smaller than for first-principles calculations. This limits the number of atoms that can be studied in TB MD simulations to about a thousand on the most powerful modern supercomputers. In order to extend the application of quantum MD to larger systems, many new, so-called O(N) (since their computational cost grows linearly with the system size), techniques have been introduced in recent years. The basic ingredients of orbital based O(N) approaches are the spatial localisation of the orbitals and a novel energy functional which does not involve explicit orthogonalisation of the orbitals or calculation of the inverse of the overlap matrix. It is well known that for certain systems e.g. insulators, the wavefunctions can be represented by exponentially localised orbitals (Wannier type functions), therefore, it should be possible to exploit this localisation in MD calculations.
In this paper we will discuss our approach to the parallel programming of the O(N) energy functional proposed independently in references[2, 3] and[4] and extended to a non-orthogonal representation of the orbitals in [5]. However, a lot of the techniques presented in this paper are applicable to other O(N) algorithms in the context of a TB model and to LDA implementations of O(N) methods. We will not present in detail the calculation of all the different terms for our MD simulation but will rather present an overview illustrated with a few typical calculations such as the overlap matrix. The TB O(N) scheme has already been used with a serial code to study fullerene impacts on a semiconducting surface[7] and one of the target problems for our parallel code is an inhomogeneous system of fullerenes deposited on a diamond surface (see figure 1). In the section which follows we will discuss the implementation of the O(N) TB MD in a serial code. In particular we will present how the Energy functional and its derivative are calculated in the context of sparse matrices. In the next section we will discuss the data distribution and communication routines used in the parallel code. In the following section we will look at techniques used in programming classical MD simulations on parallel machines that are relevant to our quantum simulations. In the section on dynamic load balancing, we will discuss its application to inhomogeneous systems such as fullerenes arriving on a diamond surface. In the last section we will give results for the speed and the scaling of the code for various systems.
In the TB model the wavefunctions are expanded on the basis set of atomic orbitals at all the atomic sites. In the context of the O(N) TB algorithm a localisation region for the orbital representation of the wavefunctions can then be identified with a set of atomic sites centered around a given atom. In our code atoms are included in a localised region (LR) if they belong to the
th nearest neighbour of the center atom where we have a cut-off distance to define, whether or not, an atom is a nearest neighbour. In this way, by varying the value of
, we vary the localisation region of the orbital.
The localised orbitals can then be represented by a set of coefficients
such that
(1)
where i is the atomic site on which the orbital is localised; j is summed over all the atomic sites in the localisation region of site i i.e.
is summed over the set of basis functions
atomic orbitals in our case) where
is the k'th basis function at site j. Outside the localisation region the values of the coefficients
are zero. In a non-localised TB formulation,
would be non-zero at all atomic sites. In the program the orbitals are stored as sparse coded matrices in the sense that we only store the non-zero values of
: the values in the localisation region LRi, and a second matrix tells us which atomic sites each of these values corresponds to. In the program the complete set of orbitals for the system is stored as a four dimensional sparse coded matrix of the form C(k,m,j,i) where m is the index for the orbitals having the same localisation region and a two dimensional matrix NEIG(j,i) lists the true site numbers for each C value. A further vector NNEIG(i), contains the number of sites in each localisation region. It would be prohibitive, in terms of memory, and computationally inefficient to work with the full C matrix containing all the zero values of the C's outside the localisation regions. It should also be noted that while for non-localised problems the overlap matrix, Hamiltonian etc. are usually calculated with Basic Linear Algebra Subprograms (BLAS), typically written in assembly language and highly machine-optimised, we now have to hand code the sparse matrix multiplications in Fortran. This leads to codes which typically have a MFLOP speed lower than non-localised codes but the "time to solution" for large systems is, of course much lower than standard O(N**3) non localised codes.
In this paper we will discuss results for the code using the energy functional for non-orthogonal localised orbitals presented in reference [5]. This is essentially the same as the energy functional for localised orbitals presented in [2, 4] except that the number of orbitals is chosen to be larger than the number of electronic states. All the coding techniques we will discuss apply to both functionals. For more discussion on the physics behind these energy functionals the reader should consult these papers and the references therein. The energy functional used in our program is given by
, (2)
where N is the number of electrons in the system. M is the number of orbitals which is chosen to be larger than N/2; the number of occupied states. The number of occupied states is half the number of electrons due to the spin degeneracy of the electrons. In all our calculations we choose
corresponding to three orbitals for each of the localisation regions.
is a scalar parameter (chemical potential) chosen to give the correct total charge and Q is the first order expansion of the inverse of the overlap matrix,
Q = 2I-S (3)
where S is the overlap matrix
and I is the identity matrix. As pointed out in reference [6], in order to control unphysical charge transfer for certain systems studied with TB models e.g. systems with dangling bonds, it is often useful to add a Hubbard like term to the energy functional. This term is present in our program and can be switched on and off depending on what type of system is under study. It is given by
where qi is the total charge at atomic site i
and U is a scalar parameter typically chosen to be 8eV for carbon systems.
The Hamiltonian in equation 2 is the tight-binding Hamiltonian of Xu et al.[6]. This is an empirical tight-binding model where the Hamiltonian matrix elements are parametrised as a function of distance between the atomic sites and atomic orbital type. Thus the different terms in the Hamiltonian can be calculated, knowing only the inter-atomic distances and direction cosines of the vectors between the atoms connected by hopping terms. The same site orbital products are trivial to calculate. The hopping terms lead to products of orbitals on neighbouring sites so to keep track of these terms we must have an enlarged neighbourhood list NEIGH(i,j) which contains all the sites in a given localisation region plus the ones connected to it via the hopping terms. If the Hamiltonian operates on an orbital localised at site i it will produce a new orbital which is localised in the larger region defined by the sites NEIGH(i,j).
A third site index matrix NEIGS(i,j) is required for the sparse coding of S and tells us which pairs of orbitals have non-zero overlap i.e. the element S(i,j) is the overlap between the orbital centered at site i and the orbital centered at site NEIGS(i,j). In fact, the element S(i,j) is itself a small 3 x 3 matrix since we have 3 orbitals centered at each site so that S is a four dimensional matrix in our program. Thus we have three index matrices NEIG(i,j), NEIGH(i,j) and NEIGS(i,j) which tell us the structure of all the sparse data matrices in the program. The minimisation of the energy functional of equation 2 was performed using the conjugate gradient method. In this method at each step t, the gradient, at the current position, is calculated from
(4)
The minimisation direction D**t , which is conjugate to all the former directions of search, is then determined and the energy is minimised along the line
where C**t is the current position. The energy functional is, in fact, a quartic in
(8th power if the Hubbard term is included) which can be calculated explicitly by evaluating the energy functional in equation 2 at
. Once
has been determined the next point in the conjugate gradient process can then be calculated
. This whole process is repeated until the required tolerance on the energy is reached. The majority of calculations to evaluate the quartic polynomial and the derivative of the energy functional involve sparse matrix products. We will take as a typical example the calculation of S. As previously noted S is stored as a four dimensional array where the first two indices correspond to the orbitals centered on the same site so that if there was no sparse coding S would be given by,
(5)
where m = 1,3 and n= 1,3 are the indexes for the 3 orbitals on each site; k = 1,4 is the index for the basis set of four orbitals and l,i and j are site indices. The sum over l is only non-zero on the sites where the two orbitals overlap. The type style C(k,m,l,i) is used to denote the way these variable are stored in the program. C(k,m,l,i) is sparse coded only on the l index while S(m,n,j,i) is sparse coded only on the j index. Thus in the calculation of S the indices m,n,k and i are direct indices in the sense that they run over all their possible values while j and l are indirect indices running over only the non-zero values of the matrices. The values of these indirect indices are given through our index matrices NEIG(l,i) and NEIGS(j,i). This makes our calculation different from standard sparse matrix multiplications which are typically of the form
where only i is a direct index. In our calculation we are essentially pulling in two 3 x 4 matrices (C(k,m,l,i); m = 1,3; k = 1,4) from indirectly indexed memory locations and multiplying them together to give a 3 x 3 matrix (S(m,n,j,i); m = 1,3; n = 1,3). Therefore in sparse notation S is given by, S(m,n,j,i) =
(6)
where at each MD step we precalculate index lists giving the values of a and b for which NEIG(a,j) = NEIG(b,i) required for the outermost summation. Since we pull in small matrices rather than single values for each indirect memory reference these calculations run much faster on RISC type architectures with cache (such as the T3D) than a standard sparse matrix multiply. Using a data layout which gives the best memory access for the most common loops in our program we achieved speeds of about 30 MFLOPS for the calculation steps e.g. the calculation of S.
The calculation of
, once we have constructed
, is of the same form as the calculation of S given in equation 6. The calculation of the quartic polynomial for the line minimisation requires the further calculation of
where D is the conjugate gradient direction. All these calculations are of the same form as equation 6, the calculation of the overlap matrix. In total, for a typical run, about 65% of the time in the CG step is taken up with sparse matrix multiplications while the remaining 35% is mainly vector matrix products (to calculate the derivative) which are also sparse coded. To perform the MD step the spatial derivative of the total energy E (including the ionic potential and kinetic energy ) must be calculated. E is given by
(7)
where is the band structure energy which takes the form
; is the interatomic potential of the TB model [6] and RI is the atomic position of atom I. The forces on the ions are then given by
(8)
and the ionic positions are updated using the Verlet algorithm [8]. Calculating the contribution to the forces from the electronic band structure term
takes the largest percentage of the time in the MD step and, as in the CG step, this calculation mainly involves sparse matrix multiplications. As part of the MD step we must also update the index lists NEIG(i,j), NEIGH(i,j) and NEIGS(i,j) as the sites within the different localisation regions may change. The direction cosines which occur in the matrix elements of the Hamiltonian must also be updated. In a typical MD run where 15 CG steps were performed for each MD step [4,7] the CG steps took 85 to 90% of the time.
While we have drawn some analogies between classical short-range MD and quantum MD with localised orbital there are several major differences which will make our parallel implementation rather different from techniques used in classical MD. In particular, due to the minimisation of the electronic degrees of freedom, the amount of calculation required to update an atom at each MD step in quantum simulations is very much larger than the update of an atomic position for classical MD. This has implications for the loadbalancing and also the techniques used to calculate the neighbourhood lists.
plus the conjugate gradient directions
must also be communicated between the processors. Our program is written in a message passing format using Fortran77 and PVM while the communication routines for the orbitals were written using the faster SHared MEMory (SHMEM) library routines which are native to the T3D. In addition to carrying all the local information for its set of particles and orbitals, each processor has global lists running over all the particles in the system which tell it which processor deals with a given particle and what is the local address of that particle. Using these global lists and its local neighbourhood lists each processor can determine the location of all its neighbouring orbitals. In addition each processor has a complete list of all the atomic sites so that the new neighbourhood lists can be calculated in parallel when the particles move.
The first step in our load balancing algorithm is to spatially divide the system into three dimensional blocks. Typically the number of blocks is chosen to be equal to the number of processors but this is not a necessary condition for our load balancing algorithm. In our program we read in three parameters which correspond to the number of blocks in the x,y and z directions. We then construct a one dimensional list of all the sites by ordering the blocks as a one dimensional list with x,y,z ordering and each block we write as a one dimensional list with z,y,x ordering. It is important that the ordering of the sites for each block is the reverse of the block ordering to give good spatial locality although it is not necessary to order the blocks as x,y,z. There may be systems where better spatial locality is obtained by ordering with the y or z coordinate being the most rapidly varying. In this way we construct a one dimensional list that has three dimensional spatial locality. We now wish to load balance on this list giving out the sites to each processor such that each processor does the same amount of work. In essentially all the subroutines in the program the outermost loop is over the sites allocated to each processor. In a typical subroutine, such as the calculation of the overlap matrix, we explicitly time the calculations associated with each site by adding a subroutine timing call within the loop over the sites. We now have a time ti associated with the calculations for each site where i = 1,2, ..., N, the order of sites being the one dimensional list we have constructed with spatial locality. We now calculate the time each processor would take to do the calculations if the system was perfectly load balanced i.e.
(9)
where
is the number of processors. We then hand out contiguous strips of sites to each processor, from our one dimensional list, such that the sum of the times (ti) on each processor are as close as possible to
. In this way we achieve an essentially optimal load balancing while still having a high degree of spatial locality of the mapping of the sites onto the processors. This means m sites, j,j + 1,j + 2, ..., j + m are allocated to a given processor such that,
(10)
Providing the the ratio of the number of atoms to the number of processors is not too small, we have found this load balancing algorithm to work extremely well. In a typical simulation where we had ~ 40 sites per processor we achieved load balancing down to a few percent. We perform the loadbalancing dynamically during our MD run every 10 to 20 MD steps depending on how rapidly the atoms are moving in our simulation. All the data for the orbitals and the atoms is reorganised in memory to correspond to the new mapping of sites to processors. Due to the computationally intensive nature of quantum MD the load balancing typically takes less than one percent of the run time if it is carried out every 20 MD steps. This should be contrasted with classical MD where the load balancing can easily become a significant percentage of the run time. It should be noted that our load balancing algorithm is very general in that the same program can run any geometry of system, even systems of different dimensions, on any configuration of processors.

Figure 2: - " Weak scaling " of the code for a bulk diamond system of 2048 to 32768 atoms running on 32 to 512 processors where the localisation region for each orbital contains 45 sites. The curve of circles shows the code performance for the conjugate gradient minimisation and the curve of squares shows the performance for a full molecular dynamics run.
In figure 2 we show the results for the " weak scaling " of our code where the size of the system is increased proportionally with the number of processors The results shown are for a 2048 atom bulk diamond system on 32 processors up to a 32768 atom system on 512 processors. The localisation region is taken up to the 3rd nearest neighbour giving 45 atomic sites in each localisation region. This is a localisation region which is sufficiently large to give good convergence and very small errors (compared to a non-localised algorithm) for carbon based insulators. The results for the MD runs are given with 15 CG steps for each MD step which is representative of the amount of CG steps required for convergence in a typical run [4,7] for this type of system. As can be seen from the graph the speed up is extremely linear and very close to ideal. The full MD simulation MFLOP speed is slightly lower than the CG step due to the index lists that are constructed at each MD step. Also some of the calculations for the atomic update, run at a slightly lower MFLOP speed than the CG step. We present the results in GFLOPs rather than seconds for an MD step as the smaller systems have wraparound effects on the localisation regions which means that proportionally less time is required for the MD step than larger systems. It should be noted that the algorithm used is strictly O(N), in terms of the number of calculations per MD step, so that provided there are no wraparound effects on the localisation regions the speed up in GFLOPS translates directly into an identical increase in the number of atoms updated per second. All of these simulations spent about 14% of the time in communications and 86% doing calculations. An MD step, including 15 CG minimisation steps of the orbitals, took about 50s for the 32768 atom system on 512 processors, running at above 11 GFLOPS. The weakscaling curves for other ratios of particles to processors and different localisation regions are also essentially linear with nearly ideal speedup. Due to the localised nature of the problem the ratio of communications to calculations remains the same as we scale up the problem and the communications remain local. The GFLOP speed for the inhomogeneous system on 128 processors is ~ 8% lower than the bulk diamond system. The lower GFLOP speed is caused by two main factors the first of which is the load imbalance introduced into the problem by the inhomogeneous nature of the system and the fact the particles are moving. The second most important factor is the increase in communications due to the less spatially structured nature of the problem which makes the spatial mapping of the system to the processors less direct than the bulk diamond system. A third less important factor is the extra time required to run the loadbalancing algorithm every 20 MD steps although this is only about 1% of the program time. Without dynamic loadbalancing our test run is about twice as slow, which is due to the fact that when the fullerenes arrive at the surface the number of sites in the localisation regions of their atoms can increase by factors of 2 to 3.
These results should be compared with the vector code that runs at about 270 MFLOPS on one processor of a C90 for the bulk diamond system. The number of floating point operations per MD step is, to within a few percent, the same for the vector and serial code. It should be noted that the structure of the loops in the vector code is very different from that of the parallel code. The innermost loop in the vector code must be over sites since this is the only loop long enough to efficiently exploit the vector processing while in the parallel code we parallelise over the site loop which must therefore be the outermost loop.

Figure 3: - " Strong scaling " of the code for two bulk diamond systems of 2048 and 4096 atoms, and the inhomogeneous system shown in figure 1. All orbital localisation regions are taken up to the third nearest neighbour.
Figure 3 shows the " strong scaling " performance for the code where the size of the system is kept constant and the number of processors is increased. Results are shown for 32 to 256 processors for the 2048 and 4096 atom bulk diamond system and also the inhomogeneous system. The choices of systems are meant to be typical of the size of problem that would be run in a research environment on a system of 32 to 128 processors. The speed up is very linear and in the case of 4096 atoms of diamond there is a 7.3 speed up in going from 32 to 256 processors. The speed up is lower for the smaller 2048 atom system since when we are in the regime where the number of atoms on each processor is very small (8 atoms per processor for 256 processors), the amount of time in communications becomes more important. This is because the fewer atomic sites we have on each processor the more probable it is that the orbitals, required to perform the calculation of the overlap matrix elements etc. for these sites, will not be on that processor and will have to be communicated. Typically the communicated data will also be less local on the processors when we have a few atoms on each processor thus, the communication times will be larger. In order to reduce the communication cost of studying small systems on a large number of processors it may be more efficient to use a different data distribution where the data for each orbital is spatially distributed over the processors. However, it is not clear that the overall speed would be faster as the MFLOP speed for the calculations may decrease. This type of data distribution was implemented in reference [15] on a CM5 but the single PE performance is very much lower that what we find for our implementation. The load balancing also becomes more difficult as we cannot simply load balance over sites since the calculations associated with the MD step for each individual atom are distributed over processors. All our production runs of the code were never in the regime where there was a few sites on each processor which would only be useful for studying small systems over very long timescales.
[2] F. Mauri, G. Galli, and R. Car, Phys. Rev. B 47, 9973 (1993).
[3] F. Mauri, G. Galli, Phys. Rev . B 50, 4316 (1994).
[4] P. Ordejón, D. Drabold, M. Grunbach, and R. Martin, Phys. Rev. B 48. 14646 (1993).
[5] J. Kim, F. Mauri, G. Galli, Phys. Rev. B 52, 1640 (1995).
[6] C. Xu, C. Wang, C. Chan, and K. Ho, J. Phys. Condens. Matter 4, 6047 (1992).
[7] G. Galli and F. Mauri, Phys. Rev. Lett. (in press).
[8] L. Verlet, Phys. Rev. 159, 98 (1967).
[9] L. J. Clarke, I. Stich and M. C. Payne, Comp. Phys. Commun. 72, 14 (1992)
[10] S. Plimpton, pre-print SAND91-1144, Sandia National Laboratories, (1993).
[11] A. Canning, A. De Vita, G. Galli, F. Gygi, F. Mauri and R. Car, Procs. of the 94 Fall, Cray User Group Conf. (edited by B. and K. Winget), 18 (1995).
[12] Y. Deng, R. A. McCoy, R. B. Marr and R. F. Peierls, Procs. of the Seventh SIAM Conf., (edited by D. H. Bailey et al), 605, (1995).
[13] D. C. Rapaport, Comp. Phys. Commun. 62, 198 (1991), D. C. Rapaport, Comp. Phys. Commun. 62, 217 (1991).
[14] W. Smith, Comp. Phys. Commun. 62, 229 (1991).
[15] S. Itoh, P. Ordejón and R. Martin, University of Illinois preprint, (1995).
| ![]() REFER TO CONTENTS |
| ![]() |
|---|