# «Abstract We present Sapporo, a library for performing high-precision gravitational N -body simulations on NVIDIA arXiv:0902.4463v2 [astro-ph.IM] 5 ...»

SAPPORO: A way to turn your graphics cards into a GRAPE-6

Evghenii Gaburov∗,c,a,b, Stefan Harfsta,b, Simon Portegies Zwarta,b,c

a Astronomical Institute “Anton Pannekoek”, University of Amsterdam

b Section Computational Science, University of Amsterdam

c Leiden Observatory, University of Leiden

the Netherlands

Abstract

We present Sapporo, a library for performing high-precision gravitational N -body simulations on NVIDIA

arXiv:0902.4463v2 [astro-ph.IM] 5 Mar 2009

Graphical Processing Units (GPUs). Our library mimics the GRAPE-6 library, and N -body codes cur- rently running on GRAPE-6 can switch to Sapporo by a simple relinking of the library. The precision of our library is comparable to that of GRAPE-6, even though internally the GPU hardware is limited to single precision arithmetics. This limitation is eﬀectively overcome by emulating double precision for calculating the distance between particles. The performance loss of this operation is small ( ∼ 20%) compared to the advantage of being able to run at high precision. We tested the library using several GRAPE-6-enabled N-body codes, in particular with Starlab and phiGRAPE. We measured peak per- formance of 800 Gﬂop/s for running with 106 particles on a PC with four commercial G92 architecture GPUs (two GeForce 9800GX2). As a production test, we simulated a 32k Plummer model with equal mass stars well beyond core collapse. The simulation took 41 days, during which the mean performance was 113 Gﬂop/s. The GPU did not show any problems from running in a production environment for such an extended period of time.

1. Introduction Graphical processing units (GPUs) are qui

Preprint submitted to Elsevier March 5, 2009 In this paper we introduce Sapporo2, a library, which is written in CUDA, for running gravitational N -body simulations on NVIDIA GPUs. The Sapporo library closely matches the calling sequence of the GRAPE-6 library (Makino et al., 2003), and codes which are already running on a GRAPE-6 can be immediately be run on GPUs without any modiﬁcations to their source code.

Here, we describe the implementation of our library, and present both accuracy and performance measurements using Sapporo. For the latter, we use two direct N -body simulations environment, one being Starlab (Portegies Zwart et al., 2001) and the other is phiGRAPE (Harfst et al., 2007) as it is implemented in the Multi Scale Software Environment (MUSE, Portegies Zwart et al., 2009).

Our new implementation of the N -body force calculation in CUDA is an important step towards high-precision direct N -body simulations using GPUs, as the library is more ﬂexible and more general than previous implementations. In addition we identify the operations in the code where double precision accuracy is most important and implement double precision arithmetic in those locations. The cost for doing this is limited to a ∼ 20% loss in performance (but overall performance is still very high).

**2. Implementation**

We have designed a library to calculate gravitational forces on a GPU in order to accelerate N body simulations. The library can be used in combination with a standard 4th -order predictor-corrector Hermite integration scheme (Makino & Aarseth 1992), either with shared or block time steps. Such a scheme consists of three essential steps: predictor, force calculation, and corrector. In the predictor step, the positions and velocities of all particles are predicted for the next time step. Then, the gravitational accelerations and their ﬁrst derivatives (jerks) are calculated using the predicted positions and velocities.

Finally, the predicted positions and velocities are corrected using the newly computed accelerations and jerks. In the case of a block time step scheme, the last two steps are only executed on a block of active particles. In the following, we will focus on block time steps, since shared time step scheme can be considered a variant of the block time step scheme.

In the case of the block time steps, the system is divided into i-particles and j-particles, which are the sinks and sources of the gravitational forces3, respectively. Within the Hermite scheme, we take the following subsequent steps: we predict the positions and velocities of the i- and j-particles (predictor step); calculate the gravitational forces exerted by the j-particles on the i-particles (calculation step);

correct the positions and velocities of the i-particles (corrector step).

The actual parallelisation of the force calculation is not too diﬃcult and many examples for parallel N -body algorithms exist (Gualandris et al., 2007; Dorband et al., 2003). Implementing the algorithm on a GPU is a little more challenging due to the speciﬁc design of GPUs. For example, a G80/G92 NVIDIA GPU consist of 16 multi-processors (MPs) each being able to execute up to 768 threads in parallel. A parallel, SIMD (single instruction multiple data), element on such a GPU is called a warp. A warp is a set of 32 threads which execute the same instruction but operate on diﬀerent data, and up to 24 warps can be executed in parallel on each MP. In total, a single G80/G92 GPU is able to execute up to 12288 threads in parallel, which means that a program which runs on a GPU, called a kernel, should be able to eﬃciently exploit such a high degree of parallelism.

In addition, the GPU design impose a strict memory access pattern on such a kernel. For example, a GPU has two types of memory: global memory, which is an equivalent of RAM on the CPU, and the shared memory, which is equivalent to L1-cache on the CPU. The global memory is further subdivided into local memory which is allocated on a per thread basis, and both texture and constant cached readonly memory. The major diﬀerence between the shared and the global memory is the access latency.

Access to an element of the shared memory is as fast as access to a register, whereas access to an element in the global memory has a latency of 400-600 cycles. Since the data initially reside in the global memory, each of the parallel thread should cooperatively load the data to the shared memory in a speciﬁc pattern in order to reduce the access latency. The threads can then eﬃciently operate on the data in the shared memory.

2 The Sapporo library is free and can be downloaded from http://modesta.science.uva.nl.

3 In the following, by gravitational forces we imply gravitational accelerations, potentials and jerks.

2.1. Task decomposition between the CPU and the GPU At ﬁrst, we estimate the complexity of each of the steps in the Hermite scheme. We assume that the number of j-particles is equal to n, and the number of i-particles is equal to m. The time complexities of the predictor, the calculation, and the corrector steps are O(n), O(nm), and O(m), respectively.

It is important to note, that the n2 -scaling of the force calculation has been reduced to nm by the introduction of block time steps. As a result, the time contribution of the calculation and of the corrector steps decreases with block size, whereas that of the prediction step remains constant. This is important as block sizes are typically small on average in direct N -body simulations.

Motivated by this, we implemented both the prediction of j-particles and the force calculation on the GPU. The prediction of i-particles and the corrector step are carried out on the CPU (Fig. 1); in fact, GRAPE-enabled N -body codes have exactly the same decomposition. Another motivation for this decomposition is the communication overhead: if the predictor step would be carried out on the CPU, all n particles would have to be communicated to the GPU at every time-step. In this case, communication would become a bottle-neck for the calculation. In our chosen implementation, only i-particles need to be communicated, and this therefore decreases the communication overhead.

2.2. Predictor step In this step, the GPU’s task is to predict positions and velocities of the j-particles. It is a rather trivial step to parallelise since a particle does not require information from other particles. Hence, we assign a particle to each of the parallel threads on the GPU. Each thread reads for its particle position (x), velocity (v), acceleration (a), jerk (j), and time step (dt) from global memory into registers. This

**data is required to execute the predictor step:**

3 Figure 2: An illustration of the decomposition between multiprocessors (MP) on the GPU. The n j-particles are equally distributed between all P MPs, where P is the number of MPs. An identical copy of all m i-particles is distributed to each of the MPs. As a result, every MP computes gravitational forces from n/P bodies on the same set of i-particles in parallel.

digits, the other stores the least signiﬁcant digits. Such a representation of DP is known as a doublesingle (DS)4. In this representation, the number of signiﬁcant digits is 14 compared to 16 in DP5, but nevertheless it is a factor of two larger than in SP.

In the following, we will use DS to emulate DP where it is required. In Eq. 1, this only needs to be done for the positions x0 and xP. All multiplications and divisions are carried out and have their results stored in SP. These SP numbers are then added together and the resulting number is added to the DS number6 x0. The result is stored again in DS as xP. Therefore, the ﬂoating point operation (FLOP) count for Eq. 1 is 1 DS + 10 SP operations or 11 FLOP per particle; the Eq. 2 requires only 6 FLOP per particle.

2.3. The calculation step We use predicted positions and velocities of the j-particles to calculate gravitational forces that the j-particles exert on the i-particles. We schematically depict the parallelisation strategy for such problem in Fig. 2. As before, we assume that the total number of the j-particles is equal to n and the total number of the i-particles is equal to m.

We split the problem in P parallel blocks, where P is the number of parallel multiprocessors (MP) on the GPU. The j-particles are distributed evenly among the P MPs. Each of the MPs then computes the 4 http://crd.lbl.gov/∼dhbailey/mpdist/ 5 The mantissa of DP consists of 53 bits, compared to 48 bits in DS because a SP number has a mantissa of 24 bits.

6 Addition of DS with SP operation requires 10 SP ﬂoating point operations (FLOP) 4 Figure 3: Illustration of the decomposition on one multiprocessor. 1) Each thread reads in parallel one of the n/P particles to shared memory; the total number of shared-particles is therefore equals to m. 2) These shared particles are processed sequentially. Steps 1) and 2) are repeated until all the n/P particles have been processed.

Figure 4: Illustration of the decomposition on a multiprocessor in the case when the number of i-particles is equal to half the number of parallel threads that can be executed simultaneously. In this case, every particle is processed by two threads in order to guarantee that all threads are active. However, every thread processes half of the j-particles that are associated with this MP.

partial gravitational forces, exerted by the n/P j-particles assigned to that MPs, on all the i-particles in parallel. This is accomplished by assigning each of the m parallel threads one of the i-particles in the block. Here, we assume that the number of i-particles m, is smaller than or equal to the maximal number of parallel threads that a block is able to execute. Should m be larger, we split the i-particles in segments such that each of the segments contains the maximal number of i-particles that can be processed in parallel; the segments themselves are processed in a serial manner.

The P parallel blocks have no means of communication with each other, whereas the threads within a multiprocessor are able to exchange information via shared memory, which can be considered an equivalent of an on-chip cache memory. Therefore, once all the i-particles are processed, partial accelerations from each of the blocks must be accumulated.

The parallelisation method that we use in each block is illustrated in Fig. 3. Each thread in a block loads one particle from the global memory to the shared memory, and therefore the total number of particles stored in shared memory (shared particles) equals to the number of threads in a block.

Afterwards, each thread sequentially calculates and sums the partial gravitational forces from the shared particles. Once completed, this loop is repeated until all of n/P -particles in the block have been processed.

If the number of the i-particles is smaller than the maximal number of threads that a block can execute, it is desirable to further parallelise the algorithm; we show this schematically in Fig. 4. Here, we assume that the number of the i-particles equals to half the number of threads that can be executed in parallel. Thus, we split the n/P j-particles in two equal parts, such that each thread processes n/(2P ) particles. In this way, the load per thread decreases by a factor of two, but since the total number of the i-particles that are processed concurrently doubles, the performance is unaﬀected. If we were to refrain from this parallelisation step, the performance would suﬀer since there would not be enough i-particles to fully occupy the multiprocessor, and half of the parallel threads would remain idle.

The ﬁnal step is to sum up the partial forces computed by each of the P parallel blocks. A simple approach, in which all the data from the GPU memory is ﬁrst copied to the host and then reduced using the CPU, is suﬃciently eﬃcient (Nitadori, 2009). Here, we have chosen to reduce the partial forces directly in the GPU memory and only copy back the ﬁnal result.