WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:   || 2 | 3 |

«Abstract We present Sapporo, a library for performing high-precision gravitational N -body simulations on NVIDIA arXiv:0902.4463v2 [astro-ph.IM] 5 ...»

-- [ Page 1 ] --

SAPPORO: A way to turn your graphics cards into a GRAPE-6

Evghenii Gaburov∗,c,a,b, Stefan Harfsta,b, Simon Portegies Zwarta,b,c

a Astronomical Institute “Anton Pannekoek”, University of Amsterdam

b Section Computational Science, University of Amsterdam

c Leiden Observatory, University of Leiden

the Netherlands

Abstract

We present Sapporo, a library for performing high-precision gravitational N -body simulations on NVIDIA

arXiv:0902.4463v2 [astro-ph.IM] 5 Mar 2009

Graphical Processing Units (GPUs). Our library mimics the GRAPE-6 library, and N -body codes cur- rently running on GRAPE-6 can switch to Sapporo by a simple relinking of the library. The precision of our library is comparable to that of GRAPE-6, even though internally the GPU hardware is limited to single precision arithmetics. This limitation is effectively overcome by emulating double precision for calculating the distance between particles. The performance loss of this operation is small ( ∼ 20%) compared to the advantage of being able to run at high precision. We tested the library using several GRAPE-6-enabled N-body codes, in particular with Starlab and phiGRAPE. We measured peak per- formance of 800 Gflop/s for running with 106 particles on a PC with four commercial G92 architecture GPUs (two GeForce 9800GX2). As a production test, we simulated a 32k Plummer model with equal mass stars well beyond core collapse. The simulation took 41 days, during which the mean performance was 113 Gflop/s. The GPU did not show any problems from running in a production environment for such an extended period of time.

1. Introduction Graphical processing units (GPUs) are qui

–  –  –

Preprint submitted to Elsevier March 5, 2009 In this paper we introduce Sapporo2, a library, which is written in CUDA, for running gravitational N -body simulations on NVIDIA GPUs. The Sapporo library closely matches the calling sequence of the GRAPE-6 library (Makino et al., 2003), and codes which are already running on a GRAPE-6 can be immediately be run on GPUs without any modifications to their source code.

Here, we describe the implementation of our library, and present both accuracy and performance measurements using Sapporo. For the latter, we use two direct N -body simulations environment, one being Starlab (Portegies Zwart et al., 2001) and the other is phiGRAPE (Harfst et al., 2007) as it is implemented in the Multi Scale Software Environment (MUSE, Portegies Zwart et al., 2009).

Our new implementation of the N -body force calculation in CUDA is an important step towards high-precision direct N -body simulations using GPUs, as the library is more flexible and more general than previous implementations. In addition we identify the operations in the code where double precision accuracy is most important and implement double precision arithmetic in those locations. The cost for doing this is limited to a ∼ 20% loss in performance (but overall performance is still very high).

2. Implementation

We have designed a library to calculate gravitational forces on a GPU in order to accelerate N body simulations. The library can be used in combination with a standard 4th -order predictor-corrector Hermite integration scheme (Makino & Aarseth 1992), either with shared or block time steps. Such a scheme consists of three essential steps: predictor, force calculation, and corrector. In the predictor step, the positions and velocities of all particles are predicted for the next time step. Then, the gravitational accelerations and their first derivatives (jerks) are calculated using the predicted positions and velocities.

Finally, the predicted positions and velocities are corrected using the newly computed accelerations and jerks. In the case of a block time step scheme, the last two steps are only executed on a block of active particles. In the following, we will focus on block time steps, since shared time step scheme can be considered a variant of the block time step scheme.

In the case of the block time steps, the system is divided into i-particles and j-particles, which are the sinks and sources of the gravitational forces3, respectively. Within the Hermite scheme, we take the following subsequent steps: we predict the positions and velocities of the i- and j-particles (predictor step); calculate the gravitational forces exerted by the j-particles on the i-particles (calculation step);

correct the positions and velocities of the i-particles (corrector step).

The actual parallelisation of the force calculation is not too difficult and many examples for parallel N -body algorithms exist (Gualandris et al., 2007; Dorband et al., 2003). Implementing the algorithm on a GPU is a little more challenging due to the specific design of GPUs. For example, a G80/G92 NVIDIA GPU consist of 16 multi-processors (MPs) each being able to execute up to 768 threads in parallel. A parallel, SIMD (single instruction multiple data), element on such a GPU is called a warp. A warp is a set of 32 threads which execute the same instruction but operate on different data, and up to 24 warps can be executed in parallel on each MP. In total, a single G80/G92 GPU is able to execute up to 12288 threads in parallel, which means that a program which runs on a GPU, called a kernel, should be able to efficiently exploit such a high degree of parallelism.





In addition, the GPU design impose a strict memory access pattern on such a kernel. For example, a GPU has two types of memory: global memory, which is an equivalent of RAM on the CPU, and the shared memory, which is equivalent to L1-cache on the CPU. The global memory is further subdivided into local memory which is allocated on a per thread basis, and both texture and constant cached readonly memory. The major difference between the shared and the global memory is the access latency.

Access to an element of the shared memory is as fast as access to a register, whereas access to an element in the global memory has a latency of 400-600 cycles. Since the data initially reside in the global memory, each of the parallel thread should cooperatively load the data to the shared memory in a specific pattern in order to reduce the access latency. The threads can then efficiently operate on the data in the shared memory.

2 The Sapporo library is free and can be downloaded from http://modesta.science.uva.nl.

3 In the following, by gravitational forces we imply gravitational accelerations, potentials and jerks.

–  –  –

2.1. Task decomposition between the CPU and the GPU At first, we estimate the complexity of each of the steps in the Hermite scheme. We assume that the number of j-particles is equal to n, and the number of i-particles is equal to m. The time complexities of the predictor, the calculation, and the corrector steps are O(n), O(nm), and O(m), respectively.

It is important to note, that the n2 -scaling of the force calculation has been reduced to nm by the introduction of block time steps. As a result, the time contribution of the calculation and of the corrector steps decreases with block size, whereas that of the prediction step remains constant. This is important as block sizes are typically small on average in direct N -body simulations.

Motivated by this, we implemented both the prediction of j-particles and the force calculation on the GPU. The prediction of i-particles and the corrector step are carried out on the CPU (Fig. 1); in fact, GRAPE-enabled N -body codes have exactly the same decomposition. Another motivation for this decomposition is the communication overhead: if the predictor step would be carried out on the CPU, all n particles would have to be communicated to the GPU at every time-step. In this case, communication would become a bottle-neck for the calculation. In our chosen implementation, only i-particles need to be communicated, and this therefore decreases the communication overhead.

2.2. Predictor step In this step, the GPU’s task is to predict positions and velocities of the j-particles. It is a rather trivial step to parallelise since a particle does not require information from other particles. Hence, we assign a particle to each of the parallel threads on the GPU. Each thread reads for its particle position (x), velocity (v), acceleration (a), jerk (j), and time step (dt) from global memory into registers. This

data is required to execute the predictor step:

–  –  –

3 Figure 2: An illustration of the decomposition between multiprocessors (MP) on the GPU. The n j-particles are equally distributed between all P MPs, where P is the number of MPs. An identical copy of all m i-particles is distributed to each of the MPs. As a result, every MP computes gravitational forces from n/P bodies on the same set of i-particles in parallel.

digits, the other stores the least significant digits. Such a representation of DP is known as a doublesingle (DS)4. In this representation, the number of significant digits is 14 compared to 16 in DP5, but nevertheless it is a factor of two larger than in SP.

In the following, we will use DS to emulate DP where it is required. In Eq. 1, this only needs to be done for the positions x0 and xP. All multiplications and divisions are carried out and have their results stored in SP. These SP numbers are then added together and the resulting number is added to the DS number6 x0. The result is stored again in DS as xP. Therefore, the floating point operation (FLOP) count for Eq. 1 is 1 DS + 10 SP operations or 11 FLOP per particle; the Eq. 2 requires only 6 FLOP per particle.

2.3. The calculation step We use predicted positions and velocities of the j-particles to calculate gravitational forces that the j-particles exert on the i-particles. We schematically depict the parallelisation strategy for such problem in Fig. 2. As before, we assume that the total number of the j-particles is equal to n and the total number of the i-particles is equal to m.

We split the problem in P parallel blocks, where P is the number of parallel multiprocessors (MP) on the GPU. The j-particles are distributed evenly among the P MPs. Each of the MPs then computes the 4 http://crd.lbl.gov/∼dhbailey/mpdist/ 5 The mantissa of DP consists of 53 bits, compared to 48 bits in DS because a SP number has a mantissa of 24 bits.

6 Addition of DS with SP operation requires 10 SP floating point operations (FLOP) 4 Figure 3: Illustration of the decomposition on one multiprocessor. 1) Each thread reads in parallel one of the n/P particles to shared memory; the total number of shared-particles is therefore equals to m. 2) These shared particles are processed sequentially. Steps 1) and 2) are repeated until all the n/P particles have been processed.

Figure 4: Illustration of the decomposition on a multiprocessor in the case when the number of i-particles is equal to half the number of parallel threads that can be executed simultaneously. In this case, every particle is processed by two threads in order to guarantee that all threads are active. However, every thread processes half of the j-particles that are associated with this MP.

partial gravitational forces, exerted by the n/P j-particles assigned to that MPs, on all the i-particles in parallel. This is accomplished by assigning each of the m parallel threads one of the i-particles in the block. Here, we assume that the number of i-particles m, is smaller than or equal to the maximal number of parallel threads that a block is able to execute. Should m be larger, we split the i-particles in segments such that each of the segments contains the maximal number of i-particles that can be processed in parallel; the segments themselves are processed in a serial manner.

The P parallel blocks have no means of communication with each other, whereas the threads within a multiprocessor are able to exchange information via shared memory, which can be considered an equivalent of an on-chip cache memory. Therefore, once all the i-particles are processed, partial accelerations from each of the blocks must be accumulated.

The parallelisation method that we use in each block is illustrated in Fig. 3. Each thread in a block loads one particle from the global memory to the shared memory, and therefore the total number of particles stored in shared memory (shared particles) equals to the number of threads in a block.

Afterwards, each thread sequentially calculates and sums the partial gravitational forces from the shared particles. Once completed, this loop is repeated until all of n/P -particles in the block have been processed.

If the number of the i-particles is smaller than the maximal number of threads that a block can execute, it is desirable to further parallelise the algorithm; we show this schematically in Fig. 4. Here, we assume that the number of the i-particles equals to half the number of threads that can be executed in parallel. Thus, we split the n/P j-particles in two equal parts, such that each thread processes n/(2P ) particles. In this way, the load per thread decreases by a factor of two, but since the total number of the i-particles that are processed concurrently doubles, the performance is unaffected. If we were to refrain from this parallelisation step, the performance would suffer since there would not be enough i-particles to fully occupy the multiprocessor, and half of the parallel threads would remain idle.

The final step is to sum up the partial forces computed by each of the P parallel blocks. A simple approach, in which all the data from the GPU memory is first copied to the host and then reduced using the CPU, is sufficiently efficient (Nitadori, 2009). Here, we have chosen to reduce the partial forces directly in the GPU memory and only copy back the final result.



Pages:   || 2 | 3 |


Similar works:

«ASTRONOMICAL INSTITUTE ANTON PANNEKOEK 1921-2011 _ University of Amsterdam ALUMNI BOOK Book edited by Milena Hoekstra Picture on cover page courtesy of Huib Henrichs API Logo courtesy of Christian Thalmann Astronomical Institute Anton Pannekoek Visitors address: Science Park 904, 1098 XH Amsterdam Postal address: Postbus 94249, 1090 GE Amsterdam Tel. +31 (0) 525-7491 Fax. +31 (0) 525-7484 Email: secr-astro-science@uva.nl Web: www.astro.uva.nl -2_ CONTENTS _ Preface p. 7 Contributions p. 8...»

«ASTRONOMY: C. G. ABBOT 82 PROC. N. A. S.THE LARGER OPPORTUNITIES FOR RESEARCH ON THE RELATIONS OF SOLAR AND TERRESTRIAL RADIATION' By C. G. ABBOT SMITHSONIAN ASTROPHYSICAL OBSERVATORY Communicated by the National Research Council, December 8, 1919 The earth is maintained at it§ present temperature by a balance between the solar radiation received and the terrestrial radiation emitted. The mean intensity of the solar radiation as it is at the earth's mean distance outside the atmosphere is...»

«J. Astrophys. Astr. (1987) 8, 343–350 An Analytical Approximation of the Hubble Space Telescope Monochromatic Point Spread Functions Ο. Bendinelli Dipartimento di Astronomia, Università di Bologna, Via Zamboni 33, 40126 Bologna, Italy G. Parmeggiani Osservatorio Astronomico di Bologna, Via Zamboni 33, 40126 Bologna, Italy F. Zavatti Dipartimento di Astronomia, Università di Bologna, Via Zamboni 33, 40126 Bologna, Italy Received 1987 June 16; accepted 1987 October 7 Abstract. A mixture of...»

«The BMB Weekly Vol. 41, No. 40, September 29-October 3, 2008 nd Please send submissions to: Katie Gallagher at galla134@msu.edu or 105B Biochemistry (mailbox on 2 floor) Calendar Monday, September 29 Department of Physiology Cancer Faculty Candidate Seminar: Explore novel epigenetic targets for breast cancer therapy. Yi Huang, Johns Hopkins University, 9:00 a.m., 1425 Biomedical and Physical Sciences. Chemistry Seminar: Presentation by Kapil Lokare, Michigan State University, 11:20 a.m., 136...»

«EL VIEJO TERCIO JUEGOS DISTRIBUCIÓN JUEGOS – DISTRIBUCIÓN JUEGOS DISTRIBUCIÓN JUEGOS – DISTRIBUCIÓN elviejotercio@gmail.com www.elviejotercio.com TEL./FAX 91 8060157 Novedades juegos y libros, 4 de enero de 2012 Battleship Galaxies: The Saturn Offensive Game Set. P.V.P.: 62,00 € FORMATO: JUEGO DE TABLERO. EDITOR: WOC Battleship Galaxies es un juego de tablero de combate en el espacio con miniaturas. Las dos fuerzas enfrentadas en el juego, la humana Intergalactic Space Navy y los...»

«DOI: 10.5433/1984-3356.2015v8n15p313 Una forma territorial alternativa: la tribu de Coliqueo en la pampa bonaerense An alternative territorial form: indigenous tribe of Coliqueo in the Buenos Aires pampas Melina Yuln Graciela Silvestri RESUMEN Durante gran parte del siglo XIX la pampa de Buenos Aires fue dominio de los indígenas, propiciando así el retrato de un espacio salvaje. Luego de ser controlado por el Estado este territorio sufrió un proceso de transformación en el cual se...»

«Originally published in Analog, Jan/Feb 2011 For copyright details, please turn to the final page The First Conquest of Earth by David W. Goldman April, 2013 hen the alien fleet was first sighted just beyond the asteroid W belt, end-of-the-world riots broke out in cities around the globe. But when astronomers calculated that the huge, silent ships would take nearly three weeks to reach Earth, all but the most committed rioters felt their enthusiasm wilt. By the end of the day they’d all...»

«Of Late I Dreamt of Venus by James Van Pelt L ike a shiny pie plate, Venus hung high in the observation alcove’s window, a full globe afire with sunlight. Elizabeth Audrey contemplated its placid surface. Many would say it was gorgeous. Alexander Pope called the bright light “the torch of Venus,” and some ancient astronomer, besotted with the winkless glimmer named the planet after the goddess of love and beauty. At this distance, clouded bands swirled across the shimmering lamp,...»

«SNAP-SHOT MRI Nobel Lecture, December 8, 2003 by Peter Mansfield Sir Peter Mansfield Magnetic Resonance Centre, Department of Physics and Astronomy, University of Nottingham, Nottingham, NG7 2RD, U.K.INTRODUCTION The topic of magnetic resonance imaging (MRI) started for us at Nottingham in the early summer of 1972. During a discussion with one of my graduate students, Peter Grannell and my post-doc Dr Allan Garroway, concerning multiple-pulse line narrowing experiments in solids, the idea...»

«Few-Body Systems 0, 52–57 (2003) FewBody Systems c by Springer-Verlag 2003 Printed in Austria Quasilinear Approximation and WKB∗ R. Krivec1, V.B. Mandelzweig2,∗∗, F. Tabakin3 1 J. Stefan Institute, PO Box 3000, 1001 Ljubljana, Slovenia 2 Racah Institute of Physics, The Hebrew University, Jerusalem 91904, Israel 3 Department of Physics and Astronomy, University of Pittsburgh, Pittsburgh, PA 15260, USA Abstract. Quasilinear solutions of the radial Schr¨dinger equation for different...»

«Kochhar, Rajesh & Narlikar, Jayant (1995) Astronomy in India: A Perspective (New Delhi: Indian National Science Academy), Chap.1 1 Historical perspective India, as can be expected from an ancient culture, has a long astronomical tradition. The oldest astronomical text in India is the Vedanga Jyotisha (astronomy as part of the Vedas), one part of which is attributed to Lagadha. It is dated about 1400 BC on the basis of the statement in the book that the winter solstice took place at the star...»

«Speech by Professor Martin Rees on the occasion of the launch of 750th Anniversary Campaign – Sustaining Excellence at Drapers’ Hall on Tuesday 24th May Even to an astronomer, three quarters of a millennium seems a long enough stretch of time to be worth celebrating. But I feel somewhat miscast as a speaker on this great Mertonian occasion. I come from Cambridge University you may have heard of it. It was founded 800 years ago by disaffected scholars from Oxford who trudged eastward towards...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.