FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 | 3 | 4 | 5 |

«DiscFinder: A Data-Intensive Scalable Cluster Finder for Astrophysics Bin Fu Kai Ren Julio L´ pez o Eugene Fink Garth Gibson {binf, kair, jclopez, ...»

-- [ Page 1 ] --

DiscFinder: A Data-Intensive Scalable Cluster

Finder for Astrophysics

Bin Fu Kai Ren Julio L´ pez

o Eugene Fink

Garth Gibson

{binf, kair, jclopez, efink, garth}@cs.cmu.edu


September 2010

Parallel Data Laboratory

Carnegie Mellon University

Pittsburgh, PA 15213-3890

Copyright c 2010: The authors and Carnegie Mellon University

The work in this paper is based on research supported in part by the Betty and Gordon Moore Foundation, by

the National Science Foundation under award SCI-0430781, by the Department of Energy, under award number DEFC02-06ER25767, and by a Google research award. We also thank The McWilliams Center for Cosmology and the member companies of the PDL Consortium (including APC, Data-Domain, EMC, Facebook, Google, HewlettPackard, Hitachi, IBM, Intel, LSI, Microsoft, NEC, NetApp, Oracle, Seagate, Sun, Symantec, and VMware) for their interest, insights, feedback, and support.

Keywords: Astrophysics, Data Analytics, Out-of-Core Algorithms, Data-Intensive Scalable Computing, DISC, eScience Abstract DiscFinder is a scalable, distributed, data-intensive group finder for analyzing observation and simulation astrophysics datasets. Group finding is a form of clustering used in astrophysics for identifying large-scale structures such as clusters and superclusters of galaxies. DiscFinder runs on commodity compute clusters and scales to large datasets with billions of particles. It is designed to operate on datasets that are much larger than the aggregate memory available in the computers where it executes. As a proof-of-concept, we have implemented DiscFinder as an application on top of the Hadoop framework. DiscFinder has been used to cluster the largest open-science cosmology simulation datasets containing as many as 14.7 billion particles. We evaluate its performance and scaling properties and describe the performed optimization.

1 Introduction Today, the generation of new knowledge in data-driven sciences, such as astrophysics, seismology and bio-informatics, is enabled by the processing of massive simulation-generated and sensorcollected datasets. The advance of many fields of science is increasingly dependent on the analysis of these necessarily much larger datasets. For example, high-resolution cosmological simulations and large sky surveys are essential in astrophysics for answering questions about the nature of dark energy and dark matter (DE&DM) and the formation of large-scale structures in theuniverse.

The analysis of these datasets becomes extremely challenging as their size grows. They become too large to fit in the aggregate memory of the computers available to process them. Enabling the scaling of science analytics to these much larger datasets requires new data-intensive approaches.

Frameworks, such as Hadoop and Dryad, are commonly used for data-intensive applications on Internet-scale datasets. These applications include text analysis, natural language processing, indexing the web graph, mining social networks and other machine learning applications. The natural question is then: Can we leverage these data-intensive frameworks for science analytics?

In this work, we want to understand the requirements for developing new algorithms to enable science analytics using these systems. In the process, we should understand the advantages and limitations of these frameworks and how they should be enhanced or extended to better support analytics for science.

As part of our collaboration with domain scientists, we have developed DiscFinder: a new data-intensive distributed approach for finding clusters in large particle datasets. Group finding is a technique commonly used in astrophysics for studying the distribution of mass in the universe and its relation to DE&DM. DiscFinder is complementary to other existing group finding methods and is particularly useful for processing very large datasets on commodity compute clusters even in the case where the available aggregate memory cannot hold the entire dataset. Although, the mechanisms described here are specific to group finders for astrophysics, the principles are generally applicable to clustering problems in other fields of science.

DiscFinder is scalable and flexible. DiscFinder scales up to process datasets with tens of billions of particles. It has been used to find clusters in the largest open-science simulation datasets.

The DiscFinder design is compact and conceptually simple. The approach used in DiscFinder leverages sequential group finder implementations, which are employed for clustering relatively small subsets of the input particles. The main idea is to group and partition the input particle dataset into regions of space, then execute a sequential group finder for each partition, and finally merge the results together in order to join the clusters that span across partitions. The sequential finders are independently executed on relatively small subsets of the input in order to keep their running time low. The approach is implemented as a series of jobs on top of the Hadoop framework. DiscFinder relies on Hadoop for splitting the input data, managing and coordinating the execution of tasks and handling task failures. Finally, the DiscFinder strategy is flexible in the sense that it allows multiple implementation of the sequential group finder to be used. DiscFinder distributes the execution of the sequential finder in order to scale to large datasets. This strategy reduces the overall implementation complexity and benefits from the effort invested in the development of the sequential finders.

We are interested in determining the feasibility of the DiscFinder design and understanding its performance characteristics. In our evaluation, we first characterize the DiscFinder scalability with respect to the data size and find that it is possible to cluster very large datasets with this approach.


The main benefits of DiscFinder and future analysis applications implemented using similar approaches are their simplicity and potentially shorter development time. However, the simplicity of the implementation comes at a performance cost. We characterize the DiscFinder running time and apply a set of modifications to the implementation to improve its performance. There is clearly room for improvement both at the application and framework levels. We discuss various potential optimizations that can provide additional performance benefits. As the performance of these frameworks improves, they will be more widely used in data analytics for science.

The rest of this paper is structured as follows: Section 2 describes the motivating application;

Section 3 summarizes previous related work; the DiscFinder design is explained in Section 4; the implementation details are presented in Section 5; Section 6 presents the performance evaluation of the approach.

2 Motivating Application Analysis of Astronomical Datasets. Domain scientists are now commonly faced with the challenge of analyzing massive amounts of data in order to conduct their research. In particular, ever increasing large observation and simulations datasets abound in astrophysics. The advent of digital sky surveys, beginning with the Sloan Digital Sky Survey (SDSS), increased dramatically the scope and size of astronomical observational datasets [1]. The latest SDSS data release was over 30 TB in size. The field is moving towards large time domain astronomy surveys, such as Pan-STARRS [29, 23] and LSST [22], which will store many time slices of data. Pan-STARRS is producing in the order of 1TB/day of imagery. Both Pan-STARRS and LSST are projected to produce multi-TB datasets over the lifetime of the surveys. Similarly, state-of-the-art N-body cosmological simulation, such as the Bigben BHCosmo and Kraken DM simulations [9] produce multi-billion particle datasets with sizes in the orders of tens of terabytes, even after aggressive temporal sub-sampling.

This down-sampling is needed to deal with the bandwidth and capacity limits of the storage system available in the supercomputers where these simulations execute.

The analysis of these datasets is essential for tackling key problems in astrophysics, such as the understanding the nature of dark energy (DE) and dark matter (DM), and how DE&DM controls the formation of large-scale structures in the universe [5], such as clusters and superclusters of galaxies. In order to better understand the role of DE&DM in the evolution of the universe, theoretical and observational astrophysicists analyze the aggregate properties of large-scale structures, such as the distribution of their size, volume and mass [40]. Finding the groups of particles that make up the large-scale structures is the first step towards carrying out this process. Once the groups have been identified, then their properties can be calculated and the groups can be decomposed into sub-halos for further analysis [33, 6, 10, 32, 19] Group Finding. In astrophysics, group finding refers to a family of physics-based spatial clustering problems applied both to observation and simulation datasets. Their input is a set of celestial objects such as stars, galaxies, gas, dark matter, etc. We will refer to those as particles, points or objects. A group finder separates the particles into groups that make up larger structures such as galaxies and clusters of galaxies. In very loose terms, we will refer to those structures as groups or clusters interchangeably. In slightly more formal terms, a group finder takes an input set of particles P = {p0, p1, p2,..., pn−1 } and produces a set of m disjoint groups G = {G0,..., Gm−1 } such that each group G j comprises the subset of points from P (i.e., ∀Gi ∈ G : Gi ⊂ P and ∀Gi ∈


G,∀G j ∈ G,i = j : Gi ∩ G j = 0). The criteria for determining the membership of a particle to a / group may use physics-based computations that take into account particle properties such as position, mass and velocity. There are many variants of group finding approaches, both from the algorithmic point of view as well as the criteria used for the selection.

Friends-of-Friends (FOF). FOF is a simple algorithm proposed by Huchra and Geller to study the properties of groups of galaxies in observation surveys [7, 21]. Its group membership criteria is solely based on the Euclidean inter-particle distance. FOF is widely used and works in practice for many analysis scenarios. There are a large number of more sophisticated group finding approaches that are based on FOF.

FOF is driven by two parameters: a linking length (τ) and a minimum group size (minGz). The input is the set of particles P, in particular each particle is a tuple of the form pidi, (xi, yi, zi ), where pid is the particle identifier and xi, yi, zi are the particle’s position coordinates in 3D space.

The only group membership criteria is the following: two particles pi and p j belong to the same group Gl (are friends), if the distance between them (di j ) is less than τ. This procedure is illustrated in two dimensions in Figure 1. Any other particle pk also belongs to the group Gl, if the distance between pk and any of the particles in Gl is less than τ. After applying this criteria, all the friends of pi become friends of p j and vice versa, hence the name of the approach. The output set G comprises the groups that contain a number of points larger that the minGz parameter. The output is represented as a list of tuples of the form pid, groupId where groupId is the group identifier.

–  –  –

Figure 1: Friends of Friends (FOF) clustering. This dataset contains 8 particles p1,..., p8. Shaded regions denote groups of particles. Notice that particles p3 and p6 belong to the same group although they are not close enough to each other. The reason is that they are indirectly linked through their friends, particles p4 and p5. In this example, minGz = 2, thus particles p7 and p8 do not belong to any group.

There are available sequential implementations of the FOF algorithm, such as the one from the “N-body shop” at the University of Washington [37]. These implementations rely on building a spatial indexing structure, such as a kd-tree, to speed up lookups to nearby particles [25]. However, sequential solutions are relatively slow to process billions of galaxies. There are also various parallel group finders for distributed memory machines with low-latency networks that are often used in simulations (See Section 3).

Group finding implementations, whether sequential or parallel, are in-core and thus require large enough available memory to fit the dataset and internal data structures. This means that findRELATED WORK ing groups and analyzing large simulation datasets require supercomputers of comparable capacity as the one used to generate them. For the very large datasets these resources are not readily available. In the cases where there is enough aggregate memory to execute the group finding process, the end-to-end running time is dominated by the I/O required to load the input dataset and produce the results. We have developed an alternative data-intensive group finding approach named DiscFinder, which complements existing approaches. DiscFinder is useful for finding groups in datasets need to be loaded from external storage, such as it is often the case in the analysis of astronomy surveys and post-simulation analysis of synthetic datasets. DiscFinder makes it possible to find groups in datasets that are much larger than the memory of the available compute resources.

3 Related Work Group finders used in astrophysics refers to a class of clustering approaches. The group membership criteria used in the basic FOF approach may not be appropriate for certain applications.

Pages:   || 2 | 3 | 4 | 5 |

Similar works:

«EMMA DONOGHUE ROOM Room is for Finn and Una, my best works. My child Such trouble I have. And you sleep, your heart is placid; you dream in the joyless wood; in the night nailed in bronze, in the blue dark you lie still and shine. Simonides (c. 556–468, B:C.) Presents Today I’m five. I was four last night going to sleep in Wardrobe, but when I wake up in Bed in the dark I’m changed to five, abracadabra. Before that I was three, then two, then one, then zero. “Was I minus numbers?”...»

«Session #2 Price INTRODUCTION TO SESSION #2: HOW INSTITUTIONS HAVE HANDLED INVESTIGATIONS OF PLAGIARISM Dr. Alan Price ORI This session focuses on the handling of cases of plagiarism and theft of ideas. Four speakers will talk about institutional experiences in handling a case or multiple cases of plagiarism and the difficulties that have arisen during the context of those investigations. For your information, plagiarism and theft of ideas are the most common of the allegations of possible...»

«Muhammad, sallallaahu ‘alayhi wa sallam, The Finest Man Who Ever Lived His Fine Morals, and How We Should Love and Support Him All perfect praise be to Allaah, the Lord of the worlds, and may the peace and blessings of Allaah be upon Muhammad ibn ‘Abdillaah, the seal of the prophets, the finest of the former and latter generations, our master and ideal example. He is the possessor of the Hawdh (Basin) from which the believers will drink on the day of Judgment; the Banner under which the...»

«Basic Notions of Information Structure* Manfred Krifka Humboldt Universität zu Berlin and Zentrum für Allgemeine Sprachwissenschaft, Berlin Abstract. This article takes stock of the basic notions of Information Structure (IS). It first provides a general characterization of IS following Chafe (1976) within a communicative model of Common Ground (CG), which distinguishes between CG content and CG management. IS is concerned with those features of language that concern the local CG. It then...»

«Mrs Robinson Before and After: An Existential Character Analysis of Euripides’ Hippolytos in Reception Jarrid Keith Looney Royal Holloway University of London Ph.D. in Classics Declaration of Authorship I, Jarrid Keith Looney, hereby declare that this thesis and the work presented in it is entirely my own. Where I have consulted the work of others, this is always clearly stated.Signed: Date: Abstract Throughout this thesis, I will argue that the capacity of Euripides’ Hippolytos to survive...»

«IL SANTO DE FOGAZZARO y SAN MANUEL BUENO DE UNAMUNO I En las numerosas bibliografías sobre Unamuno y sus obras, no aparece ningún estudio dedicado a las posibles influencias que el novelista italiano, Antonio Fogazzaro -célebre por las polémicas que suscit6, a causa de sus ideas modernistaspudiera haber ejercido sobre el rector de Salamanca. Estudiando las obras de Fogazzaro, considerando las circunstancias que las inspiraron y las ideas de principios de siglo, que tanto influyeron en su...»

«The Wickedly Effective MRSA Diet. Welcome to the MRSA diet. I want to make you a promise not to bore you with bleak looking food lists and some snooty diet that is impossible to follow. I’m a normal guy and I love to eat just like anybody else. What is important is that you discover how simple it really is. Listen, this book it is not a layout of every food you may and may not eat. But it consists of guidelines intended to help you lead a balanced and joyful life along with a few manicured...»

«As of 12/02/2013 Dear Colleague, On behalf of Latin Markets, it is our pleasure to invite you to participate in the Central American and Caribbean Capital Projects & Infrastructure Summit 2014 in Panama City, Panama. This is the first Central American & Caribbean Summit in a series of high-level infrastructure development and investment Summits Latin Markets will host in Mexico, Peru and Colombia in 2014. The Summit will feature the Major Infrastructure Project Investment & Development...»


«Fauna & Flora International Cambodia Programme An Investigation into Frog Consumption and Trade in Cambodia by Species, Habitats and Ecosystems Team An assessment of the collection and trade of amphibians in Cambodia July 2010 This report was a joint initative between Fauna & Flora International and The Angkor Centre for Conservation of Biodiversity. Prepared by Neang Thy & Toby Eastoe. Species, Habitat and Ecosystems team. Fauna & Flora International Cambodia Programme 32, Street 282, Bong...»

«online InnovatIv Bibliothek. Information. technologie. Band 54 Guerilla-Anwendungen in Bibliotheken Was können Bibliotheken vom Guerilla-Künstler Banksy für ihr Marketing lernen? Florian Hagen b.i.t. innovativ 2015 Band 54 b.i.t.online – Innovativ Band 54 Guerilla-Anwendungen in Bibliotheken Was können Bibliotheken vom Guerilla-Künstler Banksy für ihr Marketing lernen? Konzeptstudie zur Anwendung von Guerilla Marketing für die ZBW 2015 Verlag: Dinges & Frick GmbH, Wiesbaden...»

«THE TRAMPLED GRASS Mitigating the impacts of armed conflict on the environment James Shambaugh Judy Oglethorpe Rebecca Ham with contributions from Sylvia Tognetti Biodiversity Support Program The Trampled Grass Mitigating the impacts of armed conflict on the environment 2001 James Shambaugh Judy Oglethorpe Rebecca Ham With contributions from Sylvia Tognetti Publication Credits James Shambaugh, Judy Oglethorpe, and Rebecca Ham, Authors with contributions from Sylvia Tognetti Grammarians, Inc....»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.