FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 | 3 | 4 | 5 |

«DiscFinder: A Data-Intensive Scalable Cluster Finder for Astrophysics Bin Fu Kai Ren Julio L´ pez o Eugene Fink Garth Gibson {binf, kair, jclopez, ...»

-- [ Page 1 ] --

DiscFinder: A Data-Intensive Scalable Cluster

Finder for Astrophysics

Bin Fu Kai Ren Julio L´ pez

o Eugene Fink

Garth Gibson

{binf, kair, jclopez, efink, garth}@cs.cmu.edu


Parallel Data Laboratory

Carnegie Mellon University

Pittsburgh, PA 15213-3890


DiscFinder is a scalable, distributed, data-intensive group finder for analyzing observation and

simulation astrophysics datasets. Group finding is a form of clustering used in astrophysics for

identifying large-scale structures such as clusters and superclusters of galaxies. DiscFinder runs on commodity compute clusters and scales to large datasets with billions of particles. It is designed to operate on datasets that are much larger than the aggregate memory available in the computers where it executes. As a proof-of-concept, we have implemented DiscFinder as an application on top of the Hadoop framework. DiscFinder has been used to cluster the largest open-science cosmology simulation datasets containing as many as 14.7 billion particles. We evaluate its performance and scaling properties and describe the performed optimization.

Copyright c 2010: The authors and Carnegie Mellon University Keywords: Astrophysics, Data Analytics, Out-of-Core Algorithms, Data-Intensive Scalable Computing, DISC, eScience 1 Introduction Today, the generation of new knowledge in data-driven sciences, such as astrophysics, seismology and bio-informatics, is enabled by the processing of massive simulation-generated and sensorcollected datasets. The advance of many fields of science is increasingly dependent on the analysis of these necessarily much larger datasets. For example, high-resolution cosmological simulations and large sky surveys are essential in astrophysics for answering questions about the nature of dark energy and dark matter (DE&DM) and the formation of large-scale structures in the universe.

The analysis of these datasets becomes extremely challenging as their size grows. They become too large to fit in the aggregate memory of the computers available to process them. Enabling the scaling of science analytics to these much larger datasets requires new data-intensive approaches.

Frameworks, such as Hadoop and Dryad, are commonly used for data-intensive applications on Internet-scale datasets. These applications include text analysis, natural language processing, indexing the web graph, mining social networks and other machine learning applications. The natural question is then: Can we leverage these data-intensive frameworks for science analytics?

In this work, we want to understand the requirements for developing new algorithms to enable science analytics using these systems. In the process, we should understand the advantages and limitations of these frameworks and how they should be enhanced or extended to better support analytics for science.

As part of our collaboration with domain scientists, we have developed DiscFinder: a new data-intensive distributed approach for finding clusters in large particle datasets. Group finding is a technique commonly used in astrophysics for studying the distribution of mass in the universe and its relation to DE&DM. DiscFinder is complementary to other existing group finding methods and is particularly useful for processing very large datasets on commodity compute clusters even in the case where the available aggregate memory cannot hold the entire dataset. Although, the mechanisms described here are specific to group finders for astrophysics, the principles are generally applicable to clustering problems in other fields of science.

DiscFinder is scalable, conceptually simple and flexible. DiscFinder scales up to process datasets with tens of billions of particles. It has been used to find clusters in the largest open-science simulation datasets. The DiscFinder design is compact and conceptually simple. DiscFinder employs a direct approach that leverages sequential group finder implementations to cluster relatively small subsets of the input particles. The main idea is to group and partition the input particle dataset into regions of space, then execute a sequential group finder for each partition, and finally merge the results together in order to join the clusters that span across partitions. The sequential finders are independently executed on relatively small subsets of the input in order to keep their running time low. The approach is implemented as a series of jobs on top of the Hadoop framework. DiscFinder relies on Hadoop for splitting the input data, managing and coordinating the execution of tasks and handling task failures. Finally, the DiscFinder strategy is flexible in the sense that it allows multiple implementation of the sequential group finder to be used. DiscFinder distributes The work in this paper is based on research supported in part by the Betty and Gordon Moore Foundation, by the National Science Foundation under award SCI-0430781, by the Department of Energy, under award number DEFC02-06ER25767, and by a Google research award. We also thank The McWilliams Center for Cosmology and the member companies of the PDL Consortium (including APC, Data-Domain, EMC, Facebook, Google, HewlettPackard, Hitachi, IBM, Intel, LSI, Microsoft, NEC, NetApp, Oracle, Seagate, Sun, Symantec, and VMware) for their interest, insights, feedback, and support.


the execution of the sequential finder in order to scale to large datasets. This strategy reduces the overall implementation complexity and benefits from the effort invested in the development of the sequential finders.

We are interested in determining the feasibility of the DiscFinder design and understanding its performance characteristics. In our evaluation, we first characterize the DiscFinder scalability with respect to the data size and find that it is possible to cluster very large datasets with this approach.

The main benefits of DiscFinder and future analysis applications implemented using similar approaches are their simplicity and potentially shorter development time. However, the simplicity of the implementation comes at a performance cost. We characterize the DiscFinder running time and apply a set of modifications to the implementation to improve its performance. There is clearly room for improvement both at the application and framework levels. We discuss various potential optimizations that can provide additional performance benefits. As the performance of these frameworks improves, they will be more widely used in data analytics for science.

The rest of this paper is structured as follows: Section 2 describes the motivating application;

Section 3 summarizes previous related work; the DiscFinder design is explained in Section 4; the implementation details are presented in Section 5; Section 6 presents the performance evaluation of the approach.

2 Motivating Application Analysis of Astronomical Datasets. Domain scientists are now commonly faced with the challenge of analyzing massive amounts of data in order to conduct their research. In particular, ever increasing large observation and simulations datasets abound in astrophysics. The advent of digital sky surveys, beginning with the Sloan Digital Sky Survey (SDSS), increased dramatically the scope and size of astronomical observational datasets [1]. The latest SDSS data release was over 30 TB in size. The field is moving towards large time domain astronomy surveys, such as Pan-STARRS [29, 23] and LSST [22], which will store many time slices of data. Pan-STARRS is producing in the order of 1TB/day of imagery. Both Pan-STARRS and LSST are projected to produce multi-TB datasets over the lifetime of the surveys. Similarly, state-of-the-art N-body cosmological simulation, such as the Bigben BHCosmo and Kraken DM simulations [9] produce multi-billion particle datasets with sizes in the orders of tens of terabytes, even after aggressive temporal sub-sampling.

This down-sampling is needed to deal with the bandwidth and capacity limits of the storage system available in the supercomputers where these simulations execute.

The analysis of these datasets is essential for tackling key problems in astrophysics, such as the understanding the nature of dark energy (DE) and dark matter (DM), and how DE&DM controls the formation of large-scale structures in the universe [5], such as clusters and superclusters of galaxies. In order to better understand the role of DE&DM in the evolution of the universe, theoretical and observational astrophysicists analyze the aggregate properties of large-scale structures, such as the distribution of their size, volume and mass [40]. Finding the groups of particles that make up the large-scale structures is the first step towards carrying out this process. Once the groups have been identified, then their properties can be calculated and the groups can be decomposed into sub-halos for further analysis [33, 6, 10, 32, 19] Group Finding. In astrophysics, group finding refers to a family of physics-based spatial clustering problems applied both to observation and simulation datasets. Their input is a set of celestial


objects such as stars, galaxies, gas, dark matter, etc. We will refer to those as particles, points or objects. A group finder separates the particles into groups that make up larger structures such as galaxies and clusters of galaxies. In very loose terms, we will refer to those structures as groups or clusters interchangeably. In slightly more formal terms, a group finder takes an input set of particles P = {p0, p1, p2,..., pn−1 } and produces a set of m disjoint groups G = {G0,..., Gm−1 } such that each group G j comprises the subset of points from P (i.e., ∀Gi ∈ G : Gi ⊂ P and ∀Gi ∈ G,∀G j ∈ G,i = j : Gi ∩ G j = 0). The criteria for determining the membership of a particle to a / group may use physics-based computations that take into account particle properties such as position, mass and velocity. There are many variants of group finding approaches, both from the algorithmic point of view as well as the criteria used for the selection.

Friends-of-Friends (FOF). FOF is a simple algorithm proposed by Huchra and Geller to study the properties of groups of galaxies in observation surveys [7, 21]. Its group membership criteria is solely based on the Euclidean inter-particle distance. FOF is widely used and works in practice for many analysis scenarios. There are a large number of more sophisticated group finding approaches that are based on FOF.

FOF is driven by two parameters: a linking length (τ) and a minimum group size (minGz). The input is the set of particles P, in particular each particle is a tuple of the form pidi, (xi, yi, zi ), where pid is the particle identifier and xi, yi, zi are the particle’s position coordinates in 3D space.

The only group membership criteria is the following: two particles pi and p j belong to the same group Gl (are friends), if the distance between them (di j ) is less than τ. This procedure is illustrated in two dimensions in Figure 1. Any other particle pk also belongs to the group Gl, if the distance between pk and any of the particles in Gl is less than τ. After applying this criteria, all the friends of pi become friends of p j and vice versa, hence the name of the approach. The output set G comprises the groups that contain a number of points larger that the minGz parameter. The output is represented as a list of tuples of the form pid, groupId where groupId is the group identifier.

–  –  –

Figure 1: Friends of Friends (FOF) clustering. This dataset contains 8 particles p1,..., p8. Shaded regions denote groups of particles. Notice that particles p3 and p6 belong to the same group although they are not close enough to each other. The reason is that they are indirectly linked through their friends, particles p4 and p5. In this example, minGz = 2, thus particles p7 and p8 do not belong to any group.

There are available sequential implementations of the FOF algorithm, such as the one from the “N-body shop” at the University of Washington [37]. These implementations rely on building a spatial indexing structure, such as a kd-tree, to speed up lookups to nearby particles [25]. However, sequential solutions are relatively slow to process billions of galaxies. There are also various

3. RELATED WORK parallel group finders for distributed memory machines with low-latency networks that are often used in simulations (See Section 3).

Group finding implementations, whether sequential or parallel, are in-core and thus require large enough available memory to fit the dataset and internal data structures. This means that finding groups and analyzing large simulation datasets require supercomputers of comparable capacity as the one used to generate them. For the very large datasets these resources are not readily available. In the cases where there is enough aggregate memory to execute the group finding process, the end-to-end running time is dominated by the I/O required to load the input dataset and produce the results. We have developed an alternative data-intensive group finding approach named DiscFinder, which complements existing approaches. DiscFinder is useful for finding groups in datasets need to be loaded from external storage, such as it is often the case in the analysis of astronomy surveys and post-simulation analysis of synthetic datasets. DiscFinder makes it possible to find groups in datasets that are much larger than the memory of the available compute resources.

3 Related Work Group finders used in astrophysics refers to a class of clustering approaches. The group membership criteria used in the basic FOF approach may not be appropriate for certain applications.

Pages:   || 2 | 3 | 4 | 5 |

Similar works:

«2 Safety Aspect in Soybean Food and Feed Chains: Fungal and Mycotoxins Contamination Germán G. Barros, María S. Oviedo, María L. Ramirez and Sofia N. Chulze Section Mycology, Department of Microbiology and Immunology, National University of Rio Cuarto. Ruta 36 km. 601 (5800) Río Cuarto, Córdoba, Argentina 1. Introduction Soybean (Glycine max L. Merr.) is an Asiatic leguminous plant cultivated in many parts of the world for its oil and proteins, which are extensively used in the manufacture...»

«1 Time and Solitude: Parameters of Being in Golding’s Lord of the Flies and Marquez’s One Hundred Years of Solitude William Golding’s schoolroom classic, Lord of the Flies and the Latin American landmark fiction of Magic Realism, One Hundred Years of Solitude by Gabriel García Márquez share several features, perhaps the most striking being the use of clearly defined figures that evoke, in the first novel, Christian archetypes and in the second promethean heroes, goddess-like beauties...»

«Reaching an Optimal Mark-Up Bid through the Valuation of the Option to Sign the Contract by the Selected Bidder ∗ † ‡ § João Adelino Ribeiro, Paulo Jorge Pereira and Elísio Brandão Abstract Reaching an optimal mark-up value in the context of bidding competitions has been a research topic since the pionner models of Friedman (1956) and Gates (1967) set the standards for future discussion. The model herein proposed is based on the existence of the option to sign the contract and perform...»

«Adapting totalitarianism: Nineteen Eighty-Four in Film Adaptations (Schaefer) Bachelor Thesis English Language and Culture, Utrecht University Student: Laura Kouters Student number: 3909670 Supervisor: Dr. Roselinde Supheert Second reader: Dr. Barnita Bagchi Date of completion: January 2015 2 Table of Contents Introduction pg. 3 1. Totalitarianism in Orwell’s Nineteen Eighty Four pg. 6 2. The Torture Scene pg. 9 3. Room 101 pg. 16 4. Symbolical Ideology pg. 20 1. Clothing pg. 20 2. Big...»

«ICOTS6, 2002: Krauss and Wassner HOW SIGNIFICANCE TESTS SHOULD BE PRESENTED TO AVOID THE TYPICAL MISINTERPRETATIONS Stefan Krauss and Christoph Wassner, Max Planck Institute for Human Development Germany The use of significance tests in science has been debated from the invention of these tests until the present time. Apart from theoretical critiques on their appropriateness for evaluating scientific hypotheses, significance tests also receive criticism for inviting misinterpretations. Although...»

«10/18/2014 Assyria | Print Article | World Book Student Back Print this page Assyria Assyria, «uh SIHR ee uh», was an ancient country on the upper Tigris River in Mesopotamia. It covered roughly the northern part of present-day Iraq. Assyria's civilization was similar in many ways to that of ancient Babylonia, its southern neighbor. People began settling in Assyria about 8500 B.C. In the 800's B.C., the Assyrians started to build an empire that lasted until the end of the 600's B.C. Map...»

«Higher National Unit Specification General information for centres Unit title: Millinery: An Introduction Unit code: F18P 34 Unit purpose: The Unit is designed to provide the candidate with the skills and knowledge required to produce a folio of millinery samples and a finished millinery item to meet a brief. Candidates will gather and record research material and develop designs from this for the millinery samples and the finished item. Both conventional and unconventional millinery materials...»

«Governance and Infrastructure Development Challenges in the Kathmandu Valley Final Workshop Report February 2009 Kathmandu, Nepal Sponsored by the East-West Center's Urban Dialogue and Kathmandu Metropolitan City Government Table of Contents Executive Summary 1 Map 4 Aims and Objectives 5 Report 6 Action Plan 9 Agenda 10 Biographies 13 Participant List 16 Appendixes Kathmandu Valley Profile Metropolitan Governance in India: An Overview of Selected Cities Governance and Planning in Metro Manila...»

«An Incantation Bowl and Roman Bath House Culture, Religion, and Daily Life in the Ancient World The Kelsey Experience Jackier Prize Essay Max Schwein Student ID: 67827579 AAPTIS 277: The Land of Israel/Palestine Through the Ages Section 17 GSI: Rebecca Cassidy 4/12/2014 Schwein 2 Schwein 3 In this essay, we will be examining two fascinating artifacts (or sets of artifacts) from the ancient world. When examined together, these artifacts, although very different, can shed light on the daily life,...»

«JADAVPUR UNIVERSITY Faculty of Arts On-line submission of application(s) for admission to M.A (Day & Evening) Courses 2016-17 • Eligibility Criteria • Selection Procedure IMPORTANT INSTRUCTIONS • Candidates who have completed BA/BSc. Final Results and fulfill the eligibility conditions laid down below may apply for admissions to M.A (Day & Evening) Courses for the academic session 2016-17. • A candidate who has appeared or due to appear in the B.A./B.Sc Final Year/Semester Examination...»

«COMMUNITY NEWS August 2016 This community newsletter is sent out monthly on behalf of tawalink.com, Tawa’s community website since 2002. More regular community updates are available if you join around 2700 other Tawa residents on neighbourly.co.nz. You can opt for daily updates or weekly updates from that site. “TAWA GOES TO TOWN” – DON’T MISS IT! This iconic Tawa event takes place just once every two years. This year it’s happening on Tuesday evening, 20 September at the Michael...»

«Technologies and Architectures for Multimedia-Support in Wireless Sensor Networks 373 1 22 Technologies and Architectures for Multimedia-Support in Wireless Sensor Networks Sven Zacharias and Thomas Newe University of Limerick Ireland 1. Introduction Wireless Sensor Networks (WSNs) are an emerging technology in the area of sensory and distributed computing. A WSN consists of many, theoretically up to some thousand or even millions of sensor nodes. A sensor node is generally defined as a cheap...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.