«Abstract: Data volumes from multiple sky surveys have grown from gigabytes into terabytes during the past decade, and will grow from terabytes into ...»
Astro2010 State of the Profession Position Paper (March 2009)
Astroinformatics: A 21st Century Approach to Astronomy
Primary Author: Kirk D. Borne, Dept. of Computational and Data Sciences, 4400 University
Drive MS 6A2, George Mason University, Fairfax, VA 22030 USA (firstname.lastname@example.org).
Data volumes from multiple sky surveys have grown from gigabytes into terabytes during the
past decade, and will grow from terabytes into tens (or hundreds) of petabytes in the next decade.
This exponential growth of new data both enables and challenges effective astronomical research, requiring new approaches. Thus far, astronomy has tended to address these challenges in an informal and ad hoc manner, with the necessary special expertise being assigned to eScience or survey science. However, we see an even wider scope and therefore promote a broader vision of this data-driven revolution in astronomical research. For astronomy to effectively cope with and reap the maximum scientific return from existing and future large sky surveys, facilities, and data-producing projects, we need our own information science specialists.
We therefore recommend the formal creation, recognition, and support of a major new discipline, which we call Astroinformatics. Astroinformatics includes a set of naturally-related specialties including data organization, data description, astronomical classification taxonomies, astronomical concept ontologies, data mining, machine learning, visualization, and astrostatistics. By virtue of its new stature, we propose that astronomy now needs to integrate Astroinformatics as a formal sub-discipline within agency funding plans, university departments, research programs, graduate training, and undergraduate education. Now is the time for the recognition of Astroinformatics as an essential methodology of astronomical research. The future of astronomy depends on it.
Preamble New modes of discovery are enabled by the growth of data and computational resources in the sciences. This cyberinfrastructure includes databases, virtual observatories (distributed data), high-performance computing (clusters and petascale machines), distributed computing (the Grid, the Cloud, and peer-to-peer networks), intelligent search and discovery tools, and innovative visualization environments. Data volumes from multiple sky surveys have grown from gigabytes into terabytes during the past decade, and will grow from terabytes into tens (or hundreds) of petabytes in the next decade. This plethora of new data both enables and challenges effective astronomical research, requiring new approaches. Thus far, astronomy has tended to address these challenges in an informal and ad hoc manner, with the necessary special expertise being assigned to e-Science  or survey science. However, we see an even wider scope and therefore promote a broader vision of this data-driven revolution in astronomical research. The solutions to many of the problems posed by massive astronomical databases exist within disciplines that are far removed from astronomy, whose practitioners don’t normally interface with astronomy. For astronomy to effectively cope with and reap the maximum scientific return from existing and future large sky surveys, facilities, and data-producing projects, we need our own information science specialists. We therefore recommend the formal creation, recognition, and support of a major new discipline, which we call Astroinformatics. Astroinformatics includes a set of naturally-related specialties including data organization, data description, astronomical classification taxonomies, astronomical concept ontologies, data mining, visualization, and statistics . By virtue of its new stature, we propose that astronomy now needs to integrate Astroinformatics as a formal sub-discipline within agency funding plans, university departments, research programs, graduate training, and undergraduate education. Now is the time for the recognition of Astroinformatics as an essential methodology of astronomical research. The future of astronomy depends on it.
The Revolution in Astronomy and Other Sciences The development of models to describe and understand scientific phenomena has historically proceeded at a pace driven by new data. The more we know, the more we are driven to enhance or to change our models, thereby advancing scientific understanding. This data-driven modeling and discovery linkage has entered a new paradigm , as illustrated in the accompanying graphic . The emerging confluence of new technologies and approaches to science has produced a new Data-Sensor-Computing-Model synergism. This has been driven by numerous developments, including the information explosion, the development of dynamic intelligent sensor networks [http://www.thinkingtelescopes.lanl.gov/], the acceleration in high performance computing (HPC) power, and advances in algorithms, models, and theories. Among these, the most extreme is the growth in new data. The acquisition of data in all scientific disciplines is rapidly accelerating and causing a nearly insurmountable data avalanche . Computing power doubles every 18 months (Moore’s Law), corresponding to a factor of 100 in ten years. The I/O bandwidth (into and out of our systems, including data systems) increases by 10% each year – a factor 3 in ten years. By comparison, data volumes appear to double every year (a factor of 1,000 in ten years).
Consequently, as growth in data volume accelerates, especially in the natural sciences (where funding certainly does not grow commensurate with data volumes), we will fall further and further behind in our ability to access, analyze, assimilate, and assemble knowledge from our data collections – unless we develop and apply increasingly more powerful algorithms, methodologies, and approaches. This requires a new generation of scientists and technologists trained in the discipline of data science .
In astronomy in particular, rapid advances in three technology areas (telescopes, detectors, and computation) have continued unabated , all leading to more data . With this accelerating advance in data generation capabilities over the coming years, we will require an increasingly skilled workforce in the areas of computational and data sciences in order to confront these challenges. Such skills are more critical than ever since modern science, which has always been data-driven, will become even more data-intensive in the coming decade [6, 7]. Increasingly sophisticated computational and data science approaches will be required to discover the wealth of new scientific knowledge hidden within these new massive scientific data collections [8, 9].
The growth of data volumes in nearly all scientific disciplines, business sectors, and federal agencies is reaching historic proportions. It has been said that “while data doubles every year, useful information seems to be decreasing” , and “there is a growing gap between the generation of data and our understanding of it” . In an information society with an increasingly knowledge-based economy, it is imperative that the workforce of today and especially tomorrow be equipped to understand data and to apply methods for effective data usage. Required understandings include knowing how to access, retrieve, interpret, analyze, mine, and integrate data from disparate sources. In the sciences, the scale of data-capturing capabilities grows at least as fast as the underlying microprocessor-based measurement system . For example, in astronomy, the fast growth in CCD detector size and sensitivity has seen the average dataset size of a typical large astronomy sky survey project grow from hundreds of gigabytes 10 years ago (e.g., the MACHO survey), to tens of terabytes today (e.g., 2MASS and Sloan Digital Sky Survey ), up to a projected size of tens of petabytes 10 years from now (e.g., LSST, the Large Synoptic Survey Telescope ). In survey astronomy, LSST will produce one 56Kx56K (3-Gigapixel) image of the sky every 20 seconds, generating nearly 30 TB of data daily for 10 years. In solar physics, NASA announced in 2008 a science data center specifically for the Solar Dynamics Observatory, which will obtain one 4Kx4K image every 10 seconds, generating one TB of data per day. NASA recognizes that previous approaches to scientific data management and analysis will simply not work. We see the data flood in all sciences (e.g., numerical simulations, high-energy physics, bioinformatics, drug discovery, medical research, geosciences, climate monitoring and modeling) and outside of the sciences (e.g., banking, healthcare, homeland security, retail marketing, e-mail). The application of data mining, knowledge discovery, and e-discovery tools to these growing data repositories is essential to the success of our social, financial, medical, government, and scientific enterprises. An informatics approach is required. What is informatics? Informatics has recently been defined as “the use of digital data, information, and related services for research and knowledge generation” , which complements the usual definition: informatics is the discipline of organizing, accessing, integrating, and mining data from multiple sources for discovery and decision support .
A National Imperative Our science education programs have always included the principles of evidence-based reasoning, fact-based induction, and data-oriented science . In this age of the data flood, greater emphasis on and enhancement of such data science competencies is now imperative. In particular, we must muster educational resources to train a skilled data-savvy workforce: one that knows how to find facts (i.e., data, or evidence), access them, assess them, organize them, synthesize them, look at them critically, mine them, and analyze them.
The Nature article “Agencies Join Forces to Share Data” calls for more training in data skills . This article describes a new Interagency Working Group on Digital Data representing 22 federal agencies in the U.S., including the NSF, NASA, DOE, and more. The group plans to set up a robust public infrastructure so that all researchers have a permanent home for their data.
One option is to create a national network of online data repositories funded by the government and staffed by dedicated computing and data science professionals with science discipline expertise. Who will these computing and archiving professionals be? They will be a professional workforce trained in the disciplines of computational and data sciences and who collaborate with computer science and statistics professionals in these areas, including machine learning, visualization, statistics, algorithm design, efficient data structures, scalable architectures, effective programming techniques, information retrieval methods, and data query languages.
Within the scientific domain, data science is becoming a recognized academic discipline. F. J.
Smith argues that now is the time for data science curricula in undergraduate education .
Others promote data science as a rigorous academic discipline . Another states that “without the productivity of new disciplines based on data, we cannot solve important problems of the world” . The 2007 NSF workshop on data repositories included a track on data-centric scholarship – the workshop report explicitly states our key message: “Data-driven science is becoming a new scientific paradigm – ranking with theory, experimentation, and computational science” . Consequently, astronomy and other scientific disciplines are developing subdisciplines that are information-rich and data-intensive to such an extent that these are now becoming (or have already become) recognized stand-alone research disciplines and full-fledged academic programs on their own merits. The latter include bioinformatics and geoinformatics, but will soon include astroinformatics, health informatics, and data science.
National Study Groups Face the Data Flood Several national study groups have issued reports on the urgency of establishing scientific and
educational programs to face the data flood challenges:
1. NAS report: “Bits of Power: Issues in Global Access to Scientific Data” (1997) ;
2. NSF report: “Knowledge Lost in Information: Report of the NSF Workshop on Research Directions for Digital Libraries” (2003) ;
3. NSB (National Science Board) report:“Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century” (2005);
4. NSF report with the Computing Research Association:“Cyberinfrastructure for Education and Learning for the Future: A Vision and Research Agenda” (2005);
5. NSF “Atkins Report” : “Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure” (2005) ;
6. NSF report: “The Role of Academic Libraries in the Digital Data Universe” (2006) ;
7. NSF report: “Cyberinfrastructure Vision for 21st Century Discovery” (2007) ;
8. JISC/NSF Workshop on Data-Driven Science & Repositories (2007) .
Each of these reports has issued a call to action in response to the data avalanche in science, engineering, and the global scholarly environment. For example, the NAS “Bits of Power” report lists five major recommendations, one of which includes: “Improve science education in the area of scientific data management” . The Atkins NSF Report stated that skills in digital libraries, metadata standards, digital classification, and data mining are critical . In particular, that report states: “The importance of data in science and engineering continues on a path of exponential growth; some even assert that the leading science driver of high-end computing will soon be data rather than processing cycles. Thus it is crucial to provide major new resources for handling and understanding data.”  The core and most basic resource is the human expert, trained in key data science skills. As stated in the 2003 NSF “Knowledge Lost in Information” report, human cognition and human capabilities are fundamental to successful leveraging of cyberinfrastructure, digital libraries, and national data resources .