«HOW TO MAKE THE DREAM COME TRUE: THE ASTRONOMERS’ DATA MANIFESTO Ray P. Norris CSIRO Australia Telescope, PO Box 76, Epping, NSW 1710, Australia ...»
HOW TO MAKE THE DREAM COME TRUE:
THE ASTRONOMERS’ DATA MANIFESTO
Ray P. Norris
CSIRO Australia Telescope, PO Box 76, Epping, NSW 1710, Australia
Astronomy is one of the most data-intensive of the sciences. Data technology is accelerating the quality
and effectiveness of its research, and the rate of astronomical discovery is higher than ever. As a
result, many view astronomy as being in a “Golden Age”, and projects such as the Virtual Observatory are amongst the most ambitious data projects in any field of science. But these powerful tools will be impotent unless the data on which they operate are of matching quality. Astronomy, like other fields of science, therefore needs to establish and agree on a set of guiding principles for the management of astronomical data. To focus this process, we are constructing a “data manifesto”, which proposes guidelines to maximise the rate and cost-effectiveness of scientific discovery.
Keywords: astronomy, data management, virtual observatory
1. INTRODUCTION The last few years have seen a revolution in the way astronomers use data. An astronomer can type the name of an object into a web page, and instantly view a wide range of observed data on that object, obtain references to all publications that mention it, and even produce plots of the spectral energy distribution (SED: Fig 1).
Figure 1: A typical Spectral Energy Distribution (SED) generated automatically by the NASA/IPAC Extragalactic Database (NED) using data collected by many different authors and instruments.
This SED includes data obtained from many different instruments using different technologies, calibration processes, and data formats, brought together in data centres that understand the instrumentspecific metadata. All papers written in the astronomical journals are available on-line through a powerful engine that searches the entire body of astronomical literature. Many such papers will contain links to other publications and data.
The Virtual Observatory (VO) promises to place even more power at the hands of the astronomer, and with it the capability of accelerating the rate of scientific discovery. The VO will enable the astronomer to search all available databases in a region of sky, and superimpose or combine the images, or produce a plot comparing measurements on different instruments. Some of us dream even further. For example, I look forward to the day when I can move my mouse over an image I have just produced, and the VO dynamically gives me all available information, in the form of graphs, images, and literature, about the position underneath my cursor.
How do we turn these dreams and promises into reality? One requisite is obviously to build the necessary tools, services, and data structures, and the VO is doing just that. However, these tools will be ineffective without high-quality data on which to operate. While data from major international observatories, such as the European Southern Observatory and NASA’s Great Observatories, are now freely available and managed in a way that is difficult to fault, much, perhaps most, of the remaining astronomical data and information are still relatively inaccessible. Even worse, some of these data are so poorly managed that they will be lost.
One of the reasons for poor data management is that many astronomers and observatory directors are unaware that good data management can generate good science, and that bad data management can inhibit the process of scientific discovery. Furthermore, in many areas there is not even a consensus on what constitutes good data management (Norris, 2005; Norris et al. 2006).
In an attempt to stimulate a discussion that might lead to such a consensus, and to promote awareness of these issues, a group of astronomers recently established “An Astronomers’ Data Manifesto”.
In this paper, I discuss the successes and challenges of astronomical data management, and describe the manifesto and its purpose.
2 THE ASTRONOMICAL LITERATURE AND DATA CENTRES
2.1 The Astronomical Literature Virtually all papers in the fields of astronomy, astrophysics, and related areas are referenced by the Smithsonian/NASA Astrophysics Data System (http://www.adsabs.harvard.edu/), known colloquially as the ADS. It references not only the mainstream journals, but conference reports, theses, preprint servers, and even institutional technical reports, where they are made public. It includes links to the papers in all their published forms, so that, for astronomers with institutional access to the journals, this provides transparent access to the entire astronomical literature. Even authors who publish papers (such as this) in non-astronomical journals can request to have their paper listed by the ADS. Powerful facilities enable a search to be made by author, title, keyword, or even text contained within the paper.
As a result, the ADS has probably become the primary entry point to the published literature for most astronomers.
To disseminate their new research results, most authors now submit preprints (usually after acceptance by a journal) to arXiv.org (http://www.arxiv.org/list/astro-ph/new). This has become the primary means of accessing new research results, and many astronomers check it daily. Basic search facilities are also available, although ADS probably remains the most flexible way of accessing the arXiv.org contents.
The principal commercial astronomical journals have responded positively to these changes, and their electronic editions have become the main journals of record. It is likely that paper editions will be phased out within a few years. However, it is unclear how the astronomical publishing paradigm will
change, given a number of conflicting forces:
1. There is a growing demand for open access, or free, journals, particularly as a solution to the “Digital Divide” discussed below. However, it is not yet clear how an open-access journal can afford to maintain the editorial quality and peer-review processes currently offered by mainstream journals.
2. Since most astronomers now access the literature via ADS or arXiv, the “title” of a journal has become less important. While there is still prestige associated with publishing in a highimpact journal, the actual visibility is similar regardless of where the paper is published, and so in time the impact factor may cease to differentiate journals. The commercial journals will therefore need to offer additional value, compared to open access journals, if they are to retain their authors and readers.
3. Some of the main journals have an excellent track record of responding to the changing demands of the astronomical community, and of promoting initiatives such as electronic access to associated data and tables, and linkages to other data centres. As a result, there is a significant groundswell of support for such journals from within the astronomical community.
2.2 Astronomical Data Centres Astronomy enjoys a number of first-class data centres, the best known of which are CDS and NED.
NED is the NASA/IPAC Extragalactic Database (http://nedwww.ipac.caltech.edu/), which offers access to data taken from the literature and from major astronomical surveys. Its search engine provides all available data on an object or position in the sky, for which it will list measured data, images, references to the literature, and even some interpretation by comparing measurements made at different wavelengths by different authors and instruments (Fig 1). The difficulty of accomplishing this latter feat should not be underestimated, as authors use different metadata, jargon, and (even if they don’t know the word) ontologies. As well as providing access to data, NED also provides a number of tools and innovative facilities such as its knowledgebase. Its key constraint is that it is designed to include only extragalactic objects (i.e. objects lying outside the Milky Way) and so does not include, for example, stars within the Milky Way, or solar-system objects. Nevertheless, for those who focus on extragalactic astronomy, NED has become the primary tool for accessing data.
CDS is the Centre de Données Astronomiques de Strasbourg (http://cdsweb.u-strasbg.fr/). Like NED, it offers access to data from the literature and from major surveys, and provides tools and search engines to access and interpret that data. It differs from NED in that it aims to include data on all astronomical objects outside the solar system, whether extragalactic or galactic. The main databases at CDS are Vizier, which includes nearly all major published surveys and tables, and Simbad, which provides search tools to access data taken both from the literature and form surveys. A number of other powerful tools are also provided, such as Aladin which enables a user to superimpose images from several data sources, including personal files. Just as important are the CDS research and development programs, which have been influential in shaping the way in which astronomers use data, and continue to be important drivers in the development of the VO.
A number of other major data centres around the world, such as those in Canada, China, Japan, and Russia, offer significant features for particular purposes. In addition, a number of specialised data centres exist to serve data for particular instruments or classes of instrument, such as NASA’s High Energy Astrophysics Science Archive Research Center (http://heasarc.gsfc.nasa.gov/). Furthermore, the electronic data provided by the journals themselves effectively constitute a data centre, a blurring which is increasing as journals explore innovative projects such as those which offer to store authors’ source data. Finally, it is important to acknowledge the regrettable closure of a major data centre (NASA’s Astrophysical Data Center) in 2002, which serves as a warning against any complacency that high-performing data centres are immune to threats of closure.
2.3 Linkages between Literature and Data Centres Many astronomers assume that data centres such as NED and CDS give them access to essentially all the published data. However, Andernach (2006), who has conducted a case study of over 2000 published articles, finds that typically only about 50% of results published in journals ever appear in the data centres, and lists some surprising and significant omissions.
It is not hard to understand why. At present, when authors submit data such as tables, spectra, or images, to journals, they do so in a variety of formats. The meaning of the axes or columns is often only apparent after reading the captions or the body of the paper, and authors continue to use jargon which is opaque to anyone outside the immediate field. Even worse, formatting errors still occur in published data tables, further impairing attempts at machine-readability.
To incorporate these data into a data centre requires a knowledgeable expert to interpret the words of the author, so that the results can be translated into a standard form. Given the finite resources of the data centres, and the expanding volume of astronomical literature, the availability of such knowledgeable experts then becomes a bandwidth bottleneck between the literature and the data centres.
Naturally, users would like to see all peer-reviewed results appear in the data centres. This will become increasingly important as VO tools become more widely used. In the same way that, at the moment, an astronomical publication that does not appear in ADS is effectively invisible to the astronomical community and will probably never be cited, in a few years time a result or image that is not accessible by the VO may never be used, generating wasteful repeated observations and slowing down the rate of scientific discovery.
How can we ensure that all validated astronomical data appear in the data centres? One solution would be to increase funding to the data centres so that they can employ enough knowledgeable experts to interpret all published data, but the finite available resources make this option unlikely.
An alternative is to find ways of automatically transferring published data from journals to data centres.
This would probably require that the authors provide data in standard formats and that they provide the necessary metadata to interpret them. Then, if an author chooses to supply these metadata, and certifies that the data have been checked using appropriate tools, they could be imported automatically into the data centres.
This effectively redistributes the transcription workload from the data centres to the authors, and necessarily entails more work for authors. However, they benefit from the greater scientific impact and the higher citation rate that will result from their data being in the data centres. In many cases the paper itself will benefit from this further level of checking.
There is a potential disadvantage to such a system, in that it increases the likelihood that simple formatting errors in published papers might ultimately reduce the quality of the data in the centres. It remains to be seen whether automated checking procedures can reduce this possibility to the level where the disadvantage is outweighed by the advantage of doubling the quantity of high-quality information offered by the data centres.
3 OPEN ACCESS Most astronomical data are unfettered by intellectual property or confidentiality issues, other than widely supported exceptions such as initial protection of observers’ data by major facilities. As a result, astronomical archive data are generally available to all astronomers at no charge. It is this tradition which has enabled the success of astronomical data centres, and which will be vital for the success of the VO. The adoption of an open-access policy is not just for public good. For example, the Hubble archive results in roughly three times as many papers as those based on the original data (Beckwith, 2004). Similarly, the International Ultraviolet Explorer (IUE) archive increased the usage of IUE data by a factor of 5 (Wamsteker & Griffin, 1995). So, in principle, observatories might multiply their scientific output by making their archive data public. Since the funding for most major observatories depends on performance indicators such as publications and citations, it may be an expensive decision for an observatory not to adopt an open-access policy.