«1.0 Summary 3 2.0 Survey 4 2.1 The Need for Linking Repositories 6 2.2 Research Data and Source Repositories 7 2.3 The Accessibility and Sharing of ...»
Project STORE: Astronomy Report
Sayeed Choudhury (Johns Hopkins), Robert Hanisch (Space Telescope Institute) and
Rowena Stewart (University of Edinburgh)
1.0 Summary 3
2.0 Survey 4
2.1 The Need for Linking Repositories 6
2.2 Research Data and Source Repositories 7
2.3 The Accessibility and Sharing of Primary Research Data 14
2.4 Output Repositories 16
2.5 And Finally… 19
3.0 Interviews and Workshop 21
3.1 UK-based Astronomers 22
3.2 US-based Astronomers 46 Summary In many ways, digital astronomy is at the forefront of issues related to data curation, given the existing experience with generating large amounts of data in raw form, and significant quantities of derived data in processed form. Additionally, astronomers have agreed upon a set of standards and web services for accessing, organizing and disseminating data. In the United States, the international Virtual Observatory effort is often cited as the archetypal example for cyberinfrastructure-related discussions. Astronomy data is “unconstrained” in the sense that it does not contain the same privacy, legal, commercial, etc. parameters of other scientific disciplines. This characteristic enables astronomers, and librarians, to build systems in an open manner.
1) Apart from being a condition of use of source repositories, the culture in astronomy is strong for citing source data in publications. Links from output to source repositories may be more useful than vice versa. The main value for accessing data in this manner would be value to the research community, to validate results, to identify specific astronomical objects of interest, or to identify collaborative opportunities.
2) Researchers are happy for their (source) data to be used as long as it is credited and, where publicly funded, there is an obligation for it to be made so anyway after a proprietary period of usually 6 to18 months (during which time data is restricted to project team members).
3) ArXiv.org and NASA-ADS are the main A&I database and output repositories used. The Virtual Observatory team and Sheridan Libraries at Johns Hopkins are working with the University of Chicago Press to consider output repository support at the time of article submission, especially as it supports preservation of derived data cited within publications.
4) There are facilities to link source to output data in operation, e.g. CDS's Simbad but they are not comprehensive and one interviewee mentioned his work on improving the linking.
5) Source repositories like being able to monitor how much they are used, especially if metrics for use might help gather additional funding or support.
6) Astronomers should define standard methods to refer to same objects when viewed through different spectra, including the provenance or annotations with certain data (or analyses of data) are deposited into output repositories. Additional metadata through automated mechanisms (e.g., telescope directly records weather conditions) would also be useful.
7) Astronomers would not seek help from librarians or informational professionals with information seeking or navigating, but rather for assistance with metadata and preservation matters related to datasets.
Survey This report provides an overview and details from the survey and interviews with astronomers through Project STORE. The information that was collected first within the survey related to institutional affiliation, professional identity, and discipline. Given the connection to Johns Hopkins University, astronomers from both the UK and US responded to the survey. The survey
included responses from sixty-four astronomers at the following thirty-one institutions:
There were multiple respondents from the University of Edinburgh, Johns Hopkins University, Open University, Space Telescope Institute, University of Leicester, University of Sheffield, University of Nottingham, and the University of Sussex. Given that astronomers at the University of Edinburgh and Johns Hopkins University sent specific email messages to their colleagues, it is not surprising that these institutions were especially well represented in the survey. That is, the larger number of respondents most probably reflects the impact of personal communication rather than a special interest in the survey topics. The topics examined by Project STORE almost certainly have widespread for the astronomy community throughout the UK and US.
Survey respondents identified themselves with the following distribution of roles:
Within the overall discipline of astronomy, the respondents identified the following (unique)
main fields of interest:
Astronomy Stellar Evolution • Astronomy - in particular planetary systems formation and evolution • Astronomy & astrophysics • Astronomy Astrophysics Galaxies • Astronomy Astrophysics Galaxy evolution Star formation • Astronomy Astrophysics Scientific Databases • Astronomy Cosmology Numerical Simulation • Astronomy, astrophysics, astrobiology • Astronomy, large databases, astronomical instrumentation, survey astronomy, galaxies, cosmology • Astronomy: interstellar medium; star formation • Astrophysics • Astrophysics (Observational and computational) • Astrophysics and Space Science • Clusters of galaxies radio astronomy • Computers, Astronomy, Physics • Cosmology • Cosmology - large scale structure - statistical descriptions of large datasets • Cosmology; galaxy formation and large-scale clustering • Data Curation, Astronomical Archiving • Data management data discovery data access multi-wavelength data integration • Dust, ISM, Star formation • Extragalactic astronomy • Galactic astrophysics • Galaxy formation, AGN • High energy astrophysics • Interstellar Medium Stellar Populations Cosmology Supernovae and Supernova Remnants • Large Databases • Observational Astronomy • Observational astrophysics Space instrumentation • Physics • Physics (astronomy) • Physics, Applied Mathematics • Polymer physics • Population synthesis, stellar winds, galaxy evolution • Proto-planetary disks stellar atmospheres planet formation chemistry atomic data • Solar Physics Plasma Physics • Solar System astronomy. Comets, asteroids.
• Solar Terrestrial Physics • Spectroscopy in the FUV: Hot Stars in Globular Clusters Emission from the Diffuse • Interstellar Medium Star formation in spiral galaxies. Supernovae.
• Stellar astrophysics binary stars Telluric ozone • Stellar spectroscopy Stellar Evolution • Wide field astronomical surveys Databases Information technologies • Two of the respondents noted that it would have been helpful to include the RAE categories on the survey itself.
The Need for Linking Repositories This section of the survey featured brief definitions of source and output repositories.
Respondents considered the following two questions, and provided their responses as follows:
“Source repositories contain primary research data. If a standard feature of such repositories was the ability to identify and link to the publications that had been developed from these data, how advantageous would you find it?”
Among the free form comments, one respondent stated, “I would find it a fairly dangerous development.” This assertion may relate to another respondent’s comment that “Data are objective, whereas interpretations are subjective and the two should only be collated with great caution since some people (especially students) tend to give as much credence to a fashionable finding as to the actual data.” More than one respondent indicated that such a service exists through the ADS at Harvard and SIMBAD at Strasbourg. These individuals stated that the service is helpful in tracking down literature, but the process for generating such links could be improved and automated. One respondent whose primary field of work relates to storing and archiving source repositories (perhaps obviously) indicated that s/he has less interest in output repositories, but it might still be useful to know how the data is being used in publications.
The next question related to links from publications to the primary source data:
“How advantageous to you would it be if it were possible to go directly from within an online publication (electronic journal article or other text) to the primary source data from which that publication was developed?”
The responses were classified as follows:
Several respondents indicated again that such a service exists through ADS at Harvard. One respondent stated that this service “would be useful only in cases (e.g. optical astronomy) where the primary data is not currently kept in public archives.” Another respondent raised the important point that such utility allows astronomers to examine and (perhaps) verify results with “controversial” results.
Research Data and Source Repositories This section of the survey addressed data formats, source repositories, metadata and reasons for modes of access to others’ research datasets. The first question in this section addressed
electronic source data:
“What kinds of electronic source data do you produce? (select all that apply)”
From the list of choices, the respondents identified the following subset:
The next question focused on the formats for source data:
“In what formats are these source data held? (select all that apply)”
From the list of choices, the respondents identified the following subset:
One respondent pointed out that FITS is not a flat file. Three respondents identified formats outside the set of choices provided: FORTRAN binary files, HDF proprietary format, and IDL database format. One respondent offered a strong opinion about proprietary formats, stating “God preserve us from idiots who archive data in proprietary commercial formats (excel spreadsheets and MS-word documents)!”
The next question focused on the idea of combinations of data formats:
“Are the data you generate sometimes a combination or group of different data formats (see MoreInfo)?”
The respondents offered the following distributions of responses:
One respondent pointed out that one of the standard publishing format is Tex/LaTeX with embedded postscript for graphics.
The next question raised the topic of source repositories:
“To which source repositories do you submit your data? (select all that apply)” The responses reflect a diversity of approaches to this question, perhaps in part to the difference
practices of UK-based and US-based astronomers. The responses included:
It is worth noting that some respondents may not have been aware that they are using the Virtual Observatory (VO). For example, MAST is part of the VO, yet more than one respondent mentioned it without citing the VO.
The next question built upon the previous one by asking how often respondents submitted data to
these aforementioned repositories:
“How often have you submitted data to any of these source repositories? (Tick any that are applicable)” Noting (understandably) that the respondents stated “never” for repositories that they did not
choose from the previous list, the responses below describe cases that include positive responses:
The next three questions related to metadata:
“By selecting the following options, please would you indicate what types of metadata you consider it important to assign to your data. The metadata given in the following list are generic and you can use the ‘Other’ option to enter more discipline-specific terms if that is appropriate.”
The fifteen respondents who cited “other” metadata types mentioned the following items:
Description of the instrument operating mode and a detailed format description such that • others could process the data Description of data structure • Detailed format information for binary files • In astronomical images it is essential to have positional information, calibration • information, instrument set up information Lots of astronomical metadata, e.g. celestial object, position, observation date, data • reduction software versions, etc.
Over three hundred and fifty other details containing data detailing the instrument used, • instrument operating conditions, atmospheric conditions, light conditions, error margins, data pipeline used, data pipeline operating conditions, filter and reprocessing information.
More metadata is created but stored in the database system rather than with the individual files (though there are links from the file to the database and visa versa).
Processing method and version • Reference of published paper connected to data • Relevant field/sub-fields • Specifics of the data processing steps used in creating the product • Summary of input parameters of run which produced the data • Telescope, instrument • Various astronomical parameters •
Two of the other respondents also made these additional comments:
- “I do not think one should include publications under ‘data’. It is important to recognize AND PRESERVE the fundamental difference between them.”
- “I think there should be a core mandatory list and then an optional one. The latter could be as long and comprehensive as people like to be e.g. link to ADS paper, where data has been published; data source; software + version used in the project.”
The next question addressed the stage of metadata assignment:
“At what stage are metadata assigned to your data? (select all that apply)”