Accessing HEASARC and LAMBDA data in the Cloud
Introduction
Beginning in 2023, the Year of Open Science, as part of NASA's Open Science Initiative, and in collaboration with the Amazon Web Services (AWS) Open Data project, HEASARC data are now available in the cloud. This effort is motivated by the need to increase the accessibility of this data in the broader community and to enable the kind of science that requires the significant resources of cloud computing.
HEASARC data are now on AWS and registered in their Open Data Registry. HEASARC is building a next-generation platform that will be like HEASARC@SciServer but running in AWS, but in the mean time, the data are already available. Below we show in a tutorial notebook how to do this in Python. These locations could then be used with cloud-compatible client software such as Astropy-affiliated packages Astroquery and PyVO to provide seamless access to data access in the cloud. Our Xamin data portal offers results in various formats including a list of cloud URIs.
NASA Astrophysics including HEASARC are building cloud analysis capabilities with the Fornax Initiative. See details ....
Pythonic Data Access Tutorial
For a quick tutorial on accessing HEASARC or LAMBDA data in the cloud using Python, we have prepared a Python notebook that you can download, or view it rendered as HTML.
Some software, such as Astropy's FITS IO routines can read data directly from the S3 bucket, including with options to read only a subset of a FITS file. Tools like HEASoft based on cfitsio can also read any file out of a URL. See below.
Note that some HEASoft tools that rely on knowing the directory structure of an input dataset might require you to copy the data out of the S3 object store and into a file system it can access.
Direct Bucket Access
These data can currently be accessed by using the HEASARC or LAMBDA web tools to browse the archive and retrieve a list of observations or files to download, or by doing the same with one of our APIs. (See our archive pages for the HEASARC options or the LAMBDA data portal.) If the given tool does not return cloud URIs, they can be inferred from the on premises URL. Simply replace the beginning of the traditional access URL with the AWS S3 bucket address. For example, a Chandra image located at
http://heasarc.gsfc.nasa.gov/FTP/chandra/data/byobsid/5/4475/primary/acisf04475N004_full_img2.fits.gz
can also be found at
s3://nasa-heasarc/chandra/data/byobsid/5/4475/primary/acisf04475N004_full_img2.fits.gz
or
http://nasa-heasarc.s3.amazonaws.com/chandra/data/byobsid/5/4475/primary/acisf04475N004_full_img2.fits.gz
For LAMBDA data, similar URLs can be turned into URIs starting with "s3://nasa-lambda/". Note that for WMAP, there is one small change to the path from "map" to "wmap" to clarify that it's the mission name. I.e.,
http://lambda.gsfc.nasa.gov/data/map/dr5/skymaps/9yr/smoothed/wmap_band_smth_iqumap_r9_9yr_K_v5.fits
can also be found at
s3://nasa-lambda/wmap/dr5/skymaps/9yr/smoothed/wmap_band_smth_iqumap_r9_9yr_K_v5.fits
or
http://nasa-lambda.s3.amazonaws.com/wmap/dr5/skymaps/9yr/smoothed/wmap_band_smth_iqumap_r9_9yr_K_v5.fits
Thanks to Amazon's Open Data project, these data are free to access from anywhere, not subject to cloud data egress costs. As described on HEASARC's data policy web page, these data are available freely for your use.
Datasets
The datasets currently available include:
- High-energy astrophysics datasets
- Ariel5
- ASCA
- BBXRT
- Chandra
- Compton
- Copernicus
- COS-B
- DXS
- EXOSAT
- Fermi (subset)
- Ginga
- HaloSat
- HEAO-1
- Hitomi
- IXPE
- Nicer
- NuSTAR
- OSO-8
- ROSAT
- SAS-2
- BeppoSAX
- Suzaku
- Swift
- VELA 5B
- WASS
- XQC
- Rossi XTE
- XMM-Newton
- CMB datasets
- WMAP
- COBE
Please also see the HEASARC and LAMBDA entries in the AWS Open Data Registry.
Caveats
Some selection of datasets has been made to avoid putting into the cloud data that we don't believe will
be useful to access this way, such as older mission data in non-standard file formats. We will also keep
the nasa-heasarc
bucket in sync with the on-prem archive on a best efforts basis for the
ongoing missions. Therefore the most recent data products may only be available from the HEASARC on-prem
archive for a few days until the next sync.