Spatial models require spatial data, and getting spatial data into model-ready formats is a task that is always more faffy and nuanced than I envisage. I am interested to hear how other people go about this!
Thinking about gridded data, and working on the presumption that we want all our input data to be the same CRS, resolution, bounds, etc, then the first question is: is processing the data to meet this the responsbility of the model or a pre-model data pipeline?
Other issues that I frequently encounter include:
- What do you do with missing values in grid cells for a certain variable, when all other variables have values. Do you interpolate? (And is it the model or the pre-model data compilation that deals with this?)
- Similarly if your data also has a time dimension, what if certain timesteps are missing - do you interpolate or exclude these from the model run?
- How do you set the spatial bounds for you model run? Do you use a vector shape to clip the data, or use one of the variables as a reference?
- What format do you take input data in? Lots of raster files (how about variables with a time dimension - multiple bands or multiple files?)?
- Do you do any unit conversion or presume the data are in the correct units (requiring extra pre-processing if not)?
- Do you do anything to try and minimise data loss when resampling data?
A lot of the answers are, I suspect, specific to the data and model in question. I feel like this is a common enough task across all spatial models, but is moslty done in an ad-hoc manner. Is there room for suggesting best practices and conventions to ease the process?
1 Like
Great questions! I agree there are a lot of devils in the details. A couple of things we’ve tried in the CSDMS Integration Facility are:
- Data components: Python packages that fetch data from a particular dataset, offering a BMI or a more Pythonic API. Tian Gan recently published a paper about this concept. It doesn’t solve all problems of course, but at least provides a programmatic way to get the data into the form of a NumPy array(s). Having programmatic access makes it easier to script analyses (as opposed to faffing about with point-and-click on websites).
- The PyMT tool includes some utilities for grid mapping/interpolation (using the Python interface for the ESMF mappers), and a unit conversion tool that Eric wrote. (Eric please add/correct me!)
- Landlab has utilities to read input raster data in either ESRI ASCII or NetCDF formats. The ESRI ASCII reader has an optional argument for specifying a “no data” value, and when the raster is used to initialize a grid, the no-data nodes will be assigned closed-boundary status. We use this, for example, to set up a watershed as a model domain (inset into a rectangular grid and surrounded by no-data nodes).
- I find xarray is great for multi-temporal raster data, and its close relationship with NetCDF makes it relatively easy to import/export.
It’s great to see CSDMS tools tackling these issues! I should invest some time in learning them a bit better. It would be neat to write some data components for data that I commonly use.
Xarray is definitely great, it’s what I use for most of these data wrangling tasks. There’s a neat extension rioxarray that makes it easier to work with raster files via rasterio
in xarray.