Acquiring and Pre-Processing the WSIM-GLDAS Dataset

Authors

Josh Brinks

Elaine Famutimi

Published

April 6, 2024

Overview

In this lesson, you will acquire the data set called Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) from the NASA Socioeconomic Data and Applications Center (SEDAC) website and will also retrieve a global administrative boundary data set called geoBoundaries data directly from an application programming interface (API). You will learn how to subset the data for a region of interest and save the result data as a new file.

Learning Objectives

After completing this lesson, you should be able to:

Retrieve WSIM-GLDAS data from SEDAC.
Retrieve Administrative Boundary using the geoBoundaries API.
Subset WSIM-GLDAS data for a region and time period of interest.
Visualize geospatial data to highlight precipitation deficit patterns.
Write a pre-processed NetCDF-formatted file to disk.

Introduction

The water cycle is the process of circulation of water on, above, and under the Earth’s surface (NOAA 2019). Human activities produce greenhouse gas emissions, land use changes, dam and reservoir development, and groundwater extraction have affected the natural water cycle in recent decades (IPCC 2023). The influence of these human activities on the water cycle has consequential impacts on oceanic, groundwater, and land processes, influencing phenomena such as droughts and floods (Zhou 2016).

Precipitation deficits, or periods of below-average rainfall, can lead to drought, characterized by prolonged periods of little to no rainfall and resulting water shortages. Droughts often trigger environmental stresses and can create cycles of reinforcement impacting ecosystems and people (Rodgers 2023). For example, California frequently experiences drought but the combination of prolonged dry spells and sustained high temperatures prevented the replenishment of cool fresh water to the Klamath River, which led to severe water shortages in 2003 and again from 2012 to 2014. These shortages affect agricultural areas like the Central Valley, which grows almonds, one of California’s most important crops, with the state producing 80% of the world’s almonds. These severe droughts, coupled with competition for limited freshwater resources, resulted in declining populations of Chinook salmon due to heat stress and gill rot disease disrupting the food supply for Klamath basin tribal groups (Guillen 2002; Bland 2014).

Data Science Review

A raster dataset is a type of geographic data in digital image format with numerical information stored in each pixel. (Rasters are often called grids because of their regularly-shaped matrix data structure.) Rasters can store many types of information and can have dimensions that include latitude, longitude, and time. NetCDF is one format for raster data; others include Geotiff, ASCII, and many more. Several raster formats like NetCDF can store multiple raster layers, or a “raster stack,” which can be useful for storing and analyzing a series of rasters.

The Water Security (WSIM-GLDAS) Monthly Grids, v1 (1948 - 2014) The Water Security (WSIM-GLDAS) Monthly Grids, v1 (1948 - 2014) dataset “identifies and characterizes surpluses and deficits of freshwater, and the parameters determining these anomalies, at monthly intervals over the period January 1948 to December 2014” (ISciences and Center For International Earth Science Information Network-CIESIN-Columbia University 2022b). The dataset can be downloaded from the NASA SEDAC website. Downloads of the WSIM-GLDAS data are organized by a combination of thematic variables (composite surplus/deficit, temperature, PETmE, runoff, soil moisture, precipitation) and integration periods (a temporal aggregation) (1, 3, 6, 12 months). Each variable-integration combination consists of a NetCDF raster (.nc) file ( with a time dimension that contains a raster layer for each of the 804 months between January, 1948 and December, 2014. Some variables also contain multiple attributes each with their own time series. Hence, this is a large file that can take a lot of time to download and may cause computer memory issues on certain systems. This is considered BIG data.

Knowledge Check

How would you best describe the water cycle?
1. A prolonged period of little to no rainfall.
2. Low precipitation combined with record temperatures.
3. The circulation of water on and above Earth’s surface.
4. A cycle that happens due to drought.
What human interventions affect the water cycle? (select all that apply)
1. Greenhouse gas emissions
2. Land use changes
3. Dam and reservoir development
4. Groundwater overexploitation
What is a precipitation deficit?
1. A period of rainfall below the average.
2. A prolonged period of little to no rainfall.
3. A period of chain reactions.
4. A period of rainfall above the average.

Acquiring the Data

Data Science Review

The Water Security (WSIM-GLDAS) Monthly Grids dataset used in this lesson is hosted by NASA’s Socioeconomic Data and Applications Center (SEDAC), one of several Distributed Active Archive Centers (DAACs). SEDAC hosts a variety of data products including geospatial population data, human settlements and infrastructure, exposure and vulnerability to climate change, and satellite-based data on land use, air, and water quality. In order to download data hosted by SEDAC, you are required to have a free NASA EarthData account. You can create an account here: NASA EarthData.

For this lesson, we will work with the WSIM-GLDAS data set Composite Anomaly Twelve-Month Return Period NetCDF file. This represents the variable “Composite Anomaly” for the integration period of twelve-months, in this instance an anomaly is defined as a rare observation. We begin with the twelve-month aggregation because this aggregation is a snapshot of anomalies for the entire year, so it’s useful to get an initial understanding of that year; once we identify outliers in the data, we can take a closer look at the 3-month or 1-month return periods for an in-depth analysis of the data. Let’s download the file directly from the SEDAC website. The data set documentation describes the composite variables as key features of WSIM-GLDAS which combine “the return periods of multiple water parameters into composite indices of overall water surpluses and deficits (ISciences and Center For International Earth Science Information Network-CIESIN-Columbia University 2022a)”. The composite anomaly files represent these model outputs in terms of the rarity of their return period, or how often they occur. Please go ahead and download the file.

First, go to the SEDAC website at https://sedac.ciesin.columbia.edu/. You can explore the website by themes, data sets, or collections. We will use the search bar at the top to search for “water security wsim”. Find and click on the Water Security (WSIM-GLDAS) Monthly Grids, v1 (1948–2014) data set. Take a moment to review the dataset’s overview and documentation pages.
When you’re ready, click on the Data Download tab. You will be asked to sign in using your NASA EarthData account.
Find the Composite Class, and find and click on the Variable Composite Anomaly Twelve-Month Return Period.

Reading the Data

Data Science Review

This lesson uses the stars, sf, lubridate, and cubelyr packages.

The stars package in R helps you work with large and complex spatial data, making it easier to analyze and visualize maps and satellite images. The sf package lets you handle and analyze spatial data in a simple way, allowing you to work with maps and geographic information seamlessly. The lubridate package makes it really easy to handle dates and times in R, so you can effortlessly convert, manipulate, and perform calculations with them. The cubelyr package helps you create and analyze multidimensional data cubes, making it easier to explore complex datasets and discover patterns.

Make sure they are installed before you begin working with the code in this document. If you’d like to learn more about the functions used in this lesson you can use the help guides on their package websites.

Once you have downloaded the file to your local computer, install and load the R packages required for this exercise. This is accomplished by defining the list of packages and assigning them to the new variable called “packages_to_check”. Next we loop (iterate) through each of the packages in the list to see if they are already installed. If they are we continue to the next item, and if they aren’t then we go ahead and install them.

packages_to_check <- c("stars", "sf", "lubridate", "cubelyr")

# Check and install packages
for (package_name in packages_to_check) {
  if (!package_name %in% rownames(installed.packages())) {
    install.packages(package_name)
    cat(paste("Package", package_name, "installed.\n"))
  } else {
    cat(paste("Package", package_name, "is already installed.\n"))
  }
  library(package_name, character.only = TRUE)
}

Once you’ve completed the download and placed the .nc file into your working directory read the file with the stars::read_stars() function.

Initializing R (reading) the file with the argument proxy = TRUE allows you to inspect the basic elements of the file without reading the whole file into memory. Multidimensional raster datasets can be extremely large and bring your computing environment to a halt if you have memory limitations.

Now we can use the print command to view basic information. The output tells us we have 5 attributes (deficit, deficit_cause, surplus, surplus_cause, both) and 3 dimensions. The first 2 dimensions are the spatial extents (x/y–longitude/latitude) and time is the 3rd dimension.

This means that the total number of individual raster layers in this NetCDF is 4020 (5 attributes x 804 time steps/months). Again, BIG data.

Attribute Selection

The WSIM-GLDAS data is quite large with many variables available. We can manage this large file by selecting a single variable; in this case “deficit” (drought). Read the data back in; this time with proxy = FALSE and only selecting the deficit layer.

# read in just the deficit layer using proxy = false
wsim_gldas_anoms <- stars::read_stars("data/composite_12mo.nc", sub = 'deficit', proxy = FALSE)

Time Selection

Specifying a temporal range of interest will make the file size smaller and therefore more manageable. We’ll select every year for the range 2000-2014. This can be accomplished by generating a sequence for every year between December 2000 and December 2014, and then passing that list of dates, in the form of a vector, using the function filter.

# generate a vector of dates for subsetting
keeps<-seq(lubridate::ymd("2000-12-01"),
           lubridate::ymd("2014-12-01"), 
           by = "year")
#change data type to POSIXct
keeps <- as.POSIXct(keeps)
# filter the dataset using the vector of dates
wsim_gldas_anoms <- dplyr::filter(wsim_gldas_anoms, time %in% keeps)
# re-check the basic info
print(wsim_gldas_anoms)

stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
         Min.  1st Qu.    Median      Mean   3rd Qu.     Max.  NA's
deficit   -60 -14.7116 -7.280078 -12.64074 -3.486546 4.991005 94083
dimension(s):
     from   to offset delta  refsys                    values x/y
x       1 1440   -180  0.25  WGS 84                      NULL [x]
y       1  600     90 -0.25  WGS 84                      NULL [y]
time    1   15     NA    NA POSIXct 2000-12-01,...,2014-12-01

Now we’re down to a single attribute (“deficit”) with 15 time steps. We can take a look at the first 6 years by passing the object through slice and then into plot.

wsim_gldas_anoms |>
  # slice out the first 6 time steps
  dplyr::slice(index = 1:6, along = "time") |>
  # plot them with some test breaks for the legend
  plot(key.pos = 1, breaks = c(0, -5, -10, -20, -30, -50), key.lab = "Deficit")

Although we have now reduced the data to a single attribute with a restricted time of interest, we can take it a step further and limit the spatial extent to a country or state of interest.

Knowledge Check

Which of these best describe a raster dataset?
1. A type of geographic data in digital image format.
2. A table or list of numbers.
3. A geographic region of interest.
4. An attribute with a time period of interest.
Which of the following is true about the information that rasters can store? (select all that apply)
1. Attributes (thematic content)
2. Dimensions (information expressing spatial or temporal extent information)
3. Geographic coordinates
4. A list of numbers
In the R programming language, what does the term vector refer to?
1. A grid of geographic data.
2. A collection or list of numbers.
3. A geographic region of interest.
4. An attribute with a time period of interest.

Spatial Selection

Data Science Review

GeoJSON is a format for encoding, storing and sharing geographic data as vector data, i.e., points, lines and polygons. It’s commonly used in web mapping applications to represent geographic features and their associated attributes. GeoJSON files are easily readable by both humans and computers, making them a popular choice for sharing geographic data over the internet.

We can spatially crop the raster stack with a few different methods. Options include using a bounding box in which the outer geographic coordinates (xmin, ymin, xmax, ymax) are specified, or using another raster object, or a vector boundary like a shapefile or GeoJSON to set the clipping extent.

Data Science Review

Built by the community and William & Mary geoLab, the geoBoundaries Global Database of Political Administrative Boundaries is an online, open license (CC BY 4.0 / ODbL) resource of information on administrative boundaries (i.e., state, county) for every country in the world. Since 2016, this project has tracked approximately 1 million spatial units within more than 200 entities, including all UN member states.

In this example we use a vector boundary to accomplish the geoprocessing task of clipping the data to an administrative or political unit. First we acquire the data in GeoJSON format for the United States from the geoBoundaries API. (Note it is also possible to download the vectorized boundaries directly from https://www.geoboundaries.org/ in lieu of using the API).

To use the geoBoundaries’ API, the root URL below is modified to include a 3 letter code from the International Standards Organization used to identify countries (ISO3), and an administrative level for the data request. Administrative levels correspond to geographic units such as the Country (administrative level 0), the State/Province (administrative level 1), the County/District (administrative level 2) and so on:

“https://www.geoboundaries.org/api/current/gbOpen/ISO3/LEVEL/”

For this example we adjust the bolded components of the sample URL address below to specify the country we want using the ISO3 Character Country Code for the United States (USA) and the desired Administrative Level of State (ADM1).

# make the request to geoboundarie's website for the USA boundaries
usa <- httr::GET("https://www.geoboundaries.org/api/current/gbOpen/USA/ADM1/")

In the line of code above, we used a function called httr:GET to obtain metadata from the URL. We assign the result to a new variable called “usa”. Next we will examine the content.

# parse the content into a readable format
usa <- httr::content(usa)
# look at the labels for available information
names(usa)

 [1] "boundaryID"                "boundaryName"             
 [3] "boundaryISO"               "boundaryYearRepresented"  
 [5] "boundaryType"              "boundaryCanonical"        
 [7] "boundarySource"            "boundaryLicense"          
 [9] "licenseDetail"             "licenseSource"            
[11] "boundarySourceURL"         "sourceDataUpdateDate"     
[13] "buildDate"                 "Continent"                
[15] "UNSDG-region"              "UNSDG-subregion"          
[17] "worldBankIncomeGroup"      "admUnitCount"             
[19] "meanVertices"              "minVertices"              
[21] "maxVertices"               "meanPerimeterLengthKM"    
[23] "minPerimeterLengthKM"      "maxPerimeterLengthKM"     
[25] "meanAreaSqKM"              "minAreaSqKM"              
[27] "maxAreaSqKM"               "staticDownloadLink"       
[29] "gjDownloadURL"             "tjDownloadURL"            
[31] "imagePreview"              "simplifiedGeometryGeoJSON"

The parsed content contains 32 components. Item 29 is a direct link to the GeoJSON file (gjDownloadURL) representing the vector boundary data. Next we will obtain those vectors and visualize the results.

# directly read in the geojson with sf from the geoboundaries server
usa <- sf::st_read(usa$gjDownloadURL)

Reading layer `geoBoundaries-USA-ADM1' from data source 
  `https://github.com/wmgeolab/geoBoundaries/raw/9469f09/releaseData/gbOpen/USA/ADM1/geoBoundaries-USA-ADM1.geojson' 
  using driver `GeoJSON'
Simple feature collection with 56 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.1489 ymin: -14.54869 xmax: 179.7785 ymax: 71.36516
Geodetic CRS:  WGS 84

# check the visuals
plot(sf::st_geometry(usa))

Upon examination, shown in the image above, one sees that it includes all US states and overseas territories. For this demonstration, we can simplify it to the contiguous United States. (Of course, it could also be simplified to other areas of interest simply by adapting the code below.)

We first create a list of the geographies we wish to remove and assign them to a variable called “drops”. Next, we reassign our “usa” variable to include only those geographies in the continental US and finally, we plot the results.

# create a vector of territories we don't want in our CONUSA boundary
drops<-
  c("Alaska", "Hawaii", 
    "American Samoa",
    "Puerto Rico",
    "Commonwealth of the Northern Mariana Islands", 
    "Guam", 
    "United States Virgin Islands")

# select all the states and territories not in the above list
usa<-usa[!(usa$shapeName %in% drops),]
# check the visuals
plot(sf::st_geometry(usa))

We can take this a step further and select a single state for analysis. Here we use a slightly different method by creating a new variable called “texas” by calling the state out by name.

# extract just texas from the CONUSA boundary
texas <- usa[usa$shapeName == "Texas",]
# check the visuals
plot(sf::st_geometry(texas))

From here we can clip the WSIM-GLDAS raster stack by using the stored boundary of Texas.

Drought in the News

Texas experienced a severe drought in 2011 that caused rivers to dry up and lakes to reach historic low levels (StateImpact 2014). The drought was further exacerbated by high temperatures related to climate change in February of 2013. Climate experts discovered that the drought was produced by “La Niña”, a weather pattern that causes the surface temperature of the Pacific Ocean to be cooler than normal. This, in turn, creates drier and warmer weather in the southern United States. La Niña can occur for a year or more, and returns once every few years (NOAA 2023).

It is estimated that the drought cost farmers and ranchers about $8 billion in losses.(Roeseler 2011) Furthermore, the dry conditions fueled a series of wildfires across the state in early September of 2011, the most devastating of which occurred in Bastrop County, where 34,000 acres and 1,300 homes were destroyed (Roeseler 2011).

# crop the wsim-gldas file to the extent of te texas boundary
wsim_gldas_anoms_tex <- wsim_gldas_anoms[texas]

Finally, we visualize the last time-step in the WSIM-GLDAS dataset (15/December, 2014) and render it with an overlay of the Texas boundary, using the following steps.

# slive out the first 15 timesteps of wsim-gldas and plot them
wsim_gldas_anoms_tex |>
  dplyr::slice(index = 15, along = "time") |>
  plot(reset = FALSE, breaks = c(0, -5, -10, -20, -30, -50))
# add the texas boundary on top
plot(sf::st_geometry(texas),
     add = TRUE,
     lwd = 3,
     fill = NA,
     border = 'purple')

At this point, you may want to ask, does the data look plausible? That is, are the values being rendered in your map of interest? This simple check is helpful to make sure your subsetting has worked as expected. (You will want to use other methods to systematically evaluate the data.) If the results are acceptable, the subsetted dataset may be written to disk as a NetCDF file, and saved for future modules.

You can download this subset if you are running this script locally. Not for 2i2c. For 2i2c, click on the output folder and download the wsim_gldas_tex.nc file provided.:

## Use this chunk to download locally. Not for 2i2c
## write the processed wsim-gldas file to disk as a netcdf
# stars::write_mdim(wsim_gldas_anoms_tex, "output/wsim_gldas_tex.nc")

The size of the pre-processed dataset is 1.6 MB compared to the original dataset of 1.7 GB. This is much more manageable in cloud environments, workshops, and git platforms.

If you want to share an image of the plot that you created you can save it to your computer as a .png file. You can open the png() device, producing the plot, and closing the device with dev.off().

You can download this image if you are running this script locally. Not for 2i2c. For 2i2c, click on the output folder and download the wsim_map_plot.png image provided.:

# ## Use this chunk to download locally. Not for 2i2c
# # Save the map plot as a PNG file
# # Specify file name and dimensions 
# png("output/wsim_map_plot.png", width = 4, height = 4, units="in", res=300)  
# wsim_gldas_anoms_tex |>
#   dplyr::slice(index = 15, along = "time") |>
#   plot(reset = FALSE, breaks = c(0, -5, -10, -20, -30, -50))
# plot(sf::st_geometry(texas),
#      add = TRUE,
#      lwd = 3,
#      fill = NA,
#      border = 'purple')
# #close png() device
# dev.off()

Once you run this code you can find the file in the file location… This allows you to share your findings.

Knowledge Check

There are several options for spatially subsetting (or clipping) a raster/raster stack to a region of interest. What method was used in this lesson?
1. Using a vector of dates.
2. Using another raster object.
3. Specifying a bounding box.
4. Using a vector boundary dataset.
When running into memory issues, what is something you can do to reduce the computational load?
1. Work with one time frame or region at a time.
2. Save it as a new file.
3. Subset the data to a region of interest/time frame.
4. Find other data to work with.
What is the importance of subsetting data?
1. Freeing up space.
2. Analyzing a certain time or area of interest.
3. Making code run faster.
4. All of the above.

In this Lesson, You Learned…

Congratulations! Now you should be able to:

Navigate the SEDAC website to find and download datasets.
Access administrative boundaries from geoBoundaries data using API.
Temporally subset a NetCDF raster stack using R packages such as dplyr and lubridate.
Crop a NetCDF raster stack with a spatial boundary.
Write a subsetted dataset to disk and create an image to share results.

Lesson 1b

In the next lesson, we will create more advanced visualizations and extract data of interest.

Lesson 1b: WSIM-GLDAS Visualizations and Data Extraction

References

Bland, Alastair. 2014. “California Drought Has Wild Salmon Competing with Almonds for Water.” National Public Radio. https://www.npr.org/sections/thesalt/2014/08/21/342167846/california-drought-has-wild-salmon-competing-with-almonds-for-water.

Guillen, George. 2002. “Klamath River Fish Die-Off.” Mortality Report AFWO-01-03. Arcata, CA: U.S. Fish & Wildlife Service. https://www.trrp.net/DataPort/doc.php?id=301.

IPCC. 2023. “Climate Change 2021 the Physical Science Basis,” June. https://doi.org/10.1017/9781009157896.

ISciences, and Center For International Earth Science Information Network-CIESIN-Columbia University. 2022a. “Documentation for the Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) Monthly Grids, Version 1.” Palisades, NY: NASA Socioeconomic Data; Applications Center (SEDAC). https://doi.org/10.7927/X7FJ-JJ41.

———. 2022b. “Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) Monthly Grids, Version 1.” Palisades, NY: NASA Socioeconomic Data; Applications Center (SEDAC). https://doi.org/10.7927/Z1FN-KF73.

NOAA. 2019. “Water Cycle.” National Oceanic; Atmospheric Administration. https://www.noaa.gov/education/resource-collections/freshwater/water-cycle.

———. 2023. “What Are El Nino and La Nina?” National Oceanic; Atmospheric Administration. https://oceanservice.noaa.gov/facts/ninonina.html.

Rodgers, Alison Ince. 2023. “Understanding Droughts.” National Geographic Society. https://education.nationalgeographic.org/resource/understanding-droughts/.

Roeseler, Charles. 2011. “Exceptional Drought Continues Across Southeast Texas.” National Weather Service. https://www.weather.gov/media/hgx/stormsignals/vol87.pdf.

StateImpact. 2014. “Everything You Need to Know about the Texas Drought.” National Public Radio. https://stateimpact.npr.org/texas/tag/drought/.

Zhou, Haddeland, T. 2016. “Human-Induced Changes in the Global Water Cycle.” Geophysical Monograph Series. https://doi.org/10.1002/9781118971772.ch4.

Footnotes

Photo Credit, NASA/JPL.↩︎