WSIM-GLDAS: Acquisition, Exploration, and Integration

Authors

Josh Brinks

Elaine Famutimi

Published

June 3, 2024

Overview

In this lesson, you will acquire the dataset called Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) from the NASA Socioeconomic Data and Applications Center (SEDAC) website, perform various pre-processing tasks, and explore the dataset with advanced visualizations. We will also integrate WSIM-GLDAS with a global administrative boundary dataset called geoBoundaries and a population dataset called the Gridded Population of the World. You will learn how to perform numerous data manipulations and visualizations throughout this lesson.

Learning Objectives

After completing this lesson, you should be able to:

Retrieve WSIM-GLDAS data from SEDAC.
Retrieve Administrative Boundary using the geoBoundaries API.
Subset WSIM-GLDAS data for a region and time period of interest.
Visualize geospatial data to highlight precipitation deficit patterns.
Write a pre-processed NetCDF-formatted file to disk.
Perform visual exploration with histograms, choropleths, and time series maps.
Integrate gridded population data with WSIM-GLDAS data to perform analyses and construct visualizations to understand how people are impacted.
Summarize WSIM-GLDAS and population raster data using zonal statistics.

Programming Review

This is the R version of this lesson. If you would like to work through the Python version please visit WSIM-GLDAS: Acquisition, Exploration, and Integration (Python)

Introduction

The water cycle is the process of circulation of water on, above, and under the Earth’s surface (NOAA 2019). Human activities produce greenhouse gas emissions, land use changes, dam and reservoir development, and groundwater extraction have affected the natural water cycle in recent decades (IPCC 2023). The influence of these human activities on the water cycle has consequential impacts on oceanic, groundwater, and land processes, influencing phenomena such as droughts and floods (Zhou 2016).

Precipitation deficits, or periods of below-average rainfall, can lead to drought, characterized by prolonged periods of little to no rainfall and resulting water shortages. Droughts often trigger environmental stresses and can create cycles of reinforcement impacting ecosystems and people (Rodgers 2023). For example, California frequently experiences drought but the combination of prolonged dry spells and sustained high temperatures prevented the replenishment of cool fresh water to the Klamath River, which led to severe water shortages in 2003 and again from 2012 to 2014. These shortages affect agricultural areas like the Central Valley, which grows almonds, one of California’s most important crops, with the state producing 80% of the world’s almonds. These severe droughts, coupled with competition for limited freshwater resources, resulted in declining populations of Chinook salmon due to heat stress and gill rot disease disrupting the food supply for Klamath basin tribal groups (Guillen 2002; Bland 2014).

Data Science Review

A raster dataset is a type of geographic data in digital image format with numerical information stored in each pixel. (Rasters are often called grids because of their regularly-shaped matrix data structure.) Rasters can store many types of information and can have dimensions that include latitude, longitude, and time. NetCDF is one format for raster data; others include Geotiff, ASCII, and many more. Several raster formats like NetCDF can store multiple raster layers, or a “raster stack,” which can be useful for storing and analyzing a series of rasters.

The Water Security (WSIM-GLDAS) Monthly Grids, v1 (1948 - 2014) The Water Security (WSIM-GLDAS) Monthly Grids, v1 (1948 - 2014) dataset “identifies and characterizes surpluses and deficits of freshwater, and the parameters determining these anomalies, at monthly intervals over the period January 1948 to December 2014” (ISciences and Center For International Earth Science Information Network-CIESIN-Columbia University 2022b). The dataset can be downloaded from the NASA SEDAC website. Downloads of the WSIM-GLDAS data are organized by a combination of thematic variables (composite surplus/deficit, temperature, PETmE, runoff, soil moisture, precipitation) and integration periods (a temporal aggregation) (1, 3, 6, 12 months). Each variable-integration combination consists of a NetCDF raster (.nc) file ( with a time dimension that contains a raster layer for each of the 804 months between January, 1948 and December, 2014. Some variables also contain multiple attributes each with their own time series. Hence, this is a large file that can take a lot of time to download and may cause computer memory issues on certain systems. This is considered BIG data.

Knowledge Check

How would you best describe the water cycle?
1. A prolonged period of little to no rainfall.
2. Low precipitation combined with record temperatures.
3. The circulation of water on and above Earth’s surface.
4. A cycle that happens due to drought.
What human interventions affect the water cycle? (select all that apply)
1. Greenhouse gas emissions
2. Land use changes
3. Dam and reservoir development
4. Groundwater overexploitation
What is a precipitation deficit?
1. A period of rainfall below the average.
2. A prolonged period of little to no rainfall.
3. A period of chain reactions.
4. A period of rainfall above the average.

Acquiring the Data

Data Science Review

The Water Security (WSIM-GLDAS) Monthly Grids dataset used in this lesson is hosted by NASA’s Socioeconomic Data and Applications Center (SEDAC), one of several Distributed Active Archive Centers (DAACs). SEDAC hosts a variety of data products including geospatial population data, human settlements and infrastructure, exposure and vulnerability to climate change, and satellite-based land use, air, and water quality data. In order to download data hosted by SEDAC, you are required to have a free NASA EarthData account. You can create an account here: NASA EarthData.

For this lesson, we will work with the WSIM-GLDAS dataset Composite Anomaly Twelve-Month Return Period NetCDF file. This contains water deficit, surplus, and combined “Composite Anomaly” variables with an integration period of 12 months. Integration period represents the time period at which the anomaly values are averaged over. The 12-month integration averages water deficits (droughts), surpluses (floods), and combined (presence of droughts and floods) over a 12-month period ending with the month specified. We begin with the 12-month aggregation because this is a snapshot of anomalies for the entire year making it useful to get an understanding of a whole year; once we identify time periods of interest in the data, we can take a closer look using the 3-month or 1-month integration periods.

We’ll start by downloading the file directly from the SEDAC website. The dataset documentation describes the composite variables as key features of WSIM-GLDAS that combine “the return periods of multiple water parameters into composite indices of overall water surpluses and deficits (ISciences and Center For International Earth Science Information Network-CIESIN-Columbia University 2022a)”. The composite anomaly files present the data in terms of how often they occur; or a “return period.” For example, a deficit return period of 25 signifies a drought so severe that we would only expect it to happen once every 25 years. Please go ahead and download the file.

First, go to the SEDAC website at https://sedac.ciesin.columbia.edu/. You can explore the website by themes, datasets, or collections. We will use the search bar at the top to search for “water security wsim”. Find and click on the Water Security (WSIM-GLDAS) Monthly Grids, v1 (1948–2014) dataset. Take a moment to review the dataset’s overview and documentation pages.
When you’re ready, click on the Data Download tab. You will be asked to sign in using your NASA EarthData account.
Find the Composite Class, and click on the Variable Composite Anomaly Twelve-Month Return Period.

Reading the Data

Data Science Review

This lesson uses the stars, terra, sf, lubridate, httr, cubelyr, exactextractr, ggplot2, data.table, and kableExtra packages. If you’d like to learn more about the functions used in this lesson you can use the help guides on their package websites.

The stars and terra packages are designed to work with large and complex raster spatial data while the sf package lets you handle and analyze vector spatial data (more on that later). The cubelyr package helps you create and analyze multidimensional data cubes, making it easier to explore complex datasets and discover patterns. The lubridate package makes it easy to work with dates and times in R, and httr provides methods for users to interact with online databases and websites to directly download data. exactextractr performs fast and efficient zonal statistics that summarizes raster and vector data. data.table makes standard data processing much easier, and lastly, ggplot2 and kableExtra are used for creating plots, maps, and tables.

Make sure they are installed before you begin working with the code in this document. If you’d like to learn more about the functions used in this lesson you can use the help guides on their package websites.

Once you have downloaded the WSIM-GLDAS file to your local computer, install and load the R packages required for this exercise. You can do this by defining the list of packages and assigning them to the new variable called “packages_to_check”. Next we loop (iterate) through each of the packages in the list to see if they are already installed. If they are we continue to the next item, and if they aren’t then we go ahead and install them.

install.packages('stars')
install.packages('terra')
install.packages('sf')
install.packages('cubelayer')
install.packages('lubridate')
install.packages('httr')
install.packages('data.table')
install.packages('exactextractr')
install.packages('ggplot2')
install.packages('kableExtra')

Once you’ve completed the download and placed the composite_12mo.nc file into your working directory and read the file with the stars::read_stars() function.

Reading the file with the argument proxy = TRUE allows you to inspect the basic elements of the file without reading the whole file into memory. Multidimensional raster datasets can be extremely large and bring your computing environment to a halt if you have memory limitations.

# read in the 12 month integration WSIM-GLDAS file with stars
wsim_gldas_anoms <- stars::read_stars("data/composite_12mo.nc", proxy = TRUE)

deficit, deficit_cause, surplus, surplus_cause, both,

Now we can use the print() command to view basic information. The output tells us we have 5 attributes (deficit, deficit_cause, surplus, surplus_cause, both) and 3 dimensions. The first 2 dimensions are the spatial extents (x/y–longitude/latitude) and time is the 3rd dimension.

This means that the total number of individual raster layers in this NetCDF is 4020 (5 attributes x 804 time steps/months). Again, BIG data.

# check the basic info
print(wsim_gldas_anoms)

stars_proxy object with 5 attributes in 5 file(s):
$deficit
[1] "[...]/composite_12mo.nc:deficit"

$deficit_cause
[1] "[...]/composite_12mo.nc:deficit_cause"

$surplus
[1] "[...]/composite_12mo.nc:surplus"

$surplus_cause
[1] "[...]/composite_12mo.nc:surplus_cause"

$both
[1] "[...]/composite_12mo.nc:both"

dimension(s):
     from   to offset delta  refsys                    values x/y
x       1 1440   -180  0.25  WGS 84                      NULL [x]
y       1  600     90 -0.25  WGS 84                      NULL [y]
time    1  793     NA    NA POSIXct 1948-12-01,...,2014-12-01

Attribute Selection

We can manage this large file by selecting a single variable; in this case “deficit” (drought). Read the data back in; this time with proxy = FALSE to read the data into actual memory, but only selecting the deficit layer to make it more manageable.

# read in just the deficit layer using proxy = false
wsim_gldas_anoms <- stars::read_stars("data/composite_12mo.nc", sub = 'deficit', proxy = FALSE)

deficit,

Time Selection

Specifying a temporal range of interest will make the file size smaller and even more manageable. We’ll select every year for the range 2000-2014. This can be accomplished by generating a sequence for every year between December 2000 and December 2014, and then passing that list of dates, in the form of a vector, using the function filter. Remember, we’re using the 12-month integration of WSIM-GLDAS. This means that each time step listed averages the deficit over the 12 prior months. Therefore, if we only select a sequence of December months spanning 2000-2014, each resulting layer will represent the average deficit for that year.

# generate a vector of dates for subsetting
keeps<-seq(lubridate::ymd("2000-12-01"),
           lubridate::ymd("2014-12-01"), 
           by = "year")
#change data type to POSIXct
keeps <- as.POSIXct(keeps)
# filter the dataset using the vector of dates
wsim_gldas_anoms <- dplyr::filter(wsim_gldas_anoms, time %in% keeps)
# re-check the basic info
print(wsim_gldas_anoms)

stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
         Min.  1st Qu.    Median      Mean   3rd Qu.     Max.  NA's
deficit   -60 -14.7116 -7.280078 -12.64074 -3.486546 4.991005 94083
dimension(s):
     from   to offset delta  refsys                    values x/y
x       1 1440   -180  0.25  WGS 84                      NULL [x]
y       1  600     90 -0.25  WGS 84                      NULL [y]
time    1   15     NA    NA POSIXct 2000-12-01,...,2014-12-01

Now we’re down to a single attribute (“deficit”) with 15 time steps. We can take a look at the first 6 years by passing the object through slice and then into plot.

wsim_gldas_anoms |>
  # slice out the first 6 time steps
  dplyr::slice(index = 1:6, along = "time") |>
  # plot them with some test breaks for the legend
  plot(key.pos = 1, breaks = c(0, -5, -10, -20, -30, -50), key.lab = "Deficit")

Although we have now reduced the data to a single attribute with a restricted time of interest, we can take it a step further and limit the spatial extent to a country or state of interest.

Knowledge Check

Which of these best describe a raster dataset?
1. A type of geographic data made up of square cells.
2. A table or list of geographic numbers.
3. A geographic region of interest.
4. A type of geographic data made up of lines and points.
Which of the following is true about the information that rasters can store? (select all that apply)
1. Attributes (thematic content)
2. Dimensions (information expressing spatial or temporal extent information)
3. Geographic coordinates
4. A list of numbers

Spatial Selection

Data Science Review

GeoJSON is a format for encoding, storing and sharing geographic data as vector data, i.e., points, lines and polygons. It’s commonly used in web mapping applications to represent geographic features and their associated attributes. GeoJSON files are easily readable by both humans and computers, making them a popular choice for sharing geographic data over the internet.

We can spatially crop the raster stack with a few different methods. Options include using a bounding box in which the outer geographic coordinates are specified (xmin, ymin, xmax, ymax), using another raster object, or using a vector boundary like a shapefile or GeoJSON to crop the extent of the original raster data.

In this example we use a vector boundary to accomplish the geoprocessing task of cropping the data to an administrative or political unit. First, we acquire the data in GeoJSON format for the United States from the geoBoundaries API. (Note it is also possible to download the vectorized boundaries directly from https://www.geoboundaries.org/ in lieu of using the API).

To use the geoBoundaries’ API, the root URL below is modified to include a 3 letter code from the International Standards Organization used to identify countries (ISO3), and an administrative level for the data request. Administrative levels correspond to geographic units such as the Country (administrative level 0), the State/Province (administrative level 1), the County/District (administrative level 2), and so on:

https://www.geoboundaries.org/api/current/gbOpen/ISO3/LEVEL/

Data Science Review

Built by the community and William & Mary geoLab, the geoBoundaries Global Database of Political Administrative Boundaries is an online, open license (CC BY 4.0 / ODbL) resource for administrative boundaries (i.e., state, county, province) for every country in the world. Since 2016, this project has tracked approximately 1 million spatial units within more than 200 entities; including all UN member states.

For this example we adjust the bolded components of the sample URL address below to specify the country we want using the ISO3 Character Country Code for the United States (USA) and the desired Administrative Level of State (ADM1).

# make the request to geoboundarie's website for the USA boundaries
usa <- httr::GET("https://www.geoboundaries.org/api/current/gbOpen/USA/ADM1/")

In the line of code above, we used httr:GET to obtain metadata from the URL. We assign the result to a new variable called “usa”. Next we will examine the content.

# parse the content into a readable format
usa <- httr::content(usa)
# look at the labels for available information
names(usa)

 [1] "boundaryID"                "boundaryName"             
 [3] "boundaryISO"               "boundaryYearRepresented"  
 [5] "boundaryType"              "boundaryCanonical"        
 [7] "boundarySource"            "boundaryLicense"          
 [9] "licenseDetail"             "licenseSource"            
[11] "boundarySourceURL"         "sourceDataUpdateDate"     
[13] "buildDate"                 "Continent"                
[15] "UNSDG-region"              "UNSDG-subregion"          
[17] "worldBankIncomeGroup"      "admUnitCount"             
[19] "meanVertices"              "minVertices"              
[21] "maxVertices"               "meanPerimeterLengthKM"    
[23] "minPerimeterLengthKM"      "maxPerimeterLengthKM"     
[25] "meanAreaSqKM"              "minAreaSqKM"              
[27] "maxAreaSqKM"               "staticDownloadLink"       
[29] "gjDownloadURL"             "tjDownloadURL"            
[31] "imagePreview"              "simplifiedGeometryGeoJSON"

The parsed content contains 32 components. Item 29 is a direct link to the GeoJSON file (gjDownloadURL) where the vector boundary data is located. Next we will obtain the GeoJSon and check the results.

# directly read in the geojson with sf from the geoboundaries server
usa <- sf::st_read(usa$gjDownloadURL)

Reading layer `geoBoundaries-USA-ADM1' from data source 
  `https://github.com/wmgeolab/geoBoundaries/raw/9469f09/releaseData/gbOpen/USA/ADM1/geoBoundaries-USA-ADM1.geojson' 
  using driver `GeoJSON'
Simple feature collection with 56 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.1489 ymin: -14.54869 xmax: 179.7785 ymax: 71.36516
Geodetic CRS:  WGS 84

# check the visuals
plot(sf::st_geometry(usa))

Upon examination, we see that this GeoJSon includes all US states and overseas territories. For this demonstration, we can simplify it to the contiguous United States. (Of course, it could also be simplified to other areas of interest simply by adapting the code below.)

We first create a list of the geographies we wish to remove and assign them to a variable called “drops”. Next, we reassign our “usa” variable to include only the entries in the continental US and finally we plot the results.

# create a vector of territories we don't want in our CONUSA boundary
drops<-
  c("Alaska", "Hawaii", 
    "American Samoa",
    "Puerto Rico",
    "Commonwealth of the Northern Mariana Islands", 
    "Guam", 
    "United States Virgin Islands")

# select all the states and territories not in the above list
# in R the "!" signifies NOT or the OPPOSITE
usa<-usa[!(usa$shapeName %in% drops),]
# check the visuals
plot(sf::st_geometry(usa))

We can take this a step further and select a single state for analysis. Here we use a slightly different method by creating a new object called “texas” by subsetting the state out by name.

# extract just texas from the CONUSA boundary
texas <- usa[usa$shapeName == "Texas",]
# check the visuals
plot(sf::st_geometry(texas))

From here we can clip the WSIM-GLDAS raster stack by using the stored boundary of Texas. You can call the sf::st_crop() function to crop the WSIM-GLDAS layer, but as you see below, more simply, you can just use bracket indexing to crop a stars object with a sf object.

Drought in the News

Texas experienced a severe drought in 2011 that caused rivers to dry up and lakes to reach historic low levels (StateImpact 2014). The drought was further exacerbated by high temperatures related to climate change in February of 2013. Climate experts discovered that the drought was produced by “La Niña”, a weather pattern that causes the surface temperature of the Pacific Ocean to be cooler than normal. This, in turn, creates drier and warmer weather in the southern United States. La Niña can occur for a year or more, and returns once every few years (NOAA 2023).

It is estimated that the drought cost farmers and ranchers about $8 billion in losses.(Roeseler 2011) Furthermore, the dry conditions fueled a series of wildfires across the state in early September of 2011, the most devastating of which occurred in Bastrop County, where 34,000 acres and 1,300 homes were destroyed (Roeseler 2011).

# crop the wsim-gldas file to the extent of te texas boundary
wsim_gldas_anoms_tex <- wsim_gldas_anoms[texas]

Finally, we visualize the last time-step in the WSIM-GLDAS dataset (15/December, 2014) and render it with an overlay of the Texas boundary to perform a visual check of our processing.

# slive out the first 15 timesteps of wsim-gldas and plot them
wsim_gldas_anoms_tex |>
  dplyr::slice(index = 15, along = "time") |>
  # plot with 
  plot(reset = FALSE, breaks = c(0, -5, -10, -20, -30, -50))
# add the texas boundary on top
plot(sf::st_geometry(texas),
     add = TRUE,
     lwd = 3,
     fill = NA,
     border = 'purple')

Voila! We successfully cropped the WSIM-GLDAS to the extent of Texas and created an overlay map with both dat sets to check the results. If you were carrying out further analysis or wanted to share your work with colleagues, you may want to save the processed WSIM-GLDAS to disk.

Multidimensional (deficit, time, latitude, longitude) raster files can be saved with the stars::write_mdim function and vector data can be saved using sf::st_write.

# write the processed wsim-gldas file to disk as a netcdf
stars::write_mdim(wsim_gldas_anoms_tex, "output/wsim_gldas_tex.nc")
# write the Texas boundary to disk
sf::st_write(texas, "output/texas.geojson")

The size of the pre-processed dataset is 1.6 MB compared to the original dataset of 1.7 GB. This is much more manageable in cloud environments, workshops, and git platforms. T

Advanced Visualizations and Data Integrations

Now that we’ve introduced the basics of manipulating and visualizing WSIM-GLDAS, we can explore more advanced visualizations and data integrations. Let’s clear the workspace and start over again with the same WSIM-GLDAS Composite Anomaly Twelve-Month Return Period we used earlier. We will spatially subset the data to cover only the Continental United States (CONUSA) which will help to minimize our memory footprint. We can further reduce our memory overhead by reading in just the variable we want to analyze. In this instance we can read in just the deficit attribute from the WSIM-GLDAS Composite Anomaly Twelve-Month Return Period file, rather than reading the entire NetCDF with all of its attributes.

Coding Review

Random Access Memory (RAM) is where data and programs that are currently in use are stored temporarily. It allows the computer to quickly access data, making everything you do on the computer faster and more efficient. Unlike the hard drive, which stores data permanently, RAM loses all its data when the computer is turned off. RAM is like a computer’s short-term memory, helping it to handle multiple tasks at once.

For this exercise, we can quickly walk through similar pre-processing steps we performed earlier in this lesson and then move on to more advanced visualizations and integrations. Read the original 12-month integration data back in, filter with a list of dates for each December spanning 2000-2014, and then crop the raster data with the boundary of the contiguous United States using our geoBoundaries object.

# read in the wsim-gldas layer from SEDAC
wsim_gldas <- stars::read_stars("data/composite_12mo.nc", proxy = FALSE, sub = 'deficit')

deficit,

# generate a vector of dates for subsetting
keeps<-seq(lubridate::ymd("2000-12-01"),
           lubridate::ymd("2014-12-01"), 
           by = "year")

#change data type to POSIXct
keeps <- as.POSIXct(keeps)
# filter using that vector
wsim_gldas <- dplyr::filter(wsim_gldas, time %in% keeps)
# crop to conusa
wsim_gldas<-wsim_gldas[usa]
#rename the time dimension to just the years each time-step represents
wsim_gldas <-
  wsim_gldas |> stars::st_set_dimensions("time", values = as.character(seq(2000,2014)))

Double check the object information.

# check the basic information again
print(wsim_gldas)

stars object with 3 dimensions and 1 attribute
attribute(s):
         Min.   1st Qu.    Median      Mean   3rd Qu. Max.   NA's
deficit   -60 -9.821823 -4.206245 -7.941395 -2.086784   60 153810
dimension(s):
     from  to offset delta refsys        values x/y
x     221 453   -180  0.25 WGS 84          NULL [x]
y     163 262     90 -0.25 WGS 84          NULL [y]
time    1  15     NA    NA     NA 2000,...,2014

You will want to review the printout to make sure it looks okay.

Does it contain the variables you were expecting?
Do the values for the variables seem plausible?

Other basic descriptive analyses are useful to verify and understand your data. One of these is to produce a frequency distribution (also known as a histogram), which is reviewed below.

Knowledge Check

What are the two ways we reduced our memory footprint when we loaded the WSIM-GLDAS dataset for this lesson? (select all that apply)
1. Spatially subsetting the data to the CONUSA
2. Reading in only one year of data
3. Reading in only the attribute of interest (i.e., ‘deficit’)
4. All of the above.

Annual CONUSA Time Series

The basic data properties reviewed in the previous step are useful for exploratory data analysis, but we should perform further inspection. We can start our visual exploration of annual drought in the CONUSA by creating a map illustrating the deficit return period for each of the years in the WSIM-GLDAS object.

# load the base data of the usa boundary
ggplot2::ggplot(usa)+
  # plot the stars wsim object
  stars::geom_stars(data = wsim_gldas)+
  # set equal coordinates for axes
  ggplot2::coord_equal()+
  # create multiple panels for each time step
  ggplot2::facet_wrap(~time)+
  # plot the usa boundary with just the outline
  ggplot2::geom_sf(fill = NA)+
  # add the palette for wsim-gldas 
  ggplot2::scale_fill_stepsn(
    colors = c(
    '#9B0039',
    # -50 to -40
    '#D44135',
    # -40 to -20
    '#FF8D43',
    # -20 to -10
    '#FFC754',
    # -10 to -5
    '#FFEDA3',
    # -5 to -3
    '#FFF4C7',
    # -3 to 0
    '#FFFFFF'), 
    # set the breaks on the palette
    breaks = c(-60, -50, -40, -20, -10,-5,-3, 0))+
  # add plot labels
  ggplot2::labs(
    title="Annual Mean Deficit Anomalies for the CONUSA",
    subtitle = "Using Observed 12 Month Integrated WSIM-GLDAS Data for 2000-2014",
    fill = "Deficit Return Period"
  )+
  # set the minimal theme
  ggplot2::theme_minimal()+
  # turn off some extra graphical elements
  ggplot2::theme(
    axis.title.x=ggplot2::element_blank(),
    axis.text.x=ggplot2::element_blank(),
    axis.ticks.x=ggplot2::element_blank(),
    axis.title.y=ggplot2::element_blank(),
    axis.text.y=ggplot2::element_blank(),
    axis.ticks.y=ggplot2::element_blank())

This visualization shows that there were several significant drought events (as indicated by dark red deficit return-period values) throughout 2000-2014. Significant drought events included the southeast in 2000, the southwest in 2002, the majority of the western 3rd in 2007, Texas-Oklahoma in 2011, Montana-Wyoming-Colorado in 2012, and the entirety of the California coast in 2014. The droughts of 2012 and 2011 are particularly severe and widespread with return periods greater than 50 years covering multiple states. Based on historical norms, we should only expect droughts this strong every 50-60 years!

Knowledge Check

The output maps of Annual Mean Deficit Anomalies for the CONUSA indicate that…
1. The most significant deficit in 2004 is located in the western United States.
2. The least significant deficit 2003 is located in the Midwest.
3. The most significant deficit in 2011 is located around Texas and neighboring states.
4. The least significant deficit in 2000 is located in the southeast.

Monthly Time Series

We can get a more detailed look at these drought events by using the 1-month composite WSIM-GLDAS dataset and cropping the data to a smaller spatial extent matching one of the events we’ve noted in the previous plot. Let’s take a closer look at the 2014 California drought.

Drought in the News

The California drought of 2012-2014 was the worst in 1,200 years (“Evidence Suggests California’s Drought Is the Worst in 1,200 Years” 2014). This drought caused problems for homeowners, and even conflicts between farmers and wild salmon! Governor Jerry Brown declared a drought emergency and called on residents to reduce water intake by 20%. Water use went up by 8% in May of 2014 compared to 2013, in places like coastal California and Los Angeles. Due to the water shortages, the state voted to fine water-wasters up to $500 dollars. The drought also affected residents differently based on economic status. For example, in El Dorado County, located in a rural area east of Sacramento, residents were taking bucket showers and rural residents reported wells, which they rely on for fresh water, were drying up. The federal government eventually announced a $9.7 million emergency drought aid for those areas (Sanders 2014).

Additionally, there were thousands of adult salmon struggling to survive in the Klamath River in Northern California, where water was running low and warm due to the diversion of river flow into the Central Valley, an agricultural area that grows almond trees. Almonds are one of California’s most important crops, with the state producing 80% of the world’s almonds. However, salmon, which migrate upstream, could get a disease called gill rot, which flourishes in warm water and already killed tens of thousands of Chinook in 2002. This disease was spreading through the salmon population again due to this water allocation, affecting local Native American tribes that rely on the fish (Bland 2014).

In order to limit the amount of computing memory required for the operation, we will first clear items from the in-memory workspace and then reload a smaller composite file, we’ll start by removing the 12-month composite object.

# remove the large wsim object from the environment
rm(wsim_gldas)

Now let’s load the composite 1-month file from SEDAC into the workspace. The attributes and dimensions will be the same as the 12-month integration so we’ll skip ahead to directly loading in the deficit variable without performing a check on the structure using proxy = TRUE.

# clear the memory; R has an automated garbage collector to clear unused memory
# but you can invoke it by using the gc() command
gc()

           used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  1403795  75.0    2192305  117.1   2192305  117.1
Vcells 17407142 132.9  539326463 4114.8 748265085 5708.9

# read in the 1 month wsim-gldas data
wsim_gldas_1mo <- stars::read_stars("data/composite_1mo.nc", sub = 'deficit', proxy = FALSE)

deficit,

# check the basic info
print(wsim_gldas_1mo)

stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
         Min.   1st Qu.    Median     Mean   3rd Qu.     Max.  NA's
deficit   -60 -29.83644 -2.461974 -16.9758 -2.461974 34.54536 87340
dimension(s):
     from   to offset delta  refsys                    values x/y
x       1 1440   -180  0.25  WGS 84                      NULL [x]
y       1  600     90 -0.25  WGS 84                      NULL [y]
time    1  804     NA    NA POSIXct 1948-01-01,...,2014-12-01

Once again, we’ll subset the time dimension for our period of interest. However, this time we want every month for 2014 so we can take a closer look at the California drought.

# generate a vector of dates for subsetting
keeps<-seq(lubridate::ymd("2014-01-01"),
           lubridate::ymd("2014-12-01"), 
           by = "month")
#change data type to POSIXct
keeps <- as.POSIXct(keeps)
# filter using that vector
wsim_gldas_1mo <- dplyr::filter(wsim_gldas_1mo, time %in% keeps)
# check the info
print(wsim_gldas_1mo)

stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
         Min.   1st Qu.    Median      Mean   3rd Qu.     Max.  NA's
deficit   -60 -12.92903 -6.746436 -14.71285 -6.746436 15.66717 87340
dimension(s):
     from   to offset delta  refsys                    values x/y
x       1 1440   -180  0.25  WGS 84                      NULL [x]
y       1  600     90 -0.25  WGS 84                      NULL [y]
time    1   12     NA    NA POSIXct 2014-01-01,...,2014-12-01

Now we have 12 rasters with monthly data for 2014. Let’s zoom in on California and see how this drought progressed over the course of the year.

# isolate only the california border
california<-usa[usa$shapeName=="California",]
# subset wsim-gldas to the extent of the california boundary
wsim_gldas_california <- wsim_gldas_1mo[california]
# give the time dimension pretty labels
wsim_gldas_california <-
  wsim_gldas_california |>
    stars::st_set_dimensions("time", values = month.name)

# monthly plots of california
# load the base california data
ggplot2::ggplot(california)+
  # load the wsim stars object
  stars::geom_stars(data = wsim_gldas_california)+
  # equal aspect ratio for lat/long axe
  ggplot2::coord_equal()+
  # create a subplot for each time step
  ggplot2::facet_wrap(~time)+
  # only plot the outline of california
  ggplot2::geom_sf(fill = NA)+
  # set the wsim palette
  ggplot2::scale_fill_stepsn(
    colors = c(
    '#9B0039',
    # -50 to -40
    '#D44135',
    # -40 to -20
    '#FF8D43',
    # -20 to -10
    '#FFC754',
    # -10 to -5
    '#FFEDA3',
    # -5 to -3
    '#FFF4C7',
    # -3 to 0
    '#FFFFFF'), 
    # set the wsim breaks
    breaks = c(-60, -50, -40, -20, -10,-5,-3, 0))+
  # add labels
  ggplot2::labs(
    title="Deficit Anomalies for California",
    subtitle = "Using Observed Monthly WSIM-GLDAS Data for 2014",
    fill = "Deficit Return Period"
  )+
  # start with a base minimal theme
  ggplot2::theme_minimal()+
  # remove additional elements to make a cleaner map
  ggplot2::theme(
    axis.title.x=ggplot2::element_blank(),
    axis.text.x=ggplot2::element_blank(),
    axis.ticks.x=ggplot2::element_blank(),
    axis.title.y=ggplot2::element_blank(),
    axis.text.y=ggplot2::element_blank(),
    axis.ticks.y=ggplot2::element_blank())

This series of maps shows a startling picture. California faced massive water deficits throughout the state in January and February. This was followed by water deficits in the western half of the state in May-August. Although northern and eastern California saw some relief by September, southwest California continued to see deficits through December.

Monthly Histograms

Data Science Review

A data frame is a data structure used for storing tabular data. It organizes data in rows and columns. Each column can have a different type of data (numeric, character, factor, etc.), and rows represent individual observations or cases. Data frames provide a convenient way to work with structured data, making them essential for data analysis and statistics projects.

We can explore the data further by creating a frequency distribution (also called a histogram) of the deficit anomalies for any given spatial extent; here we are still looking at the deficit anomalies in California. We start by extracting the data from the raster time series and then created a data frame of values that are easier to manipulate into a histogram. R data frames are data displayed in table format, which can be plotted on graphs or charts.

# extract the raster values into a dataframe
deficit_hist <-  
  wsim_gldas_california |>
  as.data.frame(wsim_gldas_california$deficit)

# remove the NA values
deficit_hist<-stats::na.omit(deficit_hist)

# create the histogram
ggplot2::ggplot(deficit_hist, ggplot2::aes(deficit))+
  # style the bars
  ggplot2::geom_histogram(binwidth = 6, fill = "#325d88")+
  # subplot for each timestep
  ggplot2::facet_wrap(~time)+
  # use the minimal theme
  ggplot2::theme_minimal()

The histograms start to quantify what we saw in the time series maps. Whereas the map shows where the deficits occur, the frequency distribution indicates the number of raster cells for each return period of the deficit range. The number of raster cells under a 60-year deficit (return period) is very high in most months, far exceeding any other value in the range.

Knowledge Check

What’s another term for ‘histogram’?
1. Choropleth
2. Frequency Distribution
3. Data array
4. Box plot

Zonal Summaries

To this point we’ve described the 2014 California drought by examining the state as a whole. Although we have a sense of what’s happening in different cities or counties by looking at the maps, they do not provide quantitative summaries of local areas.

Zonal statistics are one way to summarize the cells of a raster layer that lie within the boundary of another data layer (which may be in either raster or vector format). For example, aggregating deficit return periods with another raster depicting land cover type or a vector boundary (shapefile) of countries, states, or counties, will produce descriptive statistics by that new layer. These statistics could include the sum, mean, median, standard deviation, and range.

For this section, we begin by calculating the mean deficit return period within California counties. First, we retrieve a vector dataset of California counties from the geoBoundaries API. Since geoBoundaries does not attribute which counties belong to which states, we utilize a spatial operation called intersect to select only those counties in California.

# get the adm2 (county) level data for the USA
# request the adm2 usa information
cali_counties <- httr::GET("https://www.geoboundaries.org/api/current/gbOpen/USA/ADM2/")
# parse the content to find the direct download link
cali_counties <- httr::content(cali_counties)
# read in the geojson directly from geoboundaries
cali_counties <- sf::st_read(cali_counties$gjDownloadURL)

Reading layer `geoBoundaries-USA-ADM2' from data source 
  `https://github.com/wmgeolab/geoBoundaries/raw/9469f09/releaseData/gbOpen/USA/ADM2/geoBoundaries-USA-ADM2.geojson' 
  using driver `GeoJSON'
Simple feature collection with 3233 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.1474 ymin: -14.55254 xmax: 179.7785 ymax: 71.35256
Geodetic CRS:  WGS 84

# geoBoundaries does not list which counties belong to which state so you need to run an intersection
# intersect the usa adm2 data with just the california boundary
cali_counties<-sf::st_intersection(cali_counties, california)
# plot the result
plot(sf::st_geometry(cali_counties))

The output of that intersection looks as expected. As noted above, in general a visual and/or tabular check on your data layers is always a good idea. If you expect 50 counties in a given state, you should see 50 counties resulting from your intersection of your two layers, etc. You may want to be on the look out for too few (such as an island area that may be in one layer but not the other) or too many counties (such as those that intersect with a neighboring state).

Coding Review

The exactextractr (Daniel Baston 2023) R package summarizes raster values over groupings, or zones, also known as zonal statistics. Zonal statistics help in assessing the statistical characteristics of a certain region.

The terra R package processes raster geospatial data, offering functionalities such as data manipulation, spatial analysis, modeling, and visualization, with a focus on efficiency and scalability.

We will perform our zonal statistics using the exactextractr package (Daniel Baston 2023). It is the fastest, most accurate, and most flexible zonal statistics tool for the R programming language, but it currently has no default methods for the stars package, so we’ll switch to terra for this portion of the lesson. You’ll notice that there are slight differences in syntax to perform the same operations in terra as stars.

Read the data back in with terra.

# remove the previous 1 month object
rm(wsim_gldas_1mo)

# create a terra collection with the wsim-gldas 1 month file from SEDAC
wsim_gldas_1mo<-terra::sds("data/composite_1mo.nc")

We’ll check the basic info with print.

print(wsim_gldas_1mo)

class       : SpatRasterDataset 
subdatasets : 5 
dimensions  : 600, 1440 (nrow, ncol)
nlyr        : 804, 804, 804, 804, 804 
resolution  : 0.25, 0.25  (x, y)
extent      : -180, 180, -60, 90  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source(s)   : composite_1mo.nc 
names       : deficit, deficit_cause, surplus, surplus_cause, both

You can see that the information from terra is arranged differently but it’s all still there. The attributes are numbered in the “subdatasets” (5) and listed by name in the “names” column. The time dimension is not listed by name, but “nlyr” tells us there are 804 layers for each attribute/subdataset. You can access more detailed information for the time dimensions using terra::time.

Subset the data along the time dimension.

# pull out just the deficit layer
wsim_gldas_1mo<-wsim_gldas_1mo["deficit"]
# create a sequence of dates to keep
keeps<-seq(lubridate::ymd("2014-01-01"), lubridate::ymd("2014-12-01"), by = "month")
# subset the terra object for only those time steps
wsim_gldas_1mo<-wsim_gldas_1mo[[terra::time(wsim_gldas_1mo) %in% keeps]]
# label the time steps
names(wsim_gldas_1mo) <- keeps
# check the info
print(wsim_gldas_1mo)

class       : SpatRaster 
dimensions  : 600, 1440, 12  (nrow, ncol, nlyr)
resolution  : 0.25, 0.25  (x, y)
extent      : -180, 180, -60, 90  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : composite_1mo.nc:deficit 
varname     : deficit (Composite Deficit Index) 
names       : 2014-01-01, 2014-02-01, 2014-03-01, 2014-04-01, 2014-05-01, 2014-06-01, ... 
time (days) : 2014-01-01 to 2014-12-01

Now that we’ve selected just the “deficit” variable terra lists the “varname” as deficit and shows t he time-steps in the “names” column. A little confusing but terra changes the display once the object goes from multiple variables to just one.

Time to perform the zonal summary.

# run the extraction
cali_county_summaries<-
  exactextractr::exact_extract(
    # the raster data
    wsim_gldas_1mo, 
    # the boundary data
    cali_counties, 
    # the calculation
    'mean', 
    # don't show progress
    progress = FALSE)
# give the timestep prettier labels
names(cali_county_summaries)<-lubridate::month(keeps, label = TRUE, abbr = FALSE)

exactextractr will return summary statistics in the same order of the input boundary file, therefore we can join the California county names to the exactextract summary statistics output for visualization. We will also make a version to view as a table to inspect the raw data. We can take a quick look at the first 10 counties to see their mean deficit return period for January-June.

# bind the extracted means with the california boundary
cali_counties<-cbind(cali_counties, cali_county_summaries)
# prepare a version to create a table
cali_county_table<-cbind(County=cali_counties$shapeName,
                         round(cali_county_summaries))
# create the table with only the first 10 rows and 7 columns
kableExtra::kbl(cali_county_table[c(1:10),c(1:7)]) |>
    kableExtra::kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"))

County	January	February	March	April	May	June
Alpine	-42	-16	-14	-8	-28	-29
Mohave	-38	-10	-10	-13	-60	-21
Tuolumne	-46	-19	-16	-8	-42	-24
Tehama	-58	-50	-24	-21	-28	-47
Trinity	-60	-46	-46	-26	-44	-60
Curry	-60	-56	-26	-4	-9	-55
San Benito	-60	-60	-56	-23	-60	-60
Josephine	-60	-25	-33	-14	-18	-35
Sierra	-19	-19	-11	-11	-18	-28
El Dorado	-44	-17	-12	-9	-34	-29

This confirms the widespread distribution of high deficit values (all the bright red) in our exploratory maps. The data is currently in wide format, which makes for easy viewing of a time series, but more advanced programmatic visualization typically requires data to be in a long format (more on that later).

Knowledge Check

What can zonal statistics be used for?
1. Subsetting data to a time period of interest.
2. Creating a time series.
3. Summarizing the values of cells that lie within a boundary.
4. All of the above.

County Choropleths

Now that we’ve inspected the raw data we can make a choropleth out of the mean deficit return period data. We previously demonstrated more complex maps using ggplot2, sf, and stars, but you can also make quick plots of sf objects with the base plotting function. By default sf will make a map for every column listed in the dataset. In this case we only want to look at the monthly means so we will just plot columns 11 through 23. You can make simple alterations to the color palette and position of the legend, but custom map titles and legend titles are not easily accomplished with multi-panel maps (multiple maps in one) like the one pictured below.

# plot the data for a check using only the 12 monthly summaries in columns 11 through 23
plot(cali_counties[c(11:23)],
     # the palette we've been using
     pal = c(
    '#9B0039',
    # -50 to -40
    '#D44135',
    # -40 to -20
    '#FF8D43',
    # -20 to -10
    '#FFC754',
    # -10 to -5
    '#FFEDA3',
    # -5 to -3
    '#FFF4C7',
    # -3 to 0
    '#FFFFFF'), 
    # the breaks we want
    breaks = c(-61, -50, -40, -20, -10,-5,-3, 5), 
    # where we're placing the legend
    key.pos = 1, 
    # how thick we want the legend to be
    key.width = 0.3)

Due to the widespread water deficits in the raw data, the mean values do not appear much different from the raw deficit raster layer, however, choropleth maps, also called thematic maps, can make it easier for users to survey the landscape by visualizing familiar places (like counties) that place themselves and their lived experiences alongside the data.

While this paints a striking picture of widespread water deficits, how many people are affected by this drought? Although the land area appears rather large, if one is not familiar with the distribution of population and urban centers in California it can be difficult to get a sense of the direct human impact. (This is partly because more populous locations are usually represented by smaller land areas and the less populous locations are usually represented by large administrative boundaries containing much more land area). Normalizing a given theme by land area may be something an analyst wants to do but we cover another approach below.

Knowledge Check

Choropleth maps, aka thematic maps, are useful for
1. Visualizing data in familiar geographies like counties or states
2. Finding directions from one place to another
3. Visualizing data in uniform geographic units like raster grid cells.
4. Calculating return periods.

Integrating Population Data

Gridded Population of the World (GPW) is a data collection from SEDAC that models the distribution of the global human population as counts and densities in a raster format (Center For International Earth Science Information Network-CIESIN-Columbia University 2018).We will take full advantage of exactextractr to integrate across WSIM-GLDAS, geoBoundaries, and GPW. To begin, we need to download the 15-minute resolution (roughly 30 square kilometer at the equator) population density data for the year 2015 from GPW. This version of GPW most closely matches our time period (2014) and the resolution of WSIM-GLDAS (0.25 degrees). Although in many applications one might choose to use GPW’s population count data layers, because we are using exactextractr we can achieve more accurate results (especially along coastlines) by using population density in conjunction with land area estimates from the exactextractr package.

Data Review

The Gridded Population of the World Version 4 is available in multiple target metrics (e.g. counts, density), periods (2000, 2005, 2010, 2015, 2020), and spatial resolutions (30 sec, 2.5 min, 15 min, 30 min, 60 min). Read more about GPW at the collection home page on SEDAC. GPW is one of four global gridded population datasets available in raster format. These datasets vary in the degree to which they use additional information as ancillary variables to model the spatial distribution of the population from the administrative units (vector polygons) in which they originate. A review of these datasets and their underlying models is found in a paper by Leyk and colleagues (Leyk et al. 2019). You can learn more about gridded population datasets at POPGRID.org. Fitness-for-use is an important principle in determining the best dataset to use for a specific analysis. The question we ask here is — what is the population exposure to different levels of water deficit in California? — uses spatially coarse inputs and is for a place with high-quality data inputs, GPW is a good choice for this analysis. Users with vector-format census data (at the county or sub-county level) could also adopt this approach for those data. In the case of California, the US Census data and GPW will produce nearly identical estimates because GPW is based on the census inputs.

Load in the population density layers.

# read in GPW with terra
pop_dens<-terra::rast("data/gpw_v4_population_density_rev11_2015_15_min.tif")
# check the basic information
print(pop_dens)

class       : SpatRaster 
dimensions  : 720, 1440, 1  (nrow, ncol, nlyr)
resolution  : 0.25, 0.25  (x, y)
extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : gpw_v4_population_density_rev11_2015_15_min.tif 
name        : gpw_v4_population_density_rev11_2015_15_min

For this example we’ll classify the WSIM-GLDAS deficit return period raster layer into eight categories. Binning the data will make it easier to manage the output and interpret the results.

# set the class breaks row-wise (from, to, new label)
# e.g. first row states: "all return periods from 0 to 5 will now be labeled 0
m <- c(0, 5, 0,
       -3, 0, -3,
       -5, -3, -5,
       -10, -5, -10,
       -20, -10, -20,
       -40, -20, -40,
       -50, -40, -50,
       -65, -50, -60)
# convert the vector into a matrix
rclmat <- matrix(m, ncol=3, byrow=TRUE)
# classify the data
wsim_gldas_1mo_class <-
  terra::classify(wsim_gldas_1mo, rclmat, include.lowest = TRUE)

In our previous example, we used exactextractr’s built-in 'mean' function, but we can also pass custom functions to exactextractr that will carry out several operations at once as well. The following code could be combined into a single function passed to exactextractr, but it is presented here as multiple functions in order to follow along more easily. You can read more about exactextractr arguments in the package help guide. The key arguments to be aware of are the calls to:

weights = pop_dens: summarizes each WSIM-GLDAS cell’s deficit return period with the corresponding population density value.
coverage_area = TRUE: calculates the corresponding area of the WSIM-GLDAS raster cell that is covered by the California boundary.

# run the extraction
pop_by_rp <-
  exactextractr::exact_extract(wsim_gldas_1mo_class, california, function(df) {
    df <- data.table::setDT(df)}, 
  # convert output to data frame
  summarize_df = TRUE, 
  # specify the weights we're using
  weights = pop_dens, 
  # return the coverage area (m^2)
  coverage_area = TRUE,
  # return the county name with the output data
  include_cols = 'shapeISO', 
  # don't show progress
  progress = FALSE)

This returns a data.frame with a row for every raster cell in the WSIM-GLDAS layer that is overlapped by the California boundary. Let’s take a look at the first 6 rows.

head(pop_by_rp)

   shapeISO 2014-01-01 2014-02-01 2014-03-01 2014-04-01 2014-05-01 2014-06-01
1:    US-CA        -60        -40        -40        -20        -40        -50
2:    US-CA        -60        -40        -40        -20        -20        -40
3:    US-CA        -60        -40        -40        -40        -40        -60
4:    US-CA        -60        -60        -40        -40        -60        -60
5:    US-CA        -60        -60        -20        -40        -50        -60
6:    US-CA        -60        -60        -40        -40        -40        -60
   2014-07-01 2014-08-01 2014-09-01 2014-10-01 2014-11-01 2014-12-01     weight
1:        -60        -60         -5         -3        -10        -10 14.1623726
2:        -60        -50         -3         -3        -10        -20  3.1425104
3:        -60        -60         -3         -3         -5        -20  4.3180256
4:        -60        -60         -5         -3         -5        -20 15.4701538
5:        -60        -60        -10          0         -5        -10 44.1462708
6:        -60        -60        -20          0         -5        -10  0.6994646
   coverage_area
1:        569715
2:       1624997
3:      12378942
4:       8855739
5:      13233592
6:      19485454

shapeISO: The label of the polygon boundary where the cell is located. This was passed on from the California geojson boundary as specified in the include_cols = 'shapeISO' argument. In this instance, it’s not very helpful because we used the state-level California boundary, but if we passed the ADM2 boundary with counties it would provide the name of the county where the cell is located.
2014-01-01 to 2014-12-01: The next 12 columns list the deficit return period classification value for the cell in each of the 12 months corresponding to the time dimension of the wsim_gldas_1mo raster layer.
weight: The weight column lists the corresponding population density value (persons per km^2) for that WSIM-GLDAS cell. The WSIM-GLDAS and GPW raster layers have the same projection and resolution. Therefore, because they are perfectly aligned, each WSIM return period cell has a corresponding GPW population weight that can be layered exactly on it.
coverage_area: The total area (m^2) of the WSIM-GLDAS cell that is covered by the California boundary layer. Given the total area of the WSIM cell that is covered, and the GPW persons per unit area weight, we can calculate the number of people estimated to be living within this cell under this WSIM deficit return period.

We will need to perform a few more processing steps to prepare this data.frame for a time series visualization integrating all of the data. We will use the melt function to transform the data from wide format to long format in order to produce a visualization in ggplot2. Specifically, we need to use melt to make the 12 month columns (2014-01-01 to 2014-12-01) into 2 new columns: 1) specifying the WSIM-GLDAS deficit return period value and 2) the month it came from.

Coding Review

Converting data from wide to long or long to wide formats is a key component to data processing, however, it can be confusing. To read more about melting/pivoting longer (wide to long) and casting/pivoting wider (long to wide) check out the data.table vignette Efficient reshaping using data.tables and the dplyr pivot_longer and pivot_wider reference pages.

# convert the dataset from wide to long (melt)
pop_by_rp <-
  data.table::melt(
    # data we're melting
    pop_by_rp,
    # the columns that need to stay columns/ids
    id.vars = c("shapeISO", "coverage_area", "weight"),
    # name for the month columns we're melting into a single column
    variable.name = "month",
    # name for the values being converted into a single column
    value.name = "return_period")

# check the first 5 rows of melted data
head(pop_by_rp)

   shapeISO coverage_area     weight      month return_period
1:    US-CA        569715 14.1623726 2014-01-01           -60
2:    US-CA       1624997  3.1425104 2014-01-01           -60
3:    US-CA      12378942  4.3180256 2014-01-01           -60
4:    US-CA       8855739 15.4701538 2014-01-01           -60
5:    US-CA      13233592 44.1462708 2014-01-01           -60
6:    US-CA      19485454  0.6994646 2014-01-01           -60

Each row lists the land area (coverage_area) covered by the zone, the population density value (weight) for the zone, the month the deficit return period corresponds to, and the actual deficit return period class for the zone.

Next, we’ll summarize the data by return-period class.

# create new column with population totals for each month and return period combination
# divide by 1000000
pop_by_rp <-
  pop_by_rp[, .(pop_rp = round(sum(coverage_area * weight) / 1e6)), by = .(month, return_period)]
# some cells do not have deficit return period values and result in NaN--just remove
pop_by_rp <- na.omit(pop_by_rp)
# create a new column with the total population for that month (will be the same for each month but needed to create wsim class fraction)
pop_by_rp[, total_pop := sum(pop_rp), by = month]
# check the first rows
head(pop_by_rp)

        month return_period   pop_rp total_pop
1: 2014-01-01           -60 37556929  39547064
2: 2014-01-01           -40   718545  39547064
3: 2014-01-01           -20   551332  39547064
4: 2014-01-01           -50   708659  39547064
5: 2014-01-01           -10    11554  39547064
6: 2014-01-01            -5       45  39547064

Now we have a row for every unique combination of month, return period class, and the total population. We can calculate the percent of the total population represented by this return period class with 2 more lines.

# calculate the fraction of that month-wsim class combination relative to the total population
pop_by_rp[, pop_frac := pop_rp / total_pop][, total_pop := NULL]
# check the first rows
head(pop_by_rp)

        month return_period   pop_rp     pop_frac
1: 2014-01-01           -60 37556929 9.496768e-01
2: 2014-01-01           -40   718545 1.816936e-02
3: 2014-01-01           -20   551332 1.394116e-02
4: 2014-01-01           -50   708659 1.791938e-02
5: 2014-01-01           -10    11554 2.921582e-04
6: 2014-01-01            -5       45 1.137885e-06

Before plotting we’ll make the month labels more legible for plotting, convert the WSIM-GLDAS return period class into a factor, and set the WSIM-GLDAS class palette.

Coding Review

Factors are the most common way to handle categorical data in R. Although converting your categorical variables into factors is not not always the best choice, in many instances (especially plotting with ggplot2) the benefits will out way any annoyances. To learn more about factors and R check out Hadley Wickham’s chapter on factors in R for Data Science 2nd Edition.

# ggplot is easier with factors
pop_by_rp$return_period<-
  factor(pop_by_rp$return_period, 
         levels = c("0", "-3", "-5", "-10", "-20", "-40", "-50", "-60"))
# create the palette to pass to ggplot
leg_colors<-c(
    '#9B0039',
    # -50 to -40
    '#D44135',
    # -40 to -20
    '#FF8D43',
    # -20 to -10
    '#FFC754',
    # -10 to -5
    '#FFEDA3',
    # -5 to -3
    '#fffdc7',
    # -3 to 0
    '#FFF4C7',
    # 0-3
    "#FFFFFF")

# pretty month labels
pop_by_rp[,month:=lubridate::month(pop_by_rp$month, label = TRUE)]

Now we can put it all together into a visualization.

# add the base data
ggplot2::ggplot(pop_by_rp, 
                # set plot wide aesthetics
                ggplot2::aes(x = month, 
                             y = pop_frac,
                             group = return_period, 
                             fill = return_period))+
  # determine the overlay and positioning of each wsim-class' bar
  ggplot2::geom_bar(stat = "identity", 
                    position = "stack", 
                    color = "darkgrey")+
  # set palettes on the wsim class groups
  # rev() the order of the order of the colors so we can have -60 on the bottom
  ggplot2::scale_fill_manual(values = rev(leg_colors))+
  # limit y axis to 0/1
  ggplot2::ylim(0,1)+
  # labels
  ggplot2::labs(title = "Monthly Fraction of Population Under Water Deficits in California During 2014",
                subtitle = "Categorized by Intensity of Deficit Return Period",
                x = "",
                y = "Fraction of Population*",
                caption = "*Population derived from Gridded Population of the World (2015)",
                color = "Return Period", fill = "Return Period", group = "Return Period", alpha = "Return Period")+
  # use basic theme
  ggplot2::theme_minimal()

This figure really illustrates the human impact of the 2014 drought. Nearly 100% of the population was under a 60+ year deficit in January followed by 66% in May and approximately 40% for the remainder of the summer. That is a devastating drought!

Knowledge Review

There are several options for spatially subsetting (or clipping) a raster object to a region of interest. What method was used in this lesson?
1. Using a vector of dates.
2. Using another raster object.
3. Specifying a bounding box.
4. Using a vector boundary dataset.
When running into memory issues, what is something you can do to reduce the computational load?
1. Work with one time frame or region at a time.
2. Save it as a new file.
3. Subset the data to a region of interest/time frame.
4. Find other data to work with.
What is the importance of subsetting data?
1. Freeing up space.
2. Analyzing a certain time or area of interest.
3. Making code run faster.
4. All of the above.
Gridded population datasets…(select all that apply)
1. Show where river tributaries are.
2. Model the distribution of the global human population in raster grid cells.
3. Allow analyses of the number of persons impacted by a hazard such as drought.
4. Vary in the extent to which ancillary data are used in their production.

In this Lesson, You Learned…

Congratulations! Now you should be able to:

Navigate the SEDAC website to find and download datasets.
Access administrative boundaries from geoBoundaries data using API.
Temporally subset a NetCDF raster stack using R packages such as dplyr and lubridate.
Crop a NetCDF raster stack with a spatial boundary.
Write a subsetted dataset to disk and create an image to share results.
Identify areas of severe drought and select these areas for further analysis.
Summarize data by county using the exactextractr tool.
Integrate WSIM-GLDAS deficit, GPW population, and geoBoundaries administrative boundary data to create complex time series visualizations.

Lesson 2

In this lesson we explored the California drought of 2014. In our next lesson, we will examine near real-time flood data in California using the MODIS data product.

Lesson 2: Moderate Resolution Imaging Spectroradiometer (MODIS) Near-Real Time (NRT) flood data

References

Bland, Alastair. 2014. “California Drought Has Wild Salmon Competing with Almonds for Water.” National Public Radio. https://www.npr.org/sections/thesalt/2014/08/21/342167846/california-drought-has-wild-salmon-competing-with-almonds-for-water.

Center For International Earth Science Information Network-CIESIN-Columbia University. 2018. “Gridded Population of the World, Version 4 (GPWv4): Population Count, Revision 11.” Palisades, NY: NASA Socioeconomic Data; Applications Center (SEDAC). https://doi.org/10.7927/H4JW8BX5.

Daniel Baston. 2023. Exactextractr: Fast Extraction from Raster Datasets Using Polygons. https://isciences.gitlab.io/exactextractr/.

“Evidence Suggests California’s Drought Is the Worst in 1,200 Years.” 2014. Woods Hole Oceanographic Institution. https://phys.org/news/2014-12-evidence-california-drought-worst-years.html.

Guillen, George. 2002. “Klamath River Fish Die-Off.” Mortality Report AFWO-01-03. Arcata, CA: U.S. Fish & Wildlife Service. https://www.trrp.net/DataPort/doc.php?id=301.

IPCC. 2023. “Climate Change 2021 the Physical Science Basis,” June. https://doi.org/10.1017/9781009157896.

ISciences, and Center For International Earth Science Information Network-CIESIN-Columbia University. 2022a. “Documentation for the Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) Monthly Grids, Version 1.” Palisades, NY: NASA Socioeconomic Data; Applications Center (SEDAC). https://doi.org/10.7927/X7FJ-JJ41.

———. 2022b. “Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) Monthly Grids, Version 1.” Palisades, NY: NASA Socioeconomic Data; Applications Center (SEDAC). https://doi.org/10.7927/Z1FN-KF73.

Leyk, Stefan, Andrea E. Gaughan, Susana B. Adamo, Alex de Sherbinin, Deborah Balk, Sergio Freire, Amy Rose, et al. 2019. “The Spatial Allocation of Population: A Review of Large-Scale Gridded Population Data Products and Their Fitness for Use.” Earth System Science Data 11 (3): 1385–1409. https://doi.org/10.5194/essd-11-1385-2019.

NOAA. 2019. “Water Cycle.” National Oceanic; Atmospheric Administration. https://www.noaa.gov/education/resource-collections/freshwater/water-cycle.

———. 2023. “What Are El Nino and La Nina?” National Oceanic; Atmospheric Administration. https://oceanservice.noaa.gov/facts/ninonina.html.

Rodgers, Alison Ince. 2023. “Understanding Droughts.” National Geographic Society. https://education.nationalgeographic.org/resource/understanding-droughts/.

Roeseler, Charles. 2011. “Exceptional Drought Continues Across Southeast Texas.” National Weather Service. https://www.weather.gov/media/hgx/stormsignals/vol87.pdf.

Sanders, Sam. 2014. “Despite California’s Drought, Taps Still Flowing in LA County.” National Public Radio. https://www.npr.org/2014/07/20/333019977/despite-californias-drought-taps-still-flowing-in-la-county.

StateImpact. 2014. “Everything You Need to Know about the Texas Drought.” National Public Radio. https://stateimpact.npr.org/texas/tag/drought/.

Zhou, Haddeland, T. 2016. “Human-Induced Changes in the Global Water Cycle.” Geophysical Monograph Series. https://doi.org/10.1002/9781118971772.ch4.

Footnotes

Photo Credit, NASA/JPL.↩︎