Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) Dataset Exploration and Visualizations

Authors

Joshua Brinks

Elaine Famutimi

Published

May 14, 2024

Overview

In our previous lesson, Acquiring and Pre-Processing the Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) Dataset, we downloaded components of WSIM-GLDAS from SEDAC, subset the data using vector boundaries from geoBoundaries, performed a visual check, and wrote the file to disk. In this lesson, we will extend our work with WSIM-GLDAS by introducing an additional integration (temporal aggregation) period, calculating simple summary statistics, integrating WSIM-GLDAS with the Gridded Population of the World (GPW) data, and developing more complex visualizations.

Learning Objectives

After completing this lesson, you should be able to:

Subset the WSIM-GLDAS raster data for a region and time period of interest.
Perform visual exploration with histograms.
Integrate gridded population data with WSIM-GLDAS data to perform analyses and construct visualizations to understand how people are impacted.
- Make choropleth maps visualizing WSIM-GLDAS data by administrative vector boundaries.
- Summarize WSIM-GLDAS and population raster data using zonal statistics.

Introduction

Coding Review

This lesson uses the stars, sf, dplyr, lubridate exactextractr, ggplot2, terra, data.table, and kableExtra packages. If you’d like to learn more about the functions used in this lesson you can use the help guides on their package websites.

Load Data

We’ll begin with the WSIM-GLDAS Composite Anomaly Twelve-Month Return Period Composite Anomaly Twelve-Month Return Period file from SEDAC. We will spatially subset the data to cover only the Continental United States (CONUSA) which will help to minimize our memory footprint. We can further reduce our memory overhead by reading in just the variable we want to analyze. In this instance we can read in just the deficit attribute from the WSIM-GLDAS Composite Anomaly Twelve-Month Return Period file, rather than reading the entire NetCDF with all of its attributes.

::: column-margin ::: {.callout-tip style=“color: #5a7a2b;”} ## Coding Review Random Access Memory (RAM) is where data and programs that are currently in use are stored temporarily. It allows the computer to quickly access data, making everything you do on the computer faster and more efficient. Unlike the hard drive, which stores data permanently, RAM loses all its data when the computer is turned off. RAM is like a computer’s short-term memory, helping it to handle multiple tasks at once.

:::

# read in the wsim-gldas layer from SEDAC
wsim_gldas <- stars::read_stars("data/composite_12mo.nc", proxy = FALSE, sub = 'deficit')
# check the basic info.
print(wsim_gldas)

For this exercise, we want to explore droughts in the continental United States for the years 2000-2014. In the 12-month return period dataset, each monthly time step is a moving average, or an average of the previous 12 months. Therefore, if we wish to view a snapshot of drought for a given calendar year we need to use the December timestep which would include the 12 months of only that calendar year. We can create a data set of annualized monthly averages starting with December 2000 and ending in December 2014 using the ymd() function of the lubridate package and the filter() function of the dplyr package as demonstrated below.

# generate a vector of dates for subsetting
keeps<-seq(lubridate::ymd("2000-12-01"),
           lubridate::ymd("2014-12-01"), 
           by = "year")

#change data type to POSIXct
keeps <- as.POSIXct(keeps)

# filter using that vector
wsim_gldas <- dplyr::filter(wsim_gldas, time %in% keeps)
print(wsim_gldas)

stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
         Min.  1st Qu.    Median      Mean   3rd Qu.     Max.  NA's
deficit   -60 -14.7116 -7.280078 -12.64074 -3.486546 4.991005 94083
dimension(s):
     from   to offset delta  refsys                    values x/y
x       1 1440   -180  0.25  WGS 84                      NULL [x]
y       1  600     90 -0.25  WGS 84                      NULL [y]
time    1   15     NA    NA POSIXct 2000-12-01,...,2014-12-01

Next, we can clip the WSIM-GLDAS dataset using the USA country boundary from geoBoundaries. As in lesson 1, we acquire the vector boundary data using the geoBoundaries API.

#directly acquire the boundary from geoBoundaries API
# request the usa data and metadata
usa <- httr::GET("https://www.geoboundaries.org/api/current/gbOpen/USA/ADM1/")
# parse the content of the request to find the geojson download link
usa <- httr::content(usa)
# read in the geojson directly from geoBoundaries 
usa <- sf::st_read(usa$gjDownloadURL)

Reading layer `geoBoundaries-USA-ADM1' from data source 
  `https://github.com/wmgeolab/geoBoundaries/raw/9469f09/releaseData/gbOpen/USA/ADM1/geoBoundaries-USA-ADM1.geojson' 
  using driver `GeoJSON'
Simple feature collection with 56 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.1489 ymin: -14.54869 xmax: 179.7785 ymax: 71.36516
Geodetic CRS:  WGS 84

# remove everything not part of CONUSA
drops<-
  c("Alaska", "Hawaii", 
    "American Samoa",
    "Puerto Rico",
    "Commonwealth of the Northern Mariana Islands", 
    "Guam", 
    "United States Virgin Islands")
usa<-usa[!(usa$shapeName %in% drops),]

# rename te time dimension to something more friendly
wsim_gldas<-wsim_gldas[usa] |>
stars::st_set_dimensions("time", values = as.character(seq(2000,2014)))

In the preprocessing step of the data science lifecycle, it is important to periodically check the results. Now we’ll verify the pre-processing steps with the print() function. Note that the raster object now contains 15 timesteps from December 2000 to December 2014.

# check the basic information again
print(wsim_gldas)

stars object with 3 dimensions and 1 attribute
attribute(s):
         Min.   1st Qu.    Median      Mean   3rd Qu. Max.   NA's
deficit   -60 -9.821823 -4.206245 -7.941395 -2.086784   60 153810
dimension(s):
     from  to offset delta refsys        values x/y
x     221 453   -180  0.25 WGS 84          NULL [x]
y     163 262     90 -0.25 WGS 84          NULL [y]
time    1  15     NA    NA     NA 2000,...,2014

You will want to review the printout to make sure it looks okay.

Does it contain the variables you were expecting?
Do the values for the variables seem plausible?

Other basic descriptive analyses are useful to verify and understand your data. One of these is to produce a frequency distribution (also known as a histogram), which is reviewed below.

Knowledge Check

What are the two ways we reduced our memory footprint when we loaded the WSIM-GLDAS data set for this lesson? (select all that apply)
1. Spatially subsetting the data to the CONUSA
2. Reading in only one year of data
3. Reading in only the attribute of interest (i.e., ‘deficit’)
4. All of the above.

Annual CONUSA Time Series

The statistical properties reviewed in the previous step are useful for exploratory data analysis, but we should also inspect the data’s spatial characteristics. We can start our visual exploration of annual drought in the CONUSA by creating a map visualization depicting the deficit return period for each of the years in the subset dataset we loaded in the previous step.

# load the base data of the usa boundary
ggplot2::ggplot(usa)+
  # plot the stars wsim object
  stars::geom_stars(data = wsim_gldas)+
  # set equal coordinates for axes
  ggplot2::coord_equal()+
  # create multiple panels for each time step
  ggplot2::facet_wrap(~time)+
  # plot the usa boundary with just the outline
  ggplot2::geom_sf(fill = NA)+
  # add the palette for wsim-gldas 
  ggplot2::scale_fill_stepsn(
    colors = c(
    '#9B0039',
    # -50 to -40
    '#D44135',
    # -40 to -20
    '#FF8D43',
    # -20 to -10
    '#FFC754',
    # -10 to -5
    '#FFEDA3',
    # -5 to -3
    '#FFF4C7',
    # -3 to 0
    '#FFFFFF'), 
    # set the breaks on the palette
    breaks = c(-60, -50, -40, -20, -10,-5,-3, 0))+
  # add plot labels
  ggplot2::labs(
    title="Annual Mean Deficit Anomalies for the CONUSA",
    subtitle = "Using Observed 12 Month Integrated WSIM-GLDAS Data for 2000-2014",
    fill = "Deficit Return Period"
  )+
  # set the minimal theme
  ggplot2::theme_minimal()+
  # turn off some extra graphical elements
  ggplot2::theme(
    axis.title.x=ggplot2::element_blank(),
    axis.text.x=ggplot2::element_blank(),
    axis.ticks.x=ggplot2::element_blank(),
    axis.title.y=ggplot2::element_blank(),
    axis.text.y=ggplot2::element_blank(),
    axis.ticks.y=ggplot2::element_blank())

This visualization shows that there were several significant drought events (as indicated by return-period values) throughout 2000-2014. Significant drought events included the southeast in 2000, the southwest in 2002, the majority of the western 3rd in 2007, Texas-Oklahoma in 2011, Montana-Wyoming-Colorado in 2012, and the entirety of the California coast in 2014. The droughts of 2012 and 2011 are particularly severe and widespread with return periods greater than 50 years covering multiple states. Based on historical norms, we should only expect droughts this strong every 50-60 years!

Knowledge Check

The output maps of Annual Mean Deficit Anomalies for the CONUSA indicate that…
1. The most significant deficit in 2004 is located in the western United States.
2. The least significant deficit 2003 is located in the Midwest.
3. The most significant deficit in 2011 is located around Texas and neighboring states.
4. The least significant deficit in 2000 is located in the southeast.

Monthly Time Series

We can get a more detailed look at these drought events by using the 1-month composite WSIM-GLDAS dataset and clipping the data to a smaller spatial extent. Let’s examine the 2014 California drought.

Drought in the News

The California drought of 2012-2014 was the worst in 1,200 years (“Evidence Suggests California’s Drought Is the Worst in 1,200 Years” 2014). This drought caused problems for homeowners, and even conflicts between farmers and wild salmon! Governor Jerry Brown declared a drought emergency and called on residents to reduce water intake by 20%. Water use went up by 8% in May of 2014 compared to 2013, in places like coastal California and Los Angeles. Due to the water shortages, the state voted to fine water-wasters up to $500 dollars. The drought also affected residents differently based on economic status. For example, in El Dorado County, located in a rural area east of Sacramento, residents were taking bucket showers and rural residents reported wells, which they rely on for fresh water, were drying up. The federal government eventually announced a $9.7 million emergency drought aid for those areas (Sanders 2014).

Additionally, there were thousands of adult salmon struggling to survive in the Klamath River in Northern California, where water was running low and warm due to the diversion of river flow into the Central Valley, an agricultural area that grows almond trees. Almonds are one of California’s most important crops, with the state producing 80% of the world’s almonds. However, salmon, which migrate upstream, could get a disease called gill rot, which flourishes in warm water and already killed tens of thousands of Chinook in 2002. This disease was spreading through the salmon population again due to this water allocation, affecting local Native American tribes that rely on the fish (Bland 2014).

In order to limit the amount of computing memory required for the operation, we will first clear items from the in-memory workspace and then reload a smaller composite file, we’ll start by removing the 12-month composite object.

# remove the large wsim object from the environment
rm(wsim_gldas)

Now let’s load the composite 1-month file from SEDAC into the workspace.

# clear the memory
#gc()
# read in the 1 month wsim-gldas data
wsim_gldas_1mo <- stars::read_stars("data/composite_1mo.nc", sub = 'deficit', proxy = FALSE)

deficit,

# check the basic info
print(wsim_gldas_1mo)

stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
         Min.   1st Qu.    Median     Mean   3rd Qu.     Max.  NA's
deficit   -60 -29.83644 -2.461974 -16.9758 -2.461974 34.54536 87340
dimension(s):
     from   to offset delta  refsys                    values x/y
x       1 1440   -180  0.25  WGS 84                      NULL [x]
y       1  600     90 -0.25  WGS 84                      NULL [y]
time    1  804     NA    NA POSIXct 1948-01-01,...,2014-12-01

Once again, we’ll subset the time dimension for our period of interest. This time we want every month for 2014.

# generate a vector of dates for subsetting
keeps<-seq(lubridate::ymd("2014-01-01"),
           lubridate::ymd("2014-12-01"), 
           by = "month")
#change data type to POSIXct
keeps <- as.POSIXct(keeps)
# filter using that vector
wsim_gldas_1mo <- dplyr::filter(wsim_gldas_1mo, time %in% keeps)
# check the info
print(wsim_gldas_1mo)

stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
         Min.   1st Qu.    Median      Mean   3rd Qu.     Max.  NA's
deficit   -60 -12.92903 -6.746436 -14.71285 -6.746436 15.66717 87340
dimension(s):
     from   to offset delta  refsys                    values x/y
x       1 1440   -180  0.25  WGS 84                      NULL [x]
y       1  600     90 -0.25  WGS 84                      NULL [y]
time    1   12     NA    NA POSIXct 2014-01-01,...,2014-12-01

Now we have 12 rasters with monthly data for 2014. Let’s zoom in on California and see how this drought progressed over the course of the year.

# isolate only the california border
california<-usa[usa$shapeName=="California",]
# subset wsim-gldas to the extent of the california boundary
wsim_gldas_california <- wsim_gldas_1mo[california]
# give the time dimension pretty labels
wsim_gldas_california <-
  wsim_gldas_california |>
    stars::st_set_dimensions("time", values = month.name)

# monthly plots of california
# load the base california data
ggplot2::ggplot(california)+
  # load the wsim stars object
  stars::geom_stars(data = wsim_gldas_california)+
  # equal aspect ratio for lat/long axe
  ggplot2::coord_equal()+
  # create a subplot for each time step
  ggplot2::facet_wrap(~time)+
  # only plot the outline of california
  ggplot2::geom_sf(fill = NA)+
  # set the wsim palette
  ggplot2::scale_fill_stepsn(
    colors = c(
    '#9B0039',
    # -50 to -40
    '#D44135',
    # -40 to -20
    '#FF8D43',
    # -20 to -10
    '#FFC754',
    # -10 to -5
    '#FFEDA3',
    # -5 to -3
    '#FFF4C7',
    # -3 to 0
    '#FFFFFF'), 
    # set the wsim breaks
    breaks = c(-60, -50, -40, -20, -10,-5,-3, 0))+
  # add labels
  ggplot2::labs(
    title="Deficit Anomalies for California",
    subtitle = "Using Observed Monthly WSIM-GLDAS Data for 2014",
    fill = "Deficit Return Period"
  )+
  # start with a base minimal theme
  ggplot2::theme_minimal()+
  # remove additional elements to make a cleaner map
  ggplot2::theme(
    axis.title.x=ggplot2::element_blank(),
    axis.text.x=ggplot2::element_blank(),
    axis.ticks.x=ggplot2::element_blank(),
    axis.title.y=ggplot2::element_blank(),
    axis.text.y=ggplot2::element_blank(),
    axis.ticks.y=ggplot2::element_blank())

This series of maps shows a startling picture. California faced massive water deficits throughout the state in January and February. This was followed by water deficits in the western half of the state in May-August. Although northern and eastern California saw some relief by September, southwest California continued to see deficits through December.

Monthly Histograms

Tip

A data frame is a data structure used for storing tabular data. It organizes data in rows and columns. Each column can have a different type of data (numeric, character, factor, etc.), and rows represent individual observations or cases. Data frames provide a convenient way to work with structured data, making them essential for data analysis and statistics projects.

We can explore the data further by creating a frequency distribution (also called a histogram) of the deficit anomalies for any given spatial extent; here we are still looking at the deficit anomalies in California. We start by extracting the data from the raster time series and then create a data frame of values that are easier to manipulate into a histogram. R data frames are data displayed in table format, which can be plotted on graphs or charts.

# extract the raster values into a dataframe
deficit_hist <-  
  wsim_gldas_california |>
  as.data.frame(wsim_gldas_california$deficit)

# remove the NA values
deficit_hist<-stats::na.omit(deficit_hist)

# create the histogram
ggplot2::ggplot(deficit_hist, ggplot2::aes(deficit))+
  # style the bars
  ggplot2::geom_histogram(binwidth = 6, fill = "#325d88")+
  # subplot for each timestep
  ggplot2::facet_wrap(~time)+
  # use the minimal theme
  ggplot2::theme_minimal()

The histograms start to quantify what we saw in the time series maps. Whereas the map shows where the deficits occur, the frequency distribution indicates the number of raster cells for each return period of the deficit range. The number of raster cells under a 60-year deficit (return period) is very high in most months, far exceeding any other value in the range.

:::{.callout-note} ## Knowledge Check 5. What’s another term for ‘histogram’? a. Choropleth b. Frequency Distribution c. Data array d. Box plot

Zonal Summaries

The previous section describes the 2014 California drought, examining the state as a whole. Although we have a sense of what’s happening in different cities or counties by looking at the maps, the maps do not provide quantitative summaries of those local areas.

Zonal statistics are one way to summarize the cells of a raster layer that lie within the boundary of another data layer (which may be in either raster or vector format). For example, aggregating deficit return periods with another raster depicting land cover type or a vector boundary (shapefile) of countries, states, or counties, will produce descriptive statistics by that new layer. These statistics include: sum, mean, median, standard deviation, and range.

For this lesson, we begin by calculating the mean deficit return period by California counties. First, we retrieve a vector data set of California counties from the geoBoundaries API. Since geoBoundaries does not attribute which counties belong to which states, we utilize a spatial operation called intersect in order to select only those counties in California.

# get the adm2 (county) level data for the USA
# request the adm2 usa information
cali_counties <- httr::GET("https://www.geoboundaries.org/api/current/gbOpen/USA/ADM2/")
# parse the content to find the direct download link
cali_counties <- httr::content(cali_counties)
# read in the geojson directly from geoboundaries
cali_counties <- sf::st_read(cali_counties$gjDownloadURL)

Reading layer `geoBoundaries-USA-ADM2' from data source 
  `https://github.com/wmgeolab/geoBoundaries/raw/9469f09/releaseData/gbOpen/USA/ADM2/geoBoundaries-USA-ADM2.geojson' 
  using driver `GeoJSON'
Simple feature collection with 3233 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.1474 ymin: -14.55254 xmax: 179.7785 ymax: 71.35256
Geodetic CRS:  WGS 84

# geoBoundaries does not list which counties belong to which state so you need to run an intersection
# intersect the usa adm2 data with just the california boundary
cali_counties<-sf::st_intersection(cali_counties, california)
# plot the result
plot(sf::st_geometry(cali_counties))

The output of that intersection looks as expected. As noted above, in general a visual and/or tabular check on your data layers is always a good idea. If you expect 50 counties in a given state, you should see 50 counties resulting from your intersection of your two layers, etc. You may want to be on the look out for too few (such as an island area that may be in one layer but not the other) or too many counties (such as those that intersect with a neighboring state).

Coding Review

The exactextractr (Daniel Baston 2023) R package summarizes raster values over groupings, or zones, also known as zonal statistics. Zonal statistics help in assessing the statistical characteristics of a certain region.

The terra R package processes raster geospatial data, offering functionalities such as data manipulation, spatial analysis, modeling, and visualization, with a focus on efficiency and scalability.

We will perform our zonal statistics using the exactextractr package (Daniel Baston 2023). It is the fastest, most accurate, and most flexible zonal statistics tool for the R programming language, but it currently has no default methods for the stars package, so we’ll switch to terra for this portion of the lesson.

Read the data back in with terra.

# remove the previous 1 month object
rm(wsim_gldas_1mo)

# create a terra collection with the wsim-gldas 1 month file from SEDAC
wsim_gldas_1mo<-terra::sds("data/composite_1mo.nc")

Subset the data along the time dimension.

# pull out just the deficit layer
wsim_gldas_1mo<-wsim_gldas_1mo["deficit"]
# create a sequence of dates to keep
keeps<-seq(lubridate::ymd("2014-01-01"), lubridate::ymd("2014-12-01"), by = "month")
# subset the terra object for only those time steps
wsim_gldas_1mo<-wsim_gldas_1mo[[terra::time(wsim_gldas_1mo) %in% keeps]]
# label the time steps
names(wsim_gldas_1mo) <- keeps
# check the info
print(wsim_gldas_1mo)

class       : SpatRaster 
dimensions  : 600, 1440, 12  (nrow, ncol, nlyr)
resolution  : 0.25, 0.25  (x, y)
extent      : -180, 180, -60, 90  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : composite_1mo.nc:deficit 
varname     : deficit (Composite Deficit Index) 
names       : 2014-01-01, 2014-02-01, 2014-03-01, 2014-04-01, 2014-05-01, 2014-06-01, ... 
time (days) : 2014-01-01 to 2014-12-01

# run the extraction
cali_county_summaries<-
  exactextractr::exact_extract(
    # the raster data
    wsim_gldas_1mo, 
    # the boundary data
    cali_counties, 
    # the calculation
    'mean', 
    # don't show progress
    progress = FALSE)
# give the timestep prettier labels
names(cali_county_summaries)<-lubridate::month(keeps, label = TRUE, abbr = FALSE)

exactextractr will return summary statistics in the same order of the input boundary file, therefore we can join the California county names to the exactextract summary statistics output for visualization. We will also make a version to view as a table to inspect the raw data. We can take a quick look at the first 10 counties to see their mean deficit return period for January-June.

# bind the extracted means with the california boundary
cali_counties<-cbind(cali_counties, cali_county_summaries)
# prepare a version to create a table
cali_county_table<-cbind(County=cali_counties$shapeName,
                         round(cali_county_summaries))
# create the table with only the first 10 rows and 7 columns
kableExtra::kbl(cali_county_table[c(1:10),c(1:7)]) |>
    kableExtra::kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"))

County	January	February	March	April	May	June
Alpine	-42	-16	-14	-8	-28	-29
Mohave	-38	-10	-10	-13	-60	-21
Tuolumne	-46	-19	-16	-8	-42	-24
Tehama	-58	-50	-24	-21	-28	-47
Trinity	-60	-46	-46	-26	-44	-60
Curry	-60	-56	-26	-4	-9	-55
San Benito	-60	-60	-56	-23	-60	-60
Josephine	-60	-25	-33	-14	-18	-35
Sierra	-19	-19	-11	-11	-18	-28
El Dorado	-44	-17	-12	-9	-34	-29

This confirms the widespread distribution of high deficit values (all the bright red) in our exploratory maps. The data is currently in wide format, which makes for easy viewing of a time series, but more advanced programmatic visualization typically requires data to be in a normalized, or long, format (more on that later).

Knowledge Check

What can zonal statistics be used for?
1. Subsetting data to a time period of interest.
2. Creating a time series.
3. Creating summary statistics from the values of cells that lie within a boundary.
4. All of the above.

County Choropleths

Now that we’ve inspected the raw data we can make a choropleth out of the mean deficit return period data. We previously demonstrated more complex maps using ggplot2, sf, and stars, but you can also make quick plots of sf objects with the base plotting function. By default sf will make a map for every column listed in the dataset. In this case we only want to look at the monthly means so we will just plot columns 11 through 23. You can make simple alterations to the color palette and position of the legend, but custom map titles and legend titles are not easily accomplished with multi-panel maps (multiple maps in one) like the one pictured below.

# plot the data for a check using only the 12 monthly summaries in columns 11 through 23
plot(cali_counties[c(11:23)],
     # the palette we've been using
     pal = c(
    '#9B0039',
    # -50 to -40
    '#D44135',
    # -40 to -20
    '#FF8D43',
    # -20 to -10
    '#FFC754',
    # -10 to -5
    '#FFEDA3',
    # -5 to -3
    '#FFF4C7',
    # -3 to 0
    '#FFFFFF'), 
    # the breaks we want
    breaks = c(-61, -50, -40, -20, -10,-5,-3, 5), 
    # where we're placing the legend
    key.pos = 1, 
    # how thick we want the legend to be
    key.width = 0.3)

Due to the widespread water deficits in the raw data, the mean values do not appear much different from the raw deficit raster layer, however, choropleth maps, also called thematic maps, can make it easier for users to survey the landscape by visualizing familiar geographies (like counties) that place themselves and their lived experiences alongside the data.

While this paints a striking picture of widespread water deficits, how many people are affected by this drought? Although the land area appears rather large, if one is not familiar with the distribution of population and urban centers in California it can be difficult to get a sense of the direct human impact. (This is partly because more populous locations are usually represented by smaller land areas and the less populous locations are usually represented by large administrative boundaries containing much more land area. Normalizing a given theme by land area may be something an analyst wants to do but we cover another approach below.)

Knowledge Check

Choropleth maps, aka thematic maps, are useful for
1. Visualizing data in familiar geographies like counties or states
2. Finding directions from one place to another
3. Visualizing data in uniform geographic units like raster grid cells.
4. Calculating return periods.

Integrating Population Data

Gridded Population of the World (GPW) is a data collection from SEDAC that models the distribution of the global human population as counts and densities in a raster format (Center For International Earth Science Information Network-CIESIN-Columbia University 2018).We will take full advantage of exactextractr to integrate across WSIM-GLDAS, geoBoundaries, and GPW. To begin, we need to download the 15-minute resolution (roughly 30 square kilometer at the equator) population density data for the year 2015 from GPWv4. This most closely matches our time period (2014) and the resolution of WSIM-GLDAS. Although in many applications one might choose to use GPW’s population count data layers, because we are using exactextractr we can achieve more accurate results (especially along coastlines) by using population density in conjunction with land area estimates from the exactextractr package.

Data Review

The Gridded Population of the World Version 4 is available in multiple target metrics (e.g. counts, density), periods (2000, 2005, 2010, 2015, 2020), and spatial resolutions (30 sec, 2.5 min, 15 min, 30 min, 60 min). Read more about GPW at the collection home page on SEDAC. GPW is one of four global gridded population datasets available in raster format. These data sets vary in the degree to which they use additional information as ancillary variables to model the spatial distribution of the population from the administrative units (vector polygons) in which they originate. A review of these data sets and their underlying models is found in a paper by Leyk and colleagues (Leyk et al. 2019). You can learn more about gridded population data sets at POPGRID.org. Fitness-for-use is an important principle in determining the best dataset to use for a specific analysis. The question we ask here is — what is the population exposure to different levels of water deficit in California? — uses spatially coarse inputs and is for a place with high-quality data inputs, GPW is a good choice for this analysis. Users with vector-format census data (at the county or sub-county level) could also adopt this approach for those data. In the case of California, the US Census data and GPW will produce nearly identical estimates because GPW is based on the census inputs.

Load in the population density layers.

# read in GPW with terra
pop_dens<-terra::rast("data/gpw_v4_population_density_rev11_2015_15_min.tif")
# check the basic information
print(pop_dens)

class       : SpatRaster 
dimensions  : 720, 1440, 1  (nrow, ncol, nlyr)
resolution  : 0.25, 0.25  (x, y)
extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : gpw_v4_population_density_rev11_2015_15_min.tif 
name        : gpw_v4_population_density_rev11_2015_15_min

For this example we’ll classify the WSIM-GLDAS deficit return period raster layer into eight categories. Binning the data will make it easier to manage the output and interpret the results.

# set the class breaks row-wise (from, to, new label)
# e.g. first row states: "all return periods from 0 to 5 will now be labeled 0
m <- c(0, 5, 0,
       -3, 0, -3,
       -5, -3, -5,
       -10, -5, -10,
       -20, -10, -20,
       -40, -20, -40,
       -50, -40, -50,
       -65, -50, -60)
# convert the vector into a matrix
rclmat <- matrix(m, ncol=3, byrow=TRUE)
# classify the data
wsim_gldas_1mo_class <-
  terra::classify(wsim_gldas_1mo, rclmat, include.lowest = TRUE)

In our previous example, we used exactextractr’s built-in 'mean' function, but we can pass other custom functions to exactextractr that will carry out several operations at once as well. The following code could be combined into a single function passed to exactextractr, but it is presented here as multiple functions in order to follow along more easily. You can read more about exactextractr arguments in the package help guide. The key arguments to be aware of are the calls to:

weights = pop_dens: summarizes each WSIM-GLDAS cell’s deficit return period with the corresponding population density value.
coverage_area = TRUE: calculates the corresponding area of the WSIM-GLDAS raster cell that is covered by the California boundary.

# run the extraction
pop_by_rp <-
  exactextractr::exact_extract(wsim_gldas_1mo_class, california, function(df) {
    df <- data.table::setDT(df)
    }, 
  # convert output to data frame
  summarize_df = TRUE, 
  # specify the weights we're using
  weights = pop_dens, 
  # return the coverage area (m^2)
  coverage_area = TRUE,
  # return the county name with the output data
  include_cols = 'shapeISO', 
  # don't show progress
  progress = FALSE)

This returns a data.frame with a row for every raster cell in the WSIM-GLDAS layer that is overlapped by the California boundary. Let’s take a look at the first 6 rows.

head(pop_by_rp)

   shapeISO 2014-01-01 2014-02-01 2014-03-01 2014-04-01 2014-05-01 2014-06-01
     <char>      <num>      <num>      <num>      <num>      <num>      <num>
1:    US-CA        -60        -40        -40        -20        -40        -50
2:    US-CA        -60        -40        -40        -20        -20        -40
3:    US-CA        -60        -40        -40        -40        -40        -60
4:    US-CA        -60        -60        -40        -40        -60        -60
5:    US-CA        -60        -60        -20        -40        -50        -60
6:    US-CA        -60        -60        -40        -40        -40        -60
   2014-07-01 2014-08-01 2014-09-01 2014-10-01 2014-11-01 2014-12-01     weight
        <num>      <num>      <num>      <num>      <num>      <num>      <num>
1:        -60        -60         -5         -3        -10        -10 14.1623726
2:        -60        -50         -3         -3        -10        -20  3.1425104
3:        -60        -60         -3         -3         -5        -20  4.3180256
4:        -60        -60         -5         -3         -5        -20 15.4701538
5:        -60        -60        -10          0         -5        -10 44.1462708
6:        -60        -60        -20          0         -5        -10  0.6994646
   coverage_area
           <num>
1:        569715
2:       1624997
3:      12378942
4:       8855739
5:      13233592
6:      19485454

shapeISO: The label of the polygon boundary where the cell is located. This was passed on from the California geojson boundary as specified in the include_cols = 'shapeISO' argument. In this instance, it’s not very helpful because we used the state-level California boundary, but if we passed the ADM2 boundary with counties it would provide the name of the county where the cell is located.
2014-01-01 to 2014-12-01: The next 12 columns list the deficit return period classification value for the cell in each of the 12 months corresponding to the time dimension of the wsim_gldas_1mo raster layer.
weight: The weight column lists the corresponding population density value (persons per km^2) for that WSIM-GLDAS cell. The WSIM-GLDAS and GPW raster layers have the same projection and resolution. Therefore, because they are perfectly aligned, each WSIM return period cell has a corresponding GPW population weight that can be layered exactly on it.
coverage_area: The total area (m^2) of the WSIM-GLDAS cell that is covered by the California boundary layer. Given the total area of the WSIM cell that is covered, and the GPW persons per unit area weight, we can calculate the number of people estimated to be living within this cell under this WSIM deficit return period.

We will need to perform a few more processing steps to prepare this data.frame for a time series visualization integrating all of the data. We will use the melt function to transform the data from wide format to long format in order to produce a visualization in ggplot2. Specifically, we need to use melt to make the 12 month columns (2014-01-01 to 2014-12-01) into 2 new columns: 1) specifying the WSIM-GLDAS deficit return period value and 2) the month it came from.

Coding Review

Converting data from wide to long or long to wide formats is a key component to data processing, however, it can be confusing. To read more about melting/pivoting longer (wide to long) and casting/pivoting wider (long to wide) check out the data.table vignette Efficient reshaping using data.tables and the dplyr pivot_longer and pivot_wider reference pages.

# convert the dataset from wide to long (melt)
pop_by_rp <-
  data.table::melt(
    # data we're melting
    pop_by_rp,
    # the columns that need to stay columns/ids
    id.vars = c("shapeISO", "coverage_area", "weight"),
    # name for the month columns we're melting into a single column
    variable.name = "month",
    # name for the values being converted into a single column
    value.name = "return_period")

# check the first 5 rows of melted data
head(pop_by_rp)

   shapeISO coverage_area     weight      month return_period
     <char>         <num>      <num>     <fctr>         <num>
1:    US-CA        569715 14.1623726 2014-01-01           -60
2:    US-CA       1624997  3.1425104 2014-01-01           -60
3:    US-CA      12378942  4.3180256 2014-01-01           -60
4:    US-CA       8855739 15.4701538 2014-01-01           -60
5:    US-CA      13233592 44.1462708 2014-01-01           -60
6:    US-CA      19485454  0.6994646 2014-01-01           -60

Each row lists the land area (coverage_area) covered by the zone, the population density value (weight) for the zone, the month the deficit return period corresponds to, and the actual deficit return period class for the zone.

Next, we’ll summarize the data by return-period class.

# create new column with population totals for each month and return period combination
# divide by 1000000
pop_by_rp <-
  pop_by_rp[, .(pop_rp = round(sum(coverage_area * weight) / 1e6)), by = .(month, return_period)]
# some cells do not have deficit return period values and result in NaN--just remove
pop_by_rp <- na.omit(pop_by_rp)
# check the first 5 rows
head(pop_by_rp)

        month return_period   pop_rp
       <fctr>         <num>    <num>
1: 2014-01-01           -60 37556929
2: 2014-01-01           -40   718545
3: 2014-01-01           -20   551332
4: 2014-01-01           -50   708659
5: 2014-01-01           -10    11554
6: 2014-01-01            -5       45

Now we have a row for every unique combination of month, return period class, and the total population. We can calculate the percent of the total population represented by this return period class with 2 more lines.

# create a new column with the total population for that month (will be the same for each month but needed to create wsim class fraction)
pop_by_rp[, total_pop := sum(pop_rp), by = month]
# calculate the fraction of that month-wsim class combination relative to the total population
pop_by_rp[, pop_frac := pop_rp / total_pop][, total_pop := NULL]
# check the first rows
head(pop_by_rp)

        month return_period   pop_rp     pop_frac
       <fctr>         <num>    <num>        <num>
1: 2014-01-01           -60 37556929 9.496768e-01
2: 2014-01-01           -40   718545 1.816936e-02
3: 2014-01-01           -20   551332 1.394116e-02
4: 2014-01-01           -50   708659 1.791938e-02
5: 2014-01-01           -10    11554 2.921582e-04
6: 2014-01-01            -5       45 1.137885e-06

Before plotting we’ll make the month labels more legible for plotting, convert the WSIM-GLDAS return period class into a factor, and set the WSIM-GLDAS class palette.

Coding Review

Factors are the most common way to handle categorical data in R. Although converting your categorical variables into factors is not not always the best choice, in many instances (especially plotting with ggplot2) the benefits will out way any annoyances. To learn more about factors and R check out Hadley Wickham’s chapter on factors in R for Data Science 2nd Edition.

# ggplot is easier with factors
pop_by_rp$return_period<-
  factor(pop_by_rp$return_period, 
         levels = c("0", "-3", "-5", "-10", "-20", "-40", "-50", "-60"))
# create the palette to pass to ggplot
leg_colors<-c(
    '#9B0039',
    # -50 to -40
    '#D44135',
    # -40 to -20
    '#FF8D43',
    # -20 to -10
    '#FFC754',
    # -10 to -5
    '#FFEDA3',
    # -5 to -3
    '#fffdc7',
    # -3 to 0
    '#FFF4C7',
    # 0-3
    "#FFFFFF")

# pretty month labels
pop_by_rp[,month:=lubridate::month(pop_by_rp$month, label = TRUE)]

Now we can put it all together into a visualization.

# add the base data
ggplot2::ggplot(pop_by_rp, 
                # set plot wide aesthetics
                ggplot2::aes(x = month, 
                             y = pop_frac,
                             group = return_period, 
                             fill = return_period))+
  # determine the overlay and positioning of each wsim-class' bar
  ggplot2::geom_bar(stat = "identity", 
                    position = "stack", 
                    color = "darkgrey")+
  # set palettes on the wsim class groups
  # rev() the order of the order of the colors so we can have -60 on the bottom
  ggplot2::scale_fill_manual(values = rev(leg_colors))+
  # limit y axis to 0/1
  ggplot2::ylim(0,1)+
  # labels
  ggplot2::labs(title = "Monthly Fraction of Population Under Water Deficits in California During 2014",
                subtitle = "Categorized by Intensity of Deficit Return Period",
                x = "",
                y = "Fraction of Population*",
                caption = "*Population derived from Gridded Population of the World (2015)",
                color = "Return Period", fill = "Return Period", group = "Return Period", alpha = "Return Period")+
  # use basic theme
  ggplot2::theme_minimal()

This figure really illustrates the human impact of the 2014 drought. Nearly 100% of the population was under a 60+ year deficit in January followed by 66% in May and approximately 40% for the remainder of the summer. That is a devastating drought!

:::{.callout-note} ## Knowledge Check 1. Gridded population datasets…(select all that apply) a. Show where river tributaries are. b. Model the distribution of the global human population in raster grid cells. c. Allow analyses of the number of persons impacted by a hazard such as drought. d. Vary in the extent to which ancillary data are used in their production.

Congratulations! In this Lesson You Learned How To…

Identify areas of severe drought and select these areas for further analysis.
Summarize data by county using the exactextractr tool.
Integrate WSIM-GLDAS deficit, GPW population, and geoBoundaries administrative boundary data to create complex time series visualizations.

Lesson 2

In this lesson we explored the California drought of 2014. In our next lesson, we will examine near real-time flood data in California using the MODIS data product.

Lesson 2: Moderate Resolution Imaging Spectroradiometer (MODIS) Near-Real Time (NRT) flood data

References

Bland, Alastair. 2014. “California Drought Has Wild Salmon Competing with Almonds for Water.” National Public Radio. https://www.npr.org/sections/thesalt/2014/08/21/342167846/california-drought-has-wild-salmon-competing-with-almonds-for-water.

Center For International Earth Science Information Network-CIESIN-Columbia University. 2018. “Gridded Population of the World, Version 4 (GPWv4): Population Count, Revision 11.” Palisades, NY: NASA Socioeconomic Data; Applications Center (SEDAC). https://doi.org/10.7927/H4JW8BX5.

Daniel Baston. 2023. Exactextractr: Fast Extraction from Raster Datasets Using Polygons. https://isciences.gitlab.io/exactextractr/.

“Evidence Suggests California’s Drought Is the Worst in 1,200 Years.” 2014. Woods Hole Oceanographic Institution. https://phys.org/news/2014-12-evidence-california-drought-worst-years.html.

Leyk, Stefan, Andrea E. Gaughan, Susana B. Adamo, Alex de Sherbinin, Deborah Balk, Sergio Freire, Amy Rose, et al. 2019. “The Spatial Allocation of Population: A Review of Large-Scale Gridded Population Data Products and Their Fitness for Use.” Earth System Science Data 11 (3): 1385–1409. https://doi.org/10.5194/essd-11-1385-2019.

Sanders, Sam. 2014. “Despite California’s Drought, Taps Still Flowing in LA County.” National Public Radio. https://www.npr.org/2014/07/20/333019977/despite-californias-drought-taps-still-flowing-in-la-county.