# read in the wsim-gldas layer from SEDAC
<- stars::read_stars("data/composite_12mo.nc", proxy = FALSE, sub = 'deficit')
wsim_gldas # check the basic info.
print(wsim_gldas)
Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) Dataset Exploration and Visualizations
Overview
In our previous lesson, Acquiring and Pre-Processing the Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) Dataset, we downloaded components of WSIM-GLDAS from SEDAC, subset the data using vector boundaries from geoBoundaries, performed a visual check, and wrote the file to disk. In this lesson, we will extend our work with WSIM-GLDAS by introducing an additional integration (temporal aggregation) period, calculating simple summary statistics, integrating WSIM-GLDAS with the Gridded Population of the World (GPW) data, and developing more complex visualizations.
Learning Objectives
After completing this lesson, you should be able to:
- Subset the WSIM-GLDAS raster data for a region and time period of interest.
- Perform visual exploration with histograms.
- Integrate gridded population data with WSIM-GLDAS data to perform analyses and construct visualizations to understand how people are impacted.
- Make choropleth maps visualizing WSIM-GLDAS data by administrative vector boundaries.
- Summarize WSIM-GLDAS and population raster data using zonal statistics.
Introduction
This lesson uses the stars
, sf
, dplyr
, lubridate
exactextractr
, ggplot2
, terra
, data.table
, and kableExtra packages. If you’d like to learn more about the functions used in this lesson you can use the help guides on their package websites.
Load Data
We’ll begin with the WSIM-GLDAS Composite Anomaly Twelve-Month Return Period Composite Anomaly Twelve-Month Return Period file from SEDAC. We will spatially subset the data to cover only the Continental United States (CONUSA) which will help to minimize our memory footprint. We can further reduce our memory overhead by reading in just the variable we want to analyze. In this instance we can read in just the deficit
attribute from the WSIM-GLDAS Composite Anomaly Twelve-Month Return Period file, rather than reading the entire NetCDF with all of its attributes.
::: column-margin ::: {.callout-tip style=“color: #5a7a2b;”} ## Coding Review Random Access Memory (RAM) is where data and programs that are currently in use are stored temporarily. It allows the computer to quickly access data, making everything you do on the computer faster and more efficient. Unlike the hard drive, which stores data permanently, RAM loses all its data when the computer is turned off. RAM is like a computer’s short-term memory, helping it to handle multiple tasks at once.
:::
For this exercise, we want to explore droughts in the continental United States for the years 2000-2014. In the 12-month return period dataset, each monthly time step is a moving average, or an average of the previous 12 months. Therefore, if we wish to view a snapshot of drought for a given calendar year we need to use the December timestep which would include the 12 months of only that calendar year. We can create a data set of annualized monthly averages starting with December 2000 and ending in December 2014 using the ymd()
function of the lubridate package and the filter()
function of the dplyr package as demonstrated below.
# generate a vector of dates for subsetting
<-seq(lubridate::ymd("2000-12-01"),
keeps::ymd("2014-12-01"),
lubridateby = "year")
#change data type to POSIXct
<- as.POSIXct(keeps)
keeps
# filter using that vector
<- dplyr::filter(wsim_gldas, time %in% keeps)
wsim_gldas print(wsim_gldas)
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
deficit -60 -14.7116 -7.280078 -12.64074 -3.486546 4.991005 94083
dimension(s):
from to offset delta refsys values x/y
x 1 1440 -180 0.25 WGS 84 NULL [x]
y 1 600 90 -0.25 WGS 84 NULL [y]
time 1 15 NA NA POSIXct 2000-12-01,...,2014-12-01
Next, we can clip the WSIM-GLDAS dataset using the USA country boundary from geoBoundaries. As in lesson 1, we acquire the vector boundary data using the geoBoundaries API.
#directly acquire the boundary from geoBoundaries API
# request the usa data and metadata
<- httr::GET("https://www.geoboundaries.org/api/current/gbOpen/USA/ADM1/")
usa # parse the content of the request to find the geojson download link
<- httr::content(usa)
usa # read in the geojson directly from geoBoundaries
<- sf::st_read(usa$gjDownloadURL) usa
Reading layer `geoBoundaries-USA-ADM1' from data source
`https://github.com/wmgeolab/geoBoundaries/raw/9469f09/releaseData/gbOpen/USA/ADM1/geoBoundaries-USA-ADM1.geojson'
using driver `GeoJSON'
Simple feature collection with 56 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -179.1489 ymin: -14.54869 xmax: 179.7785 ymax: 71.36516
Geodetic CRS: WGS 84
# remove everything not part of CONUSA
<-
dropsc("Alaska", "Hawaii",
"American Samoa",
"Puerto Rico",
"Commonwealth of the Northern Mariana Islands",
"Guam",
"United States Virgin Islands")
<-usa[!(usa$shapeName %in% drops),]
usa
# rename te time dimension to something more friendly
<-wsim_gldas[usa] |>
wsim_gldas::st_set_dimensions("time", values = as.character(seq(2000,2014))) stars
In the preprocessing step of the data science lifecycle, it is important to periodically check the results. Now we’ll verify the pre-processing steps with the print()
function. Note that the raster object now contains 15 timesteps from December 2000 to December 2014.
# check the basic information again
print(wsim_gldas)
stars object with 3 dimensions and 1 attribute
attribute(s):
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
deficit -60 -9.821823 -4.206245 -7.941395 -2.086784 60 153810
dimension(s):
from to offset delta refsys values x/y
x 221 453 -180 0.25 WGS 84 NULL [x]
y 163 262 90 -0.25 WGS 84 NULL [y]
time 1 15 NA NA NA 2000,...,2014
You will want to review the printout to make sure it looks okay.
Does it contain the variables you were expecting?
Do the values for the variables seem plausible?
Other basic descriptive analyses are useful to verify and understand your data. One of these is to produce a frequency distribution (also known as a histogram), which is reviewed below.
- What are the two ways we reduced our memory footprint when we loaded the WSIM-GLDAS data set for this lesson? (select all that apply)
- Spatially subsetting the data to the CONUSA
- Reading in only one year of data
- Reading in only the attribute of interest (i.e., ‘deficit’)
- All of the above.
Annual CONUSA Time Series
The statistical properties reviewed in the previous step are useful for exploratory data analysis, but we should also inspect the data’s spatial characteristics. We can start our visual exploration of annual drought in the CONUSA by creating a map visualization depicting the deficit return period for each of the years in the subset dataset we loaded in the previous step.
# load the base data of the usa boundary
::ggplot(usa)+
ggplot2# plot the stars wsim object
::geom_stars(data = wsim_gldas)+
stars# set equal coordinates for axes
::coord_equal()+
ggplot2# create multiple panels for each time step
::facet_wrap(~time)+
ggplot2# plot the usa boundary with just the outline
::geom_sf(fill = NA)+
ggplot2# add the palette for wsim-gldas
::scale_fill_stepsn(
ggplot2colors = c(
'#9B0039',
# -50 to -40
'#D44135',
# -40 to -20
'#FF8D43',
# -20 to -10
'#FFC754',
# -10 to -5
'#FFEDA3',
# -5 to -3
'#FFF4C7',
# -3 to 0
'#FFFFFF'),
# set the breaks on the palette
breaks = c(-60, -50, -40, -20, -10,-5,-3, 0))+
# add plot labels
::labs(
ggplot2title="Annual Mean Deficit Anomalies for the CONUSA",
subtitle = "Using Observed 12 Month Integrated WSIM-GLDAS Data for 2000-2014",
fill = "Deficit Return Period"
+
)# set the minimal theme
::theme_minimal()+
ggplot2# turn off some extra graphical elements
::theme(
ggplot2axis.title.x=ggplot2::element_blank(),
axis.text.x=ggplot2::element_blank(),
axis.ticks.x=ggplot2::element_blank(),
axis.title.y=ggplot2::element_blank(),
axis.text.y=ggplot2::element_blank(),
axis.ticks.y=ggplot2::element_blank())
This visualization shows that there were several significant drought events (as indicated by return-period values) throughout 2000-2014. Significant drought events included the southeast in 2000, the southwest in 2002, the majority of the western 3rd in 2007, Texas-Oklahoma in 2011, Montana-Wyoming-Colorado in 2012, and the entirety of the California coast in 2014. The droughts of 2012 and 2011 are particularly severe and widespread with return periods greater than 50 years covering multiple states. Based on historical norms, we should only expect droughts this strong every 50-60 years!
- The output maps of Annual Mean Deficit Anomalies for the CONUSA indicate that…
- The most significant deficit in 2004 is located in the western United States.
- The least significant deficit 2003 is located in the Midwest.
- The most significant deficit in 2011 is located around Texas and neighboring states.
- The least significant deficit in 2000 is located in the southeast.
Monthly Time Series
We can get a more detailed look at these drought events by using the 1-month composite WSIM-GLDAS dataset and clipping the data to a smaller spatial extent. Let’s examine the 2014 California drought.
The California drought of 2012-2014 was the worst in 1,200 years (“Evidence Suggests California’s Drought Is the Worst in 1,200 Years” 2014). This drought caused problems for homeowners, and even conflicts between farmers and wild salmon! Governor Jerry Brown declared a drought emergency and called on residents to reduce water intake by 20%. Water use went up by 8% in May of 2014 compared to 2013, in places like coastal California and Los Angeles. Due to the water shortages, the state voted to fine water-wasters up to $500 dollars. The drought also affected residents differently based on economic status. For example, in El Dorado County, located in a rural area east of Sacramento, residents were taking bucket showers and rural residents reported wells, which they rely on for fresh water, were drying up. The federal government eventually announced a $9.7 million emergency drought aid for those areas (Sanders 2014).
Additionally, there were thousands of adult salmon struggling to survive in the Klamath River in Northern California, where water was running low and warm due to the diversion of river flow into the Central Valley, an agricultural area that grows almond trees. Almonds are one of California’s most important crops, with the state producing 80% of the world’s almonds. However, salmon, which migrate upstream, could get a disease called gill rot, which flourishes in warm water and already killed tens of thousands of Chinook in 2002. This disease was spreading through the salmon population again due to this water allocation, affecting local Native American tribes that rely on the fish (Bland 2014).
In order to limit the amount of computing memory required for the operation, we will first clear items from the in-memory workspace and then reload a smaller composite file, we’ll start by removing the 12-month composite object.
# remove the large wsim object from the environment
rm(wsim_gldas)
Now let’s load the composite 1-month file from SEDAC into the workspace.
# clear the memory
#gc()
# read in the 1 month wsim-gldas data
<- stars::read_stars("data/composite_1mo.nc", sub = 'deficit', proxy = FALSE) wsim_gldas_1mo
deficit,
# check the basic info
print(wsim_gldas_1mo)
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
deficit -60 -29.83644 -2.461974 -16.9758 -2.461974 34.54536 87340
dimension(s):
from to offset delta refsys values x/y
x 1 1440 -180 0.25 WGS 84 NULL [x]
y 1 600 90 -0.25 WGS 84 NULL [y]
time 1 804 NA NA POSIXct 1948-01-01,...,2014-12-01
Once again, we’ll subset the time dimension for our period of interest. This time we want every month for 2014.
# generate a vector of dates for subsetting
<-seq(lubridate::ymd("2014-01-01"),
keeps::ymd("2014-12-01"),
lubridateby = "month")
#change data type to POSIXct
<- as.POSIXct(keeps)
keeps # filter using that vector
<- dplyr::filter(wsim_gldas_1mo, time %in% keeps)
wsim_gldas_1mo # check the info
print(wsim_gldas_1mo)
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
deficit -60 -12.92903 -6.746436 -14.71285 -6.746436 15.66717 87340
dimension(s):
from to offset delta refsys values x/y
x 1 1440 -180 0.25 WGS 84 NULL [x]
y 1 600 90 -0.25 WGS 84 NULL [y]
time 1 12 NA NA POSIXct 2014-01-01,...,2014-12-01
Now we have 12 rasters with monthly data for 2014. Let’s zoom in on California and see how this drought progressed over the course of the year.
# isolate only the california border
<-usa[usa$shapeName=="California",]
california# subset wsim-gldas to the extent of the california boundary
<- wsim_gldas_1mo[california]
wsim_gldas_california # give the time dimension pretty labels
<-
wsim_gldas_california |>
wsim_gldas_california ::st_set_dimensions("time", values = month.name)
stars
# monthly plots of california
# load the base california data
::ggplot(california)+
ggplot2# load the wsim stars object
::geom_stars(data = wsim_gldas_california)+
stars# equal aspect ratio for lat/long axe
::coord_equal()+
ggplot2# create a subplot for each time step
::facet_wrap(~time)+
ggplot2# only plot the outline of california
::geom_sf(fill = NA)+
ggplot2# set the wsim palette
::scale_fill_stepsn(
ggplot2colors = c(
'#9B0039',
# -50 to -40
'#D44135',
# -40 to -20
'#FF8D43',
# -20 to -10
'#FFC754',
# -10 to -5
'#FFEDA3',
# -5 to -3
'#FFF4C7',
# -3 to 0
'#FFFFFF'),
# set the wsim breaks
breaks = c(-60, -50, -40, -20, -10,-5,-3, 0))+
# add labels
::labs(
ggplot2title="Deficit Anomalies for California",
subtitle = "Using Observed Monthly WSIM-GLDAS Data for 2014",
fill = "Deficit Return Period"
+
)# start with a base minimal theme
::theme_minimal()+
ggplot2# remove additional elements to make a cleaner map
::theme(
ggplot2axis.title.x=ggplot2::element_blank(),
axis.text.x=ggplot2::element_blank(),
axis.ticks.x=ggplot2::element_blank(),
axis.title.y=ggplot2::element_blank(),
axis.text.y=ggplot2::element_blank(),
axis.ticks.y=ggplot2::element_blank())
This series of maps shows a startling picture. California faced massive water deficits throughout the state in January and February. This was followed by water deficits in the western half of the state in May-August. Although northern and eastern California saw some relief by September, southwest California continued to see deficits through December.
Monthly Histograms
A data frame is a data structure used for storing tabular data. It organizes data in rows and columns. Each column can have a different type of data (numeric, character, factor, etc.), and rows represent individual observations or cases. Data frames provide a convenient way to work with structured data, making them essential for data analysis and statistics projects.
We can explore the data further by creating a frequency distribution (also called a histogram) of the deficit anomalies for any given spatial extent; here we are still looking at the deficit anomalies in California. We start by extracting the data from the raster time series and then create a data frame of values that are easier to manipulate into a histogram. R data frames are data displayed in table format, which can be plotted on graphs or charts.
# extract the raster values into a dataframe
<-
deficit_hist |>
wsim_gldas_california as.data.frame(wsim_gldas_california$deficit)
# remove the NA values
<-stats::na.omit(deficit_hist)
deficit_hist
# create the histogram
::ggplot(deficit_hist, ggplot2::aes(deficit))+
ggplot2# style the bars
::geom_histogram(binwidth = 6, fill = "#325d88")+
ggplot2# subplot for each timestep
::facet_wrap(~time)+
ggplot2# use the minimal theme
::theme_minimal() ggplot2
The histograms start to quantify what we saw in the time series maps. Whereas the map shows where the deficits occur, the frequency distribution indicates the number of raster cells for each return period of the deficit range. The number of raster cells under a 60-year deficit (return period) is very high in most months, far exceeding any other value in the range.
:::{.callout-note} ## Knowledge Check 5. What’s another term for ‘histogram’? a. Choropleth b. Frequency Distribution c. Data array d. Box plot
Zonal Summaries
The previous section describes the 2014 California drought, examining the state as a whole. Although we have a sense of what’s happening in different cities or counties by looking at the maps, the maps do not provide quantitative summaries of those local areas.
Zonal statistics are one way to summarize the cells of a raster layer that lie within the boundary of another data layer (which may be in either raster or vector format). For example, aggregating deficit return periods with another raster depicting land cover type or a vector boundary (shapefile) of countries, states, or counties, will produce descriptive statistics by that new layer. These statistics include: sum, mean, median, standard deviation, and range.
For this lesson, we begin by calculating the mean deficit return period by California counties. First, we retrieve a vector data set of California counties from the geoBoundaries API. Since geoBoundaries does not attribute which counties belong to which states, we utilize a spatial operation called intersect in order to select only those counties in California.
# get the adm2 (county) level data for the USA
# request the adm2 usa information
<- httr::GET("https://www.geoboundaries.org/api/current/gbOpen/USA/ADM2/")
cali_counties # parse the content to find the direct download link
<- httr::content(cali_counties)
cali_counties # read in the geojson directly from geoboundaries
<- sf::st_read(cali_counties$gjDownloadURL) cali_counties
Reading layer `geoBoundaries-USA-ADM2' from data source
`https://github.com/wmgeolab/geoBoundaries/raw/9469f09/releaseData/gbOpen/USA/ADM2/geoBoundaries-USA-ADM2.geojson'
using driver `GeoJSON'
Simple feature collection with 3233 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -179.1474 ymin: -14.55254 xmax: 179.7785 ymax: 71.35256
Geodetic CRS: WGS 84
# geoBoundaries does not list which counties belong to which state so you need to run an intersection
# intersect the usa adm2 data with just the california boundary
<-sf::st_intersection(cali_counties, california)
cali_counties# plot the result
plot(sf::st_geometry(cali_counties))
The output of that intersection looks as expected. As noted above, in general a visual and/or tabular check on your data layers is always a good idea. If you expect 50 counties in a given state, you should see 50 counties resulting from your intersection of your two layers, etc. You may want to be on the look out for too few (such as an island area that may be in one layer but not the other) or too many counties (such as those that intersect with a neighboring state).
The exactextractr (Daniel Baston 2023) R package summarizes raster values over groupings, or zones, also known as zonal statistics. Zonal statistics help in assessing the statistical characteristics of a certain region.
The terra R package processes raster geospatial data, offering functionalities such as data manipulation, spatial analysis, modeling, and visualization, with a focus on efficiency and scalability.
We will perform our zonal statistics using the exactextractr
package (Daniel Baston 2023). It is the fastest, most accurate, and most flexible zonal statistics tool for the R programming language, but it currently has no default methods for the stars
package, so we’ll switch to terra
for this portion of the lesson.
Read the data back in with terra.
# remove the previous 1 month object
rm(wsim_gldas_1mo)
# create a terra collection with the wsim-gldas 1 month file from SEDAC
<-terra::sds("data/composite_1mo.nc") wsim_gldas_1mo
Subset the data along the time dimension.
# pull out just the deficit layer
<-wsim_gldas_1mo["deficit"]
wsim_gldas_1mo# create a sequence of dates to keep
<-seq(lubridate::ymd("2014-01-01"), lubridate::ymd("2014-12-01"), by = "month")
keeps# subset the terra object for only those time steps
<-wsim_gldas_1mo[[terra::time(wsim_gldas_1mo) %in% keeps]]
wsim_gldas_1mo# label the time steps
names(wsim_gldas_1mo) <- keeps
# check the info
print(wsim_gldas_1mo)
class : SpatRaster
dimensions : 600, 1440, 12 (nrow, ncol, nlyr)
resolution : 0.25, 0.25 (x, y)
extent : -180, 180, -60, 90 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source : composite_1mo.nc:deficit
varname : deficit (Composite Deficit Index)
names : 2014-01-01, 2014-02-01, 2014-03-01, 2014-04-01, 2014-05-01, 2014-06-01, ...
time (days) : 2014-01-01 to 2014-12-01
# run the extraction
<-
cali_county_summaries::exact_extract(
exactextractr# the raster data
wsim_gldas_1mo, # the boundary data
cali_counties, # the calculation
'mean',
# don't show progress
progress = FALSE)
# give the timestep prettier labels
names(cali_county_summaries)<-lubridate::month(keeps, label = TRUE, abbr = FALSE)
exactextractr
will return summary statistics in the same order of the input boundary file, therefore we can join the California county names to the exactextract summary statistics output for visualization. We will also make a version to view as a table to inspect the raw data. We can take a quick look at the first 10 counties to see their mean deficit return period for January-June.
# bind the extracted means with the california boundary
<-cbind(cali_counties, cali_county_summaries)
cali_counties# prepare a version to create a table
<-cbind(County=cali_counties$shapeName,
cali_county_tableround(cali_county_summaries))
# create the table with only the first 10 rows and 7 columns
::kbl(cali_county_table[c(1:10),c(1:7)]) |>
kableExtra::kable_styling(
kableExtrabootstrap_options = c("striped", "hover", "condensed"))
County | January | February | March | April | May | June |
---|---|---|---|---|---|---|
Alpine | -42 | -16 | -14 | -8 | -28 | -29 |
Mohave | -38 | -10 | -10 | -13 | -60 | -21 |
Tuolumne | -46 | -19 | -16 | -8 | -42 | -24 |
Tehama | -58 | -50 | -24 | -21 | -28 | -47 |
Trinity | -60 | -46 | -46 | -26 | -44 | -60 |
Curry | -60 | -56 | -26 | -4 | -9 | -55 |
San Benito | -60 | -60 | -56 | -23 | -60 | -60 |
Josephine | -60 | -25 | -33 | -14 | -18 | -35 |
Sierra | -19 | -19 | -11 | -11 | -18 | -28 |
El Dorado | -44 | -17 | -12 | -9 | -34 | -29 |
This confirms the widespread distribution of high deficit values (all the bright red) in our exploratory maps. The data is currently in wide format, which makes for easy viewing of a time series, but more advanced programmatic visualization typically requires data to be in a normalized, or long, format (more on that later).
- What can zonal statistics be used for?
- Subsetting data to a time period of interest.
- Creating a time series.
- Creating summary statistics from the values of cells that lie within a boundary.
- All of the above.
County Choropleths
Now that we’ve inspected the raw data we can make a choropleth out of the mean deficit return period data. We previously demonstrated more complex maps using ggplot2, sf, and stars, but you can also make quick plots of sf objects with the base plotting function. By default sf will make a map for every column listed in the dataset. In this case we only want to look at the monthly means so we will just plot columns 11 through 23. You can make simple alterations to the color palette and position of the legend, but custom map titles and legend titles are not easily accomplished with multi-panel maps (multiple maps in one) like the one pictured below.
# plot the data for a check using only the 12 monthly summaries in columns 11 through 23
plot(cali_counties[c(11:23)],
# the palette we've been using
pal = c(
'#9B0039',
# -50 to -40
'#D44135',
# -40 to -20
'#FF8D43',
# -20 to -10
'#FFC754',
# -10 to -5
'#FFEDA3',
# -5 to -3
'#FFF4C7',
# -3 to 0
'#FFFFFF'),
# the breaks we want
breaks = c(-61, -50, -40, -20, -10,-5,-3, 5),
# where we're placing the legend
key.pos = 1,
# how thick we want the legend to be
key.width = 0.3)
Due to the widespread water deficits in the raw data, the mean values do not appear much different from the raw deficit raster layer, however, choropleth maps, also called thematic maps, can make it easier for users to survey the landscape by visualizing familiar geographies (like counties) that place themselves and their lived experiences alongside the data.
While this paints a striking picture of widespread water deficits, how many people are affected by this drought? Although the land area appears rather large, if one is not familiar with the distribution of population and urban centers in California it can be difficult to get a sense of the direct human impact. (This is partly because more populous locations are usually represented by smaller land areas and the less populous locations are usually represented by large administrative boundaries containing much more land area. Normalizing a given theme by land area may be something an analyst wants to do but we cover another approach below.)
- Choropleth maps, aka thematic maps, are useful for
- Visualizing data in familiar geographies like counties or states
- Finding directions from one place to another
- Visualizing data in uniform geographic units like raster grid cells.
- Calculating return periods.
Integrating Population Data
Gridded Population of the World (GPW) is a data collection from SEDAC that models the distribution of the global human population as counts and densities in a raster format (Center For International Earth Science Information Network-CIESIN-Columbia University 2018).We will take full advantage of exactextractr to integrate across WSIM-GLDAS, geoBoundaries, and GPW. To begin, we need to download the 15-minute resolution (roughly 30 square kilometer at the equator) population density data for the year 2015 from GPWv4. This most closely matches our time period (2014) and the resolution of WSIM-GLDAS. Although in many applications one might choose to use GPW’s population count data layers, because we are using exactextractr we can achieve more accurate results (especially along coastlines) by using population density in conjunction with land area estimates from the exactextractr package.
The Gridded Population of the World Version 4 is available in multiple target metrics (e.g. counts, density), periods (2000, 2005, 2010, 2015, 2020), and spatial resolutions (30 sec, 2.5 min, 15 min, 30 min, 60 min). Read more about GPW at the collection home page on SEDAC. GPW is one of four global gridded population datasets available in raster format. These data sets vary in the degree to which they use additional information as ancillary variables to model the spatial distribution of the population from the administrative units (vector polygons) in which they originate. A review of these data sets and their underlying models is found in a paper by Leyk and colleagues (Leyk et al. 2019). You can learn more about gridded population data sets at POPGRID.org. Fitness-for-use is an important principle in determining the best dataset to use for a specific analysis. The question we ask here is — what is the population exposure to different levels of water deficit in California? — uses spatially coarse inputs and is for a place with high-quality data inputs, GPW is a good choice for this analysis. Users with vector-format census data (at the county or sub-county level) could also adopt this approach for those data. In the case of California, the US Census data and GPW will produce nearly identical estimates because GPW is based on the census inputs.
Load in the population density layers.
# read in GPW with terra
<-terra::rast("data/gpw_v4_population_density_rev11_2015_15_min.tif")
pop_dens# check the basic information
print(pop_dens)
class : SpatRaster
dimensions : 720, 1440, 1 (nrow, ncol, nlyr)
resolution : 0.25, 0.25 (x, y)
extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source : gpw_v4_population_density_rev11_2015_15_min.tif
name : gpw_v4_population_density_rev11_2015_15_min
For this example we’ll classify the WSIM-GLDAS deficit return period raster layer into eight categories. Binning the data will make it easier to manage the output and interpret the results.
# set the class breaks row-wise (from, to, new label)
# e.g. first row states: "all return periods from 0 to 5 will now be labeled 0
<- c(0, 5, 0,
m -3, 0, -3,
-5, -3, -5,
-10, -5, -10,
-20, -10, -20,
-40, -20, -40,
-50, -40, -50,
-65, -50, -60)
# convert the vector into a matrix
<- matrix(m, ncol=3, byrow=TRUE)
rclmat # classify the data
<-
wsim_gldas_1mo_class ::classify(wsim_gldas_1mo, rclmat, include.lowest = TRUE) terra
In our previous example, we used exactextractr’s built-in 'mean'
function, but we can pass other custom functions to exactextractr that will carry out several operations at once as well. The following code could be combined into a single function passed to exactextractr, but it is presented here as multiple functions in order to follow along more easily. You can read more about exactextractr arguments in the package help guide. The key arguments to be aware of are the calls to:
weights = pop_dens
: summarizes each WSIM-GLDAS cell’s deficit return period with the corresponding population density value.coverage_area = TRUE
: calculates the corresponding area of the WSIM-GLDAS raster cell that is covered by the California boundary.
# run the extraction
<-
pop_by_rp ::exact_extract(wsim_gldas_1mo_class, california, function(df) {
exactextractr<- data.table::setDT(df)
df
}, # convert output to data frame
summarize_df = TRUE,
# specify the weights we're using
weights = pop_dens,
# return the coverage area (m^2)
coverage_area = TRUE,
# return the county name with the output data
include_cols = 'shapeISO',
# don't show progress
progress = FALSE)
This returns a data.frame
with a row for every raster cell in the WSIM-GLDAS layer that is overlapped by the California boundary. Let’s take a look at the first 6 rows.
head(pop_by_rp)
shapeISO 2014-01-01 2014-02-01 2014-03-01 2014-04-01 2014-05-01 2014-06-01
<char> <num> <num> <num> <num> <num> <num>
1: US-CA -60 -40 -40 -20 -40 -50
2: US-CA -60 -40 -40 -20 -20 -40
3: US-CA -60 -40 -40 -40 -40 -60
4: US-CA -60 -60 -40 -40 -60 -60
5: US-CA -60 -60 -20 -40 -50 -60
6: US-CA -60 -60 -40 -40 -40 -60
2014-07-01 2014-08-01 2014-09-01 2014-10-01 2014-11-01 2014-12-01 weight
<num> <num> <num> <num> <num> <num> <num>
1: -60 -60 -5 -3 -10 -10 14.1623726
2: -60 -50 -3 -3 -10 -20 3.1425104
3: -60 -60 -3 -3 -5 -20 4.3180256
4: -60 -60 -5 -3 -5 -20 15.4701538
5: -60 -60 -10 0 -5 -10 44.1462708
6: -60 -60 -20 0 -5 -10 0.6994646
coverage_area
<num>
1: 569715
2: 1624997
3: 12378942
4: 8855739
5: 13233592
6: 19485454
shapeISO
: The label of the polygon boundary where the cell is located. This was passed on from the California geojson boundary as specified in theinclude_cols = 'shapeISO'
argument. In this instance, it’s not very helpful because we used the state-level California boundary, but if we passed the ADM2 boundary with counties it would provide the name of the county where the cell is located.2014-01-01
to2014-12-01
: The next 12 columns list the deficit return period classification value for the cell in each of the 12 months corresponding to the time dimension of thewsim_gldas_1mo
raster layer.weight
: Theweight
column lists the corresponding population density value (persons per km^2) for that WSIM-GLDAS cell. The WSIM-GLDAS and GPW raster layers have the same projection and resolution. Therefore, because they are perfectly aligned, each WSIM return period cell has a corresponding GPW population weight that can be layered exactly on it.coverage_area
: The total area (m^2) of the WSIM-GLDAS cell that is covered by the California boundary layer. Given the total area of the WSIM cell that is covered, and the GPW persons per unit area weight, we can calculate the number of people estimated to be living within this cell under this WSIM deficit return period.
We will need to perform a few more processing steps to prepare this data.frame
for a time series visualization integrating all of the data. We will use the melt
function to transform the data from wide format to long format in order to produce a visualization in ggplot2
. Specifically, we need to use melt
to make the 12 month columns (2014-01-01
to 2014-12-01
) into 2 new columns: 1) specifying the WSIM-GLDAS deficit return period value and 2) the month it came from.
Converting data from wide to long or long to wide formats is a key component to data processing, however, it can be confusing. To read more about melting/pivoting longer (wide to long) and casting/pivoting wider (long to wide) check out the data.table vignette Efficient reshaping using data.tables and the dplyr pivot_longer
and pivot_wider
reference pages.
# convert the dataset from wide to long (melt)
<-
pop_by_rp ::melt(
data.table# data we're melting
pop_by_rp,# the columns that need to stay columns/ids
id.vars = c("shapeISO", "coverage_area", "weight"),
# name for the month columns we're melting into a single column
variable.name = "month",
# name for the values being converted into a single column
value.name = "return_period")
# check the first 5 rows of melted data
head(pop_by_rp)
shapeISO coverage_area weight month return_period
<char> <num> <num> <fctr> <num>
1: US-CA 569715 14.1623726 2014-01-01 -60
2: US-CA 1624997 3.1425104 2014-01-01 -60
3: US-CA 12378942 4.3180256 2014-01-01 -60
4: US-CA 8855739 15.4701538 2014-01-01 -60
5: US-CA 13233592 44.1462708 2014-01-01 -60
6: US-CA 19485454 0.6994646 2014-01-01 -60
Each row lists the land area (coverage_area) covered by the zone, the population density value (weight) for the zone, the month the deficit return period corresponds to, and the actual deficit return period class for the zone.
Next, we’ll summarize the data by return-period class.
# create new column with population totals for each month and return period combination
# divide by 1000000
<-
pop_by_rp pop_rp = round(sum(coverage_area * weight) / 1e6)), by = .(month, return_period)]
pop_by_rp[, .(# some cells do not have deficit return period values and result in NaN--just remove
<- na.omit(pop_by_rp)
pop_by_rp # check the first 5 rows
head(pop_by_rp)
month return_period pop_rp
<fctr> <num> <num>
1: 2014-01-01 -60 37556929
2: 2014-01-01 -40 718545
3: 2014-01-01 -20 551332
4: 2014-01-01 -50 708659
5: 2014-01-01 -10 11554
6: 2014-01-01 -5 45
Now we have a row for every unique combination of month, return period class, and the total population. We can calculate the percent of the total population represented by this return period class with 2 more lines.
# create a new column with the total population for that month (will be the same for each month but needed to create wsim class fraction)
:= sum(pop_rp), by = month]
pop_by_rp[, total_pop # calculate the fraction of that month-wsim class combination relative to the total population
:= pop_rp / total_pop][, total_pop := NULL]
pop_by_rp[, pop_frac # check the first rows
head(pop_by_rp)
month return_period pop_rp pop_frac
<fctr> <num> <num> <num>
1: 2014-01-01 -60 37556929 9.496768e-01
2: 2014-01-01 -40 718545 1.816936e-02
3: 2014-01-01 -20 551332 1.394116e-02
4: 2014-01-01 -50 708659 1.791938e-02
5: 2014-01-01 -10 11554 2.921582e-04
6: 2014-01-01 -5 45 1.137885e-06
Before plotting we’ll make the month labels more legible for plotting, convert the WSIM-GLDAS return period class into a factor, and set the WSIM-GLDAS class palette.
Factors are the most common way to handle categorical data in R. Although converting your categorical variables into factors is not not always the best choice, in many instances (especially plotting with ggplot2) the benefits will out way any annoyances. To learn more about factors and R check out Hadley Wickham’s chapter on factors in R for Data Science 2nd Edition.
# ggplot is easier with factors
$return_period<-
pop_by_rpfactor(pop_by_rp$return_period,
levels = c("0", "-3", "-5", "-10", "-20", "-40", "-50", "-60"))
# create the palette to pass to ggplot
<-c(
leg_colors'#9B0039',
# -50 to -40
'#D44135',
# -40 to -20
'#FF8D43',
# -20 to -10
'#FFC754',
# -10 to -5
'#FFEDA3',
# -5 to -3
'#fffdc7',
# -3 to 0
'#FFF4C7',
# 0-3
"#FFFFFF")
# pretty month labels
:=lubridate::month(pop_by_rp$month, label = TRUE)] pop_by_rp[,month
Now we can put it all together into a visualization.
# add the base data
::ggplot(pop_by_rp,
ggplot2# set plot wide aesthetics
::aes(x = month,
ggplot2y = pop_frac,
group = return_period,
fill = return_period))+
# determine the overlay and positioning of each wsim-class' bar
::geom_bar(stat = "identity",
ggplot2position = "stack",
color = "darkgrey")+
# set palettes on the wsim class groups
# rev() the order of the order of the colors so we can have -60 on the bottom
::scale_fill_manual(values = rev(leg_colors))+
ggplot2# limit y axis to 0/1
::ylim(0,1)+
ggplot2# labels
::labs(title = "Monthly Fraction of Population Under Water Deficits in California During 2014",
ggplot2subtitle = "Categorized by Intensity of Deficit Return Period",
x = "",
y = "Fraction of Population*",
caption = "*Population derived from Gridded Population of the World (2015)",
color = "Return Period", fill = "Return Period", group = "Return Period", alpha = "Return Period")+
# use basic theme
::theme_minimal() ggplot2
This figure really illustrates the human impact of the 2014 drought. Nearly 100% of the population was under a 60+ year deficit in January followed by 66% in May and approximately 40% for the remainder of the summer. That is a devastating drought!
:::{.callout-note} ## Knowledge Check 1. Gridded population datasets…(select all that apply) a. Show where river tributaries are. b. Model the distribution of the global human population in raster grid cells. c. Allow analyses of the number of persons impacted by a hazard such as drought. d. Vary in the extent to which ancillary data are used in their production.
Congratulations! In this Lesson You Learned How To…
- Identify areas of severe drought and select these areas for further analysis.
- Summarize data by county using the exactextractr tool.
- Integrate WSIM-GLDAS deficit, GPW population, and geoBoundaries administrative boundary data to create complex time series visualizations.
Lesson 2
In this lesson we explored the California drought of 2014. In our next lesson, we will examine near real-time flood data in California using the MODIS data product.
Lesson 2: Moderate Resolution Imaging Spectroradiometer (MODIS) Near-Real Time (NRT) flood data