Exploring SVI, TRI, and Health Outcomes

Author

Joshua Brinks

Published

September 4, 2024

Overview

In this lesson, we will introduce you to 3 public health and air quality datasets. These include the EPA’s Toxic Release Inventory (TRI), the EPA’s Integrated Compliance Information System for Air (ICIS-AIR), and the CDC’s PLACES health outcomes dataset. You will learn how to form queries and retrieve each dataset from their respective API, process the returned object into a useable format, create visualizations and perform simple analysis. You will also use NASA’s Social Vulnerability Index (SVI) to examine socioeconomic patterns in the greater Detroit metropolitan area and explore relationships with public health.

Programming Reminder

This lesson uses the Python programming environment.

Learning Objectives

After completing this lesson, you should be able to:

  • Retrieve multiple public health datasets through the EPA and CDC’s APIs.
  • Create multipanel raster maps from multilayered SVI data.
  • Geolocating tabular data using latitude and longitude.
  • Create maps joining geolocated points and basemaps.
  • Create maps with graduated point symbols for point source pollution releases.
  • Interpolate a raster surface from point spatial data.
  • Perform raster math to create specialized indexes.
  • Perform spatial correlation analysis.
Preface: A Note on Data Interpretation and Ethical Considerations

This lesson is designed to introduce you to various environmental, public health, and socioeconomic datasets, and to provide you with the tools to analyze and visualize this information. However, it’s crucial to approach this material with an understanding of its sensitive nature and potential implications.

The topics of environmental justice, institutional racism, socioeconomic conditions, and pollution are complex and multifaceted. The data and analyses presented in this lesson are not intended to draw definitive conclusions or suggest scientific evidence of cause and effect relationships. Rather, they are meant to equip you with the skills to investigate data and perform analyses that could be applied to your local communities.

As you work through this lesson, remember that correlation does not imply causation. The patterns you may observe in the data could be influenced by numerous factors not captured in these datasets. It’s essential to approach your findings with caution and to consider the broader historical, social, and economic contexts that shape environmental and health outcomes in different communities.

This lesson aims to empower you with data literacy and analytical skills. We encourage you to use these tools responsibly, always considering the ethical implications of your analyses and the potential impact on the communities represented in the data. When drawing insights or making decisions based on such analyses, it’s crucial to involve community stakeholders, consider multiple perspectives, and seek additional expertise when necessary.

Introduction

In many urban areas, air quality issues disproportionately affect low-income communities and communities of color. This disparity is a key focus of environmental justice efforts. In cities like Detroit, Chicago, and Cleveland industrial facilities, highways, and other pollution sources are often concentrated in disadvantaged neighborhoods. Detroit has a history of industrial pollution and is working to address air quality issues in areas like Southwest Detroit, where residents face higher exposure to pollutants from nearby industries and heavy truck traffic. Similarly, Chicago has seen community efforts to address air pollution in areas like the Southeast Side, where residents have fought against polluting industries and advocated for stricter regulations.

The EPA is the primary federal agency responsible for environmental protection in the United States. It sets and enforces air quality standards, including the National Ambient Air Quality Standards (NAAQS) for six criteria pollutants. The EPA also maintains the AirNow system, which provides real-time air quality data to the public. CDC (Centers for Disease Control and Prevention): While primarily focused on public health, the CDC plays a crucial role in understanding the health impacts of air pollution. It conducts research on the relationships between air quality and various health outcomes, and provides guidance on protecting public health from environmental hazards.

In response to these challenges, community-driven science initiatives have emerged. These efforts involve local residents in collecting data on air quality and other environmental factors, often using low-cost sensors and mobile monitoring techniques. This approach helps fill gaps in official monitoring networks and empowers communities to advocate for themselves. Open data is crucial for community-driven solutions in several ways:

  • Transparency: Open data allows communities to verify official reports and hold authorities accountable.
  • Accessibility: When air quality data is freely available, communities can use it to inform local decision-making and advocacy efforts.
  • Innovation: Open data enables researchers, activists, and tech developers to create new tools and analyses that can benefit communities.
  • Collaboration: Open data facilitates collaboration between communities, scientists, and policymakers, leading to more effective solutions.

While the EPA and CDC provide federal networks of open-access datasets like the Air Quality System (AQS), TRI, ICIS-AIR, and PLACES collections, community driven data collection is a large part of driving change in traditionally underrepresented communities. In Chicago, the Array of Things project has installed sensors throughout the city, providing open data on various environmental factors including air quality, while Detroit’s Community Air Monitoring Project uses low-cost sensors to collect and share air quality data in areas underserved by official monitoring stations.

With access to open data, communities can:

  • Identify local air quality hotspots that may be missed by sparse official monitoring networks.
  • Correlate air quality data with health outcomes to strengthen advocacy efforts.
  • Develop targeted interventions, such as promoting indoor air filtration on high-pollution days.
  • Create custom alerts and information systems tailored to local needs.

Although open data provides many benefits, challenges still remain. These include: 1) ensuring data quality and consistency, especially when integrating data from various sources; 2) addressing the “digital divide” to ensure all community members can access and use the data; 3) balancing the need for detailed local data with privacy concerns; and 4) building capacity within communities to effectively use and interpret complex environmental data.

EJ in the News

The Clear the Air Coalition, a new environmental justice advocacy group in Michigan, is calling for stronger protection of vulnerable communities from pollution. Launched in Detroit, the coalition argues that state regulators focus too much on technical compliance with environmental laws rather than public health. They claim the current permitting process favors polluters and fails to consider the cumulative impacts of pollution on overburdened communities. The group seeks changes in state law and a shift in mindset from environmental regulators.

Clear the Air Coalition

While the Michigan Department of Environment, Great Lakes, and Energy (EGLE) states its commitment to protecting underserved communities and notes improvements in air quality over recent decades, coalition members argue that the current approach is insufficient. They call for regulators to consider the combined effects of multiple pollution sources in marginalized areas and to give local governments more power to reject new polluting facilities in already burdened communities. The coalition plans to raise awareness and push for these changes during the upcoming Air Quality Awareness Week.

The state of Michigan, Department of Environment, Great Lakes, and Energy (EGLE), and the Office of the Environmental Justice Public Advocate are one of the most active state departments aiming to address issues of the environment, public health, and social justice. Here are some of the projects enacted in greater Detroit area:

  • Detroit Environmental Agenda: This community-led initiative works closely with EGLE to address environmental concerns in Detroit. It focuses on issues like air quality, water quality, and waste management.
  • 48217 Community Air Monitoring Project: Named after the zip code of a heavily industrialized area in Southwest Detroit, this project involves community members working with EGLE to monitor air quality using low-cost sensors.
  • Detroit Climate Action Plan: Developed in partnership with EGLE, this plan addresses climate change impacts on the city, with a focus on vulnerable communities.
  • Delray Neighborhood Initiatives: EGLE has been involved in efforts to address air quality concerns in the Delray neighborhood, which is impacted by industrial emissions and heavy truck traffic.
  • Green Door Initiative: This Detroit-based organization collaborates with EGLE on various environmental justice projects, including lead abatement and air quality improvement efforts.
  • Detroit River Sediment Cleanup: EGLE has been involved in efforts to clean up contaminated sediments in the Detroit River, which disproportionately affects nearby low-income communities.
  • Asthma Prevention Programs: EGLE supports community-based asthma prevention programs in Detroit, recognizing the link between air quality and asthma rates in disadvantaged neighborhoods.

These are just a few of examples that demonstrate the ongoing collaboration between community groups, local government, and EGLE to address environmental justice concerns in Detroit.

Air quality issues in urban areas disproportionately affect low-income communities and communities of color, making it a critical focus for environmental justice efforts. Federal agencies like the EPA and CDC play crucial roles in setting standards and conducting research, but community-driven science initiatives have emerged as powerful tools for local action. Open data is key to these efforts, enabling transparency, accessibility, innovation, and collaboration. Cities like Detroit and Chicago are at the forefront of these initiatives, with projects that empower residents to monitor and advocate for better air quality. While challenges remain, particularly in data management and accessibility, the collaboration between community groups, local governments, and state agencies like Michigan’s EGLE demonstrates a promising path forward.

Data Analysis and Exercises

To better understand and address these complex issues, we’ll work through a series of coding and data science exercises. These hands-on activities will allow users to explore some of the open datasets and tools that are available to community members and stakeholders that offer practical insights into air quality, public health, and social vulnerability.

We’ll begin with the ICIS-AIR dataset, then move on to the TRI Facility dataset, and finally work our way down to TRI Form A. After exploring these 3 air pollution datasets, we will take a look at the CDC PLACES health outcomes data and perform some simple analysis integrating the pollution and health outcomes data.

Data Science Review

This lesson uses the following Python modules:

  • pandas: Essential for data manipulation and analysis.
  • geopandas: Extends pandas functionality to handle geospatial data.
  • matplotlib: Used for creating static, animated, and interactive visualizations.
  • numpy: Provides support for large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays.
  • requests: Allows you to send HTTP requests and interact with APIs easily.
  • contextily: Adds basemaps to your plots, enhancing the visual context of your geospatial data.
  • pygris: Simplifies the process of working with US Census Bureau TIGER/Line shapefiles.
  • rasterio: Reads and writes geospatial raster datasets.
  • xarray: Introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays, making it easier to work with labeled multi-dimensional arrays.
  • shapely: Manipulation and analysis of geometric objects in the Cartesian plane.
  • scipy: Used for scientific and technical computing, particularly the stats and interpolate submodules.
  • rioxarray: Extends xarray to make it easier to handle geospatial raster data.
  • time: Provides various time-related functions (part of Python’s standard library).
  • tabulate: Used for creating nicely formatted tables in various output formats.
  • pysal: A library of spatial analysis functions.
  • splot: Provides statistical plots for spatial analysis.

If you’d like to learn more about the functions used in this lesson, you can refer to the documentation on their respective websites.

The pandas module is essential for data manipulation and analysis, while geopandas extends its functionality to handle geospatial data. matplotlib is used for creating static, animated, and interactive visualizations. numpy provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

The requests module allows you to send HTTP requests and interact with APIs easily. contextily adds basemaps to your plots, enhancing the visual context of your geospatial data. pygris simplifies the process of working with US Census Bureau TIGER/Line shapefiles.

rasterio and xarray are used for working with geospatial raster data. rasterio reads and writes geospatial raster datasets, while xarray introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays, making it easier to work with labeled multi-dimensional arrays.

shapely is used for manipulation and analysis of geometric objects. scipy provides additional tools for scientific computing, including statistical functions and interpolation methods. rioxarray combines the functionality of rasterio and xarray for easier handling of geospatial raster data.

pysal is a library for spatial analysis, providing tools for exploratory spatial data analysis. splot is a visualization module that works with pysal to create statistical plots for spatial analysis.

Make sure these modules are installed before you begin working with the code in this document.

ICIS Air

ICIS-AIR (Integrated Compliance Information System for Air) is a comprehensive database maintained by the U.S. Environmental Protection Agency (EPA) that focuses specifically on air quality compliance and enforcement data. It tracks information related to stationary sources of air pollution, including their compliance with various Clean Air Act regulations, permit data, and enforcement actions. While ICIS-AIR and the Toxic Release Inventory (TRI) are separate systems, they have significant overlap in their coverage of industrial facilities and air emissions. Many facilities that report to TRI also have data in ICIS-AIR, providing complementary information.

Where TRI focuses on the quantities of specific toxic chemicals released into the air, ICIS-AIR offers a broader picture of a facility’s overall air quality compliance status, including non-toxic pollutants regulated under the Clean Air Act. Together, these systems provide a more comprehensive view of industrial air pollution sources, enabling researchers, regulators, and the public to assess both the types and quantities of air pollutants emitted (from TRI) and the regulatory compliance status of the emitting facilities (from ICIS-AIR).

Data Processing

Defining Our Study Area

Next we’ll query the API for ICIS-AIR regulated facilities in the greater Detroit area, but first we need to create a spatial object that defines our study area that we can use throughout this analysis. The U.S. Census Bureau defines the Detroit, MI Metropolitan Statistical Area (MSA) as including 6 counties (Wayne, Macomb, Oakland, Lapeer, Livingston, and St. Clair), but for this lesson we will focus on the core 3 counties (Wayne, Macomb, and Oakland); both population and air emissions related facilities are much more sparse in the outer counties.

We can us pygris to get vector boundaries for the counties and dissolve them into a single boundary we can you for cropping data and restricting API searches.

import geopandas as gpd
import pandas as pd
import pygris
from shapely.geometry import box


counties = ['Wayne', 'Oakland', 'Macomb']

# Fetch Detroit metro area counties using pygris
metro_counties = pygris.counties(state="MI", year=2022)
detroit_metro = metro_counties[metro_counties['NAME'].isin(counties)]

# Dissolve the counties into a single polygon
detroit_metro = detroit_metro.dissolve(by='STATEFP')

# Convert to GeoDataFrame
detroit_metro = gpd.GeoDataFrame(detroit_metro, geometry='geometry', crs='EPSG:4269')

# Get the bounding box
bbox = detroit_metro.total_bounds

# Create the bounding box as a polygon
bbox_polygon = gpd.GeoDataFrame(
    geometry=[box(*bbox)],
    crs=detroit_metro.crs
)
Using FIPS code '26' for input 'MI'

Let’s verify this by overlaying it on a basemap.

import matplotlib.pyplot as plt
import contextily as ctx

# Create the plot
fig, ax = plt.subplots(figsize=(7, 7))

# Reproject data to Web Mercator to match the basemap
detroit_metro_bm = detroit_metro.to_crs(epsg=3857)
bbox_polygon_bm = bbox_polygon.to_crs(epsg=3857)

# Plot the metro area and bounding box
detroit_metro_bm.plot(ax=ax, color='lightblue', edgecolor='darkblue', alpha=0.3)
bbox_polygon_bm.boundary.plot(ax=ax, color='grey', linewidth=2)

# Add the basemap
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)

# Set the extent of the map to the bounding box
ax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])
ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])

# Remove axes
ax.set_axis_off()

plt.title("Detroit Metro Study Area", fontsize=16)
plt.tight_layout()
plt.show()

This looks correct, and we’ll use this boundary object repeatedly throughout our analysis.

Querying the API

The EPA’s Enforcement and Compliance History Online (ECHO) website serves as a comprehensive portal for environmental compliance and enforcement data. It includes the ECHO Data Service, which provides programmatic access to EPA data through RESTful APIs. Among these is the ICIS-AIR API, which specifically offers access to air quality compliance and enforcement data. This API allows users to retrieve detailed information about stationary sources of air pollution, including facility details, permit data, emissions reports, and compliance status.

The ECHO API requires that users follow a multi-step workflow to acquire data via the Facility Air API:

  1. Use get_facilities to validate parameters, get summary statistics, and obtain a query_id (QID), valid for about 30 minutes.
  2. Use get_qid with the obtained QID to paginate through facility results.
  3. Use get_map with the QID to visualize and navigate facility locations.
  4. Use get_download with the QID to generate a CSV file of facility information.

The get_facility_info endpoint operates independently, returning either clustered summary statistics or an array of individual facilities. This API structure allows for efficient querying, visualization, and extraction of air quality compliance data, making it a valuable resource for environmental analysis and research.

We’ll begin by querying this database for just Michigan facilities in the workflow detailed above. From there we’ll subset with our list of counties.

We start with the base url for the ECHO API and our parameter list.

# Base URL for ECHO ICIS-AIR API
base_url = "https://echodata.epa.gov/echo/air_rest_services"

# Parameters for the initial API call
params = {
    "output": "JSON",
    "p_st": "MI"
}
Data Science Review

An Application Programming Interface (API) is a set of protocols, tools, and definitions that enable different software applications to communicate with each other. It acts as an intermediary, allowing one application to request specific data or actions from another application or service without needing access to its entire codebase. APIs define the methods and data formats that applications can use to request and exchange information, typically through specific endpoints (URLs) that accept requests and return responses in standardized formats like JSON. They are crucial for integrating external services, accessing databases, and enabling communication between different parts of larger open science software systems.

Now we define 2 functions that will carry out the first 2 steps; 1) identifying the facilities in our search and 2) retrieve the facility data.

import requests

# Define the function to query the API for the facilities of interest
def get_facilities():
    response = requests.get(f"{base_url}.get_facilities", params=params)
    if response.status_code == 200:
        data = response.json()
        if 'Results' in data:
            qid = data['Results']['QueryID']
            print(f"Query ID: {qid}")
            print(f"Total Facilities: {data['Results']['QueryRows']}")
            return qid
    print("Failed to get facilities and QID")
    return None

# Define the function to retrieve the relevant data from the facilities of interest
def get_facility_data(qid):
    all_facilities = []
    page = 1
    while True:
        params = {"qid": qid, "pageno": page, "output": "JSON"}
        response = requests.get(f"{base_url}.get_qid", params=params)
        if response.status_code == 200:
            data = response.json()
            if 'Results' in data and 'Facilities' in data['Results']:
                facilities = data['Results']['Facilities']
                if not facilities:  # No more facilities to retrieve
                    break
                all_facilities.extend(facilities)
                print(f"Retrieved page {page}")
                page += 1
            else:
                break
        else:
            print(f"Failed to retrieve page {page}")
            break
    return all_facilities

Now that we’ve defined the functions we can make a request to the API with our Detroit metro counties.

# Step 1: Get the Query ID
qid = get_facilities()

if qid:
    # Step 2: Use get_qid to retrieve all facility data
    print("Retrieving facility data...")
    facilities = get_facility_data(qid)
    
    # Convert to DataFrame
    df_icis_air = pd.DataFrame(facilities)
    
    print(f"\nSuccessfully retrieved {len(df_icis_air)} ICIS-AIR facilities for Michigan")
    print("\nColumns in the dataset:")
    print(df_icis_air.columns)
    
else:
    print("Failed to retrieve facility data")
Query ID: 113
Total Facilities: 3395
Retrieving facility data...
Retrieved page 1

Successfully retrieved 3395 ICIS-AIR facilities for Michigan

Columns in the dataset:
Index(['AIRName', 'SourceID', 'AIRStreet', 'AIRCity', 'AIRState',
       'LocalControlRegionCode', 'AIRZip', 'RegistryID', 'AIRCounty',
       'AIREPARegion', 'FacFederalAgencyCode', 'FacFederalAgencyName',
       'FacDerivedHuc', 'FacFIPSCode', 'FacIndianCntryFlg',
       'AIRIndianCntryFlg', 'FacIndianSpatialFlg', 'FacDerivedTribes',
       'FacUsMexBorderFlg', 'FacSICCodes', 'AIRNAICS', 'FacLat', 'FacLong',
       'AIRPrograms', 'AIRMacts', 'AIRStatus', 'AIRUniverse',
       'AIRClassification', 'AIRCmsCategoryCode', 'AIRCmsCategoryDesc',
       'FacDerivedWBD', 'FacDerivedWBDName', 'ChesapeakeBayFlag', 'AIRIDs',
       'CWAIDs', 'RCRAIDs', 'RmpIDs', 'SDWAIDs', 'TRIIDs', 'GHGIDs', 'EisIDs',
       'CamdIDs', 'AIRComplStatus', 'AIRHpvStatus', 'AIRMnthsWithHpv',
       'AIRQtrsWithHpv', 'AIRQtrsWithViol', 'AIRPollRecentViol',
       'AIRRecentViolCnt', 'AIRLastViolDate', 'AIREvalCnt', 'AIRDaysLastEval',
       'AIRLastEvalDate', 'AIRLastEvalDateEPA', 'AIRLastEvalDateState',
       'AIRFceCnt', 'AIRDaysLastFce', 'AIRLastFceDate', 'AIRLastFceDateEPA',
       'AIRLastFceDateState'],
      dtype='object')

There are over three thousand facilities in the ICIS-AIR Michigan dataset. Here are the descriptions for some key columns.

Key Columns in the ICIS-AIR ECHO API Dataset

  • AIRName: The name of the facility
  • SourceID: A unique identifier for the air pollution source
  • AIRStreet, AIRCity, AIRState, AIRZip: Address components of the facility
  • RegistryID: A unique identifier for the facility in the EPA’s registry
  • AIRCounty: The county where the facility is located
  • AIREPARegion: The EPA region responsible for the facility
  • AIRNAICS: North American Industry Classification System code(s) for the facility
  • FacLat, FacLong: Latitude and longitude coordinates of the facility
  • AIRPrograms: Air quality programs applicable to the facility
  • AIRStatus: Current operational status of the facility
  • AIRUniverse: Categorization of the facility within air quality regulation
  • AIRComplStatus: Current compliance status under the Clean Air Act
  • AIRHpvStatus: High Priority Violator status
  • AIRQtrsWithViol: Number of quarters with violations
  • AIRLastViolDate: Date of the most recent violation
  • AIREvalCnt: Count of evaluations conducted
  • AIRLastEvalDate: Date of the most recent evaluation
  • AIRFceCnt: Count of Full Compliance Evaluations
  • AIRLastFceDate: Date of the last Full Compliance Evaluation

These columns provide information about each facility’s location, operational characteristics, compliance history, and regulatory oversight under the Clean Air Act. It includes information on violations, evaluations, and compliance status that assess a facility’s environmental performance and regulatory compliance.

We can check some basic information like how many facilities are in Detroit metro and the breakdown by our 3 counties.

# Subset the dataframe to include only the Detroit metro counties
icis_air_detroit = df_icis_air[df_icis_air['AIRCounty'].isin(counties)]

# Print information about the subset
print(f"Total ICIS-AIR facilities in Michigan: {len(df_icis_air)}")
print(f"ICIS-AIR facilities in Detroit metro area: {len(icis_air_detroit)}")

# Display the count of facilities in each metro county
print("\nFacilities per county:")
print(icis_air_detroit['AIRCounty'].value_counts())
Total ICIS-AIR facilities in Michigan: 3395
ICIS-AIR facilities in Detroit metro area: 903

Facilities per county:
AIRCounty
Wayne      431
Oakland    272
Macomb     200
Name: count, dtype: int64

Next we should check for facilities with missing latitude or longitude coordinates.

# Count records with missing coordinate values
missing_coords = icis_air_detroit[(icis_air_detroit['FacLat'].isnull()) | (icis_air_detroit['FacLong'].isnull())]
print(f"Number of ICIS-AIR records with missing coordinates: {len(missing_coords)}")

# Remove records with missing coordinates
icis_air_detroit = icis_air_detroit.dropna(subset=['FacLat', 'FacLong'])
Number of ICIS-AIR records with missing coordinates: 0

No missing records! That’s great. As you’ll find out later, you won’t always be so lucky.

Now we can create a spatial object, reproject it so it matches the basemap of Detroit, and make a map showing the location of all the ICIS-AIR facilities in Detroit metro.

# Create a GeoDataFrame for ICIS-AIR facilities
gdf_icis_air = gpd.GeoDataFrame(
    icis_air_detroit, 
    geometry=gpd.points_from_xy(icis_air_detroit.FacLong, icis_air_detroit.FacLat),
    crs="EPSG:4326"
)

# Reproject ICIS-AIR data to Web Mercator so it matches the basemap
gdf_icis_air_bm = gdf_icis_air.to_crs(epsg=3857)

# Create the plot
fig, ax = plt.subplots(figsize=(7, 7))

# Plot the metro area and bounding box (reusing objects from earlier)
detroit_metro_bm.plot(ax=ax, facecolor='none', edgecolor='blue', linewidth=2)
bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)

# Plot ICIS-AIR facilities
gdf_icis_air_bm.plot(ax=ax, color='cyan', markersize=50, alpha=0.5, label='ICIS-AIR Facilities', edgecolor = "grey")

# Add the basemap
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)

# Set the extent of the map to the bounding box
ax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])
ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])

# Remove axes
ax.set_axis_off()

# Add legend
ax.legend()

plt.title("Detroit Metro Area ICIS-AIR Facilities", fontsize=16)
plt.tight_layout()
plt.show()

print(f"Number of ICIS-AIR facilities plotted: {len(gdf_icis_air)}")

Number of ICIS-AIR facilities plotted: 903

These are distributed along business and industrial sectors as expected, but we can’t infer much else from this map.

Which regulated facilities have been in violation status? We can plot all the facilities with at least one violation during a reporting quarter and use graduated symbols that reflect the total number of quarters with violations. This information is in the AIRQtrsWithViol column.

# Convert to numeric, replacing non-numeric values with NaN
gdf_icis_air_bm['AIRQtrsWithViol'] = pd.to_numeric(gdf_icis_air_bm['AIRQtrsWithViol'], errors='coerce')

# subset the dataset for those with violations
gdf_icis_air_violators = gdf_icis_air_bm = gdf_icis_air_bm[gdf_icis_air_bm['AIRQtrsWithViol'] > 0]

# Create the plot
fig, ax = plt.subplots(figsize=(7, 7))

# Plot the metro area and bounding box (reusing objects from earlier)
detroit_metro_bm.plot(ax=ax, facecolor='none', edgecolor='blue', linewidth=2)
bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)

# Plot TRI facilities with graduated symbols based on air releases
scatter = ax.scatter(gdf_icis_air_violators.geometry.x,
                     gdf_icis_air_violators.geometry.y, 
                     s=gdf_icis_air_violators['AIRQtrsWithViol']*10,  # Adjust the multiplier as needed
                     c='orangered', 
                     edgecolor='yellow', 
                     linewidth=1, 
                     alpha=0.7)

# Add the basemap
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)

# Set the extent of the map to the bounding box
ax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])
ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])

# Remove axes
ax.set_axis_off()

# Add legend
ax.legend()

plt.title("Detroit Metro Area ICIS-AIR Facilities w/ Violations", fontsize=16)
plt.tight_layout()
plt.show()

We can also create a table showing which businesses have been in violation statues, their name, address, and the date of their most recent violation using the tabulate module.

from tabulate import tabulate

gdf_icis_air_violators = gdf_icis_air_bm = gdf_icis_air_bm[gdf_icis_air_bm['AIRQtrsWithViol'] >= 10]

# List the columns we want to use
table_columns = ['AIRName', 'AIRStreet', 'AIRCity', 'AIRQtrsWithViol', 'AIRLastViolDate']

# Create a new DataFrame with only the columns we want
table_data = gdf_icis_air_violators[table_columns].copy()

# Ensure AIRLastViolDate is in datetime format
table_data['AIRLastViolDate'] = pd.to_datetime(table_data['AIRLastViolDate'])

# Sort the DataFrame by AIRLastViolDate in descending order
table_data = table_data.sort_values('AIRLastViolDate', ascending=False)

# Create a dictionary to map old column names to pretty column names
pretty_names = {
    'AIRName': 'Name',
    'AIRStreet': 'Street',
    'AIRCity': 'City',
    'AIRQtrsWithViol': 'Violations',
    'AIRLastViolDate': 'Violation Date'
}

# Rename the columns
table_data.rename(columns=pretty_names, inplace=True)

# Format the 'Violation Date' column to a specific date format (optional)
table_data['Violation Date'] = table_data['Violation Date'].dt.strftime('%Y-%m-%d')

# Create the table
table = tabulate(table_data, headers='keys', tablefmt='html', showindex=False)
table 
Name Street City Violations Violation Date
EES COKE BATTERY L.L.C. 1400 ZUG ISLAND ROAD RIVER ROUGE 12 2024-04-23
ROSATI SPECIALTIES 24200 CAPITAL BLVD. CLINTON TWP 12 2024-03-21
CARMEUSE LIME INC, RIVER ROUGE OPERATION 25 MARION AVE RIVER ROUGE 12 2023-08-17
BASF CORPORATION - CHEMICAL PLANTS 1609 BIDDLE AVE WYANDOTTE 12 2023-03-24
MARATHON PETROLEUM COMPANY LP 1001 S OAKWOOD DETROIT 12 2023-03-15
FCA US LLC - DETROIT ASSEMBLY COMPLEX 2101 CONNER AVE DETROIT 12 2023-01-25
U S STEEL GREAT LAKES WORKS 1 QUALITY DR ECORSE 12 2022-11-07
FCA US LLC WARREN TRUCK ASSEMBLY PLANT 21500 MOUND ROAD WARREN 12 2022-10-21
NYLOK LLC 15260 HALLMARK COURT MACOMB 12 2022-08-29
AJAX METAL PROCESSING INC. 4651 BELLEVUE AVE DETROIT 10 2022-06-07
STERLING ENGINE HOLDINGS LLC 54420 PONTIAC TRAIL MILFORD 12 2021-06-10
DETROIT RENEWABLE POWER, LLC 5700 RUSSELL ST DETROIT 12 2019-08-29
JVISFH, LLC 23944 FREEWAY PARK DRIVE FARMINGTN HLS 12 2019-03-22
LEAR CORPORATION DBA EAGLE OTTAWA 2930 WEST AUBURN RD ROCHESTER HLS 11 2014-11-10
BASF CORPORATION - PLASTIC PLANTS 1609 BIDDLE AVE. WYANDOTTE 12 2014-06-24
WYANDOTTE DEPT MUNI POWER PLANT 2555 VAN ALSTYNE WYANDOTTE 12 2014-04-21
POWER SOLUTIONS INTERNATIONAL 32505 INDUSTRIAL DRIVE MADISON HTS 12 2013-12-09
HENRY FORD WEST BLOOMFIELD HOSPITAL 6777 WEST MAPLE ROAD W BLOOMFIELD 12 2012-05-03
HD INDUSTRIES 19455 GLENDALE DETROIT 12 2012-04-18

ICIS-AIR is a powerfule tool for communites and policy-makers to identify regulated facilities and their compliance status, but the dataset does have limitations. Next, we’ll take a look at facilities required to report to the Toxic Release Inventory (TRI).

Knowledge Check

Which of the following best describes the ICIS-AIR database?

  1. A database of air quality measurements from monitoring stations
  2. A comprehensive database of air quality compliance and enforcement data
  3. A public health database tracking respiratory illnesses
  4. A database of weather patterns affecting air quality

Toxic Release Inventory

The Toxic Release Inventory (TRI) is a crucial resource in understanding and addressing air quality issues in the United States. Established under the Emergency Planning and Community Right-to-Know Act of 1986, the TRI is a publicly accessible database maintained by the Environmental Protection Agency (EPA). It requires certain industrial facilities to report annually on their releases of toxic chemicals into the environment, including air emissions. The TRI provides valuable data on over 770 chemicals and chemical categories, offering insights into the types and quantities of pollutants released into the air by various industries. This information is vital to researchers, policymakers, and community advocates in assessing local air quality, identifying potential health risks, and developing targeted strategies to reduce toxic air emissions. By making this data publicly available, the TRI plays an important role in promoting transparency and supporting environmental justice initiatives focused on improving air quality in communities across the nation.

Data Review

The Toxics Release Inventory (TRI) and the Integrated Compliance Information System for Air (ICIS-AIR) are two important but distinct environmental reporting systems maintained by the U.S. Environmental Protection Agency (EPA). They have several key differences:

  • Regulatory Basis
    • TRI: Established under the Emergency Planning and Community Right-to-Know Act (EPCRA) of 1986
    • ICIS-AIR: Part of the Clean Air Act (CAA) compliance and enforcement program
  • Focus
    • TRI: Tracks the management of certain toxic chemicals that may pose a threat to human health and the environment
    • ICIS-AIR: Focuses specifically on air quality and emissions from facilities regulated under the Clean Air Act
  • Reported Information
    • TRI: Facilities report on releases, waste management, and pollution prevention activities for specific toxic chemicals
    • ICIS-AIR: Tracks emissions data, compliance status, and enforcement actions related to air quality regulations
  • Facility Coverage
    • TRI: Covers facilities in specific industries that manufacture, process, or use TRI-listed chemicals above certain thresholds
    • ICIS-AIR: Includes a broader range of facilities that emit air pollutants, regardless of the specific chemicals involved
  • Reporting Thresholds
    • TRI: Has specific chemical thresholds that trigger reporting requirements
    • ICIS-AIR: Generally doesn’t have chemical-specific thresholds; requirements are based on overall emissions and facility type
  • Public Accessibility
    • TRI: Designed with a strong focus on public right-to-know, with data easily accessible to the public
    • ICIS-AIR: While public, it’s primarily designed for regulatory and enforcement purposes
  • Data Frequency
    • TRI: Annual reporting is required for covered facilities
    • ICIS-AIR: May involve more frequent reporting, depending on permit requirements and compliance status
  • Scope of Pollutants
    • TRI: Focuses on a specific list of toxic chemicals and chemical categories
    • ICIS-AIR: Covers a wider range of air pollutants, including criteria air pollutants and hazardous air pollutants
  • Use in Environmental Management
    • TRI: Often used for assessing long-term trends in toxic chemical releases and waste management practices
    • ICIS-AIR: More commonly used for day-to-day air quality management and enforcement activities
  • Geographic Coverage
    • TRI: Nationwide program with consistent reporting across states
    • ICIS-AIR: While national, implementation can vary more by state or local air quality management district

Envirofacts API

EnviroFacts is a comprehensive online database and information system maintained by the U.S. Environmental Protection Agency (EPA). It serves as a centralized hub for accessing a wide range of environmental data collected by the EPA and other federal agencies. EnviroFacts integrates information from multiple EPA databases, covering various aspects of environmental health, including air quality, water quality, hazardous waste, and toxic releases. One of the key components of EnviroFacts is the Toxic Release Inventory (TRI) data. Through EnviroFacts, users can easily access and query TRI information, allowing them to investigate toxic chemical releases and waste management activities in their local areas. The integration of TRI data within the broader EnviroFacts system enables researchers, policymakers, and community members to contextualize toxic release information alongside other environmental indicators.

Envirofacts is available as a web based search and data platform and also as a programmatic API. The web platform is a great way to familiarize yourself with the available datasets and create simple downloads, however, for analytical purposes we recommend learning to navigate their API so you can create repeatable and reliable analysis.

Environmental Justice In the News

The Michigan Department of Environment, Great Lakes and Energy (EGLE) has settled a civil rights complaint regarding a hazardous waste facility in Detroit. The complaint, filed in 2020 by environmental groups and local residents, challenged the renewal and expansion of U.S. Ecology North’s license, arguing it was unjust to increase hazardous waste storage in a predominantly low-income and minority neighborhood. As part of the settlement, EGLE will now consider environmental justice concerns in future licensing decisions for hazardous waste facilities.

The agreement, described as “groundbreaking” by the Sierra Club, introduces several new measures. EGLE will conduct environmental justice and cumulative impact analyses for future licensing decisions, potentially denying licenses that would have an unlawful impact on human health and the environment. The settlement also includes provisions for improved community engagement, such as better translation services, public input processes, and the installation of air monitors around U.S. Ecology North. Additionally, the state will work with local residents to conduct a community health assessment. While some details remain to be clarified, environmental advocates view this as a significant step towards advancing environmental justice in Michigan.

Querying the API

We’ll continue to use the Detroit metro boundary we created earlier. We assigned them the names detroit_metro and detroit_metro_bm (projected to mercator to match basemaps for plotting).

# Print the bounding box coordinates
print("Bounding Box:")
print(f"Minimum X (Longitude): {bbox[0]}")
print(f"Minimum Y (Latitude): {bbox[1]}")
print(f"Maximum X (Longitude): {bbox[2]}")
print(f"Maximum Y (Latitude): {bbox[3]}")
Bounding Box:
Minimum X (Longitude): -83.689438
Minimum Y (Latitude): 42.02793
Maximum X (Longitude): -82.705966
Maximum Y (Latitude): 42.897541

Now we’ll create and empty list to store the TRI data, loop through each county making an API query, and place the retrieved data in the empty container.

# Fetch TRI facility data from EPA API for each county
tri_data = []

# Loop through each county to make a separate API inquiry (was having trouble attempting all at once)
for county in counties:
    api_url = f"https://data.epa.gov/efservice/tri_facility/state_abbr/MI/county_name/{county}/JSON"
    response = requests.get(api_url)
    # Is our response good
    if response.status_code == 200:
        county_data = response.json()
        # Add the county to the list
        tri_data.extend(county_data)
    else:
        print(f"Failed to fetch data for {county} County. Status code: {response.status_code}")

Processing the Data

The TRI data comes in json format. They can be a little confusing to interpret, becauce each record (traditionally a row in a table) is viewed with columns structured vertically. Let’s look at the first record.

tri_data[1]
{'tri_facility_id': '48007GRWGR3155W',
 'facility_name': 'PPG GROW DETROIT',
 'street_address': '14000 STANSBURY',
 'city_name': 'DETROIT',
 'county_name': 'WAYNE',
 'state_county_fips_code': '26163',
 'state_abbr': 'MI',
 'zip_code': '48227',
 'region': '5',
 'fac_closed_ind': '0',
 'mail_name': 'PPG INDS.',
 'mail_street_address': '1330 PIEDMONT ATTN:  RICH MCCURDY',
 'mail_city': 'TROY',
 'mail_state_abbr': 'MI',
 'mail_province': None,
 'mail_country': None,
 'mail_zip_code': '48083',
 'asgn_federal_ind': 'C',
 'asgn_agency': None,
 'frs_id': None,
 'parent_co_db_num': '001344803',
 'parent_co_name': 'PPG INDS INC',
 'fac_latitude': 422146,
 'fac_longitude': 831255,
 'pref_latitude': 42.3903,
 'pref_longitude': 83.182,
 'pref_accuracy': 150,
 'pref_collect_meth': 'A1',
 'pref_desc_category': 'PG',
 'pref_horizontal_datum': '1',
 'pref_source_scale': 'J',
 'pref_qa_code': '1000',
 'asgn_partial_ind': '0',
 'asgn_public_contact': None,
 'asgn_public_phone': None,
 'asgn_public_contact_email': None,
 'bia_code': None,
 'standardized_parent_company': None,
 'asgn_public_phone_ext': None,
 'epa_registry_id': '110041981807',
 'asgn_technical_contact': None,
 'asgn_technical_phone': None,
 'asgn_technical_phone_ext': None,
 'mail': None,
 'asgn_technical_contact_email': None,
 'foreign_parent_co_name': None,
 'foreign_parent_co_db_num': None,
 'standardized_foreign_parent_company': None}

This record shows all the information for PPG GROW DETROIT with columns and values listed side-by-side (‘city_name’: ‘DETROIT’). This may seem difficult to deal with, but most programming languages have tools to easily parse json data into traditional tables. pd.DataFrame does it automatically when you attempt to create a pandas data frame.

# Convert TRI data to a DataFrame
tri_df = pd.DataFrame(tri_data)

print(f"Number of facilities fetched: {len(tri_df)}")

# Check the first few rows (too large to print in this document)
# tri_df.head
Number of facilities fetched: 791

Checking the first few rows–everything looks as it should. We want to geocode the facility locations into a point layer using the latitude and longitude values. Therefore, we should start by making sure all the facilities have latitude and longitude values.

We’ll check and remove those without.

# Create a copy of the dataframe to avoid SettingWithCopyWarning
tri_df_clean = tri_df.copy()

# Remove facilities with empty latitude or longitude values
tri_df_clean = tri_df_clean.dropna(subset=['pref_latitude', 'pref_longitude'])

print(f"Number of facilities after removing empty coordinates: {len(tri_df_clean)}")

# Convert latitude and longitude to numeric type
tri_df_clean['pref_latitude'] = pd.to_numeric(tri_df_clean['pref_latitude'], errors='coerce')
tri_df_clean['pref_longitude'] = pd.to_numeric(tri_df_clean['pref_longitude'], errors='coerce')
Number of facilities after removing empty coordinates: 478

Ouch! We lost roughly 300 records due to missing coordinates (789 vs. 478). If this were an important analysis for policy-making or scientific research we would take the time to geocode the listed addresses into lat/long coordinates, but for the purposes of this exercise we’ll move on.

In my personal investigations of this data, I noticed that some longitude coordinates were not negative, but had an absolute value (e.g. 83.25) that made sense for the Detroit metro area, therefore we will flip the sign on these records.

# Function to correct longitude by making it negative if positive
def correct_longitude(lon):
    if lon > 0:
        return -lon
    return lon

# Apply longitude correction
tri_df_clean['pref_longitude'] = tri_df_clean['pref_longitude'].apply(correct_longitude)

Additionally, there are some records with wild longitudinal values that are not simply from a missing negative sign. We’ll identify these outliers using the interquantile range. Anything outside the range will be tosses. One of the downsides of the TRI database is that many aspects are provided directly by the facility (and therefore subject to errors). Again, if this were a more important analysis we would carefully explore all of these missing records and possible geocode using the address.

Toss the quartile outliers.

# Calculate IQR for longitude
Q1 = tri_df_clean['pref_longitude'].quantile(0.25)
Q3 = tri_df_clean['pref_longitude'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

lower_bound, upper_bound

# Remove outliers
tri_df_clean = tri_df_clean[(tri_df_clean['pref_longitude'] >= lower_bound) & 
                            (tri_df_clean['pref_longitude'] <= upper_bound)]

print(f"Number of facilities after removing longitude outliers: {len(tri_df_clean)}")
Number of facilities after removing longitude outliers: 473

Thankfully we only lost 2 more records from outlier.

Visualizing the Data

The data has been cleaned up a bit so now lets create a spatial point object from the data with geopandas and use matplotlib to plot the values on top of a basemap of Detroit we’ll acquire using contextily and our original Wayne, Macomb, and Oakland counties boundary.

# Create a GeoDataFrame from the cleaned TRI data
detroit_tri = gpd.GeoDataFrame(
    tri_df_clean, 
    geometry=gpd.points_from_xy(tri_df_clean.pref_longitude, tri_df_clean.pref_latitude),
    crs="EPSG:4326"
)

# Reproject data to Web Mercator for contextily
# detroit_metro = detroit_metro.to_crs(epsg=3857)
# bbox_polygon = bbox_polygon.to_crs(epsg=3857)
detroit_tri = detroit_tri.to_crs(epsg=3857)

# Create the plot
fig, ax = plt.subplots(figsize=(7, 7))

# Plot the metro area and bounding box (reusing objects from earlier)
detroit_metro_bm.plot(ax=ax, facecolor='none', edgecolor='blue', linewidth=2)
bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)

# Plot TRI facilities (reusing the detroit_tri object from earlier)
detroit_tri.plot(ax=ax, color='purple', edgecolor='grey', markersize=50, alpha=0.5, label='TRI Facilities')
# Plot ICIS-AIR facilities
gdf_icis_air_bm.plot(ax=ax, color='cyan', edgecolor='grey', markersize=50, alpha=0.5, label='ICIS-AIR Facilities')

# Add the basemap
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)

# Set the extent of the map to the bounding box
ax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])
ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])

# Remove axes
ax.set_axis_off()

# Add legend
ax.legend()

plt.title("Detroit Metro Area TRI and ICIS-AIR Facilities", fontsize=16)
plt.tight_layout()
plt.show()

print(f"Number of TRI facilities plotted: {len(detroit_tri)}")
print(f"Number of ICIS-AIR facilities plotted: {len(gdf_icis_air)}")

Number of TRI facilities plotted: 473
Number of ICIS-AIR facilities plotted: 903

We can see that there are far fewer facilities being tracked in TRI compared to ICIS-AIR. While some of this can be attributed to the records we tossed after checking for valid geographic coordinates and outliers, but generally speaking ICIS-AIR will contain far more as it tracks every facility regulated by the Clean Air Act as opposed to only those that are part of TRI.

You may have also noticed that this dataset does not contain any actually pollution data; only facility information for those that are regulated by TRI.

TRI is made up of numerous tables containing a wealth of information regarding the facilities themselves and all the regulated chemicals that are handled at each site. The TRI_FACILITY table we requested is the starter table used in most of the examples in the Envirofacts API documentation, however, it does not contain any chemical release data.

Navigating the TRI Envirofacts interface can be overwhelming at times. Multiple tables can be queried with a single URL, however, for a proper join to be carried out both tables must contain at least 1 column. In most cases this can be accomplished using TRI_FACILITY_ID. That said, air pollution release data (column name AIR_TOTAL_RELEASE) is only found in the TRI_FORM_R table, which does not contain the FACILITY_ID column.

You can querry the V_TRI_FORM_R_EZ using the API even though it’s not a documented table listed in the API website. This form contains facility information and AIR_TOTAL_RELEASE reporting, but it only goes until 2021. To get the most recent data (2023) you would have to manually download the individual tables with separate calls perform the joins, or use the custom form search web portal. We’ll demonstrate a quick API call for the V_TRI_FORM_R_EZ, but move forward with the most recent data available in the custom form search web portal.

Knowledge Check

What is the primary purpose of the Toxic Release Inventory (TRI)?

  1. To track emissions of greenhouse gases
  2. To monitor compliance with the Clean Air Act
  3. To provide data on toxic chemical releases from industrial facilities
  4. To record air quality index values for major cities

Toxic Release Inventory: Form R

We can make the query to Envirofacts. Let’s breakdown the call.

# URL of the CSV file
url = "https://data.epa.gov/efservice/V_TRI_FORM_R_EZ/REPORTING_YEAR/2021/AIR_TOTAL_RELEASE/>/0/STATE_ABBR/MI/COUNTY_NAME/CONTAINING/WAYNE/CONTAINING/OAKLAND/CONTAINING/MACOMB/CSV"
  • Base url: https://data.epa.gov/efservice/
  • The table we’re requesting: V_TRI_FORM_R_EZ/
  • For just 2021: REPORTING_YEAR/2021/
  • Only facilities that report some air release: AIR_TOTAL_RELEASE/>/0/
  • Just Michigan: STATE_ABBR/MI/
  • Our counties: COUNTY_NAME/CONTAINING/WAYNE/CONTAINING/OAKLAND/CONTAINING/MACOMB/
  • We want it as a csv: CSV
# Read the CSV file directly with pandas
tri_form_r = pd.read_csv(url)
print(f"Successfully read CSV. Number of records: {len(tri_form_r)}")

# Display information about the dataset
print("\nColumns in the dataset:")
print(tri_form_r.columns)
Successfully read CSV. Number of records: 413

Columns in the dataset:
Index(['tri_facility_id', 'facility_name', 'street_address', 'city_name',
       'county_name', 'state_county_fips_code', 'state_abbr', 'zip_code',
       'region', 'fac_closed_ind',
       ...
       'additional_text_9_1', 'srs_id', 'media_type', 'prod_ratio_or_activity',
       'industry_code', 'industry_description', 'source', 'method', 'adjusted',
       'covered_naics'],
      dtype='object', length=568)

We have roughly the same number of facilities as before, but this table has a lot of columns (568). We should make it more manageable.

keeps = [
    'facility_name', 'street_address', 'city_name',
    'reporting_year', 'air_total_release']

tri_form_r = tri_form_r[keeps]

This works well but we can get more recent data that is already filtered using the web portal. The user selects location, years of interest, and columns of interests. The portal serves up your requested dataset in an Amazon S3 cloud storage. We can important the dataset directly from the address they provide you.

tri_form_r = pd.read_csv("https://dmap-epa-enviro-prod-export.s3.amazonaws.com/396975438.CSV")

# Display information about the dataset
print(f"Successfully read CSV. Number of records: {len(tri_form_r)}")
print("\nColumns in the dataset:")
print(tri_form_r.columns)
Successfully read CSV. Number of records: 994

Columns in the dataset:
Index(['AIR_TOTAL_RELEASE', 'CHEM_NAME', 'FACILITY_NAME', 'LATITUDE',
       'LONGITUDE', 'STREET_ADDRESS', 'TRI_FACILITY_ID'],
      dtype='object')

This is a bit more manageable, but we have multiple records per facility because each chemical release has a separate record. This allows the user to investigate specific chemicals, but for the purposes of this exercise we will aggregate AIR_TOTAL_RELEASE for each facility.

# Check how it looks (too large to print on the web)
# tri_form_r.head

Let’s aggregate it.

# Sum across individual facilities
tri_form_r = tri_form_r.groupby(['FACILITY_NAME', 'LATITUDE','LONGITUDE','STREET_ADDRESS'], as_index=False).agg({
    'AIR_TOTAL_RELEASE': 'sum'
})
# How many records are there now
print(f"Successfully read CSV. Number of records: {len(tri_form_r)}")
Successfully read CSV. Number of records: 211

There were only 211 unique listings in the 2023 data.

As before, we should check for valid coordinates and release data.

# Assuming the latitude and longitude columns are named 'LATITUDE' and 'LONGITUDE'
# Adjust these names if they're different in your CSV
lat_col = 'LATITUDE'
lon_col = 'LONGITUDE'
release_col = 'AIR_TOTAL_RELEASE'  # Adjust this to the actual column name for air releases

# Remove records with missing coordinates or air release data
df_tri_clean = tri_form_r.dropna(subset=[lat_col, lon_col, release_col])

print(f"\nNumber of records after removing missing data: {len(df_tri_clean)}")

Number of records after removing missing data: 211

There were no missing coordinates are release data.

Now we can create a spatial object with GeoPandas using the latitude and longitude coordinates in the table.

# Create a GeoDataFrame
gdf_tri_form_r = gpd.GeoDataFrame(
    df_tri_clean, 
    geometry=gpd.points_from_xy(df_tri_clean[lon_col], df_tri_clean[lat_col]),
    crs="EPSG:4326"
)

Let’s check the distribution of the air release data. I suspect there will be outliers.

# Create the histogram
plt.figure(figsize=(8, 8))
plt.hist(gdf_tri_form_r['AIR_TOTAL_RELEASE'], bins=10, edgecolor='black')
plt.title('Histogram of Air Total Release Sum')
plt.xlabel('Air Sum')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)

# Show the plot
plt.show()

Looks like there are outliers around 100,000 lbs and 400,000 lbs. Before we map the data, it would be a good idea to perform a log transformation of AIR_TOTAL_RELEASE to smooth out the visuals. You can’t take the log of negative or zero values, so we’ll use the log1p function, which adds 1 to every value before taking the log.

import numpy as np

# Assuming 'result' is your DataFrame from earlier
gdf_tri_form_r['LOG_AIR_RELEASE'] = np.log1p(gdf_tri_form_r['AIR_TOTAL_RELEASE'])

# Create the histogram
plt.figure(figsize=(8, 8))
plt.hist(gdf_tri_form_r['LOG_AIR_RELEASE'], bins=10, edgecolor='black')
plt.title('Histogram of the Log of Air Total Release Sum')
plt.xlabel('Log of Air Sum')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)

# Show the plot
plt.show()

That looks much better.

Now we can visualize the data, by overlaying the points on a basemap of Detroit while using graduated symbols to illustrate the amount of air releases for a given facility.

# Reproject to Web Mercator to match the basemap
gdf_tri_form_r_bm = gdf_tri_form_r.to_crs(epsg=3857)

# Create the plot
fig, ax = plt.subplots(figsize=(7, 7))

# Plot the metro area and bounding box (reusing objects from earlier)
detroit_metro_bm.plot(ax=ax, facecolor='none', edgecolor='blue', linewidth=2)
bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=2)

# Plot TRI facilities with graduated symbols based on air releases
scatter = ax.scatter(gdf_tri_form_r_bm.geometry.x, gdf_tri_form_r_bm.geometry.y, 
                     s=gdf_tri_form_r_bm['LOG_AIR_RELEASE']*20,  # Adjust the scaling factor as needed
                     c='orangered',  # Static fill color
                     edgecolor='yellow',  # Outline color
                     linewidth=1,  # Adjust the outline width as needed
                     alpha=0.7)

# Add the basemap
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)

# Set the extent of the map to the bounding box
ax.set_xlim(bbox_polygon_bm.total_bounds[0], bbox_polygon_bm.total_bounds[2])
ax.set_ylim(bbox_polygon_bm.total_bounds[1], bbox_polygon_bm.total_bounds[3])

# Remove axes
ax.set_axis_off()

# Add a legend for symbol sizes
legend_sizes = [0,4,8,12]  # Example sizes, adjust based on your data
legend_elements = [plt.scatter([], [], s=size*20, c='orangered', edgecolor='yellow', 
                               linewidth=1, alpha=1, label=f'{size:,}') 
                   for size in legend_sizes]
ax.legend(handles=legend_elements, title='Log Total Air Releases (lbs)', 
          loc='lower right', title_fontsize=12, fontsize=10)

plt.title("Detroit Metro Area TRI Facilities - Total Air Releases (Custom Data)", fontsize=16)
plt.tight_layout()
plt.show()

print(f"\nNumber of TRI facilities plotted: {len(gdf_tri_form_r)}")
print(f"Total air releases: {gdf_tri_form_r[release_col].sum():,.2f} lbs")
print(f"Average air release per facility: {gdf_tri_form_r[release_col].mean():,.2f} lbs")


Number of TRI facilities plotted: 211
Total air releases: 2,117,167.80 lbs
Average air release per facility: 10,033.97 lbs

The point source air release data provides a good snapshot of the locations and relative amounts of chemical releases. With some basic understanding of the demographic dynamics of Detroit one might theorize who is being subjected to high levels of pollution, however, we’ll need additional data to develop any type of systematic analysis.

Knowledge Check

What key information does Form R of the Toxic Release Inventory provide?

  1. Facility names and locations only
  2. Chemical storage capacities of facilities
  3. Detailed data on chemical releases, including air releases
  4. Employee health records related to chemical exposure

Social Vulnerability Index

NASA’s Social Vulnerability Index (SVI) raster dataset, available through the Socioeconomic Data and Applications Center (SEDAC), is a high-resolution geospatial resource that quantifies social vulnerability across the United States. This dataset is based on the CDC’s Social Vulnerability Index but is presented in a raster format, providing continuous coverage at a 250-meter resolution. The SVI incorporates various socioeconomic and demographic factors such as poverty, lack of vehicle access, crowded housing, and minority status to assess communities’ capacity to prepare for, respond to, and recover from hazards, including environmental threats like poor air quality.

The raster format allows for more detailed spatial analysis and integration with other environmental datasets. This makes it particularly valuable for researchers and policymakers studying the intersection of social vulnerability and environmental risks, such as air pollution exposure. By overlaying this SVI data with air quality information, for instance, analysts can identify areas where socially vulnerable populations may be disproportionately affected by poor air quality, supporting environmental justice initiatives and targeted intervention strategies.

SEDAC, as part of NASA’s Earth Observing System Data and Information System (EOSDIS), hosts this dataset along with other socioeconomic and environmental data, facilitating interdisciplinary research on human-environment interactions. The SVI raster dataset’s high resolution and comprehensive coverage make it a powerful tool for assessing environmental equity and informing policy decisions at various geographic scales.

Although the SVI dataset is free to acquire, it does require an Earth Data account. If you don’t already have an Earth Data account you can follow these steps to download the SVI dataset on your local computer:

  1. Visit the NASA Earthdata website: Go to https://urs.earthdata.nasa.gov/ Click on “Register”: Look for the “Register” button on the top right corner of the page.
  2. Fill out the registration form: Provide the required information, including your name, email address, and a password. You’ll also need to create a username.
  3. Verify your email: NASA will send a verification email to the address you provided. Click on the link in this email to confirm your account.
  4. Log in to Earth Data: Once your account is verified, you can log in using your username and password.
  5. Access SEDAC: Visit the SEDAC website (https://sedac.ciesin.columbia.edu/) and use your Earth Data credentials to log in when prompted.
  6. Download the data: Once you’ve found the SVI dataset, you can use your Earth Data account to download it.

Remember, your Earth Data account gives you access not just to SEDAC, but to a wide range of NASA Earth science data. It’s a valuable resource for researchers, students, and anyone interested in environmental and socioeconomic data.

Data Processing

Once you’ve downloaded the dataset to your working directory, you can proceed with the analysis. We’ll be using the 2020 census tract version of SVI.

The different layers of SVI are provided as individual files, but sometimes it’s easier to work with a multilayer object. We can create one using xarray. To begin, we’ll read in each file individually, clip it to our border of Detroit metro, and create an individual data array.

import xarray as xr
import rasterio
import rasterio.mask

xr.set_options(
    keep_attrs=True,
    display_expand_attrs=True,
    display_expand_coords=True,
    display_expand_data=False,
    display_expand_data_vars=True
)

# Specify the TIF files
tif_files = [
    "data/svi/svi_2020_tract_overall_wgs84.tif",
    "data/svi/svi_2020_tract_minority_wgs84.tif",
    "data/svi/svi_2020_tract_socioeconomic_wgs84.tif",
    "data/svi/svi_2020_tract_housing_wgs84.tif",
    "data/svi/svi_2020_tract_household_wgs84.tif"
]

# Create an empty list to store the individual DataArrays
data_arrays = []

# Read each TIF file, clip it to Detroit metro's extent, and append it to the list
for file in tif_files:
    with rasterio.open(file) as src:
        # Reproject Detroit metro boundary to match the raster CRS
        metro_reprojected = detroit_metro.to_crs(src.crs)
        
        # Clip the raster to Detroit metro's geometry
        out_image, out_transform = rasterio.mask.mask(src, metro_reprojected.geometry, crop=True)
        out_meta = src.meta.copy()
        
        # Update the metadata
        out_meta.update({"driver": "GTiff",
                         "height": out_image.shape[1],
                         "width": out_image.shape[2],
                         "transform": out_transform})
        
        # Create coordinates
        height = out_meta['height']
        width = out_meta['width']
        cols, rows = np.meshgrid(np.arange(width), np.arange(height))
        xs, ys = rasterio.transform.xy(out_transform, rows, cols)
        
        # Convert lists to numpy arrays
        xs = np.array(xs)
        ys = np.array(ys)
        
        # Reshape coordinates to match dimensions of the raster
        xs = xs.reshape(height, width)
        ys = ys.reshape(height, width)
        
        # Create a DataArray from the clipped data
        da = xr.DataArray(out_image[0],  # Use the first band
                          coords={'y': ('y', ys[:, 0]),
                                  'x': ('x', xs[0, :])},
                          dims=['y', 'x'])
        da.attrs['crs'] = str(src.crs)  # Convert CRS to string
        da.attrs['transform'] = out_transform
        data_arrays.append(da)

Now we can combine them together and give the layers “pretty” names.

# Combine all DataArrays into a single DataSet
svi_detroit = xr.concat(data_arrays, dim='layer')

# Rename the layers
layer_names = ['Overall', 'Minority', 'Socioeconomic', 'Housing', 'Household']
svi_detroit = svi_detroit.assign_coords(layer=('layer', layer_names))
svi_detroit
<xarray.DataArray (layer: 5, y: 105, x: 119)>
-3.4e+38 -3.4e+38 -3.4e+38 -3.4e+38 ... -3.4e+38 -3.4e+38 -3.4e+38 -3.4e+38
Coordinates:
  * y        (y) float64 42.9 42.89 42.88 42.87 ... 42.05 42.05 42.04 42.03
  * x        (x) float64 -83.69 -83.68 -83.67 -83.66 ... -82.72 -82.71 -82.7
  * layer    (layer) <U13 'Overall' 'Minority' ... 'Housing' 'Household'
Attributes:
    crs:        EPSG:4326
    transform:  | 0.01, 0.00,-83.69|\n| 0.00,-0.01, 42.90|\n| 0.00, 0.00, 1.00|

Now we can plot each layer.

# Define the colorbar limits (SVI is a 0-1 scale)
vmin, vmax = 0, 1

# Create a multipanel plot
fig, axes = plt.subplots(3, 2, figsize=(8, 10))
axes = axes.flatten()

# Plot each layer
for i, layer in enumerate(svi_detroit.layer.values):
    # Plot with custom color limits
    im = svi_detroit.sel(layer=layer).plot(ax=axes[i], add_colorbar=False, vmin=vmin, vmax=vmax, cmap='plasma')
    axes[i].set_title(layer)
    # Remove axes and labels
    axes[i].axis('off')
    
    # Plot Detroit metro boundary
    metro_reprojected.boundary.plot(ax=axes[i], color='red', linewidth=1)

# Remove the extra subplot
fig.delaxes(axes[5])

# Add a single colorbar
cbar_ax = fig.add_axes([0.92, 0.15, 0.02, 0.7])
cbar = fig.colorbar(im, cax=cbar_ax, label='SVI Score', 
                    fraction=0.047, pad=0.04, aspect=20)

plt.tight_layout()
plt.show()

This provides an excellent look at demographic trends in the Detroit metro area. Overall vulnerability is highest in the inner city. The most striking drivers are the concentration of minorities and socioeconomic vulnerability in the downtown area. The housing and household components are slightly more varied throughout the region.

Knowledge Check

What does the Social Vulnerability Index (SVI) primarily measure?

  1. Air pollution levels in different communities
  2. Economic growth rates of different regions
  3. Communities’ capacity to prepare for and respond to hazards
  4. Population density in urban areas

Integrating TRI and SVI

There are numerous ways to assess the reationships between the SVI and TRI data. For this lesson we’ll identify areas where air releases and vulnerability are highest together by creating a new rasterized index that combines both layers.

To begin we need to rasterize the air release point data. We will create an empty raster grid and “fill” the cells with the sum of the air release values from any points the are overlapped by the cells.

Start by getting the bounds from our boundary object, specify the desired resolution, and set up the transform.

from rasterio.transform import from_origin

# Get the bounds of the Detroit metro area
minx, miny, maxx, maxy = detroit_metro_bm.total_bounds

# Define the resolution (1km/1000m) to match SVI
resolution = 5000

# Calculate the number of cells
nx = int((maxx - minx) / resolution)
ny = int((maxy - miny) / resolution)

# Create the transform for the raster
transform = from_origin(minx, maxy, resolution, resolution)

Now we have to prepare the point values and rasterize them. A key thing to note here is the merge_alg=rasterio.enums.MergeAlg.add argument passed to the rasterize function. This makes sure that cells with more than 1 point will add the values together (the default is to simply replace).

from rasterio import features

# Prepare geometries and values for rasterization
shapes = ((geom, value) for geom, value in zip(gdf_tri_form_r_bm.geometry, gdf_tri_form_r_bm.AIR_TOTAL_RELEASE))

# Rasterize the point data
air_release_raster = features.rasterize(shapes=shapes, 
                            out_shape=(ny, nx), 
                            transform=transform, 
                            fill=0, 
                            all_touched=True, 
                            merge_alg=rasterio.enums.MergeAlg.add)

We rasterized with rasterio, but we need to convert this to an xarray object to continue and we need rioxarray to write the crs information for the xarray object.

import rioxarray

# Convert the raster to an xarray DataArray
# Note: We use ny and nx here to ensure the coordinates match the raster shape
air_release_raster_da = xr.DataArray(air_release_raster, 
                         coords={'y': np.linspace(maxy, miny, ny),
                                 'x': np.linspace(minx, maxx, nx)},
                         dims=['y', 'x'])
air_release_raster_da.rio.write_crs(detroit_metro_bm.crs, inplace=True)
<xarray.DataArray (y: 26, x: 21)>
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.97 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Coordinates:
  * y            (y) float64 5.296e+06 5.291e+06 ... 5.17e+06 5.165e+06
  * x            (x) float64 -9.316e+06 -9.311e+06 ... -9.212e+06 -9.207e+06
    spatial_ref  int32 0

The spaces between county borders can create odd values because there are no points right next to each other. We can clip the raster to our Detroit boundary to fix this.

# Clip the raster with the Detroit metro boundary
air_release_raster_da = air_release_raster_da.rio.clip(
    detroit_metro_bm.geometry.values, 
    detroit_metro_bm.crs, 
    drop=False, all_touched=True)

Now we can see how our rasterized air release values look.

from matplotlib.colors import BoundaryNorm, ListedColormap

# Define the breaks for the discrete scale
breaks = [0, 1, 10, 100, 1000, 10000, 100000, 250000, 500000]

# Create a custom colormap
colors = ['#fffff3', '#FFFFCC', '#FFEDA0', '#FED976', '#FEB24C', '#FD8D3C', '#FC4E2A', '#E31A1C', '#B10026']
cmap = ListedColormap(colors)

# Create a normalization based on the breaks
norm = BoundaryNorm(breaks, cmap.N)

# Create the plot
fig, ax = plt.subplots(figsize=(7, 7))

# Plot the TRI facility points
gdf_tri_form_r_bm.plot(ax=ax, color='blue', markersize=10, alpha=0.7)

# Plot the clipped raster with the custom colormap and norm
im = ax.imshow(air_release_raster_da, extent=[minx, maxx, miny, maxy], origin='upper', 
               cmap=cmap, norm=norm)

# Add colorbar with discrete labels
cbar = plt.colorbar(im, ax=ax, extend='max', 
                    label='Total Air Releases (pounds)', 
                    ticks=breaks, 
                    fraction=0.047, pad=0.04, aspect=20)
cbar.ax.set_yticklabels([f'{b:,}' for b in breaks])

# Plot the Detroit metro boundary
detroit_metro_bm.boundary.plot(ax=ax, color='black', linewidth=2)

# Set the extent to match the Detroit metro area
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)

# Add title and labels
ax.set_title('TRI Air Total Release (100m resolution sum) with Facility Locations', fontsize=16)
ax.set_xlabel('X Coordinate')
ax.set_ylabel('Y Coordinate')

plt.tight_layout()
plt.show()

print(f"Number of TRI facilities plotted: {len(gdf_tri_form_r_bm)}")
print(f"Total air releases: {gdf_tri_form_r_bm['AIR_TOTAL_RELEASE'].sum():,.2f}")
print(f"Maximum cell value in raster: {air_release_raster_da.max().values:,.2f}")

Number of TRI facilities plotted: 211
Total air releases: 2,117,167.80
Maximum cell value in raster: 434,344.84

The raster looks as expected with the highest values in locations where our previous map had the largest circles. We overlayed the points to see the sources of the values. The cell resolution is set to 5000m. This was selected so we can evaluate the map and the process we used to create the raster, but any value could be set depending on the intended use.

Air Release Vulnerability Index

Now that we’ve created our air release raster layer we can combine it with the SVI raster to create a new index identifying areas with the highest combination of air releases and vulnerability.

Start by selecting just the SVI Overall layer and converting it to a rioxarray object so we can perform raster calculations. We’ll also ensure the 2 raster layers “line up” by reprojecting into the same coordinate reference system and matching their resolutions.

from rasterio.enums import Resampling

# Select the 'Overall' layer
svi_overall = svi_detroit.sel(layer='Overall')

# Convert to rioxarray for geospatial operations
svi_overall = svi_overall.rio.write_crs("EPSG:4326")

# Reproject SVI to match the CRS of the air release raster
svi_reprojected = svi_overall.rio.reproject_match(air_release_raster_da)

# Disaggregate the air release data to match the resolution of the SVI data
air_release_disaggregated = svi_reprojected.rio.reproject_match(
    svi_reprojected,
    resampling=Resampling.bilinear
)

There are a few more steps that will aid in interpretation.

  1. We will take the log of the air release data to reduce the impact of major outliers.
  2. We will scale the logged air release data to 0-1 to match the SVI data. Now the resulting index we create will have equal contributions from both datasets.
# Log1p transform the air release data and scale to 0-1
air_release_log = np.log1p(air_release_disaggregated)
air_release_scaled = (air_release_log - air_release_log.min()) / (air_release_log.max() - air_release_log.min())

# Multiply scaled air release data with SVI data
vulnerability_indicator = air_release_scaled * svi_reprojected

Now let’s visualize our index along with the inputs.

# Create the plots
fig, axs = plt.subplots(3, 1, figsize=(8, 20))

# Plot SVI Overall
im1 = svi_reprojected.plot(ax=axs[0], cmap='viridis', vmin=0, vmax=1, add_colorbar=False)
plt.colorbar(im1, ax=axs[0], label='SVI Overall', 
                    fraction=0.047, pad=0.04, aspect=20)
axs[0].set_title('Social Vulnerability Index (Overall)', fontsize=16)
detroit_metro.boundary.plot(ax=axs[0], color='black', linewidth=2)

# Plot Original Air Release (log-transformed for better visualization)
im2 = np.log1p(air_release_disaggregated).plot(ax=axs[1], cmap='YlOrRd', add_colorbar=False)
plt.colorbar(im2, ax=axs[1], label='Log(Air Release + 1)', 
                    fraction=0.047, pad=0.04, aspect=20)
axs[1].set_title('Air Release (Log-transformed)', fontsize=16)
detroit_metro.boundary.plot(ax=axs[1], color='black', linewidth=2)

# Plot Air Release Vulnerability Indicator
im3 = vulnerability_indicator.plot(ax=axs[2], cmap='YlOrRd', vmin=0, vmax=1, add_colorbar=False)
plt.colorbar(im3, ax=axs[2], label='Air Release Vulnerability Indicator', 
                    fraction=0.047, pad=0.04, aspect=20)
axs[2].set_title('Air Release Vulnerability Indicator\n(Scaled Air Release * SVI)', fontsize=16)
detroit_metro.boundary.plot(ax=axs[2], color='black', linewidth=2)

for ax in axs:
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    ax.set_xlim(svi_reprojected.x.min(), svi_reprojected.x.max())
    ax.set_ylim(svi_reprojected.y.min(), svi_reprojected.y.max())

plt.tight_layout()
plt.show()

# Print some statistics
print(f"Maximum vulnerability indicator: {vulnerability_indicator.max().values:.4f}")

Maximum vulnerability indicator: 0.9773

As one would expect, there are several areas in the downtown river corridor with very high combinations of SVI and TRI releases. The maximum score for our index is 0.9286! That is an extremely vulnerable community facing extraordinary levels of air pollution.

We can get a better sense of where these communities are by extracting the location of the cells with the highest scores and placing them on a basemap of Detroit.

We’ll convert the raster into a data frame, sort the values in descending order, and then extract the coordinates.

# Convert the vulnerability indicator to a pandas DataFrame
vulnerability_df = vulnerability_indicator.to_dataframe(name='index').reset_index()

# Sort by index value and get the top 10
top_10 = vulnerability_df.sort_values('index', ascending=False).head(10)

# Create points from the coordinates
top_10['geometry'] = gpd.points_from_xy(top_10.x, top_10.y)
top_10_gdf = gpd.GeoDataFrame(top_10, geometry='geometry', crs=vulnerability_indicator.rio.crs)

Now let’s make a map.

# Create the final map
fig, ax = plt.subplots(figsize=(8, 8))

# Plot the Detroit metro boundary
detroit_metro_bm.boundary.plot(ax=ax, color='black', linewidth=2)

# Plot the top 10 points
top_10_gdf.plot(ax=ax, color='blue', markersize=100, alpha=0.7)

# Add labels to the points
for idx, row in top_10_gdf.iterrows():
    ax.annotate(f"#{idx+1}", (row.geometry.x, row.geometry.y), 
                xytext=(3, 3), textcoords="offset points", 
                color='black', fontweight='bold')

# Add a basemap
ctx.add_basemap(ax, crs=vulnerability_indicator.rio.crs.to_string(), source=ctx.providers.OpenStreetMap.Mapnik)

# Add a border to the map
bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)

# Set the extent to match the Detroit metro area
ax.set_xlim(vulnerability_indicator.x.min(), vulnerability_indicator.x.max())
ax.set_ylim(vulnerability_indicator.y.min(), vulnerability_indicator.y.max())

ax.set_title('Top 10 Areas with Highest Air Release Vulnerability Index', fontsize=16)
ax.set_axis_off()

plt.tight_layout()
plt.show()

# Print the row index and index value of the top 10 points
print("Row index and index value of the top 10 points:")
for i, (idx, row) in enumerate(top_10_gdf.iterrows(), 1):
    print(f"Row index = {idx}, Index value = {round(row['index'], 2)}")

Row index and index value of the top 10 points:
Row index = 347, Index value = 0.98
Row index = 134, Index value = 0.97
Row index = 155, Index value = 0.97
Row index = 306, Index value = 0.96
Row index = 369, Index value = 0.96
Row index = 305, Index value = 0.94
Row index = 285, Index value = 0.94
Row index = 330, Index value = 0.93
Row index = 348, Index value = 0.92
Row index = 324, Index value = 0.9

As you would expect looking at the input layers, the most vulnerable and polluted areas are along the river coridor in Loncoln Park, Melvindale, Mexicantown, and River Rouge. There are multiple locations with indexes greater than 0.70. Information like this can be valuable for targeted relief or mitigation programs.

Knowledge Check

What was the main purpose of combining the TRI and SVI data in this analysis?

  1. To calculate total pollution levels for each county
  2. To identify areas with both high air releases and high social vulnerability
  3. To determine the most populous areas in Detroit
  4. To predict future industrial development zones

CDC PLACES

The CDC PLACES (Population Level Analysis and Community Estimates) dataset is a collaboration between the Centers for Disease Control and Prevention (CDC), the Robert Wood Johnson Foundation, and the CDC Foundation. It provides model-based population-level analysis and community estimates of health indicators for all counties, places (incorporated and census designated places), census tracts, and ZIP Code Tabulation Areas (ZCTAs) across the United States. Some key points to consider when working with CDC PLACES:

  1. Spatial Extent: Entire United States, including all 50 states, the District of Columbia, and Puerto Rico.
  2. Spatial Resolution: Multiple levels including counties, cities/towns, census tracts, and ZIP codes.
  3. Indicators: Wide range of chronic disease measures related to health outcomes, prevention, and health risk behaviors.
  4. Data Sources:
    • Behavioral Risk Factor Surveillance System (BRFSS)
    • U.S. Census Bureau’s American Community Survey (ACS)
  5. Methodology: Uses small area estimation methods for small geographic areas.
  6. Health Measures Include:
    • Chronic diseases: e.g., asthma, COPD, heart disease, diabetes
    • Health risk behaviors: e.g., smoking, physical inactivity, binge drinking
    • Prevention practices: e.g., health insurance coverage, dental visits, cholesterol screening
  7. Socioeconomic Data: Includes some socioeconomic and demographic variables.
  8. Annual Updates: Providing recent estimates for local areas.

This dataset is valuable for public health researchers, policymakers, and community organizations. It provides a standardized way to compare health indicators across different geographic areas and can be used to inform targeted interventions and policy decisions, especially in addressing health disparities at a local level.

As with ICIS-AIR and TRI, PLACES is available through a web search interface and a programmatic API. We will focus on the API in this example.

Processing

We’ll start by accessing the CDC PLACES data through their API and processing it for our analysis. First we set up our API request. We specify the endpoint URL, define our target counties, and create a filter to select data only for these counties in Michigan.

# Define the GeoJSON API endpoint
url = "https://data.cdc.gov/resource/cwsq-ngmh.geojson"

# Define the Detroit metro area counties
detroit_counties = ['Wayne', 'Oakland', 'Macomb']

# Create the county filter string
county_filter = " OR ".join([f"countyname = '{county}'" for county in detroit_counties])

# Define the query parameters
params = {
    "$where": f"stateabbr = 'MI' AND ({county_filter})",
    "$limit": 50000  # Adjust if necessary
}

Next, let’s make the API request and process the response:

# Make the API request
response = requests.get(url, params=params)

if response.status_code == 200:
    data = response.json()
    print(f"Successfully retrieved data")
else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")
    print(response.text)

# Convert to GeoDataFrame
gdf = gpd.read_file(response.text)
Successfully retrieved data

You can inspect the gdf return object on your own; it’s too large to print here. Let’s just take a look at the available health measures in CDC PLACES.

# Print available health measures
print("\nAvailable health measures:")
print(gdf['measure'].unique())

# Print basic information about the GeoDataFrame
print("\nGeoDataFrame Info:")
print(gdf.info())

Available health measures:
['Frequent physical distress among adults'
 'Cancer (non-skin) or melanoma among adults'
 'Visited dentist or dental clinic in the past year among adults'
 'Cognitive disability among adults' 'Coronary heart disease among adults'
 'Current asthma among adults'
 'Current lack of health insurance among adults aged 18-64 years'
 'Lack of social and emotional support among adults'
 'Arthritis among adults' 'Obesity among adults'
 'Self-care disability among adults' 'Mobility disability among adults'
 'Short sleep duration among adults'
 'Frequent mental distress among adults'
 'Fair or poor self-rated health status among adults'
 'Lack of reliable transportation in the past 12 months among adults'
 'All teeth lost among adults aged >=65 years'
 'Taking medicine to control high blood pressure among adults with high blood pressure'
 'Colorectal cancer screening among adults aged 45–75 years'
 'Vision disability among adults'
 'Visits to doctor for routine checkup within the past year among adults'
 'Diagnosed diabetes among adults'
 'Mammography use among women aged 50-74 years'
 'Independent living disability among adults'
 'Hearing disability among adults' 'Stroke among adults'
 'Chronic obstructive pulmonary disease among adults'
 'Food insecurity in the past 12 months among adults'
 'Feeling socially isolated among adults'
 'High cholesterol among adults who have ever been screened'
 'High blood pressure among adults'
 'No leisure-time physical activity among adults'
 'Cholesterol screening among adults'
 'Housing insecurity in the past 12 months among adults'
 'Utility services shut-off threat in the past 12 months among adults'
 'Binge drinking among adults'
 'Received food stamps in the past 12 months among adults'
 'Current cigarette smoking among adults' 'Any disability among adults'
 'Depression among adults']

GeoDataFrame Info:
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 46560 entries, 0 to 46559
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   measure                     46560 non-null  object  
 1   low_confidence_limit        46560 non-null  object  
 2   data_value_unit             46560 non-null  object  
 3   data_value                  46560 non-null  object  
 4   short_question_text         46560 non-null  object  
 5   statedesc                   46560 non-null  object  
 6   totalpop18plus              46560 non-null  object  
 7   locationid                  46560 non-null  object  
 8   countyname                  46560 non-null  object  
 9   year                        46560 non-null  object  
 10  high_confidence_limit       46560 non-null  object  
 11  categoryid                  46560 non-null  object  
 12  stateabbr                   46560 non-null  object  
 13  data_value_footnote         0 non-null      object  
 14  data_value_type             46560 non-null  object  
 15  data_value_footnote_symbol  0 non-null      object  
 16  locationname                46560 non-null  object  
 17  category                    46560 non-null  object  
 18  datavaluetypeid             46560 non-null  object  
 19  measureid                   46560 non-null  object  
 20  countyfips                  46560 non-null  object  
 21  datasource                  46560 non-null  object  
 22  totalpopulation             46560 non-null  object  
 23  geometry                    46560 non-null  geometry
dtypes: geometry(1), object(23)
memory usage: 8.5+ MB
None

There’s a lot of great data here considering the rarity of health outcomes data; especially at this resolution (census tracts). Because we’re dealing with air quality in this lesson we’ll use “Current asthma among adults.”

Let’s make a quick map to see how it looks.

# Create a sample map for one health measure (e.g., Current asthma)
fig, ax = plt.subplots(figsize=(7, 7))

# Filter for the specific measure and ensure data_value is numeric
gdf_asthma = gdf[gdf['measure'] == 'Current asthma among adults'].copy()
gdf_asthma['data_value'] = pd.to_numeric(gdf_asthma['data_value'], errors='coerce')

# Plot the asthma data
gdf_asthma.plot(column='data_value', 
                ax=ax, 
                legend=True, 
                legend_kwds={'label': 'Asthma Prevalence (%)', 'orientation': 'horizontal'},
                cmap='YlOrRd',
                missing_kwds={'color': 'lightgrey'})

# Add basemap
ctx.add_basemap(ax, crs=gdf_asthma.crs.to_string(), source=ctx.providers.OpenStreetMap.Mapnik)

# Add a border to the map
bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)

# Set the extent to match the Detroit metro area
ax.set_xlim(gdf_asthma.total_bounds[0], gdf_asthma.total_bounds[2])
ax.set_ylim(gdf_asthma.total_bounds[1], gdf_asthma.total_bounds[3])

plt.title('Asthma Prevalence in Detroit Metro Area', fontsize=16)
ax.axis('off')

plt.tight_layout()
plt.show()

If you hadn’t already noticed while inspecting the object we returned from the API, the census tract data are actually centroid points–not census tract boundaries.

We created a choropleth map showing asthma prevalence across the Detroit metro area. With the YlOrRd colormap darker red indicates higher asthma prevalence.

Finally, let’s print some statistics about the asthma data:

# Print some statistics for the asthma data
print("\nAsthma Statistics:")
print(f"Average asthma prevalence: {gdf_asthma['data_value'].mean():.2f}%")
print(f"Minimum asthma prevalence: {gdf_asthma['data_value'].min():.2f}%")
print(f"Maximum asthma prevalence: {gdf_asthma['data_value'].max():.2f}%")

# Print the number of census tracts per county
print("\nNumber of census tracts per county:")
print(gdf_asthma['countyname'].value_counts())

Asthma Statistics:
Average asthma prevalence: 12.06%
Minimum asthma prevalence: 7.20%
Maximum asthma prevalence: 17.20%

Number of census tracts per county:
countyname
Wayne      583
Oakland    345
Macomb     236
Name: count, dtype: int64

Wow. Some census tracts along the river front and central downtown area have adult asthma prevalance of nearly 20%! I’m not an epidimeologist, but that seems extremely high; especially when you compare to the outer suburbs and more affluent areas in the Royal Oak-Ferndale area. However, in light of our exploration of TRI and ICIS-AIR data it might not be surprising to see elevated asthma rates in these areas.

Given that we’ve mostly been working with and creating raster data in the previous sections, it might be easier to draw systematic comparisons between pollution and health outcomes if the CDC data is also in a raster format.

Interpolate a Surface w/ IDW

While the choropleth map provides a good overview, it can be affected by the varying sizes of census tracts. To get a smoother representation of the data that resembles our air quality data, we can use Inverse Distance Weighting (IDW) interpolation to create a continuous surface.

First, let’s prepare our data for interpolation by reprojecting, extracting the coordinates and the asthma value, and masking the data for any asthma values that may be NA.

from scipy.interpolate import griddata
from rasterio.transform import from_origin
from rasterio.warp import transform_bounds

# Ensure gdf_asthma is in EPSG:4326
gdf_asthma = gdf_asthma.to_crs(epsg=4326)

# Extract coordinates and values
X = gdf_asthma.geometry.x.values
Y = gdf_asthma.geometry.y.values
Z = gdf_asthma['data_value'].values

# Remove any NaN values
mask = ~np.isnan(Z)
X, Y, Z = X[mask], Y[mask], Z[mask]
Data Science Review

Inverse Distance Weighting (IDW) is a spatial interpolation method used to estimate values at unknown locations based on known values at nearby points. The fundamental principle of IDW is that points closer to the location being estimated have more influence than those farther away. IDW calculates the estimated value as a weighted average of nearby known values, with weights inversely proportional to the distance between the known point and the estimation location. IDW is popular in geographic information systems (GIS) and environmental sciences due to its simplicity and effectiveness in creating continuous surfaces from point data, such as elevation, temperature, or pollution levels. While IDW has advantages like preserving local variation and working well with evenly distributed points, it also has limitations; including potential “bull’s-eye” patterns around data points, inability to account for spatial trends or barriers, and sensitivity to outliers and clustering of input points.

Now we create the structure of the raster we’ll create. This is based on our desired resolution and the extent of our data.

# Create a grid to interpolate over
# Resolution (degrees)
grid_resolution = 0.025
# Extent
x_min, y_min, x_max, y_max = gdf_asthma.total_bounds
# Arrange the locations of the x and y dimensions
grid_x = np.arange(x_min, x_max, grid_resolution)
grid_y = np.arange(y_min, y_max, grid_resolution)
# Mesh them to a grid
grid_xx, grid_yy = np.meshgrid(grid_x, grid_y)

Now we’re ready to interpolate the surface.

# Perform IDW interpolation
points = np.column_stack((X, Y))
grid_z = griddata(points, Z, (grid_xx, grid_yy), method='linear')

This created a regular grid over our study area and then used the griddata function to perform linear interpolation (which is a form of IDW) on our asthma prevalence data.

Now, let’s convert our interpolated data into an xarray Dataset so it’s easier to manipulate.

# Create an xarray Dataset from the interpolated data
ds = xr.Dataset({
    'asthma': (['y', 'x'], grid_z),
    'x': grid_x,
    'y': grid_y
})

# Set the correct CRS
ds.rio.write_crs("EPSG:4326", inplace=True)
<xarray.Dataset>
Dimensions:      (y: 33, x: 38)
Coordinates:
  * x            (x) float64 -83.67 -83.64 -83.62 ... -82.79 -82.77 -82.74
  * y            (y) float64 42.06 42.09 42.11 42.14 ... 42.79 42.81 42.84 42.86
    spatial_ref  int32 0
Data variables:
    asthma       (y, x) float64 nan nan nan nan nan nan ... nan nan nan nan nan

This step allows us to work with our interpolated data using xarray, which provides powerful tools for working with multi-dimensional arrays.

Next, we’ll clip our interpolated surface to the Detroit metro area. This ensures our data matches our other rasters. Then we’ll project it to match basemaps so we can create a map of the data.

# Clip the dataset using the detroit_metro polygon
ds_clipped = ds.rio.clip(detroit_metro.geometry, drop=False)

# Reproject to Web Mercator (EPSG:3857)
ds_3857 = ds_clipped.rio.reproject("EPSG:3857")

Finally, let’s create a map of our interpolated asthma prevalence:

# Create the plot
fig, ax = plt.subplots(figsize=(7, 7))

# Plot the interpolated and clipped data
im = ds_3857.asthma.plot(ax=ax, cmap='YlOrRd', 
                         alpha=0.7, add_colorbar=False)

# Add colorbar with adjusted height
cbar = plt.colorbar(im, ax=ax, label='Asthma Prevalence', 
                    fraction=0.047, pad=0.04, aspect=20)

# Add basemap
ctx.add_basemap(ax, crs=ds_3857.rio.crs, 
                source=ctx.providers.OpenStreetMap.Mapnik)

# Add a border to the map
bbox_polygon_bm.boundary.plot(ax=ax, color='#315c86', linewidth=3)

# Set the extent to match the Detroit metro area
bounds_3857 = transform_bounds("EPSG:4326", "EPSG:3857", 
                               x_min, y_min, x_max, y_max)
ax.set_xlim(bounds_3857[0], bounds_3857[2])
ax.set_ylim(bounds_3857[1], bounds_3857[3])

plt.title('IDW Interpolated Asthma Prevalence in Detroit Metro Area', fontsize=16)
ax.axis('off')

plt.tight_layout()
plt.show()

This code creates a map of the interpolated asthma prevalence, providing a smooth surface that shows the spatial variation in asthma rates across the Detroit metro area. The use of IDW interpolation helps to create a continuous surface from our discrete census tract data, potentially revealing patterns that might be less apparent in the choropleth map.

By comparing this interpolated surface with our earlier maps of air pollution and social vulnerability, we can start to explore potential relationships between environmental factors, social conditions, and health outcomes in the Detroit metro area.

We can start with a simple correlation analysis and then visualize it with a scatterplot. The simpleist way to perform the analysis is the extract the raw value pairs and calculate the Pearson correlation.

from scipy import stats

# Ensure both rasters have the same shape and are aligned
air_release_aligned = air_release_raster_da.rio.reproject_match(ds_clipped)

# Flatten the arrays and remove NaN values
air_release_flat = air_release_aligned.values.flatten()
# Take the log of the release values to account for extreme outliers
air_release_flat = np.log1p(air_release_flat)
asthma_flat = ds_clipped.asthma.values.flatten()
mask = ~np.isnan(air_release_flat) & ~np.isnan(asthma_flat)

correlation, p_value = stats.pearsonr(air_release_flat[mask], asthma_flat[mask])
print(f"Correlation coefficient: {correlation}")
print(f"P-value: {p_value}")
Correlation coefficient: 0.32952588325679505
P-value: 4.714216462025532e-23

Creating a scatterplot of the paired values can provide a better look at what’s going on.

# Create the empty map
plt.figure(figsize=(7, 7))
# Create the scatter plot of the values
plt.scatter(air_release_flat[mask], asthma_flat[mask], alpha=0.5)
# Create Axis Labels
plt.xlabel('TRI Air Releases')
plt.ylabel('Asthma Prevalence')
# Title
plt.title('TRI Air Releases vs Asthma Prevalence')
# Let's see it
plt.show()

This paints a clearer picture. Although there is a modest positive correlation, there are numerous cells with 0 TRI Air Release and a wide variety of Asthma prevelance values, which negates any correlation we might be seeing at higher release levels. This is a simple pixel by pixel comparison of matching values. We can perform a more robust analysis accounting for the spatial strucuture using a global autocorrelation analysis and Moran’s I.

Pearson’s correlation and Moran’s I are both measures of association, but they serve different purposes and account for different aspects of data. Pearson’s correlation measures the linear relationship between two variables without considering their spatial context. It simply looks at how two variables change together, regardless of where the data points are located in space. In contrast, Moran’s I is specifically designed for spatial data and measures spatial autocorrelation – the degree to which a variable is correlated with itself across geographic space. Moran’s I takes into account both the values of observations and their spatial relationships, typically using a spatial weights matrix. While Pearson’s correlation can tell you if two variables are related overall, Moran’s I can reveal whether high or low values of a variable tend to cluster together in space, or if they’re randomly distributed. This makes Moran’s I particularly useful for analyzing geographical patterns and spatial dependencies, which are crucial in fields like environmental science, epidemiology, and urban planning.

We can perform this analysis in Python use pysal. We already know our data is spatially aligned from the previous code chunks. We can create a spatial data frame and continue on.

from pysal.explore import esda
from pysal.lib import weights
import shapely

# Convert rasters to a GeoDataFrame
df = gpd.GeoDataFrame(
    {
        'air_release': air_release_aligned.values.flatten(),
        'asthma': ds_clipped.asthma.values.flatten(),
        'geometry': [
            shapely.geometry.Point(x, y) 
            for x, y in zip(
                np.repeat(air_release_aligned.x.values, len(air_release_aligned.y)),
                np.tile(air_release_aligned.y.values, len(air_release_aligned.x))
            )
        ]
    }
)
Statistics Review

A p-value, or probability value, is a statistical concept used to determine the significance of results in hypothesis testing. It represents the probability of obtaining results at least as extreme as the observed results, assuming that the null hypothesis is true. In simpler terms, it measures the strength of evidence against the null hypothesis. P-values range from 0 to 1, with lower values indicating stronger evidence against the null hypothesis. Typically, a p-value below a predetermined threshold (often 0.05) is considered statistically significant, suggesting that the observed results are unlikely to have occurred by chance alone. However, it’s important to note that while p-values can indicate the strength of evidence against the null hypothesis, they do not prove or disprove hypotheses, nor do they measure the size or importance of an effect. They should be interpreted in conjunction with other factors such as effect size, sample size, and practical significance.

Check for missing values and create the spatial weights matrix that determines how wide of a net we want to cast as we search for a local dependence structure (you could change these values and see how it impacts the results). You can read more about settings for spatial autocorrelation and Moran’s I analysis here.

# Remove any rows with NaN values
df = df.dropna()

# Create a spatial weights matrix
w = weights.distance.KNN.from_dataframe(df, k=8)  # Using 8 nearest neighbors
w.transform = 'r'  # Row-standardize the weights

Now we’ll cycle through the Moran function for air release, current asthma among adults, and the 2 combined (bivariate). First air releases.

# Create a Moran scatter plot for air release
from splot.esda import moran_scatterplot

# Calculate Moran's I for air release
moran_air = esda.Moran(df['air_release'], w)

print("Moran's I for Air Release:")
print(f"I: {moran_air.I}")
print(f"p-value: {moran_air.p_sim}")

fig, ax = plt.subplots(figsize=(7, 7))
moran_scatterplot(moran_air, ax=ax)
# Get the current limits
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# Find the min and max of both axes
vmin = min(xlim[0], ylim[0])
vmax = max(xlim[1], ylim[1])
# Set both axes to the same limits
ax.set_xlim(vmin, vmax)
ax.set_ylim(vmin, vmax)
# Ensure the aspect ratio is 1:1
ax.set_aspect('equal', adjustable='box')
ax.set_title("Moran Scatter Plot: Air Release")
plt.show()
Moran's I for Air Release:
I: 0.11614142149447003
p-value: 0.001

When evaluating Moran’s I for single variables we are assessing how clustered like values are; i.e. do high or low values of air release tend to appear next to each other. On a scale of -1 to 1 a 0.12 suggests a slight positive correlation. High values of air release tend to cluster in similar spots. That said the correlation is lower and we can the same feature as before in that there are a large amount of 0 values that are mixed in all throughout the data. You could attempt to address this issue by using a larger resolution when creating the raster layer depicting air release sums (we used 5000m), or by including different sources of air pollution.

# Calculate Moran's I for asthma prevalence
moran_asthma = esda.Moran(df['asthma'], w)

print("\nMoran's I for Asthma Prevalence:")
print(f"I: {moran_asthma.I}")
print(f"p-value: {moran_asthma.p_sim}")

# Create a Moran scatter plot for asthma prevalence
fig, ax = plt.subplots(figsize=(7, 7))
moran_scatterplot(moran_asthma, ax=ax)
# Get the current limits
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# Find the min and max of both axes
vmin = min(xlim[0], ylim[0])
vmax = max(xlim[1], ylim[1])
# Set both axes to the same limits
ax.set_xlim(vmin, vmax)
ax.set_ylim(vmin, vmax)
# Ensure the aspect ratio is 1:1
ax.set_aspect('equal', adjustable='box')
ax.set_title("Moran Scatter Plot: Asthma Prevalence")
plt.show()

Moran's I for Asthma Prevalence:
I: 0.5179466646931635
p-value: 0.001

Asthma prevelance has a much stronger correlation structure. High asthma levels are clustered together spatially and have a value of 0.52. Also note that the scatterplot shows a nice fit without any odd artifacts or outliers.

# Calculate bivariate Moran's I
moran_bv = esda.Moran_BV(df['air_release'], df['asthma'], w)

print("\nBivariate Moran's I (Air Release vs Asthma):")
print(f"I: {moran_bv.I}")
print(f"p-value: {moran_bv.p_sim}")

# Create a bivariate Moran scatter plot
fig, ax = plt.subplots(figsize=(7, 7))
# make the plot
moran_scatterplot(moran_bv, ax=ax)
# Get the current limits
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# Find the min and max of both axes
vmin = min(xlim[0], ylim[0])
vmax = max(xlim[1], ylim[1])
# Set both axes to the same limits
ax.set_xlim(vmin, vmax)
ax.set_ylim(vmin, vmax)
# Ensure the aspect ratio is 1:1
ax.set_aspect('equal', adjustable='box')
# Set the title
ax.set_title("Bivariate Moran Scatter Plot: Air Release vs Asthma")
plt.show()

Bivariate Moran's I (Air Release vs Asthma):
I: 0.10554510298823498
p-value: 0.001

Similar to the pearson analysis, when we analyze their combined spatial dependence the relatipnship is not as strong; largely in part to the high levels of 0 air release values. That said the relationship is positive and it is statistically significant (p-value 0.001).

This could be due to several factors that went unaccounted for in our simple analysis.

  • Our choice of size/resolution when creating our summed air release raster.
  • The impacts of TRI air releases could be much more widespread than the 5km resolution we specified.
  • We only analyze TRI regulated facilities and their emissions.
  • We did not account for traffic density or other sources of particulate matter emissions.
  • We did not control for socioeconomic status, prevantative healthcare access, and a whole host of other confounding variables.
  • There is no underlying relationship. Unlikely based on large amounts of peer reviewed literature, but a possibility none the less.
Knowledge Check

What type of data does the CDC PLACES dataset provide?

  1. Real-time air quality measurements
  2. Estimates of various health indicators at local levels
  3. Hospital admission rates for air pollution-related illnesses
  4. Locations of healthcare facilities in each county

Conclusion

This lesson provided a comprehensive exploration of various environmental and public health datasets, focusing on the Detroit metropolitan area. By analyzing data from ICIS-AIR, the Toxic Release Inventory (TRI), the Social Vulnerability Index (SVI), and CDC PLACES, we gained insights into the complex relationships between industrial air pollution, social vulnerability, and health outcomes. The integration of these datasets, combined with geospatial analysis techniques, allowed us to identify areas where environmental hazards and social vulnerabilities intersect, highlighting potential environmental justice concerns. This approach demonstrates the power of combining diverse datasets and utilizing geospatial tools to inform policy decisions and target interventions in areas of greatest need.

Key Learning Points

Congratulations! In this lesson you:

  • Accessed and processed data from multiple EPA and CDC APIs, including ICIS-AIR, TRI, and PLACES datasets
  • Created and manipulated geospatial objects using libraries such as GeoPandas and rasterio
  • Visualized point data on maps using Matplotlib and Contextily
  • Performed data cleaning and preprocessing, including handling missing coordinates and outliers
  • Created choropleth maps to visualize health outcomes data at the census tract level
  • Utilized Inverse Distance Weighting (IDW) interpolation to create continuous surfaces from point data
  • Integrated multiple datasets to create a composite index (Air Release Vulnerability Index)
  • Analyzed spatial patterns of air pollution, social vulnerability, and health outcomes
  • Applied raster math and resampling techniques to align and compare different spatial datasets
  • Interpreted results in the context of environmental justice and public health concerns
  • Perform a Pearson and Moran’s I correlation analysis

Lesson 3

In this lesson, we explored ….

Lesson 3