Aki Shiroshita (Epidemiology PhD student, ) developed a tailored version of DeGAUSS specifically for the EV project.

About Original DeGAUSS

DeGAUSS is designed to derive environmental variables while preserving the privacy of protected health information (PHI). It uses Docker images to process address data, Users upload a CSV file containing address information and receive an output file with various environmental variables.

Limitations of Original DeGAUSS

Original DeGAUSS may not be so flexible.

Improvements in the Modified DeGAUSS

Modified DeGAUSS provides clean, processed output files with all PHI removed.

What can we get through Modified DeGAUSS?

Category Variable name Data source Description
Parsing and normalizing address libpostal cleaned address
Geocoding lon TIGER/Line Street Range Address longitude and latitude
lat TIGER/Line Street Range Address longitude and latitude
Road proximity dist_to_1100 U.S. Census Bureau distance (meters) to the nearest S1100 road
dist_to_1200 U.S. Census Bureau distance (meters) to the nearest S1200 road
length_1100 U.S. Census Bureau length (meters) of S1100 roads within a 500 m buffer
length_1200 U.S. Census Bureau length (meters) of S1200 roads within a 500 m buffer
Traffic density length_moving U.S. Department of Transportation Federal Highway Administration total length of interstates, expressways, and freeways (meters)
length_stop_go U.S. Department of Transportation Federal Highway Administration total length of arterial roads (meters)
vehicle_meters_moving U.S. Department of Transportation Federal Highway Administration average daily number of vehicles multiplied by the length of interstates, expressways, and freeways (vehicle-meters)
vehicle_meters_stop_go U.S. Department of Transportation Federal Highway Administration average daily number of vehicles multiplied by the length of arterial roads (vehicle-meters)
truck_meters_moving U.S. Department of Transportation Federal Highway Administration average daily number of trucks multiplied by the length of interstates, expressways, and freeways (truck-meters)
truck_meters_stop_go U.S. Department of Transportation Federal Highway Administration average daily number of trucks multiplied by the length of arterial roads (truck-meters)
New road proximity and traffic density dist_near U.S. Department of Transportation Federal Highway Administration distance (meters) to the nearest interstates, expressways, or freeways
aadt_near U.S. Department of Transportation Federal Highway Administration average daily number of vehicles of the nearest interstates, expressways, or freeways
Redlining categories redlining Mapping Inequality Historic HOLC classifications (A, B, C, and D)
Greenspace evi_500 LP DAAC MOD13Q1 average enhanced vegetation index within a 500 meter buffer radius
evi_1500 LP DAAC MOD13Q1 average enhanced vegetation index within a 1500 meter buffer radius
evi_2500 LP DAAC MOD13Q1 average enhanced vegetation index within a 2500 meter buffer radius
Deprivation score fraction_assisted_incom 2015 American Community Survey fraction of households receiving public assistance income or food stamps or SNAP in the past 12 months
fraction_high_school_edu 2015 American Community Survey fraction of population 25 and older with educational attainment of at least high school graduation (includes GED equivalency)
median_income 2015 American Community Survey median household income in the past 12 months in 2015 inflation-adjusted dollars
fraction_no_health_ins 2015 American Community Survey fraction of population with no health insurance coverage
fraction_poverty 2015 American Community Survey fraction of population with income in past 12 months below poverty level
fraction_vacant_housing 2015 American Community Survey fraction of houses that are vacant
dep_index 2015 American Community Survey composite measure of the 6 variables above
Air pollutants average_no2_infancy Original Schwartz model Average monthly NO₂ levels during the exposure period
average_bc_infancy Provided by Kai Zhang Average monthly black carbon levels during the exposure period

In addition, you can obtain the following information if needed:

Brief description of what Modified DeGAUSS is doing.

  1. modified_degauss_run_xxx.R

This is the main script (index file) that specifies the folder structure and file paths for input and output data. All subsequent scripts are called or sourced from here.

  1. R/initial_set_up.R

This script ensures that Podman is properly configured and running. It also loads all required R packages and custom functions used.

  1. R/xxx_cohort_preparation_simple.R

This script prepares the child cohort data before address standardization and geocoding.

R/xxx_cohort_preparation_simple.R keeps only address records overlapping with the exposure period, such as infancy (e.g., from birth to 1 year).

  1. R/parsing.R

This script parses and normalizes raw addresses using libpostal to improve match rates in geocoding. libpostal is a machine learning–based address parser trained on OpenStreetMap data.

The environment for libpostal is created using Podman, based on ghcr.io/degauss-org/postal:0.1.4.

Each raw address should end with a 5-digit ZIP code, as the geocoding algorithm uses it as the initial search key.

  1. R/geocoding_xxx.R

This is the most time-consuming part of the pipeline. It performs address geocoding using the Ruby gem Geocoder-US-2.0.4.

The algorithm applies fuzzy matching, and low-confidence results are removed. The Ruby and SQL geocoding environment is built using degauss/geocoder:3.0.

The data source of address is TIGER/Line Address Range files (2021) from the U.S. Census Bureau.

Post-geocoding cleanup

The script then adjusts address timelines to match each child’s exposure period. It creates a single continuous time span, with a total of about 365 days for each child.

The start date of the first address is trimmed so it doesn’t precede the child’s date of birth.

The end date of the last address is trimmed so it doesn’t extend beyond the first birthday.

If a gap exists between consecutive address periods, the end date of the earlier record is set to one day before the next record starts.

If an overlap exists, the end date of the earlier record is adjusted similarly.

  1. R/census.R

This script appends census and redlining variables to each address.

Adds FIPS codes (census tract and block group IDs) for both 2010 and 2020: TIGER boundary files (restricted to Tennessee)

Adds Home Owners’ Loan Corporation (HOLC) redlining grades (A, B, C, D): the University of Richmond

  1. R/road.R

This script calculates the shortest distance to the nearest primary and secondary roads (separately).

Road data are from TIGER 2011, accessed via the R tigris package. The data source is U.S. Census Bureau TIGER/Line Files.

I chose the 2011 data because it is the earliest year for which primary and secondary road shapefiles are available through R tigris package.

  1. R/green.R

This script calculates the average Enhanced Vegetation Index (EVI) around each residential address, which serves as a measure of surrounding greenness.

EVI data are derived from LP DAAC MOD13Q1, a product of NASA’s Land Processes Distributed Active Archive Center (LP DAAC). A cloud-free composite EVI raster at 250 × 250 m resolution was created by assembling individual remote-sensing images collected between June 10 and June 25, 2018, by Professor Cole Brokamp.

  1. R/dep_index.R

This script computes census tract-level material deprivation indices created by Professor Cole Brokamp. The following census tract-level variables were derived from the 2015 American Community Survey:

For a more comprehensive set of indices, use dep_index_britt.R, which generates a separate dataset containing: ADI, SVI, COI, EJI, CRE, DCI, NDI, NSES, SDI, and EQI.

  1. R/aadt.R

This script calculates Average annual daily traffic within a 500-meter buffer radius of each residential address × road length for interstates, expressways, and freeways.

Traffic and road data were obtained from the U.S. Department of Transportation, Federal Highway Administration, using the Highway Performance Monitoring System (HPMS) dataset for 2014, the earliest year available for both vehicle and truck traffic volumes.

It calculates the following metrics:

  1. R/road_and_aadt.R

Because road proximity (from TIGER) and traffic density (from HPMS) are from different data sources, this script integrates both to define a combined exposure metric. It calculates road proximity and Average Annual Daily Traffic (AADT) for interstates, expressways, and freeways using the HPMS dataset, allowing evaluation of the interaction between proximity to major roads and the traffic volume along those roads.

  1. R/road_density.R

This script calculates the length (meters) of primary, secondary and minor roads (separately) within a 500 m buffer radius of each residential address.

Road data are from TIGER 2011, accessed via the R tigris package. The data source is U.S. Census Bureau TIGER/Line Files.

I chose the 2011 data because it is the earliest year for which primary and secondary road shapefiles are available through R tigris package.

  1. R/no2_monthly_xxx.R

This script estimates average NO₂ exposure during the defined exposure period. NO₂ estimates provided by Professor Joel Schwartz are available for 2000-01 through 2016-12.

First, it identify the 1 km² grid cell in which each residential address is located. Then, it pulls monthly NO₂ values corresponding to the address’s grid cell during the exposure period. Finally, compute the mean NO₂ concentration across all months within the exposure window.

  1. R/bc_monthly_xxx.R

This script estimates average Black carbon (BC) exposure during the defined exposure period. BC estimates provided by Professor Kai Zhang are available for 2000-01 through 2017-12.

First, it identify the 0.01° × 0.01° grid cell in which each residential address is located. Then, it pulls monthly BC values corresponding to the address’s grid cell during the exposure period. Finally, compute the mean BC concentration across all months within the exposure window.

  1. R/tabulation.R

This script calculates the number of children per census block group (blocks with fewer than 11 children are masked as “<11”).

  1. R/final_clean_xxx.R

The final data-cleaning step. It removes all PHI, and retains only necessary environmental and contextual variables.

Specifically, address, latitude, and longitude are removed.