Project Prometheus

The purpose of Project Prometheus is to predict the areas of the U.S. (and globally, in the future) which will be most affected by Coronavirus by utilizing machine learning tools. By forecasting future severity rates (via our "Hotspot index"), it allows us to better balance government intervention and economic activity to minimize the negative impact to communities.

Interactive Models




Interactive MAP of Hotspot Index from LSTM Model Prediction




The above graph shows the model's prediction of hotspot index (a metric we created to describe how much of a hotspot a county is) on 05/27/2020, using data from 05/06/2020-05/20/2020.




Interactive 3D Model of Coronavirus



Interactive 3D model of Coronavirus. Note: This model was sourced from Biodigital.

Factor Importance - Pertubation Results
This pie chart shows the perturbation effect for each of the feature as a percentage of the total sum of effects. The perturbation is done by adding each feature to a random distribution between 0 and 0.2. The higher the percentage, the higher the error will increase when the feature is perturbed, which might imply more importance to the model's prediction.
Factor Importance - Permutation Results
This pie chart shows the permutation effect for each of the feature as a percentage of the toal sum of effects. The permutation is done by randomly shuffling each feature. The higher the percentage, the higher the error will increase when the feature is permuted, which might imply more importance to the model's prediction.

Methodology



A flowchart depicting data sources and our process of creating the model and the interactive choropleth. Steps containing data from NASA/JAXA and other space agencies are filled in green, while boxes using data only from non-NASA data sources are filled in blue..



Our first major task was to process NASA, JAXA, and other open source data containing measurements regarding temperature, precipitation, night light radiance, etc. across a collection of portals and file formats (h5, hd5, he5, tiff, csv). Once the data was read in with Python, we merged the different data frames into a single, standardized Pandas Dataframe, containing 40+ features.

After merging the data together, it was "cleaned" by filling in all missing values with the "next available data." Following the data cleaning process, it was split into training and evaluation/testing subsets (80% and 20%, respectively). Next, hyperparameters for the model such as the training batch size and number of epoxes (number of times the model iterates through the dataset) were set. The model was then specified to utilize the previous two week's worth (seven days) of input to predict a day's "hot spot index," a week into the future. This "hot spot index" is a number that we, as a team, defined, aiming to characterize into the infectivity, lethality, and overall spread of the virus. The hotspot index for a county is defined as follows:

HotSpot Index = ln ((IR) * 0.003 + (MR) * 0.003 + GP * 0.002 + ln(C + 1) * 0.002 + 1))

IR = Incidence rate (cases per 100,000 people)
MR = Mortality Rate (Deaths in county / Population in county)
GP = Growth Percentage (New cases / Past total cases)
C = Total Confirmed Cases up to Yesterday

Where 0.003, 0.002, and 1 are hard-coded inputs to the model. A Long Short Term Memory (LSTM) model from Tensorflow was then used to train and test the model. Below is a list of all the features used to train our model:

['FIPS', 'Confirmed', 'Deaths', 'Population', 'IncidenceRate', 'NewCases', 'MortalityRate', 'LST_Day', 'LST_Night', 'Mask Mandates', 'Stay-at-Home Orders', 'Travel Restrictions', 'Precipitation', 'UV Index', 'retail_and_recreation_percent_change_from_baseline', 'grocery_and_pharmacy_percent_change_from_baseline', 'parks_percent_change_from_baseline', 'transit_stations_percent_change_from_baseline', 'workplaces_percent_change_from_baseline', 'residential_percent_change_from_baseline', 'Life Expectancy', '% Adults with Diabetes', '% Uninsured.1', 'Median Household Income', '% 65 and over', '% Black', '% Asian', '% Native Hawaiian/Other Pacific Islander', '% Hispanic', '% Non-Hispanic White', '% Rural', '% Fair or Poor Health', '% Smokers', '% Adults with Obesity', 'Food Environment Index', 'Primary Care Physicians Ratio', '% Vaccinated', '% Some College', '% Severe Housing Problems', 'ColumnAmountNO2', 'ColumnAmountNO2CloudScreened', 'Night Light', 'Hotspot Index']

After generating the model's predictions, we plotted the values on LeafletJS to generate the interactive Choropleth map.

Future Work

In the future, we will explore more complex models besides LSTM, collect more data to reduce the amount of missing data, and perhaps design and deploy our websites in a more sophisticated manner such as using React.js to make the website dynamic. We might also create a pipeline to automatically collect satellite and coronavirus data for each day, update the database and improve the model’s accuracy. For the time being, we hope that our map is a valuable resource for the government and communities, as they can use it as a guide to determine how resources should be best deployed to save lives while maintaining normalcy where possible.

Environment

Research done by scientists from China and Harvard University suggest that warmer weather does not halt transmission of the virus, contrary to popular belief. (Powell, 2020). Utilizing meteorological factors of 122 cities, they concluded that there is no evidence of decline of Coronavirus infection rates when the temperature increases. (Xie, 2020). These results are particularly concerning, as many governments plan to ease restrictions based on this false assumption.

Furthermore, one may hypothesize that people may rely on this false assumption to guide their decision-making, resulting in "warmer areas" having more relaxed policies/behaviors and less social distancing.

Precipitation data was also included in our model, as seasonal variations in temperature and precipitation (e.g rainfall) as well as rapid changing weather variability can exert strong pressures on infectious population dynamics. (Chiyomaru and Takemoto, 2020)

Global temperature and UV data was gathered from NASA's MODIS (Moderate Resolution Imaging Spectroradiometer) instrument web data portal. Global precipitation data was sourced from the Japanese Aerospace Exploration Agency (JAXA) GCOM-W1 (Global Change Observation Mission) Shizuku's AMSR2 (Advanced Microwave Scanning Radiometer), made available on JAXA's G-Portal.

In the advent of missing data, we filled missing values with data from the next available time interval.

A weather map depicting a day's worth of global temperature data from NASA on 2020-02. Note: The uncolored spots indicate missing data.

A weather map visualizing a day's worth of global precipitation data gathered from JAXA on 2020-01-26. Note: The uncolored spots indicate missing data.

Pollution

It is a worldwide observation that pollution corresponds positively with economic activity (due to factories and transportation). NO2, or nitrogen dioxide, is a common byproduct of industrial activity which is also harmful to human health. Areas with higher levels of pollution often also have a greater percentage of the population with preexisting conditions, increasing Coronavirus' severity/lethality. As a result, NO2 is a good metric for estimating economic activity and measuring the susceptibility of populations in certain regions to Coronavirus.

We sourced the NO2 data from OMINO2 (Ozone Monitoring Instrument) and acquired it from the EarthDATA portal.

A bar chart depicting processed NO2 from NASA data on 03-04-2020 of the counties with the highest levels.

Health

So far, the inputs to the machine learning model have been based on scientific observations/measurements. This section gathers statistics directly tied to health records including historical number of cases, deaths, incidence rate, new cases, and population per county into the model. This data is important because it provides a historic/near-present trajectory of the number of cases and deaths. This information was collected by Johns Hopkins University's COVID-19 "USCounties_time" records.

Additional health information such as life expectancy, poor mental health days was also included in our model, gathered from the Robert Wood Johnson Foundation program, a large philanthropic foundation.

Ethnicity and Communities

It is a widely accepted fact that Coronavirus does not affect every community the same way. Minorities, especially African-Americans, have an on-average higher death rate compared to other races. “African American persons (92.3 deaths per 100,000 population) and Hispanic/Latino persons (74.3) that were substantially higher than that of white (45.2) or Asian (34.5) persons.”(IHME, 2020). This trend may be attributed to their living conditions and neighborhoods. As they tend to reside in more densely populated areas and are thus may be more susceptible to transmission, making it hard to social distance people from one another. In addition, most of the professions that the minority communities have does not include health insurance or paid time off making it very hard for them to seek treatment. (CDC). As a result, social demographics play an important role in identifying hot spots.

In addition to race, we also sourced data for %rural, % severe housing problems, % ethnicities, etc. and fed it into our model. We sourced this information from the Robert Wood Johnson Foundation program.

To better understand population concentrations, we gathered night light satellite data from NASA's Suomi-NPP's (National Polar-orbiting Partnership) VIIRS (Visible Infrared Imaging Radiometer Suite) instrument from the month of January 2020. This data was also fed into the final model. We also used mobility data from Google to supplement the information to better understand human movements and activities.



A time lapse generated from daily night light captures (2020-02-18 2020-03-26), created from NASA's Worldview.

Government Action

In the U.S., States are slowly re-opening the economy, with Georgia being one of the first. However, health officials are afraid that opening up the states too quickly may result in a much-feared second or third wave of cases. A good case study is Japan. Their initial rigorous restrictions kept their number of cases low. However, after lifting the restrictions too quickly they were forced to shut down a second time due to a second wave of coronavirus cases. (Leonard, 2020). To minimize economic damage, many governors "overlook" some of the guidlines for reopening. 31 out of the 50 states have already begun to partially reopen the economy, causing a resurgence in deaths and infections from Coronavirus. We sourced government action data from multistate.us, an organization which provides a COVID-19 Policy data at the state level.

Thus, government action data are an important input to the model.

Figure depicting states open/closed/restrictions.

Citations




Adams, M. (2020, August 3). Early Release - Population-Based Estimates of Chronic Conditions Affecting Risk for Complications from Coronavirus Disease, United States - Volume 26, Number 8-August 2020 - Emerging Infectious Diseases journal - CDC. Retrieved May 30, 2020, from https://wwwnc.cdc.gov/eid/article/26/8/20-0679_article

Berrick, S. (2004, October 1). Earthdata Search. Retrieved May 30, 2020, from https://search.earthdata.nasa.gov/search/granules?p=C1266136111-GES_DISC;

Berrick, S. (2020, May 30). Find Environmental Impacts Data. Retrieved May 30, 2020, from https://earthdata.nasa.gov/learn/pathfinders/covid-19/environmental-impacts;
Berrick, S. (2020, May 12). Find Seasonality Data. Retrieved May 30, 2020, from https://earthdata.nasa.gov/learn/pathfinders/covid-19/seasonality;

E. (Ed.). (2020, May 18). JHU Centers for Civic Impact Covid-19 County Cases (Daily Update). Retrieved May 30, 2020, from https://coronavirus-resources.esri.com/datasets/4cb598ae041348fb92270f102a6783cb/data?layer=1;

G. (2020). COVID-19 Community Mobility Report. Retrieved May 30, 2020, from https://www.google.com/covid19/mobility/;

IHME: COVID-19 Projections. (2020, May 26). Retrieved May 29, 2020, from https://covid19.healthdata.org/united-states-of-america;

J. (Ed.). (2020, May 27). Mortality Analyses. Retrieved May 30, 2020, from https://coronavirus.jhu.edu/data/mortality;

Jari Hovila, Antii Arola, and Johanna Tamminen (2014), OMI/Aura Surface UVB Irradiance and Erythemal Dose Daily L2 Global Gridded 0.25 degree x 0.25 degree V3, NASA Goddard Space Flight Center, Goddard Earth Sciences Data and Information Services Center (GES DISC), Accessed: [Data Access Date], 10.5067/Aura/OMI/DATA2028;

Levy, R. (2019, May 16). Earth at Night (Black Marble) 2016 Grayscale Maps. Retrieved May 30, 2020, from https://visibleearth.nasa.gov/images/144897/earth-at-night-black-marble-2016-grayscale-maps

Leonard, A. (2020, April 24). Hokkaido Forced to Reinstate Lockdown After Coronavirus Returned. Retrieved May 29, 2020, from https://time.com/5826918/hokkaido-coronavirus-lockdown/

Lusk, J. (2020, May 01). NGF: Only four states remain closed to golf with no announced dates to reopen. Retrieved May 31, 2020, from https://golfweek.usatoday.com/2020/05/01/ngf-only-four-states-closed-golf/

N. (Ed.). (2020, April 22). COVID-19 in Racial and Ethnic Minority Groups. Retrieved May 29, 2020, from https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/racial-ethnic-minorities.html

T. (2020, April 24). UPDATE State officials say reopening will be regional, data-driven. Retrieved June 01, 2020, from https://www.dailyitem.com/news/local_news/update-state-officials-say-reopening-will-be-regional-data-driven/article_26f4cc1c-8580-11ea-a7ae-f3cb259dd4ac.html

Ogden, C. (2017, August 01). Overweight & Obesity Statistics. Retrieved May 30, 2020, from https://www.niddk.nih.gov/health-information/health-statistics/overweight-obesity

Pender, J. (2020). Download Data. Retrieved May 30, 2020, from https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/

Powell, A. (2020, April 14). Warm weather may have no impact on COVID-19. Retrieved May 29, 2020, from https://news.harvard.edu/gazette/story/2020/04/covid-19-may-not-go-away-in-warmer-weather-as-do-colds/

U. (2020). Explore Health Rankings: Rankings Data & Documentation. Retrieved May 30, 2020, from https://www.countyhealthrankings.org/explore-health-rankings/rankings-data-documentation?fbclid=IwAR0sUOv2BU7j4UH8I5EL-g-vt1peVY37ZKQeQ6z94l7DQn6cNNIBw_zaZ00

Wan, Z., Hook, S., Hulley, G. (2015). MYD11C1 MODIS/Aqua Land Surface Temperature/Emissivity Daily L3 Global 0.05Deg CMG V006 [Data set]. NASA EOSDIS Land Processes DAAC. Accessed 2020-05-28 from https://doi.org/10.5067/MODIS/MYD11C1.006

Xie, J., & Zhu, Y. (2020, July 1). Association between ambient temperature and COVID-19 infection in 122 cities from China. Retrieved May 29, 2020, from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7142675/

44
Features
1
Long Short Term Memory Machine Learning Model
6.275
% Training Error using the past two weeks data to predict a week in advance
9.923
% Evaluation Error using the past two weeks data to predict a week in advance

The Team




Sky



Solomon



Gabe



Alex



Eric