The purpose of Project Prometheus is to predict the areas of the U.S. (and globally, in the future) which will be most affected by Coronavirus by utilizing machine learning tools. By forecasting future severity rates (via our "Hotspot index"), it allows us to better balance government intervention and economic activity to minimize the negative impact to communities.
The above graph shows the model's prediction of hotspot index (a metric we created to describe how much of a hotspot a county is) on 05/27/2020, using data from 05/06/2020-05/20/2020.
Interactive 3D model of Coronavirus. Note: This model was sourced from Biodigital.
Our first major task was to process NASA, JAXA, and other open source data containing measurements regarding temperature, precipitation, night light radiance, etc. across a collection of portals and file formats (h5, hd5, he5, tiff, csv).
Once the data was read in with Python, we merged the different data frames into a single, standardized Pandas Dataframe, containing 40+ features.
After merging the data together, it was "cleaned" by filling in all missing values with the "next available data."
Following the data cleaning process, it was split into training and evaluation/testing subsets (80% and 20%, respectively).
Next, hyperparameters for the model such as the training batch size and number of epoxes (number of times the model iterates through the dataset) were set.
The model was then specified to utilize the previous two week's worth (seven days) of input to predict a day's "hot spot index," a week into the future.
This "hot spot index" is a number that we, as a team, defined, aiming to characterize into the infectivity, lethality, and overall spread of the virus.
The hotspot index for a county is defined as follows:
HotSpot Index = ln ((IR) * 0.003 + (MR) * 0.003 + GP * 0.002 + ln(C + 1) * 0.002 + 1))
IR = Incidence rate (cases per 100,000 people)
MR = Mortality Rate (Deaths in county / Population in county)
GP = Growth Percentage (New cases / Past total cases)
C = Total Confirmed Cases up to Yesterday
Where 0.003, 0.002, and 1 are hard-coded inputs to the model.
A Long Short Term Memory (LSTM) model from Tensorflow was then used to train and test the model.
Below is a list of all the features used to train our model:
['FIPS', 'Confirmed', 'Deaths', 'Population', 'IncidenceRate',
'NewCases', 'MortalityRate', 'LST_Day', 'LST_Night',
'Mask Mandates', 'Stay-at-Home Orders', 'Travel Restrictions',
'Precipitation', 'UV Index',
'retail_and_recreation_percent_change_from_baseline',
'grocery_and_pharmacy_percent_change_from_baseline',
'parks_percent_change_from_baseline',
'transit_stations_percent_change_from_baseline',
'workplaces_percent_change_from_baseline',
'residential_percent_change_from_baseline', 'Life Expectancy',
'% Adults with Diabetes', '% Uninsured.1',
'Median Household Income', '% 65 and over', '% Black', '% Asian',
'% Native Hawaiian/Other Pacific Islander', '% Hispanic',
'% Non-Hispanic White', '% Rural', '% Fair or Poor Health',
'% Smokers', '% Adults with Obesity', 'Food Environment Index',
'Primary Care Physicians Ratio', '% Vaccinated', '% Some College',
'% Severe Housing Problems', 'ColumnAmountNO2',
'ColumnAmountNO2CloudScreened', 'Night Light', 'Hotspot Index']
After generating the model's predictions, we plotted the values on LeafletJS to generate the interactive Choropleth map.
Research done by scientists from China and Harvard University suggest that warmer weather does not halt transmission of the virus, contrary to popular belief. (Powell, 2020).
Utilizing meteorological factors of 122 cities, they concluded that there is no evidence of decline of Coronavirus infection rates when the temperature increases. (Xie, 2020).
These results are particularly concerning, as many governments plan to ease restrictions based on this false assumption.
Furthermore, one may hypothesize that people may rely on this false assumption to guide their decision-making, resulting in "warmer areas" having more relaxed policies/behaviors and less social distancing.
Precipitation data was also included in our model, as seasonal variations in temperature and precipitation (e.g rainfall) as well as
rapid changing weather variability can exert strong pressures on infectious population dynamics. (Chiyomaru and Takemoto, 2020)
Global temperature and UV data was gathered from NASA's MODIS (Moderate Resolution Imaging Spectroradiometer) instrument web data portal.
Global precipitation data was sourced from the Japanese Aerospace Exploration Agency (JAXA) GCOM-W1 (Global Change Observation Mission) Shizuku's AMSR2 (Advanced Microwave Scanning Radiometer), made available on JAXA's G-Portal.
In the advent of missing data, we filled missing values with data from the next available time interval.
It is a worldwide observation that pollution corresponds positively with economic activity (due to factories and transportation). NO2, or nitrogen dioxide, is a common byproduct of industrial activity which is also harmful to human health.
Areas with higher levels of pollution often also have a greater percentage of the population with preexisting conditions, increasing Coronavirus' severity/lethality.
As a result, NO2 is a good metric for estimating economic activity and measuring the susceptibility of populations in certain regions to Coronavirus.
We sourced the NO2 data from OMINO2 (Ozone Monitoring Instrument) and acquired it from the EarthDATA portal.
So far, the inputs to the machine learning model have been based on scientific observations/measurements. This section gathers statistics directly tied to health records including historical number of cases, deaths, incidence rate, new cases, and population per county into the model. This data is important because it provides a historic/near-present trajectory of the number of cases and deaths.
This information was collected by Johns Hopkins University's COVID-19 "USCounties_time" records.
Additional health information such as life expectancy, poor mental health days was also included in our model, gathered from the Robert Wood Johnson Foundation program, a large philanthropic foundation.
It is a widely accepted fact that Coronavirus does not affect every community the same way.
Minorities, especially African-Americans, have an on-average higher death rate compared to other races.
“African American persons (92.3 deaths per 100,000 population) and Hispanic/Latino persons (74.3) that were substantially higher than that of white (45.2) or Asian (34.5) persons.”(IHME, 2020).
This trend may be attributed to their living conditions and neighborhoods. As they tend to reside in more densely populated areas and are thus may be more susceptible to transmission, making it hard to social distance people from one another.
In addition, most of the professions that the minority communities have does not include health insurance or paid time off making it very hard for them to seek treatment. (CDC).
As a result, social demographics play an important role in identifying hot spots.
In addition to race, we also sourced data for %rural, % severe housing problems, % ethnicities, etc. and fed it into our model. We sourced this information from the Robert Wood Johnson Foundation program.
To better understand population concentrations, we gathered night light satellite data from NASA's Suomi-NPP's (National Polar-orbiting Partnership) VIIRS (Visible Infrared Imaging Radiometer Suite) instrument from the month of January 2020. This data was also fed into the final model.
We also used mobility data from Google to supplement the information to better understand human movements and activities.
In the U.S., States are slowly re-opening the economy, with Georgia being one of the first.
However, health officials are afraid that opening up the states too quickly may result in a much-feared second or third wave of cases.
A good case study is Japan. Their initial rigorous restrictions kept their number of cases low. However, after lifting the restrictions too quickly they were forced to shut down a second time due to a second wave of coronavirus cases. (Leonard, 2020).
To minimize economic damage, many governors "overlook" some of the guidlines for reopening. 31 out of the 50 states have already begun to partially reopen the economy, causing a resurgence in deaths and infections from Coronavirus.
We sourced government action data from multistate.us, an organization which provides a COVID-19 Policy data at the state level.
Thus, government action data are an important input to the model.