Lesson_05_wotking_with_polygon.png

Spatial Data Processing

Data

Age Adjusted Lung & Bronchus Cancer Mortality Rates by County

Age Adjusted Lung & Bronchus Cancer Mortality Rates from 1980 t0 2014 by county was obtained from Institute for Health Metrics and Evaluation (IHME). The rate was estimated using spatially explicit Bayesian mixed effects regression models for both males and females (Mokdad et al., 2016). The model incorporated 7 covariates, such as proportion of the adult population that graduated high school, the proportion of the population that is Hispanic, the proportion of the population that is black, the proportion of the population that is a race other than black or white, The proportion of a county that is contained within a state or federal Native American reservation, the median household income, and the population density. To reduce the noise due to low mortality rate in a small population size, we smoothed data with the Empirical Bayesian smoother, a tool for dealing with rate instability associated with small sample sizes. It takes the population in the region as an indicator of confidence in the data, with higher population providing a higher confidence in the value at that location. We used SpaceStat (Evaluation version) Software for smoothing data.

Age-standardized Cigarette Smoking (%) Prevalence by County

Age-standardized cigarette smoking (%) prevalence by County estimated from the Behavioral Risk Factor Surveillance System (BRFSS) survey data (Dwyer-Lindgren et al., 2014). The prevalence (%) were estimated by logistic hierarchical mixed effects regression models, stratified by sex. All estimates were age-standardized following the age structure of the 2000 census. The uncertainty of the prevalence estimates was assessed using simulation methods.

Age-standardized Poverty Rate by County

County-level poverty data (% population below poverty level) are from US Census Small Area Income and Poverty Estimates (SAIPE) Program. Income and poverty rate were estimated by combining survey data with population estimates and administrative records.

Particulate Matter (PM2.5)

Annual mean PM2.5 from 1998 to 2012 were modeled from aerosol optical depth from multiple satellite (Multiangle Imaging Spectra Radiometer, MODIS Dark Target, MODIS and SeaWiFS Deep Blue, MODIS MAIAC) products and validated by ground-based PM2.5 monitoring sites (Aaron van Donkelaar et al., 2016). All raster grid of PM2.5 were re-sampled at 2.5 km x 2 km grid size using Empirical Bayesian Kriging in ArcPython - Geo-statistical Tool (ESRI, 2017)). County population weighted mean for each year were calculated from the population data.

Nitrogen dioxide (NO2)

Annual mean ambient NO2 Concentrations estimated from three satellite instruments, such as Global Ozone Monitoring Experiment (GOME), Scanning Imaging Absorption Spectrometer for Atmospheric Chartography (SCIAMACHY) and GOME-2 satellite in combination with chemical transport model (Gedders et al., (2016). All raster grid of PM2.5 were re-sampled at 2.5 km x 2 km grid size using Empirical Bayesian Kriging in ArcPython - Geo-statistical Tool (ESRI, 2017)). County population weighted mean for each year were calculated from the population data.

Sulfer dioxide (SO2)

Annual mean ambient SO2 Concentrations estimated were obtained from time series Multi-source SO2 emission retrievals and consistency of satellite and surface measurements of (Fioletov et al,(2017). All raster grid of PM2.5 were re-sampled at 2.5 km x 2.5 km grid size using Empirical Bayesian Kriging in ArcPython - Geo-statistical Tool (ESRI, 2017)). County population weighted mean for each year were calculated from the population data.

Load Libraries

In [1]:
import os
import sys

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.colors import ListedColormap, BoundaryNorm
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401 unused import
from mpl_toolkits.axes_grid1 import make_axes_locatable
import mapclassify

import plotly.graph_objects as go
import plotly.express as px

import seaborn as sns

from glob import glob
from functools import reduce
from osgeo import ogr, gdal
from osgeo import gdalconst
import geopandas as gpd
import pycrs
import fiona
from fiona.crs import from_epsg
from shapely.geometry import box
from shapely.geometry import Point
import shapely.geometry as geom

import json

from scipy.stats import sem
from scipy.stats import zscore
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_validate

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.api import abline_plot
from statsmodels import graphics

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping

%matplotlib inline
In [2]:
# What version of Python do you have?
print(f"Python {sys.version}")
print(f"geopandas {gpd.__version__}")
print(f"osgeo {gdal.__version__}")

import tensorflow.keras

print(f"Tensor Flow Version: {tf.__version__}")
print(f"Keras Version: {tensorflow.keras.__version__}")

print()
print(f"Python {sys.version}")
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")
Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 21:48:41) [MSC v.1916 64 bit (AMD64)]
geopandas 0.6.2
osgeo 3.0.2
Tensor Flow Version: 2.0.0
Keras Version: 2.2.4-tf

Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 21:48:41) [MSC v.1916 64 bit (AMD64)]
GPU is available

Setting consistent plotting style throughout notebook

In [3]:
sns.set_style("white")
sns.set(font_scale=1.5)

Set working directory

In [4]:
path= "E:/GitHub/geospatial-python-github.io/Lesson_05_working_with_polygon_data"
os.chdir(path)
print("Current Working Directory " , os.getcwd())
Current Working Directory  E:\GitHub\geospatial-python-github.io\Lesson_05_working_with_polygon_data

Import all CSV

In [5]:
cancer=pd.read_csv("Data_05/Lung_cancer_1998_2012.csv")
smoking=pd.read_csv("Data_05/SMOKING_1998_2012.csv")
poverty=pd.read_csv("Data_05/POVERTY_1998_2012.csv")
PM25=pd.read_csv("Data_05/PM25_1998_2012.csv")
NO2=pd.read_csv("Data_05/NO2_1998_2012.csv")
SO2=pd.read_csv("Data_05/SO2_1998_2012.csv")

Load County shape file and Extract Centriod

In [6]:
county_gpd=gpd.read_file('Data_05/county_map.shp')
county_gpd.plot()
county_gpd.head()
Out[6]:
FIDs FIPS REGION DIVISION STATE COUNTY code geometry
0 58 13111.0 South South Atlantic Georgia Fannin County GA POLYGON ((1037682.266 1388266.997, 1037756.908...
1 427 42115.0 Northeast Middle Atlantic Pennsylvania Susquehanna County PA POLYGON ((1622673.906 2281043.968, 1623277.874...
2 407 42075.0 Northeast Middle Atlantic Pennsylvania Lebanon County PA POLYGON ((1623002.490 2115965.841, 1632186.404...
3 593 51683.0 South South Atlantic Virginia Manassas city VA POLYGON ((1582645.685 1903767.263, 1584122.529...
4 236 36057.0 Northeast Middle Atlantic New York Montgomery County NY POLYGON ((1725612.650 2416731.662, 1729853.604...

Now, we will get centriod of each county

Creat Centriods of all counties

In [7]:
# copy poly to new GeoDataFrame
cen_points = county_gpd.copy()
# change the geometry
cen_points.geometry = cen_points['geometry'].centroid
# same crs
cen_points.crs =county_gpd.crs
cen_points.head()
Out[7]:
FIDs FIPS REGION DIVISION STATE COUNTY code geometry
0 58 13111.0 South South Atlantic Georgia Fannin County GA POINT (1056523.936 1376613.190)
1 427 42115.0 Northeast Middle Atlantic Pennsylvania Susquehanna County PA POINT (1653442.064 2267301.476)
2 407 42075.0 Northeast Middle Atlantic Pennsylvania Lebanon County PA POINT (1633708.275 2096683.059)
3 593 51683.0 South South Atlantic Virginia Manassas city VA POINT (1584048.852 1901443.401)
4 236 36057.0 Northeast Middle Atlantic New York Montgomery County NY POINT (1735811.102 2409536.493)
In [8]:
fig, ay = plt.subplots()
county_gpd.plot(ax=ay, color="white", edgecolor="gray")
cen_points.plot(ax=ay, color='red', markersize=1)
ay.set_title('County Centriods')
Out[8]:
Text(0.5, 1, 'County Centriods')
In [9]:
# Get X (m) and Y (m) Values from geometry column
cen_points['X'] = cen_points.geometry.x
cen_points['Y'] = cen_points.geometry.y
cen_points.head()
Out[9]:
FIDs FIPS REGION DIVISION STATE COUNTY code geometry X Y
0 58 13111.0 South South Atlantic Georgia Fannin County GA POINT (1056523.936 1376613.190) 1.056524e+06 1.376613e+06
1 427 42115.0 Northeast Middle Atlantic Pennsylvania Susquehanna County PA POINT (1653442.064 2267301.476) 1.653442e+06 2.267301e+06
2 407 42075.0 Northeast Middle Atlantic Pennsylvania Lebanon County PA POINT (1633708.275 2096683.059) 1.633708e+06 2.096683e+06
3 593 51683.0 South South Atlantic Virginia Manassas city VA POINT (1584048.852 1901443.401) 1.584049e+06 1.901443e+06
4 236 36057.0 Northeast Middle Atlantic New York Montgomery County NY POINT (1735811.102 2409536.493) 1.735811e+06 2.409536e+06
In [10]:
### Convert geopandas data frame to pandas data frame without geometry column
columns =['STATE','COUNTY','REGION','DIVISION','geometry' ]
cen_df = pd.DataFrame(cen_points.drop(columns=columns))
cen_df.head()
Out[10]:
FIDs FIPS code X Y
0 58 13111.0 GA 1.056524e+06 1.376613e+06
1 427 42115.0 PA 1.653442e+06 2.267301e+06
2 407 42075.0 PA 1.633708e+06 2.096683e+06
3 593 51683.0 VA 1.584049e+06 1.901443e+06
4 236 36057.0 NY 1.735811e+06 2.409536e+06

Calculate 15 years Mean

We will create a data frame of 15 years (1998-2012) mean all variables, and will use this data in do some statistical analysis.

In [11]:
# copy cen_df as "avg" pd-dataframe
avg_df = cen_df.copy()
In [12]:
avg_df["cancer"]= cancer.iloc[:,2:16].mean(axis=1)
avg_df["poverty"]= poverty.iloc[:,2:16].mean(axis=1)
avg_df["smoking"]= smoking.iloc[:,2:16].mean(axis=1)
avg_df["PM25"]= PM25.iloc[:,2:16].mean(axis=1)
avg_df["NO2"]= NO2.iloc[:,2:16].mean(axis=1)
avg_df["SO2"]= SO2.iloc[:,2:16].mean(axis=1)
In [13]:
avg_df.head()
Out[13]:
FIDs FIPS code X Y cancer poverty smoking PM25 NO2 SO2
0 58 13111.0 GA 1.056524e+06 1.376613e+06 74.637249 11.828571 26.192857 13.635714 3.508071 0.073450
1 427 42115.0 PA 1.653442e+06 2.267301e+06 66.342705 9.250000 23.478571 14.329286 5.074071 0.072547
2 407 42075.0 PA 1.633708e+06 2.096683e+06 69.716354 11.335714 27.621429 12.487857 3.730714 0.061116
3 593 51683.0 VA 1.584049e+06 1.901443e+06 62.711264 17.728571 22.257143 13.710000 5.594500 0.158739
4 236 36057.0 NY 1.735811e+06 2.409536e+06 77.569145 19.278571 27.557143 12.247143 0.731357 0.009293
In [14]:
# Write as CSV file
avg_df.to_csv('Data_05/county_mean.csv',index=False)

Join the average data with county shape file

In [15]:
# Drop code coloun for pd-data frame
avg_mf=avg_df.drop("code", axis=1)
county_gpd_data = county_gpd.merge(avg_mf, on='FIPS')
county_gpd_data.head()
Out[15]:
FIDs_x FIPS REGION DIVISION STATE COUNTY code geometry FIDs_y X Y cancer poverty smoking PM25 NO2 SO2
0 58 13111.0 South South Atlantic Georgia Fannin County GA POLYGON ((1037682.266 1388266.997, 1037756.908... 58 1.056524e+06 1.376613e+06 74.637249 11.828571 26.192857 13.635714 3.508071 0.073450
1 427 42115.0 Northeast Middle Atlantic Pennsylvania Susquehanna County PA POLYGON ((1622673.906 2281043.968, 1623277.874... 427 1.653442e+06 2.267301e+06 66.342705 9.250000 23.478571 14.329286 5.074071 0.072547
2 407 42075.0 Northeast Middle Atlantic Pennsylvania Lebanon County PA POLYGON ((1623002.490 2115965.841, 1632186.404... 407 1.633708e+06 2.096683e+06 69.716354 11.335714 27.621429 12.487857 3.730714 0.061116
3 593 51683.0 South South Atlantic Virginia Manassas city VA POLYGON ((1582645.685 1903767.263, 1584122.529... 593 1.584049e+06 1.901443e+06 62.711264 17.728571 22.257143 13.710000 5.594500 0.158739
4 236 36057.0 Northeast Middle Atlantic New York Montgomery County NY POLYGON ((1725612.650 2416731.662, 1729853.604... 236 1.735811e+06 2.409536e+06 77.569145 19.278571 27.557143 12.247143 0.731357 0.009293

Write as a ESRI shape file

In [16]:
county_gpd_data.to_file("Data_05/county_mean_data.shp")

Visualization of Spatial Data

Map County Mean Data

In [17]:
# Make subplots that are next to each other
fig, (bx1, bx2, bx3) = plt.subplots(nrows=1, ncols=3,  figsize=(12, 5))

# Plot Cancer 
county_gpd_data.plot(column='cancer', 
                 edgecolor="gray", 
                 ax=bx1,
                 cmap='Blues',
                 linewidth=0.5
                  )
bx1.set_title("LBC Mortality Rate");
bx1.axis('off');
fig.colorbar(bx1.collections[0], ax=bx1)

# Plot Poverty
county_gpd_data.plot(column='poverty',
                  edgecolor="gray", 
                  ax=bx2,
                  cmap= 'Blues',
                  linewidth=0.5
                   );
bx2.set_title("Povert (%)");
bx2.axis('off');
fig.colorbar(bx2.collections[0], ax=bx2)

# Plot Smoking
county_gpd_data.plot(column='smoking', 
                 edgecolor="gray", 
                 ax=bx3,
                 cmap='Blues',
                 linewidth=0.5
                 );
bx3.set_title("Smoking (%)");
bx3.axis('off');
fig.colorbar(bx3.collections[0], ax=bx3)
Out[17]:
<matplotlib.colorbar.Colorbar at 0x232a180e148>
In [18]:
# Make subplots that are next to each other
fig, (bx1, bx2, bx3) = plt.subplots(nrows=1, ncols=3,  figsize=(12, 5))

# Plot PM25 
county_gpd_data.plot(column='PM25', 
                 edgecolor="gray", 
                 ax=bx1,
                 cmap='Blues',
                 linewidth=0.5,
                  )
bx1.set_title("PM25");
bx1.axis('off');
fig.colorbar(bx1.collections[0], ax=bx1)

# Plot NO2
county_gpd_data.plot(column='NO2',
                  edgecolor="gray", 
                  ax=bx2,
                  cmap= 'Blues',
                  linewidth=0.5,
                   );
bx2.set_title("NO2");
bx2.axis('off');
fig.colorbar(bx2.collections[0], ax=bx2)

# Plot SO2
county_gpd_data.plot(column='SO2', 
                 edgecolor="gray", 
                 ax=bx3,
                 cmap='Blues',
                 linewidth=0.5,
                 );
bx3.set_title("SO2");
bx3.axis('off');
fig.colorbar(bx3.collections[0], ax=bx3)
Out[18]:
<matplotlib.colorbar.Colorbar at 0x232a1c8a0c8>

Save as PNG files

Once you start using them, for() loops are fairly simple. The syntax of our for() loop will say this: for each year in the list_of_years list, run the following code. And when all years in our list have gone through the code, stop the loop.

In [19]:
county_gpd=gpd.read_file('Data_05/county_map.shp')
In [20]:
cancer_df=pd.read_csv("Data_05/Lung_cancer_1998_2012.csv")
In [21]:
cancer_df.describe()
Out[21]:
FIDs FIPS 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
count 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000
mean 332.500000 35727.693694 73.857273 74.097648 74.101438 73.939181 73.529323 72.603300 70.713445 70.485016 69.530583 68.137611 67.341979 66.486614 65.285216 65.135519 64.668501
std 192.401923 14669.711655 11.419725 11.573146 11.754784 11.922416 12.070827 12.112833 12.039270 12.223738 12.322275 12.305433 12.360195 12.547609 12.526113 12.801249 13.006163
min 0.000000 10001.000000 45.750330 45.171786 44.385688 43.523156 41.827329 40.854616 40.274638 38.811441 38.484895 37.206145 35.430022 35.070739 34.025360 32.956657 32.300568
25% 166.250000 24009.500000 66.585316 66.668914 66.580404 66.251332 65.815276 64.757661 62.755216 62.725070 61.713092 60.189047 59.463569 58.726896 57.608005 57.072282 56.591102
50% 332.500000 37126.000000 73.422482 73.593816 73.683809 73.514081 73.121945 72.123658 70.146444 69.996469 69.015290 67.717531 66.922792 66.007622 64.885836 64.735454 64.086434
75% 498.750000 51032.500000 80.168129 80.658646 80.742217 80.819539 80.453967 79.659168 77.966000 77.622660 76.661568 75.409592 74.807386 73.938938 72.472790 72.528113 72.158332
max 665.000000 54109.000000 125.371782 126.672925 124.912023 125.890516 125.275492 123.720335 120.625564 122.255614 120.007166 117.200991 116.837902 116.410885 113.512619 115.013336 115.277577
In [22]:
cancer_gpd = county_gpd.merge(cancer_df, on='FIPS')
In [23]:
# save all the maps in the charts folder
output_path = 'Data_05/Maps'
# counter for the for loop
i = 0

# list of years (which are the column names at the moment)
list_of_years = ['1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', 
                 '2006', '2007', '2008','2010','2011', '2012']

# set the min and max range for the choropleth map
vmin, vmax = 30, 130
In [24]:
# start the for loop to create one map per year
for year in list_of_years:
    
    # create map, UDPATE: added plt.Normalize to keep the legend range the same for all maps
    fig = cancer_gpd.plot(column=year, 
                       cmap='Blues', 
                       figsize=(6,5), 
                       linewidth=0.5,
                       edgecolor='0.8', 
                       vmin=vmin, 
                       vmax=vmax,
    legend=True, norm=plt.Normalize(vmin=vmin, vmax=vmax))
    
    # remove axis of chart
    fig.axis('off')
    
    # add a title
    fig.set_title('LBC Mortality Rate', \
              fontdict={'fontsize': '20',
                         'fontweight' : '3'})
    
    # this will save the figure as a high-res png in the output path. you can also save as svg if you prefer.
    filepath = os.path.join(output_path, year+'_LBC.png')
    chart = fig.get_figure()
    chart.savefig(filepath, dpi=300)
    
       # position the annotation to the bottom left
    fig.annotate(year,
            xy=(0.1, .225), xycoords='figure fraction',
            horizontalalignment='left', verticalalignment='top',
            fontsize=15)

Claculate State Mean and Standard Deviation

We use will county mean data (avg) to calculate state mean and standard deviation of LBC mortality rates and all other covariates.

In [25]:
# Mean
state_mean=avg_df.groupby(['code'])[['cancer','smoking','poverty','PM25']].mean()
state_mean.head()
Out[25]:
cancer smoking poverty PM25
code
DC 70.430824 28.307143 14.985714 13.100000
DE 65.833982 24.073810 12.930952 10.702381
GA 70.467286 26.336074 15.127089 11.670368
MD 69.042681 25.288988 13.492560 10.906875
NC 70.425814 26.308929 15.002214 11.563850
In [26]:
# Standard deviation
state_std=avg_df.groupby(['code'])[['cancer','smoking','poverty','PM25']].std()
state_std.head()
Out[26]:
cancer smoking poverty PM25
code
DC NaN NaN NaN NaN
DE 5.155995 2.291626 8.771336 0.663413
GA 11.391757 3.016408 5.487087 1.249014
MD 13.861489 4.107085 4.554613 1.747535
NC 10.780429 2.866285 4.571188 1.485551

Load state shape file and join the mean and standard deviation

In [27]:
state_gpd= gpd.read_file('Data_05/state_map.shp')
state_gpd_mean = state_gpd.merge(state_mean, on='code')
state_gpd_mean.head()
Out[27]:
REGION DIVISION STATE code geometry cancer smoking poverty PM25
0 South South Atlantic Georgia GA POLYGON ((951036.762 1377628.230, 951703.064 1... 70.467286 26.336074 15.127089 11.670368
1 Northeast Middle Atlantic New Jersey NJ POLYGON ((1725424.656 2032375.054, 1725818.270... 69.216233 25.633333 14.975510 11.496429
2 Northeast Middle Atlantic New York NY MULTIPOLYGON (((1811323.700 2157305.764, 18121... 71.095635 26.680415 15.429263 11.705783
3 Northeast Middle Atlantic Pennsylvania PA POLYGON ((1274617.959 2216152.533, 1276592.656... 74.066155 26.507889 15.460661 11.641652
4 South South Atlantic South Carolina SC POLYGON ((1145671.120 1373771.391, 1146510.223... 69.523421 26.728106 16.127329 11.328245
In [28]:
state_gpd=gpd.read_file('Data_05/state_map.shp')
state_gpd_std = state_gpd.merge(state_std, on='code')
state_gpd_std.head()
Out[28]:
REGION DIVISION STATE code geometry cancer smoking poverty PM25
0 South South Atlantic Georgia GA POLYGON ((951036.762 1377628.230, 951703.064 1... 11.391757 3.016408 5.487087 1.249014
1 Northeast Middle Atlantic New Jersey NJ POLYGON ((1725424.656 2032375.054, 1725818.270... 16.031692 4.748389 7.884255 1.253335
2 Northeast Middle Atlantic New York NY MULTIPOLYGON (((1811323.700 2157305.764, 18121... 10.639096 2.871818 5.602243 1.240578
3 Northeast Middle Atlantic Pennsylvania PA POLYGON ((1274617.959 2216152.533, 1276592.656... 11.561251 3.672934 5.631242 1.415402
4 South South Atlantic South Carolina SC POLYGON ((1145671.120 1373771.391, 1146510.223... 12.081050 2.856309 5.175032 1.442773
In [29]:
# Make subplots that are next to each other
fig, (bx1, bx2) = plt.subplots(nrows=1, ncols=2,  figsize=(12, 5))

# Plot mean data
state_gpd_mean.plot(column='cancer',
                  edgecolor="gray", 
                  ax=bx1,
                  cmap= 'Blues',
                  linewidth=0.5
                   );
bx1.set_title("Mean LBC Rate");
bx1.axis('off');
fig.colorbar(bx1.collections[0], ax=bx1)


# Plot Standard deviation
state_gpd_std.plot(column='cancer',
                  edgecolor="gray", 
                  ax=bx2,
                  cmap= 'Blues',
                  linewidth=0.5
                   );
bx2.set_title("STDEV LBC Rate");
bx2.axis('off');
fig.colorbar(bx2.collections[0], ax=bx2)
C:\anaconda3\envs\tensorflow\lib\site-packages\matplotlib\colors.py:527: RuntimeWarning:

invalid value encountered in less

Out[29]:
<matplotlib.colorbar.Colorbar at 0x232a3375948>

Interactive Map with Plotly

Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.

In [30]:
fig = go.Figure(data=go.Choropleth(
    locations=state_gpd_mean['code'], # Spatial coordinates
    z = state_gpd_mean['cancer'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "Death (per 100K)",
))

fig.update_layout(
    title_text = 'Mean LBC Mortality Rate by Selected State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

Making choropleth maps from your gespatial data requires two main types of input: GeoJSON-formatted geometry information where each feature has an id and a list of values indexed by feature id. The GeoJSON data is passed to the geojson attribute, and the data is passed into the z (color for px.choropleth_mapbox) attribute, in the same order as the IDs are passed into the location attribute.

In [31]:
mf = pd.read_csv("Data_05/county_mean.csv", dtype={"FIPS": str})
mf.head()
Out[31]:
FIDs FIPS code X Y cancer poverty smoking PM25 NO2 SO2
0 58 13111.0 GA 1.056524e+06 1.376613e+06 74.637249 11.828571 26.192857 13.635714 3.508071 0.073450
1 427 42115.0 PA 1.653442e+06 2.267301e+06 66.342705 9.250000 23.478571 14.329286 5.074071 0.072547
2 407 42075.0 PA 1.633708e+06 2.096683e+06 69.716354 11.335714 27.621429 12.487857 3.730714 0.061116
3 593 51683.0 VA 1.584049e+06 1.901443e+06 62.711264 17.728571 22.257143 13.710000 5.594500 0.158739
4 236 36057.0 NY 1.735811e+06 2.409536e+06 77.569145 19.278571 27.557143 12.247143 0.731357 0.009293
In [32]:
with open('Data_05/county_gcs.json') as read_file:
   counties = json.load(read_file)
counties["features"][0]
Out[32]:
{'type': 'Feature',
 'id': 1,
 'geometry': {'type': 'Polygon',
  'coordinates': [[[-84.50989030699998, 34.98801669100004],
    [-84.62148733899994, 34.988335687000074],
    [-84.62275633199994, 34.857137660000035],
    [-84.61864133099994, 34.855404659000044],
    [-84.42259727399994, 34.85746266800004],
    [-84.41742527199995, 34.84982266600008],
    [-84.35422225199994, 34.825025663000076],
    [-84.34366524899997, 34.82478766400004],
    [-84.32806924399995, 34.80640966100003],
    [-84.31465723999997, 34.80733466100003],
    [-84.25217121699995, 34.71816764500005],
    [-84.25229621399995, 34.664616634000026],
    [-84.24262821099995, 34.67091463500003],
    [-84.22647520599998, 34.653884633000075],
    [-84.19389119499994, 34.63884063100005],
    [-84.19675819499997, 34.61793062600003],
    [-84.18856119199995, 34.60269862300004],
    [-84.15803918499995, 34.648249634000024],
    [-84.10363417399998, 34.727843653000036],
    [-84.09317417099999, 34.72778565300007],
    [-84.09319717499994, 34.801417669000045],
    [-84.12399118399998, 34.79604666600005],
    [-84.14006718899998, 34.80689166800005],
    [-84.14751119399995, 34.844649676000074],
    [-84.14247119299995, 34.85405067800008],
    [-84.11374418499997, 34.87037568200003],
    [-84.10742618399996, 34.88690268600004],
    [-84.13354919399995, 34.91647969100006],
    [-84.14912719799997, 34.919612691000054],
    [-84.17168820599994, 34.93791069400004],
    [-84.17883820899993, 34.95209269700007],
    [-84.12945899099998, 34.987510522000036],
    [-84.12955919699994, 34.98751070600008],
    [-84.32187325299998, 34.98841469900003],
    [-84.39393927299994, 34.98807469600007],
    [-84.39490727399999, 34.98803669600005],
    [-84.50905630699998, 34.98803969100004],
    [-84.50989030699998, 34.98801669100004]]]},
 'properties': {'OBJECTID': 1,
  'FIPS': 13111,
  'REGION': 'South',
  'DIVISION': 'South Atlantic',
  'STATE': 'Georgia',
  'COUNTY': 'Fannin County',
  'FIDs': 58,
  'Shape_Length': 1.7676421068026271,
  'Shape_Area': 0.09990750532844837}}
In [33]:
fig = px.choropleth_mapbox(mf, geojson=counties, locations='FIDs', color='cancer',
                           color_continuous_scale="Viridis", 
                           range_color=(0, 120),
                           mapbox_style="carto-positron",
                           zoom=3, center = {"lat": 45.0902, "lon": -75.7129},
                           opacity=0.4,
                           labels={'cancer':'LBC mortality rate'}
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

Exploratory Data Analysis

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task (Wikipedia, 2020).

Create a Panda data-frame from GeoPandas data-frame for EDA

In [34]:
# convert to panad data frame
df=pd.DataFrame(county_gpd_data)
df.head()
Out[34]:
FIDs_x FIPS REGION DIVISION STATE COUNTY code geometry FIDs_y X Y cancer poverty smoking PM25 NO2 SO2
0 58 13111.0 South South Atlantic Georgia Fannin County GA POLYGON ((1037682.266 1388266.997, 1037756.908... 58 1.056524e+06 1.376613e+06 74.637249 11.828571 26.192857 13.635714 3.508071 0.073450
1 427 42115.0 Northeast Middle Atlantic Pennsylvania Susquehanna County PA POLYGON ((1622673.906 2281043.968, 1623277.874... 427 1.653442e+06 2.267301e+06 66.342705 9.250000 23.478571 14.329286 5.074071 0.072547
2 407 42075.0 Northeast Middle Atlantic Pennsylvania Lebanon County PA POLYGON ((1623002.490 2115965.841, 1632186.404... 407 1.633708e+06 2.096683e+06 69.716354 11.335714 27.621429 12.487857 3.730714 0.061116
3 593 51683.0 South South Atlantic Virginia Manassas city VA POLYGON ((1582645.685 1903767.263, 1584122.529... 593 1.584049e+06 1.901443e+06 62.711264 17.728571 22.257143 13.710000 5.594500 0.158739
4 236 36057.0 Northeast Middle Atlantic New York Montgomery County NY POLYGON ((1725612.650 2416731.662, 1729853.604... 236 1.735811e+06 2.409536e+06 77.569145 19.278571 27.557143 12.247143 0.731357 0.009293

Descriptive Statistics

In [35]:
mf=df[['cancer', 'poverty','smoking', 'PM25', 'NO2', 'SO2']]
In [36]:
mf.describe()
Out[36]:
cancer poverty smoking PM25 NO2 SO2
count 666.000000 666.000000 666.000000 666.000000 666.000000 666.000000
mean 70.374582 15.000332 26.229151 11.582681 2.262804 0.076624
std 12.026311 5.610755 3.303682 1.365394 1.625918 0.075046
min 39.651539 3.207143 13.657143 7.391429 0.612143 0.001287
25% 62.717918 11.194643 24.773214 10.675179 1.299411 0.029596
50% 69.952792 14.296429 26.671429 11.852857 1.826357 0.052136
75% 77.430017 18.651786 28.326786 12.596964 2.732679 0.097188
max 120.869885 34.842857 35.035714 14.329286 15.050000 0.445191
In [37]:
ax = sns.boxplot(x="code", y="cancer", data=df)

Add jitter over boxplot | seaborn

In [38]:
# Usual boxplot
ax = sns.boxplot(x='REGION', y='cancer', data=df)
 
# Add jitter with the swarmplot function.
ax = sns.swarmplot(x='REGION', y='cancer', data=df, color="grey")

Interactive Box-plot with Plotly

Plotly Express is the easy-to-use, high-level interface to Plotly, which operates on "tidy" data. In a box plot created by px.box, the distribution of the column given as y argument is represented.

In [39]:
fig = px.box(df, x="code", y="cancer")
fig.show()

Correlation Analysis

In [40]:
mf.corr()
Out[40]:
cancer poverty smoking PM25 NO2 SO2
cancer 1.000000 0.479222 0.666088 0.075867 -0.354428 -0.013786
poverty 0.479222 1.000000 0.563021 0.045417 -0.400164 -0.201537
smoking 0.666088 0.563021 1.000000 -0.109916 -0.399236 0.059811
PM25 0.075867 0.045417 -0.109916 1.000000 0.213000 0.334208
NO2 -0.354428 -0.400164 -0.399236 0.213000 1.000000 0.354689
SO2 -0.013786 -0.201537 0.059811 0.334208 0.354689 1.000000

Correlation matrix heatmap

In [41]:
corrmat = mf.corr() 
f, ax = plt.subplots(figsize =(9, 8)) 
sns.heatmap(corrmat, ax = ax, cmap ="YlGnBu", linewidths = 0.1) 
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x232a4b68188>

Grid Correlation Matrix

In [42]:
cg = sns.clustermap(corrmat, cmap ="YlGnBu", linewidths = 0.1); 
plt.setp(cg.ax_heatmap.yaxis.get_majorticklabels(), rotation = 0) 
  
cg 
Out[42]:
<seaborn.matrix.ClusterGrid at 0x232a4b42148>

Correlation Matrix Plot With Pandas

In [43]:
pd.plotting.scatter_matrix(df[['cancer','poverty','smoking','PM25']], alpha=0.2, figsize=(10, 10))
plt.show()

Correlation Matrix Plot With Seaborn

Seaborn allows to make a correlogram or correlation matrix really easily. Correlogram are awesome for exploratory analysis: it allows to quickly observe the relationship between every variable of your matrix. It is easy to do it with seaborn: just call the pairplot function

In [44]:
sns.pairplot(df[['cancer','poverty','smoking','PM25']], diag_kind="kde")
Out[44]:
<seaborn.axisgrid.PairGrid at 0x232a6234208>

Sactter Plot for Visulizatioin the Univariate Relationship

In [45]:
df.plot(x='smoking', y='cancer', style ='o')
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x232a69781c8>

Interactive plot with Plotly

In [46]:
px.scatter_3d(df, x='smoking', y='poverty', z='cancer',
                    color='cancer',
                    size='cancer', size_max=18,
                    opacity=0.7)

Mariginal plot with Seaborn

A marginal plot allows to study the relationship between 2 numeric variables. The central chart display their correlation. It is usually a scatterplot, a hexbin plot, a 2D histogram or a 2D density plot. The marginal charts, usually at the top and at the right, show the distribution of the 2 variables using histogram or density plot.

In [47]:
sns.jointplot(x=df["smoking"], y=df["cancer"], kind='scatter')
sns.jointplot(x=df["smoking"], y=df["cancer"], kind='hex')
sns.jointplot(x=df["smoking"], y=df["cancer"], kind='kde')
Out[47]:
<seaborn.axisgrid.JointGrid at 0x232a4fa3108>

Regression Analysis

Regression analysis with "scikit-learn"

First we will use "scikit-learn", a machine learning library for the Python programming language.It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy (Wikipedia).

Simple Linear Regression

Define X (predictors) and y (response) variables

In [48]:
X = df['smoking'].values.reshape(-1,1)
y = df['cancer'].values.reshape(-1,1)

Split data train and test data sets (80% training and 20% test)

In [49]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
In [50]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(532, 1)
(134, 1)
(532, 1)
(134, 1)
In [51]:
ols = LinearRegression()  
ols=ols.fit(X_train, y_train) #training the algorithm

Intercept:

In [52]:
print(ols.intercept_)
[6.58617474]

Slope

In [53]:
print(ols.coef_)
[[2.43535183]]

Residuals

In [54]:
y_train_pred=ols.predict(X_train)
train_pred= pd.DataFrame({'Actual': y_train.flatten(), 'Predicted': y_train_pred.flatten()})
train_pred['residuals']=train_pred['Actual'] - train_pred['Predicted']
train_pred
Out[54]:
Actual Predicted residuals
0 64.430637 59.729031 4.701607
1 72.694270 68.983368 3.710903
2 70.500054 71.436115 -0.936061
3 70.807277 70.496765 0.310512
4 42.016316 60.720567 -18.704251
... ... ... ...
527 60.011697 75.106538 -15.094841
528 60.404273 61.051079 -0.646806
529 70.430824 75.524027 -5.093203
530 80.127531 69.435647 10.691883
531 54.504111 60.320473 -5.816363

532 rows × 3 columns

Predicted values at test counties

Now that we have trained our algorithm, it’s time to make some predictions. To do so, we will use our test data and see how accurately our algorithm predicts the percentage score.

In [55]:
y_test_pred = ols.predict(X_test)

Now compare the actual output values for X_test with the predicted values, execute the following script:

In [56]:
test_pred = pd.DataFrame({'Observed': y_test.flatten(), 'Predicted': y_test_pred.flatten()})
test_pred.head()
Out[56]:
Observed Predicted
0 48.491048 46.839061
1 61.438730 63.799547
2 75.027515 68.774623
3 73.848832 75.471841
4 64.101773 72.445046

1:1 Plot

In [57]:
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=test_pred )
ax.set_ylim(30,120)
ax.set_xlim(30,120)
ax.plot([30, 120], [30, 120], 'k--', lw=2)
plt.title('OLS Predicted LBC Mortality Rate',fontsize=18)
Out[57]:
Text(0.5, 1.0, 'OLS Predicted LBC Mortality Rate')

Now we evaluate the performance of the algorithm. or regression algorithms, three evaluation metrics are commonly used:

error.png

In [58]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_test_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_test_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))
print('R2:', metrics.r2_score(y_test, y_test_pred))
Mean Absolute Error: 6.684817351481164
Mean Squared Error: 75.41759216055492
Root Mean Squared Error: 8.684330265515868
R2: 0.4738025790063758

Multiple Linear Regression (MLR)

Create a data-frame

In [59]:
Xm=df[['poverty','smoking','PM25','NO2','SO2']].values
y = df[['cancer']].values

Split-data

In [60]:
Xm_train, Xm_test, y_train, y_test = train_test_split(Xm, y, test_size=0.2, random_state=0)

Fit MLR model

In [61]:
mlr = LinearRegression()  
mlr.fit(Xm_train, y_train)
Out[61]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Co-efficients MLR model

In [62]:
pred_mlr = pd.DataFrame(['poverty','smoking','PM25','NO2','SO2'])
coeff = pd.DataFrame(mlr.coef_, index=['Co-efficient']).transpose() 
In [63]:
pd.concat([pred_mlr,coeff], axis=1, join='inner')
Out[63]:
0 Co-efficient
0 poverty 0.102037
1 smoking 2.267721
2 PM25 1.912677
3 NO2 -0.886578
4 SO2 -14.391893

Now let's do prediction on test data.

In [64]:
y_mlr_pred = mlr.predict(Xm_test)
In [65]:
mlr_test_pred = pd.DataFrame({'Observed': y_test.flatten(), 'Predicted': y_mlr_pred.flatten()})
mlr_test_pred.head()
Out[65]:
Observed Predicted
0 48.491048 47.903524
1 61.438730 64.971542
2 75.027515 72.385058
3 73.848832 75.984092
4 64.101773 76.804447

1:1 Plot

In [66]:
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=mlr_test_pred)
ax.set_ylim(30,120)
ax.set_xlim(30,120)
ax.plot([30, 120], [30, 120], 'k--', lw=2)
plt.title('MLR Predicted LBC Mortality Rate',fontsize=18)
Out[66]:
Text(0.5, 1.0, 'MLR Predicted LBC Mortality Rate')
In [67]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_mlr_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_mlr_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_mlr_pred)))
Mean Absolute Error: 6.65003501388036
Mean Squared Error: 77.941807072644
Root Mean Squared Error: 8.828465725857694

K-Fold Cross-Validation¶

In K-Folds Cross Validation data is splitted into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set. We will use entire data set (n=666) in this tutorial.

In [68]:
cv_Kfold = KFold(n_splits=10, random_state=1, shuffle=True)

for i in range(1,11):
    mlr = LinearRegression()  
    mlr.fit(Xm, y)
    cv_scores_Kfold = cross_val_score(mlr, Xm, y, scoring="neg_mean_squared_error", cv=cv_Kfold, n_jobs=1)
In [69]:
all_scores = pd.DataFrame(cross_validate(mlr, Xm, y, scoring=['r2', 'neg_mean_absolute_error'], n_jobs=-1, verbose=0))
print(all_scores)
   fit_time  score_time   test_r2  test_neg_mean_absolute_error
0  0.005030    0.000503  0.284481                     -7.366213
1  0.005029    0.001006  0.467482                     -6.447810
2  0.005532    0.001006  0.054937                     -6.453194
3  0.005030    0.000503  0.288957                     -7.148549
4  0.005028    0.001007  0.261092                     -9.242557
In [70]:
# mean and STD
print("Folds: " + str(len(cv_scores_Kfold)) + ", MSE: " + str(np.mean(np.abs(cv_scores_Kfold))) + ", STD: " + str(np.std(cv_scores_Kfold)))
Folds: 10, MSE: 75.34118551678769, STD: 10.944009884410121
In [71]:
y_pred_KFold = cross_val_predict(mlr, Xm, y, cv=10)
In [72]:
Kfold_pred = pd.DataFrame({'Observed': y.flatten(), 'Predicted': y_pred_KFold.flatten()})
Kfold_pred.head()
Out[72]:
Observed Predicted
0 74.637249 71.776141
1 66.342705 65.403375
2 69.716354 73.201570
3 62.711264 62.072841
4 77.569145 76.212677
In [73]:
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=Kfold_pred)
ax.set_ylim(30,130)
ax.set_xlim(30,130)
ax.plot([30, 130], [30, 130], 'k--', lw=2)
plt.title('K-Fold Cross-Validation',fontsize=18)
Out[73]:
Text(0.5, 1.0, 'K-Fold Cross-Validation')
In [74]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y, y_pred_KFold))  
print('Mean Squared Error:', metrics.mean_squared_error(y, y_pred_KFold))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, y_pred_KFold)))
print('R2:', metrics.r2_score(y, y_pred_KFold))
Mean Absolute Error: 7.150566331130477
Mean Squared Error: 84.11509081863407
Root Mean Squared Error: 9.171427959627337
R2: 0.41754595928015426

Leave-One-Out Cross-Validation

In Leave One Out Cross Validation, the number of folds (subsets) equals to the number of observations. w

In [75]:
cv_LOV = KFold(n_splits=666, random_state=None, shuffle=False)
cv_scores_LOV = cross_val_score(mlr, Xm, y, scoring="neg_mean_squared_error", cv=cv_LOV, n_jobs=1)
In [76]:
print("Folds: " + str(len(cv_scores_LOV)) + ", MSE: " + str(np.mean(np.abs(cv_scores_LOV))) + ", STD: " + str(np.std(cv_scores_LOV)))
Folds: 666, MSE: 74.91907483633577, STD: 126.95364260812033
In [77]:
y_pred_LOV = cross_val_predict(mlr, Xm, y, cv=10)
In [78]:
LOV_pred = pd.DataFrame({'Observed': y.flatten(), 'Predicted': y_pred_LOV.flatten()})
LOV_pred.head()
Out[78]:
Observed Predicted
0 74.637249 71.776141
1 66.342705 65.403375
2 69.716354 73.201570
3 62.711264 62.072841
4 77.569145 76.212677
In [79]:
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=LOV_pred)
ax.set_ylim(30,130)
ax.set_xlim(30,130)
ax.plot([30, 130], [30, 130], 'k--', lw=2)
plt.title('LOV Cross-Validation',fontsize=18)
Out[79]:
Text(0.5, 1.0, 'LOV Cross-Validation')
In [80]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y, y_pred_LOV))  
print('Mean Squared Error:', metrics.mean_squared_error(y, y_pred_LOV))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, y_pred_LOV)))
print('R2:', metrics.r2_score(y, y_pred_LOV))
Mean Absolute Error: 7.150566331130477
Mean Squared Error: 84.11509081863407
Root Mean Squared Error: 9.171427959627337
R2: 0.41754595928015426

Ordinary Least Squares (OLS)

You can use R-style formulas together with pandas data frames to fit your models. Here is a simple example using ordinary least squares:

In [81]:
ols_results = smf.ols('cancer ~ smoking', data=df).fit()
In [82]:
print(ols_results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 cancer   R-squared:                       0.444
Model:                            OLS   Adj. R-squared:                  0.443
Method:                 Least Squares   F-statistic:                     529.5
Date:                Tue, 28 Jan 2020   Prob (F-statistic):           1.31e-86
Time:                        18:20:17   Log-Likelihood:                -2405.6
No. Observations:                 666   AIC:                             4815.
Df Residuals:                     664   BIC:                             4824.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      6.7757      2.786      2.432      0.015       1.306      12.245
smoking        2.4247      0.105     23.012      0.000       2.218       2.632
==============================================================================
Omnibus:                       24.125   Durbin-Watson:                   1.511
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               30.180
Skew:                           0.372   Prob(JB):                     2.80e-07
Kurtosis:                       3.731   Cond. No.                         212.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Multiple Linear Regression (MLR)

In [83]:
 mlr_results = smf.ols('cancer ~ smoking + poverty + PM25 + NO2 + SO2', data=df).fit()
In [84]:
print(mlr_results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 cancer   R-squared:                       0.491
Model:                            OLS   Adj. R-squared:                  0.487
Method:                 Least Squares   F-statistic:                     127.1
Date:                Tue, 28 Jan 2020   Prob (F-statistic):           3.49e-94
Time:                        18:20:17   Log-Likelihood:                -2376.3
No. Observations:                 666   AIC:                             4765.
Df Residuals:                     660   BIC:                             4792.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -5.9577      4.724     -1.261      0.208     -15.235       3.319
smoking        2.2055      0.136     16.270      0.000       1.939       2.472
poverty        0.1686      0.079      2.145      0.032       0.014       0.323
PM25           1.5824      0.272      5.816      0.000       1.048       2.117
NO2           -0.7270      0.249     -2.917      0.004      -1.216      -0.238
SO2           -9.5115      5.355     -1.776      0.076     -20.027       1.004
==============================================================================
Omnibus:                       33.347   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.673
Skew:                           0.455   Prob(JB):                     3.28e-10
Kurtosis:                       3.864   Cond. No.                         587.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Generalized Linear Models (GLM)

In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value (Wikipedia, 2019).

The distribution families currently implemented in statsmodels are:

glm_family.png

In this excercise I will show you how to fit GML models with Gaussian and Poisson famlies.

Gaussian Family

If outcome is continuous and unbounded, then the most "default" choice is the Gaussian distribution (a.k.a. normal distribution), i.e. the standard linear regression (unless you use other link function then the default identity link)(Source).

In [85]:
glm_gaussian = sm.formula.glm('cancer ~ smoking + poverty + PM25 + NO2 + SO2',
                       family=sm.families.Gaussian(), data=df).fit()
In [86]:
print(glm_gaussian.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                 cancer   No. Observations:                  666
Model:                            GLM   Df Residuals:                      660
Model Family:                Gaussian   Df Model:                            5
Link Function:               identity   Scale:                          74.243
Method:                          IRLS   Log-Likelihood:                -2376.3
Date:                Tue, 28 Jan 2020   Deviance:                       49001.
Time:                        18:20:17   Pearson chi2:                 4.90e+04
No. Iterations:                     3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -5.9577      4.724     -1.261      0.207     -15.218       3.302
smoking        2.2055      0.136     16.270      0.000       1.940       2.471
poverty        0.1686      0.079      2.145      0.032       0.015       0.323
PM25           1.5824      0.272      5.816      0.000       1.049       2.116
NO2           -0.7270      0.249     -2.917      0.004      -1.216      -0.238
SO2           -9.5115      5.355     -1.776      0.076     -20.008       0.985
==============================================================================
In [87]:
print('Parameters: ', glm_gaussian.params)
print('T-values: ', glm_gaussian.tvalues)
Parameters:  Intercept   -5.957675
smoking      2.205495
poverty      0.168599
PM25         1.582433
NO2         -0.726989
SO2         -9.511550
dtype: float64
T-values:  Intercept    -1.261019
smoking      16.270107
poverty       2.145241
PM25          5.816038
NO2          -2.916634
SO2          -1.776089
dtype: float64

Plot Fitted vs Observed Values

In [88]:
y = df['cancer']
y_glm_gaussian =glm_gaussian.predict()
In [89]:
GLM_gaussian_pred = pd.DataFrame({'Observed': y, 'Predicted': y_glm_gaussian})
GLM_gaussian_pred.head()
Out[89]:
Observed Predicted
0 74.637249 72.133463
1 66.342705 65.680028
2 69.716354 73.340126
3 62.711264 62.237514
4 77.569145 76.829999
In [90]:
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=GLM_gaussian_pred)
ax.set_ylim(30,130)
ax.set_xlim(30,130)
ax.plot([30, 130], [30, 130], 'k--', lw=2)
plt.title('GLM-Gaussian',fontsize=18)
Out[90]:
Text(0.5, 1.0, 'GLM-Gaussian')

Plot yhat vs. Pearson residuals:

In [91]:
fig, ax = plt.subplots(figsize=(5,5))
ax.scatter(y_glm_gaussian, glm_gaussian.resid_pearson)
ax.hlines(0, 30, 100)
ax.set_xlim(30, 100)

ax.set_title('Residual Dependence Plot')
ax.set_ylabel('Pearson Residuals')
ax.set_xlabel('Fitted values')
Out[91]:
Text(0.5, 0, 'Fitted values')

Histogram of standardized deviance residuals:

In [92]:
from scipy import stats

fig, ax = plt.subplots(figsize=(5,5))

resid = glm_gaussian.resid_deviance.copy()  # residuals
resid_std = stats.zscore(resid)
ax.hist(resid_std, bins=25)
ax.set_title('Histogram of standardized deviance residuals');

QQ Plot of Deviance Residuals:

In [93]:
graphics.gofplots.qqplot(resid, line='r')
Out[93]:

Poisson Family

I outcome is discrete, or more precisely, you are dealing with counts (how many times something happen in given time interval), then the most common choice of the distribution to start with is Poisson distribution. The problem with Poisson distribution is that it is rather inflexible in the fact that it assumes that mean is equal to variance, if this assumption is not met, you may consider using quasi-Poisson family, or negative binomial distribution (see also Definition of dispersion parameter for quasipoisson family)(Source).

Before fiiting Poisson model we convert caner (float64) values to discrite.

In [94]:
df['cancer_int'] = df['cancer'].astype('int64')
df.head()
Out[94]:
FIDs_x FIPS REGION DIVISION STATE COUNTY code geometry FIDs_y X Y cancer poverty smoking PM25 NO2 SO2 cancer_int
0 58 13111.0 South South Atlantic Georgia Fannin County GA POLYGON ((1037682.266 1388266.997, 1037756.908... 58 1.056524e+06 1.376613e+06 74.637249 11.828571 26.192857 13.635714 3.508071 0.073450 74
1 427 42115.0 Northeast Middle Atlantic Pennsylvania Susquehanna County PA POLYGON ((1622673.906 2281043.968, 1623277.874... 427 1.653442e+06 2.267301e+06 66.342705 9.250000 23.478571 14.329286 5.074071 0.072547 66
2 407 42075.0 Northeast Middle Atlantic Pennsylvania Lebanon County PA POLYGON ((1623002.490 2115965.841, 1632186.404... 407 1.633708e+06 2.096683e+06 69.716354 11.335714 27.621429 12.487857 3.730714 0.061116 69
3 593 51683.0 South South Atlantic Virginia Manassas city VA POLYGON ((1582645.685 1903767.263, 1584122.529... 593 1.584049e+06 1.901443e+06 62.711264 17.728571 22.257143 13.710000 5.594500 0.158739 62
4 236 36057.0 Northeast Middle Atlantic New York Montgomery County NY POLYGON ((1725612.650 2416731.662, 1729853.604... 236 1.735811e+06 2.409536e+06 77.569145 19.278571 27.557143 12.247143 0.731357 0.009293 77
In [95]:
glm_poisson = sm.formula.glm('cancer_int ~ smoking + poverty + PM25 + NO2 + SO2',
                       family=sm.families.Poisson(), data=df).fit()
In [96]:
print(glm_poisson.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:             cancer_int   No. Observations:                  666
Model:                            GLM   Df Residuals:                      660
Model Family:                 Poisson   Df Model:                            5
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2354.9
Date:                Tue, 28 Jan 2020   Deviance:                       665.52
Time:                        18:20:19   Pearson chi2:                     670.
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.1018      0.068     45.685      0.000       2.969       3.235
smoking        0.0335      0.002     17.117      0.000       0.030       0.037
poverty        0.0019      0.001      1.739      0.082      -0.000       0.004
PM25           0.0233      0.004      6.111      0.000       0.016       0.031
NO2           -0.0133      0.004     -3.506      0.000      -0.021      -0.006
SO2           -0.1336      0.075     -1.782      0.075      -0.281       0.013
==============================================================================
In [97]:
y = df['cancer']
y_glm_poisson =glm_gaussian.predict()
In [98]:
GLM_poisson_pred = pd.DataFrame({'Observed': y, 'Predicted': y_glm_poisson})
GLM_poisson_pred.head()
Out[98]:
Observed Predicted
0 74.637249 72.133463
1 66.342705 65.680028
2 69.716354 73.340126
3 62.711264 62.237514
4 77.569145 76.829999
In [99]:
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=GLM_gaussian_pred)
ax.set_ylim(30,130)
ax.set_xlim(30,130)
ax.plot([30, 130], [30, 130], 'k--', lw=2)
plt.title('GLM-Gaussian',fontsize=18)
Out[99]:
Text(0.5, 1.0, 'GLM-Gaussian')

Neural Network Regression With Keras-Tensorflow

As egression is a type of supervised machine learning algorithm, it usually use to predict a continuous data and produce a model that represents the ‘best fit’ to some observed data, according to an evaluation criterion.

The Keras library is a high-level API that runs on top of TensorFlow for building deep learning models. Often, building a very complex deep learning network with Keras can be achieved with only a few lines of code.

In this exercise I will use Tensorflow 2.0 GUP to develop a simple regression model neural network architecture. For installtion of TensorFlow GUP in Python 3.7, please see Jeff Heaton very use full github site. I follow mostly his class website for regression analysis with Keras.

In [100]:
mf=pd.read_csv("Data_05/county_mean.csv")
df=mf[["cancer",'smoking','poverty','PM25','NO2','SO2']]
df.head()
Out[100]:
cancer smoking poverty PM25 NO2 SO2
0 74.637249 26.192857 11.828571 13.635714 3.508071 0.073450
1 66.342705 23.478571 9.250000 14.329286 5.074071 0.072547
2 69.716354 27.621429 11.335714 12.487857 3.730714 0.061116
3 62.711264 22.257143 17.728571 13.710000 5.594500 0.158739
4 77.569145 27.557143 19.278571 12.247143 0.731357 0.009293

# Standardize ranges

It is good practice to normalize or standardize features that use different scales and ranges. Although the model might converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.

In [101]:
df['poverty'] = zscore(df['poverty'])
df['smoking'] = zscore(df['smoking'])
df['PM25'] = zscore(df['PM25'])
df['NO2'] = zscore(df['NO2'])
df['SO2'] = zscore(df['SO2'])
C:\anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [102]:
df.head()
Out[102]:
cancer smoking poverty PM25 NO2 SO2
0 74.637249 -0.010994 -0.565725 1.504750 0.766461 -0.042326
1 66.342705 -0.833206 -1.025647 2.013096 1.730333 -0.054364
2 69.716354 0.421749 -0.653633 0.663440 0.903498 -0.206805
3 62.711264 -1.203201 0.486617 1.559197 2.050656 1.095013
4 77.569145 0.402276 0.763080 0.487010 -0.942605 -0.897868

Separate the target value, or "y", from the features. This label is the value that you will train the model to predict.

In [103]:
x_columns=['smoking','poverty','PM25','NO2','SO2']
x = df[x_columns].values
y = df['cancer'].values

Split the data into train and test

In [104]:
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=42)

Build the simple neural network regression model

from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Activation from tensorflow.keras.callbacks import EarlyStopping

In [105]:
model = Sequential()
model.add(Dense(25, input_dim=x.shape[1], activation='relu')) # Hidden 1
model.add(Dense(10, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')

monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, 
                        patience=5, verbose=1, mode='auto', restore_best_weights=True)
In [106]:
model.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor],verbose=2,epochs=1000)
Train on 499 samples, validate on 167 samples
Epoch 1/1000
499/499 - 1s - loss: 5030.0191 - val_loss: 5223.3031
Epoch 2/1000
499/499 - 0s - loss: 5004.0484 - val_loss: 5195.7016
Epoch 3/1000
499/499 - 0s - loss: 4972.4720 - val_loss: 5158.9264
Epoch 4/1000
499/499 - 0s - loss: 4931.1928 - val_loss: 5109.5059
Epoch 5/1000
499/499 - 0s - loss: 4875.9907 - val_loss: 5041.7775
Epoch 6/1000
499/499 - 0s - loss: 4801.7776 - val_loss: 4953.4651
Epoch 7/1000
499/499 - 0s - loss: 4706.6579 - val_loss: 4843.6359
Epoch 8/1000
499/499 - 0s - loss: 4586.4081 - val_loss: 4706.9905
Epoch 9/1000
499/499 - 0s - loss: 4436.0353 - val_loss: 4537.2626
Epoch 10/1000
499/499 - 0s - loss: 4253.2882 - val_loss: 4337.0617
Epoch 11/1000
499/499 - 0s - loss: 4041.2454 - val_loss: 4107.1814
Epoch 12/1000
499/499 - 0s - loss: 3801.5837 - val_loss: 3846.1060
Epoch 13/1000
499/499 - 0s - loss: 3535.2790 - val_loss: 3559.1074
Epoch 14/1000
499/499 - 0s - loss: 3246.7781 - val_loss: 3250.9497
Epoch 15/1000
499/499 - 0s - loss: 2942.6934 - val_loss: 2923.4498
Epoch 16/1000
499/499 - 0s - loss: 2624.7026 - val_loss: 2591.2643
Epoch 17/1000
499/499 - 0s - loss: 2305.7347 - val_loss: 2261.9431
Epoch 18/1000
499/499 - 0s - loss: 1994.0463 - val_loss: 1939.3507
Epoch 19/1000
499/499 - 0s - loss: 1699.7378 - val_loss: 1634.6658
Epoch 20/1000
499/499 - 0s - loss: 1424.5252 - val_loss: 1362.1127
Epoch 21/1000
499/499 - 0s - loss: 1184.3892 - val_loss: 1120.8432
Epoch 22/1000
499/499 - 0s - loss: 973.8621 - val_loss: 922.2362
Epoch 23/1000
499/499 - 0s - loss: 805.6702 - val_loss: 757.1460
Epoch 24/1000
499/499 - 0s - loss: 668.8225 - val_loss: 632.9689
Epoch 25/1000
499/499 - 0s - loss: 567.2330 - val_loss: 539.4201
Epoch 26/1000
499/499 - 0s - loss: 492.5119 - val_loss: 471.6162
Epoch 27/1000
499/499 - 0s - loss: 437.5449 - val_loss: 423.5838
Epoch 28/1000
499/499 - 0s - loss: 400.3311 - val_loss: 388.6066
Epoch 29/1000
499/499 - 0s - loss: 372.8378 - val_loss: 362.3941
Epoch 30/1000
499/499 - 0s - loss: 350.6929 - val_loss: 341.5434
Epoch 31/1000
499/499 - 0s - loss: 332.7332 - val_loss: 325.1744
Epoch 32/1000
499/499 - 0s - loss: 318.0953 - val_loss: 309.5420
Epoch 33/1000
499/499 - 0s - loss: 303.7277 - val_loss: 296.8100
Epoch 34/1000
499/499 - 0s - loss: 291.7164 - val_loss: 284.4444
Epoch 35/1000
499/499 - 0s - loss: 281.1250 - val_loss: 274.4868
Epoch 36/1000
499/499 - 0s - loss: 270.4059 - val_loss: 264.5194
Epoch 37/1000
499/499 - 0s - loss: 261.0711 - val_loss: 255.2101
Epoch 38/1000
499/499 - 0s - loss: 252.3095 - val_loss: 247.2098
Epoch 39/1000
499/499 - 0s - loss: 244.1004 - val_loss: 239.1465
Epoch 40/1000
499/499 - 0s - loss: 236.4712 - val_loss: 231.9464
Epoch 41/1000
499/499 - 0s - loss: 229.4432 - val_loss: 225.4155
Epoch 42/1000
499/499 - 0s - loss: 222.5012 - val_loss: 219.0567
Epoch 43/1000
499/499 - 0s - loss: 216.2139 - val_loss: 213.1134
Epoch 44/1000
499/499 - 0s - loss: 210.0941 - val_loss: 208.1580
Epoch 45/1000
499/499 - 0s - loss: 204.4516 - val_loss: 203.2601
Epoch 46/1000
499/499 - 0s - loss: 199.4356 - val_loss: 198.8452
Epoch 47/1000
499/499 - 0s - loss: 194.6082 - val_loss: 194.3803
Epoch 48/1000
499/499 - 0s - loss: 190.1889 - val_loss: 190.4429
Epoch 49/1000
499/499 - 0s - loss: 185.3460 - val_loss: 186.7611
Epoch 50/1000
499/499 - 0s - loss: 181.5050 - val_loss: 182.9122
Epoch 51/1000
499/499 - 0s - loss: 177.3154 - val_loss: 179.5360
Epoch 52/1000
499/499 - 0s - loss: 173.6135 - val_loss: 176.1447
Epoch 53/1000
499/499 - 0s - loss: 170.0629 - val_loss: 173.2930
Epoch 54/1000
499/499 - 0s - loss: 166.5477 - val_loss: 170.4162
Epoch 55/1000
499/499 - 0s - loss: 163.1434 - val_loss: 167.3251
Epoch 56/1000
499/499 - 0s - loss: 160.0338 - val_loss: 164.4205
Epoch 57/1000
499/499 - 0s - loss: 156.7302 - val_loss: 161.9347
Epoch 58/1000
499/499 - 0s - loss: 153.7147 - val_loss: 159.4944
Epoch 59/1000
499/499 - 0s - loss: 151.0502 - val_loss: 157.1874
Epoch 60/1000
499/499 - 0s - loss: 147.8093 - val_loss: 154.4053
Epoch 61/1000
499/499 - 0s - loss: 145.1071 - val_loss: 151.9735
Epoch 62/1000
499/499 - 0s - loss: 142.6898 - val_loss: 149.8235
Epoch 63/1000
499/499 - 0s - loss: 139.9279 - val_loss: 147.7120
Epoch 64/1000
499/499 - 0s - loss: 137.3534 - val_loss: 145.5566
Epoch 65/1000
499/499 - 0s - loss: 134.9476 - val_loss: 143.3084
Epoch 66/1000
499/499 - 0s - loss: 132.5360 - val_loss: 141.8005
Epoch 67/1000
499/499 - 0s - loss: 130.0432 - val_loss: 139.9553
Epoch 68/1000
499/499 - 0s - loss: 127.6186 - val_loss: 137.8816
Epoch 69/1000
499/499 - 0s - loss: 125.3788 - val_loss: 135.9532
Epoch 70/1000
499/499 - 0s - loss: 123.2286 - val_loss: 134.1922
Epoch 71/1000
499/499 - 0s - loss: 121.1641 - val_loss: 132.5855
Epoch 72/1000
499/499 - 0s - loss: 119.1543 - val_loss: 130.9323
Epoch 73/1000
499/499 - 0s - loss: 116.9992 - val_loss: 129.0766
Epoch 74/1000
499/499 - 0s - loss: 115.1257 - val_loss: 127.3562
Epoch 75/1000
499/499 - 0s - loss: 113.3687 - val_loss: 125.9137
Epoch 76/1000
499/499 - 0s - loss: 111.7458 - val_loss: 124.4295
Epoch 77/1000
499/499 - 0s - loss: 109.8572 - val_loss: 123.0842
Epoch 78/1000
499/499 - 0s - loss: 108.1107 - val_loss: 121.2468
Epoch 79/1000
499/499 - 0s - loss: 106.3305 - val_loss: 119.8691
Epoch 80/1000
499/499 - 0s - loss: 104.6940 - val_loss: 118.7897
Epoch 81/1000
499/499 - 0s - loss: 103.0645 - val_loss: 117.9227
Epoch 82/1000
499/499 - 0s - loss: 101.8022 - val_loss: 116.3150
Epoch 83/1000
499/499 - 0s - loss: 99.9814 - val_loss: 115.3590
Epoch 84/1000
499/499 - 0s - loss: 98.5474 - val_loss: 113.9241
Epoch 85/1000
499/499 - 0s - loss: 97.0612 - val_loss: 112.9436
Epoch 86/1000
499/499 - 0s - loss: 95.7127 - val_loss: 111.7425
Epoch 87/1000
499/499 - 0s - loss: 94.4309 - val_loss: 110.5621
Epoch 88/1000
499/499 - 0s - loss: 93.0527 - val_loss: 109.6551
Epoch 89/1000
499/499 - 0s - loss: 91.7648 - val_loss: 108.8160
Epoch 90/1000
499/499 - 0s - loss: 90.3736 - val_loss: 107.6076
Epoch 91/1000
499/499 - 0s - loss: 89.1354 - val_loss: 106.8731
Epoch 92/1000
499/499 - 0s - loss: 88.0118 - val_loss: 106.1022
Epoch 93/1000
499/499 - 0s - loss: 86.8651 - val_loss: 105.1758
Epoch 94/1000
499/499 - 0s - loss: 85.8022 - val_loss: 104.2739
Epoch 95/1000
499/499 - 0s - loss: 84.8601 - val_loss: 103.3801
Epoch 96/1000
499/499 - 0s - loss: 83.8367 - val_loss: 102.5894
Epoch 97/1000
499/499 - 0s - loss: 83.0437 - val_loss: 102.2155
Epoch 98/1000
499/499 - 0s - loss: 82.0461 - val_loss: 101.2797
Epoch 99/1000
499/499 - 0s - loss: 81.3415 - val_loss: 100.7646
Epoch 100/1000
499/499 - 0s - loss: 80.4422 - val_loss: 99.6111
Epoch 101/1000
499/499 - 0s - loss: 79.4928 - val_loss: 98.9013
Epoch 102/1000
499/499 - 0s - loss: 78.7362 - val_loss: 98.4231
Epoch 103/1000
499/499 - 0s - loss: 78.1641 - val_loss: 97.5397
Epoch 104/1000
499/499 - 0s - loss: 77.3626 - val_loss: 97.1556
Epoch 105/1000
499/499 - 0s - loss: 76.6925 - val_loss: 96.6937
Epoch 106/1000
499/499 - 0s - loss: 75.9871 - val_loss: 95.7494
Epoch 107/1000
499/499 - 0s - loss: 75.3999 - val_loss: 95.4863
Epoch 108/1000
499/499 - 0s - loss: 74.8740 - val_loss: 94.5682
Epoch 109/1000
499/499 - 0s - loss: 74.2041 - val_loss: 94.3677
Epoch 110/1000
499/499 - 0s - loss: 73.6869 - val_loss: 93.9608
Epoch 111/1000
499/499 - 0s - loss: 73.0612 - val_loss: 93.5197
Epoch 112/1000
499/499 - 0s - loss: 72.6484 - val_loss: 93.3492
Epoch 113/1000
499/499 - 0s - loss: 72.0418 - val_loss: 92.7242
Epoch 114/1000
499/499 - 0s - loss: 71.4637 - val_loss: 92.3911
Epoch 115/1000
499/499 - 0s - loss: 70.9461 - val_loss: 91.6524
Epoch 116/1000
499/499 - 0s - loss: 70.6545 - val_loss: 91.2833
Epoch 117/1000
499/499 - 0s - loss: 70.1810 - val_loss: 91.2537
Epoch 118/1000
499/499 - 0s - loss: 69.8015 - val_loss: 90.5026
Epoch 119/1000
499/499 - 0s - loss: 69.3477 - val_loss: 90.5470
Epoch 120/1000
499/499 - 0s - loss: 68.9092 - val_loss: 90.0183
Epoch 121/1000
499/499 - 0s - loss: 68.4640 - val_loss: 89.6189
Epoch 122/1000
499/499 - 0s - loss: 68.4646 - val_loss: 89.0421
Epoch 123/1000
499/499 - 0s - loss: 67.7057 - val_loss: 89.3307
Epoch 124/1000
499/499 - 0s - loss: 67.2614 - val_loss: 88.6480
Epoch 125/1000
499/499 - 0s - loss: 66.9459 - val_loss: 88.1975
Epoch 126/1000
499/499 - 0s - loss: 66.5785 - val_loss: 88.3700
Epoch 127/1000
499/499 - 0s - loss: 66.2813 - val_loss: 87.8610
Epoch 128/1000
499/499 - 0s - loss: 65.9832 - val_loss: 87.7646
Epoch 129/1000
499/499 - 0s - loss: 65.7175 - val_loss: 87.4528
Epoch 130/1000
499/499 - 0s - loss: 65.4907 - val_loss: 87.1368
Epoch 131/1000
499/499 - 0s - loss: 65.1226 - val_loss: 86.9709
Epoch 132/1000
499/499 - 0s - loss: 64.8238 - val_loss: 86.6110
Epoch 133/1000
499/499 - 0s - loss: 64.5407 - val_loss: 86.4795
Epoch 134/1000
499/499 - 0s - loss: 64.2474 - val_loss: 86.1044
Epoch 135/1000
499/499 - 0s - loss: 64.0535 - val_loss: 85.5533
Epoch 136/1000
499/499 - 0s - loss: 63.7625 - val_loss: 85.7003
Epoch 137/1000
499/499 - 0s - loss: 63.5603 - val_loss: 85.7084
Epoch 138/1000
499/499 - 0s - loss: 63.4333 - val_loss: 85.6098
Epoch 139/1000
499/499 - 0s - loss: 63.1535 - val_loss: 85.1266
Epoch 140/1000
499/499 - 0s - loss: 62.9204 - val_loss: 84.8733
Epoch 141/1000
499/499 - 0s - loss: 62.6599 - val_loss: 84.7609
Epoch 142/1000
499/499 - 0s - loss: 62.5522 - val_loss: 84.7530
Epoch 143/1000
499/499 - 0s - loss: 62.3766 - val_loss: 84.2242
Epoch 144/1000
499/499 - 0s - loss: 62.1227 - val_loss: 84.4252
Epoch 145/1000
499/499 - 0s - loss: 61.9619 - val_loss: 84.2641
Epoch 146/1000
499/499 - 0s - loss: 61.6908 - val_loss: 83.8904
Epoch 147/1000
499/499 - 0s - loss: 61.4879 - val_loss: 84.1105
Epoch 148/1000
499/499 - 0s - loss: 61.4617 - val_loss: 83.4429
Epoch 149/1000
499/499 - 0s - loss: 61.1412 - val_loss: 83.9584
Epoch 150/1000
499/499 - 0s - loss: 61.0609 - val_loss: 83.4306
Epoch 151/1000
499/499 - 0s - loss: 60.9523 - val_loss: 83.5165
Epoch 152/1000
499/499 - 0s - loss: 60.7702 - val_loss: 83.6570
Epoch 153/1000
499/499 - 0s - loss: 60.5556 - val_loss: 83.0338
Epoch 154/1000
499/499 - 0s - loss: 60.3996 - val_loss: 82.9038
Epoch 155/1000
499/499 - 0s - loss: 60.3771 - val_loss: 82.7995
Epoch 156/1000
499/499 - 0s - loss: 60.2896 - val_loss: 83.0718
Epoch 157/1000
499/499 - 0s - loss: 60.0742 - val_loss: 82.8409
Epoch 158/1000
499/499 - 0s - loss: 59.9331 - val_loss: 82.6099
Epoch 159/1000
499/499 - 0s - loss: 59.8569 - val_loss: 82.6510
Epoch 160/1000
499/499 - 0s - loss: 59.6743 - val_loss: 82.3828
Epoch 161/1000
499/499 - 0s - loss: 59.8127 - val_loss: 82.5876
Epoch 162/1000
499/499 - 0s - loss: 59.4116 - val_loss: 82.0586
Epoch 163/1000
499/499 - 0s - loss: 59.4425 - val_loss: 81.5923
Epoch 164/1000
499/499 - 0s - loss: 59.2851 - val_loss: 82.1925
Epoch 165/1000
499/499 - 0s - loss: 59.2325 - val_loss: 82.2223
Epoch 166/1000
499/499 - 0s - loss: 58.9970 - val_loss: 82.1419
Epoch 167/1000
499/499 - 0s - loss: 59.1057 - val_loss: 81.4538
Epoch 168/1000
499/499 - 0s - loss: 58.9928 - val_loss: 82.2987
Epoch 169/1000
499/499 - 0s - loss: 58.8156 - val_loss: 81.5353
Epoch 170/1000
499/499 - 0s - loss: 58.6791 - val_loss: 81.4181
Epoch 171/1000
499/499 - 0s - loss: 58.4919 - val_loss: 81.6024
Epoch 172/1000
499/499 - 0s - loss: 58.5232 - val_loss: 81.3024
Epoch 173/1000
499/499 - 0s - loss: 58.3418 - val_loss: 81.2076
Epoch 174/1000
499/499 - 0s - loss: 58.2399 - val_loss: 81.5028
Epoch 175/1000
499/499 - 0s - loss: 58.3105 - val_loss: 81.1209
Epoch 176/1000
499/499 - 0s - loss: 58.0963 - val_loss: 81.3494
Epoch 177/1000
499/499 - 0s - loss: 57.9547 - val_loss: 81.2058
Epoch 178/1000
499/499 - 0s - loss: 58.4793 - val_loss: 80.3826
Epoch 179/1000
499/499 - 0s - loss: 58.0150 - val_loss: 81.0182
Epoch 180/1000
499/499 - 0s - loss: 57.9223 - val_loss: 81.0128
Epoch 181/1000
499/499 - 0s - loss: 57.6255 - val_loss: 80.9819
Epoch 182/1000
499/499 - 0s - loss: 57.5275 - val_loss: 80.6515
Epoch 183/1000
499/499 - 0s - loss: 57.5397 - val_loss: 80.3804
Epoch 184/1000
499/499 - 0s - loss: 57.4662 - val_loss: 80.5955
Epoch 185/1000
499/499 - 0s - loss: 57.4871 - val_loss: 79.9678
Epoch 186/1000
499/499 - 0s - loss: 57.3449 - val_loss: 80.7497
Epoch 187/1000
499/499 - 0s - loss: 57.2811 - val_loss: 80.4219
Epoch 188/1000
499/499 - 0s - loss: 57.1583 - val_loss: 80.2822
Epoch 189/1000
499/499 - 0s - loss: 57.0440 - val_loss: 79.7988
Epoch 190/1000
499/499 - 0s - loss: 56.9008 - val_loss: 79.8574
Epoch 191/1000
499/499 - 0s - loss: 56.8738 - val_loss: 79.7838
Epoch 192/1000
499/499 - 0s - loss: 56.9489 - val_loss: 80.2717
Epoch 193/1000
499/499 - 0s - loss: 56.7133 - val_loss: 79.6393
Epoch 194/1000
499/499 - 0s - loss: 56.8657 - val_loss: 79.4950
Epoch 195/1000
499/499 - 0s - loss: 56.6360 - val_loss: 80.0609
Epoch 196/1000
499/499 - 0s - loss: 56.7251 - val_loss: 79.8181
Epoch 197/1000
499/499 - 0s - loss: 56.5656 - val_loss: 79.2396
Epoch 198/1000
499/499 - 0s - loss: 56.5395 - val_loss: 79.8286
Epoch 199/1000
499/499 - 0s - loss: 56.6755 - val_loss: 79.0418
Epoch 200/1000
499/499 - 0s - loss: 56.4393 - val_loss: 79.9280
Epoch 201/1000
499/499 - 0s - loss: 56.2988 - val_loss: 78.9504
Epoch 202/1000
499/499 - 0s - loss: 56.0926 - val_loss: 79.3417
Epoch 203/1000
499/499 - 0s - loss: 56.1220 - val_loss: 79.2693
Epoch 204/1000
499/499 - 0s - loss: 56.1866 - val_loss: 79.5717
Epoch 205/1000
499/499 - 0s - loss: 56.0052 - val_loss: 78.9281
Epoch 206/1000
499/499 - 0s - loss: 56.1670 - val_loss: 78.6948
Epoch 207/1000
499/499 - 0s - loss: 55.7890 - val_loss: 79.6967
Epoch 208/1000
499/499 - 0s - loss: 55.7076 - val_loss: 79.1465
Epoch 209/1000
499/499 - 0s - loss: 55.6582 - val_loss: 79.0073
Epoch 210/1000
499/499 - 0s - loss: 55.5713 - val_loss: 78.9188
Epoch 211/1000
499/499 - 0s - loss: 55.6971 - val_loss: 78.5461
Epoch 212/1000
499/499 - 0s - loss: 55.5479 - val_loss: 78.8847
Epoch 213/1000
499/499 - 0s - loss: 55.3749 - val_loss: 78.5271
Epoch 214/1000
499/499 - 0s - loss: 55.3728 - val_loss: 78.7058
Epoch 215/1000
499/499 - 0s - loss: 55.3117 - val_loss: 78.4494
Epoch 216/1000
499/499 - 0s - loss: 55.2599 - val_loss: 78.6799
Epoch 217/1000
499/499 - 0s - loss: 55.1932 - val_loss: 78.5591
Epoch 218/1000
499/499 - 0s - loss: 55.0902 - val_loss: 78.7721
Epoch 219/1000
499/499 - 0s - loss: 55.0537 - val_loss: 78.6258
Epoch 220/1000
Restoring model weights from the end of the best epoch.
499/499 - 0s - loss: 55.0798 - val_loss: 78.6914
Epoch 00220: early stopping
Out[106]:
<tensorflow.python.keras.callbacks.History at 0x232a55df488>

Prediction at Test data

In [107]:
DNN_pred = model.predict(x_test)
In [108]:
DNN_Pred_df = pd.DataFrame({'Observed': y_test.flatten(), 'Predicted': DNN_pred.flatten()})
DNN_Pred_df.head()
Out[108]:
Observed Predicted
0 117.223061 91.188492
1 79.362922 79.236511
2 120.869885 96.198563
3 58.128106 58.300262
4 77.942287 74.023018
In [109]:
plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=DNN_Pred_df)
ax.set_ylim(30,130)
ax.set_xlim(30,130)
ax.plot([30, 130], [30, 130], 'k--', lw=2)
plt.title('Neural Network Regression',fontsize=18)
Out[109]:
Text(0.5, 1.0, 'Neural Network Regression')

Mean Square Error

In [110]:
# Measure MSE error.  
score_mse = metrics.mean_squared_error(DNN_pred,y_test)
print("Final score (MSE): {}".format(score_mse))
Final score (MSE): 78.44942031770589

Root Mean Square Error

In [111]:
import numpy as np

# Measure RMSE error.  RMSE is common for regression.
score_rmse = np.sqrt(metrics.mean_squared_error(DNN_pred,y_test))
print("Final score (RMSE): {}".format(score_rmse))
Final score (RMSE): 8.857167736794075

Lift Chart

(source)

To generate a lift chart, perform the following activities:

  • Sort the data by expected output. Plot the blue line above.
  • For every point on the x-axis plot the predicted value for that same data point. This is the green line above.
  • The x-axis is just 0 to 100% of the dataset. The expected always starts low and ends high.
  • The y-axis is ranged according to the values predicted.

Reading a lift chart:

  • The expected and predict lines should be close. Notice where one is above the ot other.
  • The below chart is the most accurate on lower age.
In [112]:
# Regression chart.
def chart_regression(pred, y, sort=True):
    t = pd.DataFrame({'pred': pred, 'y': y.flatten()})
    if sort:
        t.sort_values(by=['y'], inplace=True)
    plt.plot(t['y'].tolist(), label='expected')
    plt.plot(t['pred'].tolist(), label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()
In [113]:
# Plot the chart
chart_regression(DNN_pred.flatten(),y_test)