This lesson we will cover following:

Descriptive Statistics

Correlation Analysis

Regression Analysis

The will use following data set fo US Atlantic regions (expcept Florida) which could be avilable fro download from here.

Spatial Data Processing¶

Data¶

Age Adjusted Lung & Bronchus Cancer Mortality Rates by County¶

Age Adjusted Lung & Bronchus Cancer Mortality Rates from 1980 t0 2014 by county was obtained from Institute for Health Metrics and Evaluation (IHME). The rate was estimated using spatially explicit Bayesian mixed effects regression models for both males and females (Mokdad et al., 2016). The model incorporated 7 covariates, such as proportion of the adult population that graduated high school, the proportion of the population that is Hispanic, the proportion of the population that is black, the proportion of the population that is a race other than black or white, The proportion of a county that is contained within a state or federal Native American reservation, the median household income, and the population density. To reduce the noise due to low mortality rate in a small population size, we smoothed data with the Empirical Bayesian smoother, a tool for dealing with rate instability associated with small sample sizes. It takes the population in the region as an indicator of confidence in the data, with higher population providing a higher confidence in the value at that location. We used SpaceStat (Evaluation version) Software for smoothing data.

Age-standardized Cigarette Smoking (%) Prevalence by County¶

Age-standardized cigarette smoking (%) prevalence by County estimated from the Behavioral Risk Factor Surveillance System (BRFSS) survey data (Dwyer-Lindgren et al., 2014). The prevalence (%) were estimated by logistic hierarchical mixed effects regression models, stratified by sex. All estimates were age-standardized following the age structure of the 2000 census. The uncertainty of the prevalence estimates was assessed using simulation methods.

Age-standardized Poverty Rate by County¶

County-level poverty data (% population below poverty level) are from US Census Small Area Income and Poverty Estimates (SAIPE) Program. Income and poverty rate were estimated by combining survey data with population estimates and administrative records.

Particulate Matter (PM2.5)¶

Annual mean PM2.5 from 1998 to 2012 were modeled from aerosol optical depth from multiple satellite (Multiangle Imaging Spectra Radiometer, MODIS Dark Target, MODIS and SeaWiFS Deep Blue, MODIS MAIAC) products and validated by ground-based PM2.5 monitoring sites (Aaron van Donkelaar et al., 2016). All raster grid of PM2.5 were re-sampled at 2.5 km x 2 km grid size using Empirical Bayesian Kriging in ArcPython - Geo-statistical Tool (ESRI, 2017)). County population weighted mean for each year were calculated from the population data.

Nitrogen dioxide (NO2)¶

Annual mean ambient NO2 Concentrations estimated from three satellite instruments, such as Global Ozone Monitoring Experiment (GOME), Scanning Imaging Absorption Spectrometer for Atmospheric Chartography (SCIAMACHY) and GOME-2 satellite in combination with chemical transport model (Gedders et al., (2016). All raster grid of PM2.5 were re-sampled at 2.5 km x 2 km grid size using Empirical Bayesian Kriging in ArcPython - Geo-statistical Tool (ESRI, 2017)). County population weighted mean for each year were calculated from the population data.

Sulfer dioxide (SO2)¶

Annual mean ambient SO2 Concentrations estimated were obtained from time series Multi-source SO2 emission retrievals and consistency of satellite and surface measurements of (Fioletov et al,(2017). All raster grid of PM2.5 were re-sampled at 2.5 km x 2.5 km grid size using Empirical Bayesian Kriging in ArcPython - Geo-statistical Tool (ESRI, 2017)). County population weighted mean for each year were calculated from the population data.

Load Libraries¶

import os
import sys

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.colors import ListedColormap, BoundaryNorm
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401 unused import
from mpl_toolkits.axes_grid1 import make_axes_locatable
import mapclassify

import plotly.graph_objects as go
import plotly.express as px

import seaborn as sns

from glob import glob
from functools import reduce
from osgeo import ogr, gdal
from osgeo import gdalconst
import geopandas as gpd
import pycrs
import fiona
from fiona.crs import from_epsg
from shapely.geometry import box
from shapely.geometry import Point
import shapely.geometry as geom

import json

from scipy.stats import sem
from scipy.stats import zscore
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_validate

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.api import abline_plot
from statsmodels import graphics

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping

%matplotlib inline

# What version of Python do you have?
print(f"Python {sys.version}")
print(f"geopandas {gpd.__version__}")
print(f"osgeo {gdal.__version__}")

import tensorflow.keras

print(f"Tensor Flow Version: {tf.__version__}")
print(f"Keras Version: {tensorflow.keras.__version__}")

print()
print(f"Python {sys.version}")
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")

Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 21:48:41) [MSC v.1916 64 bit (AMD64)]
geopandas 0.6.2
osgeo 3.0.2
Tensor Flow Version: 2.0.0
Keras Version: 2.2.4-tf

Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 21:48:41) [MSC v.1916 64 bit (AMD64)]
GPU is available

Setting consistent plotting style throughout notebook¶

sns.set_style("white")
sns.set(font_scale=1.5)

Set working directory¶

path= "E:/GitHub/geospatial-python-github.io/Lesson_05_working_with_polygon_data"
os.chdir(path)
print("Current Working Directory " , os.getcwd())

Current Working Directory  E:\GitHub\geospatial-python-github.io\Lesson_05_working_with_polygon_data

Import all CSV¶

cancer=pd.read_csv("Data_05/Lung_cancer_1998_2012.csv")
smoking=pd.read_csv("Data_05/SMOKING_1998_2012.csv")
poverty=pd.read_csv("Data_05/POVERTY_1998_2012.csv")
PM25=pd.read_csv("Data_05/PM25_1998_2012.csv")
NO2=pd.read_csv("Data_05/NO2_1998_2012.csv")
SO2=pd.read_csv("Data_05/SO2_1998_2012.csv")

Load County shape file and Extract Centriod

county_gpd=gpd.read_file('Data_05/county_map.shp')
county_gpd.plot()
county_gpd.head()

Now, we will get centriod of each county

Creat Centriods of all counties¶

# copy poly to new GeoDataFrame
cen_points = county_gpd.copy()
# change the geometry
cen_points.geometry = cen_points['geometry'].centroid
# same crs
cen_points.crs =county_gpd.crs
cen_points.head()

fig, ay = plt.subplots()
county_gpd.plot(ax=ay, color="white", edgecolor="gray")
cen_points.plot(ax=ay, color='red', markersize=1)
ay.set_title('County Centriods')

Text(0.5, 1, 'County Centriods')

# Get X (m) and Y (m) Values from geometry column
cen_points['X'] = cen_points.geometry.x
cen_points['Y'] = cen_points.geometry.y
cen_points.head()

### Convert geopandas data frame to pandas data frame without geometry column
columns =['STATE','COUNTY','REGION','DIVISION','geometry' ]
cen_df = pd.DataFrame(cen_points.drop(columns=columns))
cen_df.head()

Calculate 15 years Mean¶

We will create a data frame of 15 years (1998-2012) mean all variables, and will use this data in do some statistical analysis.

# copy cen_df as "avg" pd-dataframe
avg_df = cen_df.copy()

avg_df["cancer"]= cancer.iloc[:,2:16].mean(axis=1)
avg_df["poverty"]= poverty.iloc[:,2:16].mean(axis=1)
avg_df["smoking"]= smoking.iloc[:,2:16].mean(axis=1)
avg_df["PM25"]= PM25.iloc[:,2:16].mean(axis=1)
avg_df["NO2"]= NO2.iloc[:,2:16].mean(axis=1)
avg_df["SO2"]= SO2.iloc[:,2:16].mean(axis=1)

avg_df.head()

# Write as CSV file
avg_df.to_csv('Data_05/county_mean.csv',index=False)

Join the average data with county shape file¶

# Drop code coloun for pd-data frame
avg_mf=avg_df.drop("code", axis=1)
county_gpd_data = county_gpd.merge(avg_mf, on='FIPS')
county_gpd_data.head()

Write as a ESRI shape file¶

county_gpd_data.to_file("Data_05/county_mean_data.shp")

Visualization of Spatial Data¶

Map County Mean Data¶

# Make subplots that are next to each other
fig, (bx1, bx2, bx3) = plt.subplots(nrows=1, ncols=3,  figsize=(12, 5))

# Plot Cancer 
county_gpd_data.plot(column='cancer', 
                 edgecolor="gray", 
                 ax=bx1,
                 cmap='Blues',
                 linewidth=0.5
                  )
bx1.set_title("LBC Mortality Rate");
bx1.axis('off');
fig.colorbar(bx1.collections[0], ax=bx1)

# Plot Poverty
county_gpd_data.plot(column='poverty',
                  edgecolor="gray", 
                  ax=bx2,
                  cmap= 'Blues',
                  linewidth=0.5
                   );
bx2.set_title("Povert (%)");
bx2.axis('off');
fig.colorbar(bx2.collections[0], ax=bx2)

# Plot Smoking
county_gpd_data.plot(column='smoking', 
                 edgecolor="gray", 
                 ax=bx3,
                 cmap='Blues',
                 linewidth=0.5
                 );
bx3.set_title("Smoking (%)");
bx3.axis('off');
fig.colorbar(bx3.collections[0], ax=bx3)

<matplotlib.colorbar.Colorbar at 0x232a180e148>

# Make subplots that are next to each other
fig, (bx1, bx2, bx3) = plt.subplots(nrows=1, ncols=3,  figsize=(12, 5))

# Plot PM25 
county_gpd_data.plot(column='PM25', 
                 edgecolor="gray", 
                 ax=bx1,
                 cmap='Blues',
                 linewidth=0.5,
                  )
bx1.set_title("PM25");
bx1.axis('off');
fig.colorbar(bx1.collections[0], ax=bx1)

# Plot NO2
county_gpd_data.plot(column='NO2',
                  edgecolor="gray", 
                  ax=bx2,
                  cmap= 'Blues',
                  linewidth=0.5,
                   );
bx2.set_title("NO2");
bx2.axis('off');
fig.colorbar(bx2.collections[0], ax=bx2)

# Plot SO2
county_gpd_data.plot(column='SO2', 
                 edgecolor="gray", 
                 ax=bx3,
                 cmap='Blues',
                 linewidth=0.5,
                 );
bx3.set_title("SO2");
bx3.axis('off');
fig.colorbar(bx3.collections[0], ax=bx3)

<matplotlib.colorbar.Colorbar at 0x232a1c8a0c8>

Save as PNG files¶

Once you start using them, for() loops are fairly simple. The syntax of our for() loop will say this: for each year in the list_of_years list, run the following code. And when all years in our list have gone through the code, stop the loop.

county_gpd=gpd.read_file('Data_05/county_map.shp')

cancer_df=pd.read_csv("Data_05/Lung_cancer_1998_2012.csv")

cancer_df.describe()

cancer_gpd = county_gpd.merge(cancer_df, on='FIPS')

# save all the maps in the charts folder
output_path = 'Data_05/Maps'
# counter for the for loop
i = 0

# list of years (which are the column names at the moment)
list_of_years = ['1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', 
                 '2006', '2007', '2008','2010','2011', '2012']

# set the min and max range for the choropleth map
vmin, vmax = 30, 130

# start the for loop to create one map per year
for year in list_of_years:
    
    # create map, UDPATE: added plt.Normalize to keep the legend range the same for all maps
    fig = cancer_gpd.plot(column=year, 
                       cmap='Blues', 
                       figsize=(6,5), 
                       linewidth=0.5,
                       edgecolor='0.8', 
                       vmin=vmin, 
                       vmax=vmax,
    legend=True, norm=plt.Normalize(vmin=vmin, vmax=vmax))
    
    # remove axis of chart
    fig.axis('off')
    
    # add a title
    fig.set_title('LBC Mortality Rate', \
              fontdict={'fontsize': '20',
                         'fontweight' : '3'})
    
    # this will save the figure as a high-res png in the output path. you can also save as svg if you prefer.
    filepath = os.path.join(output_path, year+'_LBC.png')
    chart = fig.get_figure()
    chart.savefig(filepath, dpi=300)
    
       # position the annotation to the bottom left
    fig.annotate(year,
            xy=(0.1, .225), xycoords='figure fraction',
            horizontalalignment='left', verticalalignment='top',
            fontsize=15)

Claculate State Mean and Standard Deviation¶

We use will county mean data (avg) to calculate state mean and standard deviation of LBC mortality rates and all other covariates.

# Mean
state_mean=avg_df.groupby(['code'])[['cancer','smoking','poverty','PM25']].mean()
state_mean.head()

# Standard deviation
state_std=avg_df.groupby(['code'])[['cancer','smoking','poverty','PM25']].std()
state_std.head()

Load state shape file and join the mean and standard deviation

state_gpd= gpd.read_file('Data_05/state_map.shp')
state_gpd_mean = state_gpd.merge(state_mean, on='code')
state_gpd_mean.head()

state_gpd=gpd.read_file('Data_05/state_map.shp')
state_gpd_std = state_gpd.merge(state_std, on='code')
state_gpd_std.head()

# Make subplots that are next to each other
fig, (bx1, bx2) = plt.subplots(nrows=1, ncols=2,  figsize=(12, 5))

# Plot mean data
state_gpd_mean.plot(column='cancer',
                  edgecolor="gray", 
                  ax=bx1,
                  cmap= 'Blues',
                  linewidth=0.5
                   );
bx1.set_title("Mean LBC Rate");
bx1.axis('off');
fig.colorbar(bx1.collections[0], ax=bx1)


# Plot Standard deviation
state_gpd_std.plot(column='cancer',
                  edgecolor="gray", 
                  ax=bx2,
                  cmap= 'Blues',
                  linewidth=0.5
                   );
bx2.set_title("STDEV LBC Rate");
bx2.axis('off');
fig.colorbar(bx2.collections[0], ax=bx2)

C:\anaconda3\envs\tensorflow\lib\site-packages\matplotlib\colors.py:527: RuntimeWarning:

invalid value encountered in less

<matplotlib.colorbar.Colorbar at 0x232a3375948>

Interactive Map with Plotly¶

Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.

fig = go.Figure(data=go.Choropleth(
    locations=state_gpd_mean['code'], # Spatial coordinates
    z = state_gpd_mean['cancer'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "Death (per 100K)",
))

fig.update_layout(
    title_text = 'Mean LBC Mortality Rate by Selected State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

Making choropleth maps from your gespatial data requires two main types of input: GeoJSON-formatted geometry information where each feature has an id and a list of values indexed by feature id. The GeoJSON data is passed to the geojson attribute, and the data is passed into the z (color for px.choropleth_mapbox) attribute, in the same order as the IDs are passed into the location attribute.

mf = pd.read_csv("Data_05/county_mean.csv", dtype={"FIPS": str})
mf.head()

with open('Data_05/county_gcs.json') as read_file:
   counties = json.load(read_file)
counties["features"][0]

{'type': 'Feature',
 'id': 1,
 'geometry': {'type': 'Polygon',
  'coordinates': [[[-84.50989030699998, 34.98801669100004],
    [-84.62148733899994, 34.988335687000074],
    [-84.62275633199994, 34.857137660000035],
    [-84.61864133099994, 34.855404659000044],
    [-84.42259727399994, 34.85746266800004],
    [-84.41742527199995, 34.84982266600008],
    [-84.35422225199994, 34.825025663000076],
    [-84.34366524899997, 34.82478766400004],
    [-84.32806924399995, 34.80640966100003],
    [-84.31465723999997, 34.80733466100003],
    [-84.25217121699995, 34.71816764500005],
    [-84.25229621399995, 34.664616634000026],
    [-84.24262821099995, 34.67091463500003],
    [-84.22647520599998, 34.653884633000075],
    [-84.19389119499994, 34.63884063100005],
    [-84.19675819499997, 34.61793062600003],
    [-84.18856119199995, 34.60269862300004],
    [-84.15803918499995, 34.648249634000024],
    [-84.10363417399998, 34.727843653000036],
    [-84.09317417099999, 34.72778565300007],
    [-84.09319717499994, 34.801417669000045],
    [-84.12399118399998, 34.79604666600005],
    [-84.14006718899998, 34.80689166800005],
    [-84.14751119399995, 34.844649676000074],
    [-84.14247119299995, 34.85405067800008],
    [-84.11374418499997, 34.87037568200003],
    [-84.10742618399996, 34.88690268600004],
    [-84.13354919399995, 34.91647969100006],
    [-84.14912719799997, 34.919612691000054],
    [-84.17168820599994, 34.93791069400004],
    [-84.17883820899993, 34.95209269700007],
    [-84.12945899099998, 34.987510522000036],
    [-84.12955919699994, 34.98751070600008],
    [-84.32187325299998, 34.98841469900003],
    [-84.39393927299994, 34.98807469600007],
    [-84.39490727399999, 34.98803669600005],
    [-84.50905630699998, 34.98803969100004],
    [-84.50989030699998, 34.98801669100004]]]},
 'properties': {'OBJECTID': 1,
  'FIPS': 13111,
  'REGION': 'South',
  'DIVISION': 'South Atlantic',
  'STATE': 'Georgia',
  'COUNTY': 'Fannin County',
  'FIDs': 58,
  'Shape_Length': 1.7676421068026271,
  'Shape_Area': 0.09990750532844837}}

fig = px.choropleth_mapbox(mf, geojson=counties, locations='FIDs', color='cancer',
                           color_continuous_scale="Viridis", 
                           range_color=(0, 120),
                           mapbox_style="carto-positron",
                           zoom=3, center = {"lat": 45.0902, "lon": -75.7129},
                           opacity=0.4,
                           labels={'cancer':'LBC mortality rate'}
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

Exploratory Data Analysis¶

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task (Wikipedia, 2020).

Create a Panda data-frame from GeoPandas data-frame for EDA¶

# convert to panad data frame
df=pd.DataFrame(county_gpd_data)
df.head()

Descriptive Statistics¶

mf=df[['cancer', 'poverty','smoking', 'PM25', 'NO2', 'SO2']]

mf.describe()

Box-plot¶

A box plot is a statistical representation of numerical data through their quartiles. The ends of the box represent the lower and upper quartiles, while the median (second quartile) is marked by a line inside the box. For other statistical representations of numerical data, see other statistical charts.

Seaborn Boxplot¶

ax = sns.boxplot(x="code", y="cancer", data=df)

Add jitter over boxplot | seaborn¶

# Usual boxplot
ax = sns.boxplot(x='REGION', y='cancer', data=df)
 
# Add jitter with the swarmplot function.
ax = sns.swarmplot(x='REGION', y='cancer', data=df, color="grey")

Interactive Box-plot with Plotly¶

Plotly Express is the easy-to-use, high-level interface to Plotly, which operates on "tidy" data. In a box plot created by px.box, the distribution of the column given as y argument is represented.

fig = px.box(df, x="code", y="cancer")
fig.show()

Correlation Analysis¶

mf.corr()

Correlation matrix heatmap¶

corrmat = mf.corr() 
f, ax = plt.subplots(figsize =(9, 8)) 
sns.heatmap(corrmat, ax = ax, cmap ="YlGnBu", linewidths = 0.1)

<matplotlib.axes._subplots.AxesSubplot at 0x232a4b68188>

Grid Correlation Matrix¶

cg = sns.clustermap(corrmat, cmap ="YlGnBu", linewidths = 0.1); 
plt.setp(cg.ax_heatmap.yaxis.get_majorticklabels(), rotation = 0) 
  
cg

<seaborn.matrix.ClusterGrid at 0x232a4b42148>

Correlation Matrix Plot With Pandas¶

pd.plotting.scatter_matrix(df[['cancer','poverty','smoking','PM25']], alpha=0.2, figsize=(10, 10))
plt.show()

Correlation Matrix Plot With Seaborn¶

Seaborn allows to make a correlogram or correlation matrix really easily. Correlogram are awesome for exploratory analysis: it allows to quickly observe the relationship between every variable of your matrix. It is easy to do it with seaborn: just call the pairplot function

sns.pairplot(df[['cancer','poverty','smoking','PM25']], diag_kind="kde")

<seaborn.axisgrid.PairGrid at 0x232a6234208>

Sactter Plot for Visulizatioin the Univariate Relationship¶

df.plot(x='smoking', y='cancer', style ='o')

<matplotlib.axes._subplots.AxesSubplot at 0x232a69781c8>

Interactive plot with Plotly¶

px.scatter_3d(df, x='smoking', y='poverty', z='cancer',
                    color='cancer',
                    size='cancer', size_max=18,
                    opacity=0.7)

Mariginal plot with Seaborn¶

A marginal plot allows to study the relationship between 2 numeric variables. The central chart display their correlation. It is usually a scatterplot, a hexbin plot, a 2D histogram or a 2D density plot. The marginal charts, usually at the top and at the right, show the distribution of the 2 variables using histogram or density plot.

sns.jointplot(x=df["smoking"], y=df["cancer"], kind='scatter')
sns.jointplot(x=df["smoking"], y=df["cancer"], kind='hex')
sns.jointplot(x=df["smoking"], y=df["cancer"], kind='kde')

<seaborn.axisgrid.JointGrid at 0x232a4fa3108>

Regression Analysis¶

Regression analysis with "scikit-learn"¶

First we will use "scikit-learn", a machine learning library for the Python programming language.It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy (Wikipedia).

Simple Linear Regression¶

Define X (predictors) and y (response) variables¶

X = df['smoking'].values.reshape(-1,1)
y = df['cancer'].values.reshape(-1,1)

Split data train and test data sets (80% training and 20% test)¶

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(532, 1)
(134, 1)
(532, 1)
(134, 1)

ols = LinearRegression()  
ols=ols.fit(X_train, y_train) #training the algorithm

Intercept:¶

print(ols.intercept_)

[6.58617474]

Slope¶

print(ols.coef_)

[[2.43535183]]

Residuals¶

y_train_pred=ols.predict(X_train)
train_pred= pd.DataFrame({'Actual': y_train.flatten(), 'Predicted': y_train_pred.flatten()})
train_pred['residuals']=train_pred['Actual'] - train_pred['Predicted']
train_pred

Predicted values at test counties¶

Now that we have trained our algorithm, it’s time to make some predictions. To do so, we will use our test data and see how accurately our algorithm predicts the percentage score.

y_test_pred = ols.predict(X_test)

Now compare the actual output values for X_test with the predicted values, execute the following script:

test_pred = pd.DataFrame({'Observed': y_test.flatten(), 'Predicted': y_test_pred.flatten()})
test_pred.head()

1:1 Plot¶

plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=test_pred )
ax.set_ylim(30,120)
ax.set_xlim(30,120)
ax.plot([30, 120], [30, 120], 'k--', lw=2)
plt.title('OLS Predicted LBC Mortality Rate',fontsize=18)

Text(0.5, 1.0, 'OLS Predicted LBC Mortality Rate')

Now we evaluate the performance of the algorithm. or regression algorithms, three evaluation metrics are commonly used:

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_test_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_test_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))
print('R2:', metrics.r2_score(y_test, y_test_pred))

Mean Absolute Error: 6.684817351481164
Mean Squared Error: 75.41759216055492
Root Mean Squared Error: 8.684330265515868
R2: 0.4738025790063758

Multiple Linear Regression (MLR)¶

Create a data-frame¶

Xm=df[['poverty','smoking','PM25','NO2','SO2']].values
y = df[['cancer']].values

Split-data¶

Xm_train, Xm_test, y_train, y_test = train_test_split(Xm, y, test_size=0.2, random_state=0)

Fit MLR model¶

mlr = LinearRegression()  
mlr.fit(Xm_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Co-efficients MLR model¶

pred_mlr = pd.DataFrame(['poverty','smoking','PM25','NO2','SO2'])
coeff = pd.DataFrame(mlr.coef_, index=['Co-efficient']).transpose()

pd.concat([pred_mlr,coeff], axis=1, join='inner')

Now let's do prediction on test data.

y_mlr_pred = mlr.predict(Xm_test)

mlr_test_pred = pd.DataFrame({'Observed': y_test.flatten(), 'Predicted': y_mlr_pred.flatten()})
mlr_test_pred.head()

1:1 Plot¶

plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=mlr_test_pred)
ax.set_ylim(30,120)
ax.set_xlim(30,120)
ax.plot([30, 120], [30, 120], 'k--', lw=2)
plt.title('MLR Predicted LBC Mortality Rate',fontsize=18)

Text(0.5, 1.0, 'MLR Predicted LBC Mortality Rate')

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_mlr_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_mlr_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_mlr_pred)))

Mean Absolute Error: 6.65003501388036
Mean Squared Error: 77.941807072644
Root Mean Squared Error: 8.828465725857694

Cross-validation: evaluating estimator performance ¶

K-Fold Cross-Validation¶¶

In K-Folds Cross Validation data is splitted into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set. We will use entire data set (n=666) in this tutorial.

cv_Kfold = KFold(n_splits=10, random_state=1, shuffle=True)

for i in range(1,11):
    mlr = LinearRegression()  
    mlr.fit(Xm, y)
    cv_scores_Kfold = cross_val_score(mlr, Xm, y, scoring="neg_mean_squared_error", cv=cv_Kfold, n_jobs=1)

all_scores = pd.DataFrame(cross_validate(mlr, Xm, y, scoring=['r2', 'neg_mean_absolute_error'], n_jobs=-1, verbose=0))
print(all_scores)

   fit_time  score_time   test_r2  test_neg_mean_absolute_error
0  0.005030    0.000503  0.284481                     -7.366213
1  0.005029    0.001006  0.467482                     -6.447810
2  0.005532    0.001006  0.054937                     -6.453194
3  0.005030    0.000503  0.288957                     -7.148549
4  0.005028    0.001007  0.261092                     -9.242557

# mean and STD
print("Folds: " + str(len(cv_scores_Kfold)) + ", MSE: " + str(np.mean(np.abs(cv_scores_Kfold))) + ", STD: " + str(np.std(cv_scores_Kfold)))

Folds: 10, MSE: 75.34118551678769, STD: 10.944009884410121

y_pred_KFold = cross_val_predict(mlr, Xm, y, cv=10)

Kfold_pred = pd.DataFrame({'Observed': y.flatten(), 'Predicted': y_pred_KFold.flatten()})
Kfold_pred.head()

plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=Kfold_pred)
ax.set_ylim(30,130)
ax.set_xlim(30,130)
ax.plot([30, 130], [30, 130], 'k--', lw=2)
plt.title('K-Fold Cross-Validation',fontsize=18)

Text(0.5, 1.0, 'K-Fold Cross-Validation')

print('Mean Absolute Error:', metrics.mean_absolute_error(y, y_pred_KFold))  
print('Mean Squared Error:', metrics.mean_squared_error(y, y_pred_KFold))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, y_pred_KFold)))
print('R2:', metrics.r2_score(y, y_pred_KFold))

Mean Absolute Error: 7.150566331130477
Mean Squared Error: 84.11509081863407
Root Mean Squared Error: 9.171427959627337
R2: 0.41754595928015426

Leave-One-Out Cross-Validation¶

In Leave One Out Cross Validation, the number of folds (subsets) equals to the number of observations. w

cv_LOV = KFold(n_splits=666, random_state=None, shuffle=False)
cv_scores_LOV = cross_val_score(mlr, Xm, y, scoring="neg_mean_squared_error", cv=cv_LOV, n_jobs=1)

print("Folds: " + str(len(cv_scores_LOV)) + ", MSE: " + str(np.mean(np.abs(cv_scores_LOV))) + ", STD: " + str(np.std(cv_scores_LOV)))

Folds: 666, MSE: 74.91907483633577, STD: 126.95364260812033

y_pred_LOV = cross_val_predict(mlr, Xm, y, cv=10)

LOV_pred = pd.DataFrame({'Observed': y.flatten(), 'Predicted': y_pred_LOV.flatten()})
LOV_pred.head()

plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=LOV_pred)
ax.set_ylim(30,130)
ax.set_xlim(30,130)
ax.plot([30, 130], [30, 130], 'k--', lw=2)
plt.title('LOV Cross-Validation',fontsize=18)

Text(0.5, 1.0, 'LOV Cross-Validation')

print('Mean Absolute Error:', metrics.mean_absolute_error(y, y_pred_LOV))  
print('Mean Squared Error:', metrics.mean_squared_error(y, y_pred_LOV))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y, y_pred_LOV)))
print('R2:', metrics.r2_score(y, y_pred_LOV))

Mean Absolute Error: 7.150566331130477
Mean Squared Error: 84.11509081863407
Root Mean Squared Error: 9.171427959627337
R2: 0.41754595928015426

Regression analysis with "statsmodels"¶

"statsmodels" is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. An extensive list of result statistics are available for each estimator. The results are tested against existing statistical packages to ensure that they are correct. The package is released under the open source Modified BSD (3-clause) license. The online documentation is hosted at statsmodels.org.

Ordinary Least Squares (OLS)¶

You can use R-style formulas together with pandas data frames to fit your models. Here is a simple example using ordinary least squares:

ols_results = smf.ols('cancer ~ smoking', data=df).fit()

print(ols_results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 cancer   R-squared:                       0.444
Model:                            OLS   Adj. R-squared:                  0.443
Method:                 Least Squares   F-statistic:                     529.5
Date:                Tue, 28 Jan 2020   Prob (F-statistic):           1.31e-86
Time:                        18:20:17   Log-Likelihood:                -2405.6
No. Observations:                 666   AIC:                             4815.
Df Residuals:                     664   BIC:                             4824.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      6.7757      2.786      2.432      0.015       1.306      12.245
smoking        2.4247      0.105     23.012      0.000       2.218       2.632
==============================================================================
Omnibus:                       24.125   Durbin-Watson:                   1.511
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               30.180
Skew:                           0.372   Prob(JB):                     2.80e-07
Kurtosis:                       3.731   Cond. No.                         212.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Multiple Linear Regression (MLR)¶

 mlr_results = smf.ols('cancer ~ smoking + poverty + PM25 + NO2 + SO2', data=df).fit()

print(mlr_results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 cancer   R-squared:                       0.491
Model:                            OLS   Adj. R-squared:                  0.487
Method:                 Least Squares   F-statistic:                     127.1
Date:                Tue, 28 Jan 2020   Prob (F-statistic):           3.49e-94
Time:                        18:20:17   Log-Likelihood:                -2376.3
No. Observations:                 666   AIC:                             4765.
Df Residuals:                     660   BIC:                             4792.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -5.9577      4.724     -1.261      0.208     -15.235       3.319
smoking        2.2055      0.136     16.270      0.000       1.939       2.472
poverty        0.1686      0.079      2.145      0.032       0.014       0.323
PM25           1.5824      0.272      5.816      0.000       1.048       2.117
NO2           -0.7270      0.249     -2.917      0.004      -1.216      -0.238
SO2           -9.5115      5.355     -1.776      0.076     -20.027       1.004
==============================================================================
Omnibus:                       33.347   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.673
Skew:                           0.455   Prob(JB):                     3.28e-10
Kurtosis:                       3.864   Cond. No.                         587.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Generalized Linear Models (GLM)¶

In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value (Wikipedia, 2019).

The distribution families currently implemented in statsmodels are:

In this excercise I will show you how to fit GML models with Gaussian and Poisson famlies.

Gaussian Family¶

If outcome is continuous and unbounded, then the most "default" choice is the Gaussian distribution (a.k.a. normal distribution), i.e. the standard linear regression (unless you use other link function then the default identity link)(Source).

glm_gaussian = sm.formula.glm('cancer ~ smoking + poverty + PM25 + NO2 + SO2',
                       family=sm.families.Gaussian(), data=df).fit()

print(glm_gaussian.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                 cancer   No. Observations:                  666
Model:                            GLM   Df Residuals:                      660
Model Family:                Gaussian   Df Model:                            5
Link Function:               identity   Scale:                          74.243
Method:                          IRLS   Log-Likelihood:                -2376.3
Date:                Tue, 28 Jan 2020   Deviance:                       49001.
Time:                        18:20:17   Pearson chi2:                 4.90e+04
No. Iterations:                     3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -5.9577      4.724     -1.261      0.207     -15.218       3.302
smoking        2.2055      0.136     16.270      0.000       1.940       2.471
poverty        0.1686      0.079      2.145      0.032       0.015       0.323
PM25           1.5824      0.272      5.816      0.000       1.049       2.116
NO2           -0.7270      0.249     -2.917      0.004      -1.216      -0.238
SO2           -9.5115      5.355     -1.776      0.076     -20.008       0.985
==============================================================================

print('Parameters: ', glm_gaussian.params)
print('T-values: ', glm_gaussian.tvalues)

Parameters:  Intercept   -5.957675
smoking      2.205495
poverty      0.168599
PM25         1.582433
NO2         -0.726989
SO2         -9.511550
dtype: float64
T-values:  Intercept    -1.261019
smoking      16.270107
poverty       2.145241
PM25          5.816038
NO2          -2.916634
SO2          -1.776089
dtype: float64

Plot Fitted vs Observed Values¶

y = df['cancer']
y_glm_gaussian =glm_gaussian.predict()

GLM_gaussian_pred = pd.DataFrame({'Observed': y, 'Predicted': y_glm_gaussian})
GLM_gaussian_pred.head()

plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=GLM_gaussian_pred)
ax.set_ylim(30,130)
ax.set_xlim(30,130)
ax.plot([30, 130], [30, 130], 'k--', lw=2)
plt.title('GLM-Gaussian',fontsize=18)

Text(0.5, 1.0, 'GLM-Gaussian')

Plot yhat vs. Pearson residuals:¶

fig, ax = plt.subplots(figsize=(5,5))
ax.scatter(y_glm_gaussian, glm_gaussian.resid_pearson)
ax.hlines(0, 30, 100)
ax.set_xlim(30, 100)

ax.set_title('Residual Dependence Plot')
ax.set_ylabel('Pearson Residuals')
ax.set_xlabel('Fitted values')

Text(0.5, 0, 'Fitted values')

Histogram of standardized deviance residuals:¶

from scipy import stats

fig, ax = plt.subplots(figsize=(5,5))

resid = glm_gaussian.resid_deviance.copy()  # residuals
resid_std = stats.zscore(resid)
ax.hist(resid_std, bins=25)
ax.set_title('Histogram of standardized deviance residuals');

QQ Plot of Deviance Residuals:¶

graphics.gofplots.qqplot(resid, line='r')

Poisson Family¶

I outcome is discrete, or more precisely, you are dealing with counts (how many times something happen in given time interval), then the most common choice of the distribution to start with is Poisson distribution. The problem with Poisson distribution is that it is rather inflexible in the fact that it assumes that mean is equal to variance, if this assumption is not met, you may consider using quasi-Poisson family, or negative binomial distribution (see also Definition of dispersion parameter for quasipoisson family)(Source).

Before fiiting Poisson model we convert caner (float64) values to discrite.

df['cancer_int'] = df['cancer'].astype('int64')
df.head()

glm_poisson = sm.formula.glm('cancer_int ~ smoking + poverty + PM25 + NO2 + SO2',
                       family=sm.families.Poisson(), data=df).fit()

print(glm_poisson.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:             cancer_int   No. Observations:                  666
Model:                            GLM   Df Residuals:                      660
Model Family:                 Poisson   Df Model:                            5
Link Function:                    log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2354.9
Date:                Tue, 28 Jan 2020   Deviance:                       665.52
Time:                        18:20:19   Pearson chi2:                     670.
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.1018      0.068     45.685      0.000       2.969       3.235
smoking        0.0335      0.002     17.117      0.000       0.030       0.037
poverty        0.0019      0.001      1.739      0.082      -0.000       0.004
PM25           0.0233      0.004      6.111      0.000       0.016       0.031
NO2           -0.0133      0.004     -3.506      0.000      -0.021      -0.006
SO2           -0.1336      0.075     -1.782      0.075      -0.281       0.013
==============================================================================

y = df['cancer']
y_glm_poisson =glm_gaussian.predict()

GLM_poisson_pred = pd.DataFrame({'Observed': y, 'Predicted': y_glm_poisson})
GLM_poisson_pred.head()

plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=GLM_gaussian_pred)
ax.set_ylim(30,130)
ax.set_xlim(30,130)
ax.plot([30, 130], [30, 130], 'k--', lw=2)
plt.title('GLM-Gaussian',fontsize=18)

Text(0.5, 1.0, 'GLM-Gaussian')

Neural Network Regression With Keras-Tensorflow¶

As egression is a type of supervised machine learning algorithm, it usually use to predict a continuous data and produce a model that represents the ‘best fit’ to some observed data, according to an evaluation criterion.

The Keras library is a high-level API that runs on top of TensorFlow for building deep learning models. Often, building a very complex deep learning network with Keras can be achieved with only a few lines of code.

In this exercise I will use Tensorflow 2.0 GUP to develop a simple regression model neural network architecture. For installtion of TensorFlow GUP in Python 3.7, please see Jeff Heaton very use full github site. I follow mostly his class website for regression analysis with Keras.

mf=pd.read_csv("Data_05/county_mean.csv")
df=mf[["cancer",'smoking','poverty','PM25','NO2','SO2']]
df.head()

# Standardize ranges¶

It is good practice to normalize or standardize features that use different scales and ranges. Although the model might converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.

df['poverty'] = zscore(df['poverty'])
df['smoking'] = zscore(df['smoking'])
df['PM25'] = zscore(df['PM25'])
df['NO2'] = zscore(df['NO2'])
df['SO2'] = zscore(df['SO2'])

C:\anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

df.head()

Separate the target value, or "y", from the features. This label is the value that you will train the model to predict.

x_columns=['smoking','poverty','PM25','NO2','SO2']
x = df[x_columns].values
y = df['cancer'].values

Split the data into train and test¶

x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=42)

Build the simple neural network regression model¶

from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Activation from tensorflow.keras.callbacks import EarlyStopping

model = Sequential()
model.add(Dense(25, input_dim=x.shape[1], activation='relu')) # Hidden 1
model.add(Dense(10, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')

monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, 
                        patience=5, verbose=1, mode='auto', restore_best_weights=True)

model.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor],verbose=2,epochs=1000)

Train on 499 samples, validate on 167 samples
Epoch 1/1000
499/499 - 1s - loss: 5030.0191 - val_loss: 5223.3031
Epoch 2/1000
499/499 - 0s - loss: 5004.0484 - val_loss: 5195.7016
Epoch 3/1000
499/499 - 0s - loss: 4972.4720 - val_loss: 5158.9264
Epoch 4/1000
499/499 - 0s - loss: 4931.1928 - val_loss: 5109.5059
Epoch 5/1000
499/499 - 0s - loss: 4875.9907 - val_loss: 5041.7775
Epoch 6/1000
499/499 - 0s - loss: 4801.7776 - val_loss: 4953.4651
Epoch 7/1000
499/499 - 0s - loss: 4706.6579 - val_loss: 4843.6359
Epoch 8/1000
499/499 - 0s - loss: 4586.4081 - val_loss: 4706.9905
Epoch 9/1000
499/499 - 0s - loss: 4436.0353 - val_loss: 4537.2626
Epoch 10/1000
499/499 - 0s - loss: 4253.2882 - val_loss: 4337.0617
Epoch 11/1000
499/499 - 0s - loss: 4041.2454 - val_loss: 4107.1814
Epoch 12/1000
499/499 - 0s - loss: 3801.5837 - val_loss: 3846.1060
Epoch 13/1000
499/499 - 0s - loss: 3535.2790 - val_loss: 3559.1074
Epoch 14/1000
499/499 - 0s - loss: 3246.7781 - val_loss: 3250.9497
Epoch 15/1000
499/499 - 0s - loss: 2942.6934 - val_loss: 2923.4498
Epoch 16/1000
499/499 - 0s - loss: 2624.7026 - val_loss: 2591.2643
Epoch 17/1000
499/499 - 0s - loss: 2305.7347 - val_loss: 2261.9431
Epoch 18/1000
499/499 - 0s - loss: 1994.0463 - val_loss: 1939.3507
Epoch 19/1000
499/499 - 0s - loss: 1699.7378 - val_loss: 1634.6658
Epoch 20/1000
499/499 - 0s - loss: 1424.5252 - val_loss: 1362.1127
Epoch 21/1000
499/499 - 0s - loss: 1184.3892 - val_loss: 1120.8432
Epoch 22/1000
499/499 - 0s - loss: 973.8621 - val_loss: 922.2362
Epoch 23/1000
499/499 - 0s - loss: 805.6702 - val_loss: 757.1460
Epoch 24/1000
499/499 - 0s - loss: 668.8225 - val_loss: 632.9689
Epoch 25/1000
499/499 - 0s - loss: 567.2330 - val_loss: 539.4201
Epoch 26/1000
499/499 - 0s - loss: 492.5119 - val_loss: 471.6162
Epoch 27/1000
499/499 - 0s - loss: 437.5449 - val_loss: 423.5838
Epoch 28/1000
499/499 - 0s - loss: 400.3311 - val_loss: 388.6066
Epoch 29/1000
499/499 - 0s - loss: 372.8378 - val_loss: 362.3941
Epoch 30/1000
499/499 - 0s - loss: 350.6929 - val_loss: 341.5434
Epoch 31/1000
499/499 - 0s - loss: 332.7332 - val_loss: 325.1744
Epoch 32/1000
499/499 - 0s - loss: 318.0953 - val_loss: 309.5420
Epoch 33/1000
499/499 - 0s - loss: 303.7277 - val_loss: 296.8100
Epoch 34/1000
499/499 - 0s - loss: 291.7164 - val_loss: 284.4444
Epoch 35/1000
499/499 - 0s - loss: 281.1250 - val_loss: 274.4868
Epoch 36/1000
499/499 - 0s - loss: 270.4059 - val_loss: 264.5194
Epoch 37/1000
499/499 - 0s - loss: 261.0711 - val_loss: 255.2101
Epoch 38/1000
499/499 - 0s - loss: 252.3095 - val_loss: 247.2098
Epoch 39/1000
499/499 - 0s - loss: 244.1004 - val_loss: 239.1465
Epoch 40/1000
499/499 - 0s - loss: 236.4712 - val_loss: 231.9464
Epoch 41/1000
499/499 - 0s - loss: 229.4432 - val_loss: 225.4155
Epoch 42/1000
499/499 - 0s - loss: 222.5012 - val_loss: 219.0567
Epoch 43/1000
499/499 - 0s - loss: 216.2139 - val_loss: 213.1134
Epoch 44/1000
499/499 - 0s - loss: 210.0941 - val_loss: 208.1580
Epoch 45/1000
499/499 - 0s - loss: 204.4516 - val_loss: 203.2601
Epoch 46/1000
499/499 - 0s - loss: 199.4356 - val_loss: 198.8452
Epoch 47/1000
499/499 - 0s - loss: 194.6082 - val_loss: 194.3803
Epoch 48/1000
499/499 - 0s - loss: 190.1889 - val_loss: 190.4429
Epoch 49/1000
499/499 - 0s - loss: 185.3460 - val_loss: 186.7611
Epoch 50/1000
499/499 - 0s - loss: 181.5050 - val_loss: 182.9122
Epoch 51/1000
499/499 - 0s - loss: 177.3154 - val_loss: 179.5360
Epoch 52/1000
499/499 - 0s - loss: 173.6135 - val_loss: 176.1447
Epoch 53/1000
499/499 - 0s - loss: 170.0629 - val_loss: 173.2930
Epoch 54/1000
499/499 - 0s - loss: 166.5477 - val_loss: 170.4162
Epoch 55/1000
499/499 - 0s - loss: 163.1434 - val_loss: 167.3251
Epoch 56/1000
499/499 - 0s - loss: 160.0338 - val_loss: 164.4205
Epoch 57/1000
499/499 - 0s - loss: 156.7302 - val_loss: 161.9347
Epoch 58/1000
499/499 - 0s - loss: 153.7147 - val_loss: 159.4944
Epoch 59/1000
499/499 - 0s - loss: 151.0502 - val_loss: 157.1874
Epoch 60/1000
499/499 - 0s - loss: 147.8093 - val_loss: 154.4053
Epoch 61/1000
499/499 - 0s - loss: 145.1071 - val_loss: 151.9735
Epoch 62/1000
499/499 - 0s - loss: 142.6898 - val_loss: 149.8235
Epoch 63/1000
499/499 - 0s - loss: 139.9279 - val_loss: 147.7120
Epoch 64/1000
499/499 - 0s - loss: 137.3534 - val_loss: 145.5566
Epoch 65/1000
499/499 - 0s - loss: 134.9476 - val_loss: 143.3084
Epoch 66/1000
499/499 - 0s - loss: 132.5360 - val_loss: 141.8005
Epoch 67/1000
499/499 - 0s - loss: 130.0432 - val_loss: 139.9553
Epoch 68/1000
499/499 - 0s - loss: 127.6186 - val_loss: 137.8816
Epoch 69/1000
499/499 - 0s - loss: 125.3788 - val_loss: 135.9532
Epoch 70/1000
499/499 - 0s - loss: 123.2286 - val_loss: 134.1922
Epoch 71/1000
499/499 - 0s - loss: 121.1641 - val_loss: 132.5855
Epoch 72/1000
499/499 - 0s - loss: 119.1543 - val_loss: 130.9323
Epoch 73/1000
499/499 - 0s - loss: 116.9992 - val_loss: 129.0766
Epoch 74/1000
499/499 - 0s - loss: 115.1257 - val_loss: 127.3562
Epoch 75/1000
499/499 - 0s - loss: 113.3687 - val_loss: 125.9137
Epoch 76/1000
499/499 - 0s - loss: 111.7458 - val_loss: 124.4295
Epoch 77/1000
499/499 - 0s - loss: 109.8572 - val_loss: 123.0842
Epoch 78/1000
499/499 - 0s - loss: 108.1107 - val_loss: 121.2468
Epoch 79/1000
499/499 - 0s - loss: 106.3305 - val_loss: 119.8691
Epoch 80/1000
499/499 - 0s - loss: 104.6940 - val_loss: 118.7897
Epoch 81/1000
499/499 - 0s - loss: 103.0645 - val_loss: 117.9227
Epoch 82/1000
499/499 - 0s - loss: 101.8022 - val_loss: 116.3150
Epoch 83/1000
499/499 - 0s - loss: 99.9814 - val_loss: 115.3590
Epoch 84/1000
499/499 - 0s - loss: 98.5474 - val_loss: 113.9241
Epoch 85/1000
499/499 - 0s - loss: 97.0612 - val_loss: 112.9436
Epoch 86/1000
499/499 - 0s - loss: 95.7127 - val_loss: 111.7425
Epoch 87/1000
499/499 - 0s - loss: 94.4309 - val_loss: 110.5621
Epoch 88/1000
499/499 - 0s - loss: 93.0527 - val_loss: 109.6551
Epoch 89/1000
499/499 - 0s - loss: 91.7648 - val_loss: 108.8160
Epoch 90/1000
499/499 - 0s - loss: 90.3736 - val_loss: 107.6076
Epoch 91/1000
499/499 - 0s - loss: 89.1354 - val_loss: 106.8731
Epoch 92/1000
499/499 - 0s - loss: 88.0118 - val_loss: 106.1022
Epoch 93/1000
499/499 - 0s - loss: 86.8651 - val_loss: 105.1758
Epoch 94/1000
499/499 - 0s - loss: 85.8022 - val_loss: 104.2739
Epoch 95/1000
499/499 - 0s - loss: 84.8601 - val_loss: 103.3801
Epoch 96/1000
499/499 - 0s - loss: 83.8367 - val_loss: 102.5894
Epoch 97/1000
499/499 - 0s - loss: 83.0437 - val_loss: 102.2155
Epoch 98/1000
499/499 - 0s - loss: 82.0461 - val_loss: 101.2797
Epoch 99/1000
499/499 - 0s - loss: 81.3415 - val_loss: 100.7646
Epoch 100/1000
499/499 - 0s - loss: 80.4422 - val_loss: 99.6111
Epoch 101/1000
499/499 - 0s - loss: 79.4928 - val_loss: 98.9013
Epoch 102/1000
499/499 - 0s - loss: 78.7362 - val_loss: 98.4231
Epoch 103/1000
499/499 - 0s - loss: 78.1641 - val_loss: 97.5397
Epoch 104/1000
499/499 - 0s - loss: 77.3626 - val_loss: 97.1556
Epoch 105/1000
499/499 - 0s - loss: 76.6925 - val_loss: 96.6937
Epoch 106/1000
499/499 - 0s - loss: 75.9871 - val_loss: 95.7494
Epoch 107/1000
499/499 - 0s - loss: 75.3999 - val_loss: 95.4863
Epoch 108/1000
499/499 - 0s - loss: 74.8740 - val_loss: 94.5682
Epoch 109/1000
499/499 - 0s - loss: 74.2041 - val_loss: 94.3677
Epoch 110/1000
499/499 - 0s - loss: 73.6869 - val_loss: 93.9608
Epoch 111/1000
499/499 - 0s - loss: 73.0612 - val_loss: 93.5197
Epoch 112/1000
499/499 - 0s - loss: 72.6484 - val_loss: 93.3492
Epoch 113/1000
499/499 - 0s - loss: 72.0418 - val_loss: 92.7242
Epoch 114/1000
499/499 - 0s - loss: 71.4637 - val_loss: 92.3911
Epoch 115/1000
499/499 - 0s - loss: 70.9461 - val_loss: 91.6524
Epoch 116/1000
499/499 - 0s - loss: 70.6545 - val_loss: 91.2833
Epoch 117/1000
499/499 - 0s - loss: 70.1810 - val_loss: 91.2537
Epoch 118/1000
499/499 - 0s - loss: 69.8015 - val_loss: 90.5026
Epoch 119/1000
499/499 - 0s - loss: 69.3477 - val_loss: 90.5470
Epoch 120/1000
499/499 - 0s - loss: 68.9092 - val_loss: 90.0183
Epoch 121/1000
499/499 - 0s - loss: 68.4640 - val_loss: 89.6189
Epoch 122/1000
499/499 - 0s - loss: 68.4646 - val_loss: 89.0421
Epoch 123/1000
499/499 - 0s - loss: 67.7057 - val_loss: 89.3307
Epoch 124/1000
499/499 - 0s - loss: 67.2614 - val_loss: 88.6480
Epoch 125/1000
499/499 - 0s - loss: 66.9459 - val_loss: 88.1975
Epoch 126/1000
499/499 - 0s - loss: 66.5785 - val_loss: 88.3700
Epoch 127/1000
499/499 - 0s - loss: 66.2813 - val_loss: 87.8610
Epoch 128/1000
499/499 - 0s - loss: 65.9832 - val_loss: 87.7646
Epoch 129/1000
499/499 - 0s - loss: 65.7175 - val_loss: 87.4528
Epoch 130/1000
499/499 - 0s - loss: 65.4907 - val_loss: 87.1368
Epoch 131/1000
499/499 - 0s - loss: 65.1226 - val_loss: 86.9709
Epoch 132/1000
499/499 - 0s - loss: 64.8238 - val_loss: 86.6110
Epoch 133/1000
499/499 - 0s - loss: 64.5407 - val_loss: 86.4795
Epoch 134/1000
499/499 - 0s - loss: 64.2474 - val_loss: 86.1044
Epoch 135/1000
499/499 - 0s - loss: 64.0535 - val_loss: 85.5533
Epoch 136/1000
499/499 - 0s - loss: 63.7625 - val_loss: 85.7003
Epoch 137/1000
499/499 - 0s - loss: 63.5603 - val_loss: 85.7084
Epoch 138/1000
499/499 - 0s - loss: 63.4333 - val_loss: 85.6098
Epoch 139/1000
499/499 - 0s - loss: 63.1535 - val_loss: 85.1266
Epoch 140/1000
499/499 - 0s - loss: 62.9204 - val_loss: 84.8733
Epoch 141/1000
499/499 - 0s - loss: 62.6599 - val_loss: 84.7609
Epoch 142/1000
499/499 - 0s - loss: 62.5522 - val_loss: 84.7530
Epoch 143/1000
499/499 - 0s - loss: 62.3766 - val_loss: 84.2242
Epoch 144/1000
499/499 - 0s - loss: 62.1227 - val_loss: 84.4252
Epoch 145/1000
499/499 - 0s - loss: 61.9619 - val_loss: 84.2641
Epoch 146/1000
499/499 - 0s - loss: 61.6908 - val_loss: 83.8904
Epoch 147/1000
499/499 - 0s - loss: 61.4879 - val_loss: 84.1105
Epoch 148/1000
499/499 - 0s - loss: 61.4617 - val_loss: 83.4429
Epoch 149/1000
499/499 - 0s - loss: 61.1412 - val_loss: 83.9584
Epoch 150/1000
499/499 - 0s - loss: 61.0609 - val_loss: 83.4306
Epoch 151/1000
499/499 - 0s - loss: 60.9523 - val_loss: 83.5165
Epoch 152/1000
499/499 - 0s - loss: 60.7702 - val_loss: 83.6570
Epoch 153/1000
499/499 - 0s - loss: 60.5556 - val_loss: 83.0338
Epoch 154/1000
499/499 - 0s - loss: 60.3996 - val_loss: 82.9038
Epoch 155/1000
499/499 - 0s - loss: 60.3771 - val_loss: 82.7995
Epoch 156/1000
499/499 - 0s - loss: 60.2896 - val_loss: 83.0718
Epoch 157/1000
499/499 - 0s - loss: 60.0742 - val_loss: 82.8409
Epoch 158/1000
499/499 - 0s - loss: 59.9331 - val_loss: 82.6099
Epoch 159/1000
499/499 - 0s - loss: 59.8569 - val_loss: 82.6510
Epoch 160/1000
499/499 - 0s - loss: 59.6743 - val_loss: 82.3828
Epoch 161/1000
499/499 - 0s - loss: 59.8127 - val_loss: 82.5876
Epoch 162/1000
499/499 - 0s - loss: 59.4116 - val_loss: 82.0586
Epoch 163/1000
499/499 - 0s - loss: 59.4425 - val_loss: 81.5923
Epoch 164/1000
499/499 - 0s - loss: 59.2851 - val_loss: 82.1925
Epoch 165/1000
499/499 - 0s - loss: 59.2325 - val_loss: 82.2223
Epoch 166/1000
499/499 - 0s - loss: 58.9970 - val_loss: 82.1419
Epoch 167/1000
499/499 - 0s - loss: 59.1057 - val_loss: 81.4538
Epoch 168/1000
499/499 - 0s - loss: 58.9928 - val_loss: 82.2987
Epoch 169/1000
499/499 - 0s - loss: 58.8156 - val_loss: 81.5353
Epoch 170/1000
499/499 - 0s - loss: 58.6791 - val_loss: 81.4181
Epoch 171/1000
499/499 - 0s - loss: 58.4919 - val_loss: 81.6024
Epoch 172/1000
499/499 - 0s - loss: 58.5232 - val_loss: 81.3024
Epoch 173/1000
499/499 - 0s - loss: 58.3418 - val_loss: 81.2076
Epoch 174/1000
499/499 - 0s - loss: 58.2399 - val_loss: 81.5028
Epoch 175/1000
499/499 - 0s - loss: 58.3105 - val_loss: 81.1209
Epoch 176/1000
499/499 - 0s - loss: 58.0963 - val_loss: 81.3494
Epoch 177/1000
499/499 - 0s - loss: 57.9547 - val_loss: 81.2058
Epoch 178/1000
499/499 - 0s - loss: 58.4793 - val_loss: 80.3826
Epoch 179/1000
499/499 - 0s - loss: 58.0150 - val_loss: 81.0182
Epoch 180/1000
499/499 - 0s - loss: 57.9223 - val_loss: 81.0128
Epoch 181/1000
499/499 - 0s - loss: 57.6255 - val_loss: 80.9819
Epoch 182/1000
499/499 - 0s - loss: 57.5275 - val_loss: 80.6515
Epoch 183/1000
499/499 - 0s - loss: 57.5397 - val_loss: 80.3804
Epoch 184/1000
499/499 - 0s - loss: 57.4662 - val_loss: 80.5955
Epoch 185/1000
499/499 - 0s - loss: 57.4871 - val_loss: 79.9678
Epoch 186/1000
499/499 - 0s - loss: 57.3449 - val_loss: 80.7497
Epoch 187/1000
499/499 - 0s - loss: 57.2811 - val_loss: 80.4219
Epoch 188/1000
499/499 - 0s - loss: 57.1583 - val_loss: 80.2822
Epoch 189/1000
499/499 - 0s - loss: 57.0440 - val_loss: 79.7988
Epoch 190/1000
499/499 - 0s - loss: 56.9008 - val_loss: 79.8574
Epoch 191/1000
499/499 - 0s - loss: 56.8738 - val_loss: 79.7838
Epoch 192/1000
499/499 - 0s - loss: 56.9489 - val_loss: 80.2717
Epoch 193/1000
499/499 - 0s - loss: 56.7133 - val_loss: 79.6393
Epoch 194/1000
499/499 - 0s - loss: 56.8657 - val_loss: 79.4950
Epoch 195/1000
499/499 - 0s - loss: 56.6360 - val_loss: 80.0609
Epoch 196/1000
499/499 - 0s - loss: 56.7251 - val_loss: 79.8181
Epoch 197/1000
499/499 - 0s - loss: 56.5656 - val_loss: 79.2396
Epoch 198/1000
499/499 - 0s - loss: 56.5395 - val_loss: 79.8286
Epoch 199/1000
499/499 - 0s - loss: 56.6755 - val_loss: 79.0418
Epoch 200/1000
499/499 - 0s - loss: 56.4393 - val_loss: 79.9280
Epoch 201/1000
499/499 - 0s - loss: 56.2988 - val_loss: 78.9504
Epoch 202/1000
499/499 - 0s - loss: 56.0926 - val_loss: 79.3417
Epoch 203/1000
499/499 - 0s - loss: 56.1220 - val_loss: 79.2693
Epoch 204/1000
499/499 - 0s - loss: 56.1866 - val_loss: 79.5717
Epoch 205/1000
499/499 - 0s - loss: 56.0052 - val_loss: 78.9281
Epoch 206/1000
499/499 - 0s - loss: 56.1670 - val_loss: 78.6948
Epoch 207/1000
499/499 - 0s - loss: 55.7890 - val_loss: 79.6967
Epoch 208/1000
499/499 - 0s - loss: 55.7076 - val_loss: 79.1465
Epoch 209/1000
499/499 - 0s - loss: 55.6582 - val_loss: 79.0073
Epoch 210/1000
499/499 - 0s - loss: 55.5713 - val_loss: 78.9188
Epoch 211/1000
499/499 - 0s - loss: 55.6971 - val_loss: 78.5461
Epoch 212/1000
499/499 - 0s - loss: 55.5479 - val_loss: 78.8847
Epoch 213/1000
499/499 - 0s - loss: 55.3749 - val_loss: 78.5271
Epoch 214/1000
499/499 - 0s - loss: 55.3728 - val_loss: 78.7058
Epoch 215/1000
499/499 - 0s - loss: 55.3117 - val_loss: 78.4494
Epoch 216/1000
499/499 - 0s - loss: 55.2599 - val_loss: 78.6799
Epoch 217/1000
499/499 - 0s - loss: 55.1932 - val_loss: 78.5591
Epoch 218/1000
499/499 - 0s - loss: 55.0902 - val_loss: 78.7721
Epoch 219/1000
499/499 - 0s - loss: 55.0537 - val_loss: 78.6258
Epoch 220/1000
Restoring model weights from the end of the best epoch.
499/499 - 0s - loss: 55.0798 - val_loss: 78.6914
Epoch 00220: early stopping

<tensorflow.python.keras.callbacks.History at 0x232a55df488>

Prediction at Test data¶

DNN_pred = model.predict(x_test)

DNN_Pred_df = pd.DataFrame({'Observed': y_test.flatten(), 'Predicted': DNN_pred.flatten()})
DNN_Pred_df.head()

plt.figure(figsize=(6, 6))
ax = sns.regplot(x="Observed", y="Predicted", data=DNN_Pred_df)
ax.set_ylim(30,130)
ax.set_xlim(30,130)
ax.plot([30, 130], [30, 130], 'k--', lw=2)
plt.title('Neural Network Regression',fontsize=18)

Text(0.5, 1.0, 'Neural Network Regression')

Mean Square Error¶

# Measure MSE error.  
score_mse = metrics.mean_squared_error(DNN_pred,y_test)
print("Final score (MSE): {}".format(score_mse))

Final score (MSE): 78.44942031770589

Root Mean Square Error¶

import numpy as np

# Measure RMSE error.  RMSE is common for regression.
score_rmse = np.sqrt(metrics.mean_squared_error(DNN_pred,y_test))
print("Final score (RMSE): {}".format(score_rmse))

Final score (RMSE): 8.857167736794075

Lift Chart¶

(source)

To generate a lift chart, perform the following activities:

Sort the data by expected output. Plot the blue line above.
For every point on the x-axis plot the predicted value for that same data point. This is the green line above.
The x-axis is just 0 to 100% of the dataset. The expected always starts low and ends high.
The y-axis is ranged according to the values predicted.

Reading a lift chart:

The expected and predict lines should be close. Notice where one is above the ot other.
The below chart is the most accurate on lower age.

# Regression chart.
def chart_regression(pred, y, sort=True):
    t = pd.DataFrame({'pred': pred, 'y': y.flatten()})
    if sort:
        t.sort_values(by=['y'], inplace=True)
    plt.plot(t['y'].tolist(), label='expected')
    plt.plot(t['pred'].tolist(), label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()

# Plot the chart
chart_regression(DNN_pred.flatten(),y_test)

	FIDs	FIPS	REGION	DIVISION	STATE	COUNTY	code	geometry
0	58	13111.0	South	South Atlantic	Georgia	Fannin County	GA	POLYGON ((1037682.266 1388266.997, 1037756.908...
1	427	42115.0	Northeast	Middle Atlantic	Pennsylvania	Susquehanna County	PA	POLYGON ((1622673.906 2281043.968, 1623277.874...
2	407	42075.0	Northeast	Middle Atlantic	Pennsylvania	Lebanon County	PA	POLYGON ((1623002.490 2115965.841, 1632186.404...
3	593	51683.0	South	South Atlantic	Virginia	Manassas city	VA	POLYGON ((1582645.685 1903767.263, 1584122.529...
4	236	36057.0	Northeast	Middle Atlantic	New York	Montgomery County	NY	POLYGON ((1725612.650 2416731.662, 1729853.604...

	FIDs	FIPS	REGION	DIVISION	STATE	COUNTY	code	geometry
0	58	13111.0	South	South Atlantic	Georgia	Fannin County	GA	POINT (1056523.936 1376613.190)
1	427	42115.0	Northeast	Middle Atlantic	Pennsylvania	Susquehanna County	PA	POINT (1653442.064 2267301.476)
2	407	42075.0	Northeast	Middle Atlantic	Pennsylvania	Lebanon County	PA	POINT (1633708.275 2096683.059)
3	593	51683.0	South	South Atlantic	Virginia	Manassas city	VA	POINT (1584048.852 1901443.401)
4	236	36057.0	Northeast	Middle Atlantic	New York	Montgomery County	NY	POINT (1735811.102 2409536.493)

	FIDs	FIPS	REGION	DIVISION	STATE	COUNTY	code	geometry	X	Y
0	58	13111.0	South	South Atlantic	Georgia	Fannin County	GA	POINT (1056523.936 1376613.190)	1.056524e+06	1.376613e+06
1	427	42115.0	Northeast	Middle Atlantic	Pennsylvania	Susquehanna County	PA	POINT (1653442.064 2267301.476)	1.653442e+06	2.267301e+06
2	407	42075.0	Northeast	Middle Atlantic	Pennsylvania	Lebanon County	PA	POINT (1633708.275 2096683.059)	1.633708e+06	2.096683e+06
3	593	51683.0	South	South Atlantic	Virginia	Manassas city	VA	POINT (1584048.852 1901443.401)	1.584049e+06	1.901443e+06
4	236	36057.0	Northeast	Middle Atlantic	New York	Montgomery County	NY	POINT (1735811.102 2409536.493)	1.735811e+06	2.409536e+06

	FIDs	FIPS	code	X	Y
0	58	13111.0	GA	1.056524e+06	1.376613e+06
1	427	42115.0	PA	1.653442e+06	2.267301e+06
2	407	42075.0	PA	1.633708e+06	2.096683e+06
3	593	51683.0	VA	1.584049e+06	1.901443e+06
4	236	36057.0	NY	1.735811e+06	2.409536e+06

	FIDs	FIPS	code	X	Y	cancer	poverty	smoking	PM25	NO2	SO2
0	58	13111.0	GA	1.056524e+06	1.376613e+06	74.637249	11.828571	26.192857	13.635714	3.508071	0.073450
1	427	42115.0	PA	1.653442e+06	2.267301e+06	66.342705	9.250000	23.478571	14.329286	5.074071	0.072547
2	407	42075.0	PA	1.633708e+06	2.096683e+06	69.716354	11.335714	27.621429	12.487857	3.730714	0.061116
3	593	51683.0	VA	1.584049e+06	1.901443e+06	62.711264	17.728571	22.257143	13.710000	5.594500	0.158739
4	236	36057.0	NY	1.735811e+06	2.409536e+06	77.569145	19.278571	27.557143	12.247143	0.731357	0.009293

	FIDs	FIPS	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
count	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000
mean	332.500000	35727.693694	73.857273	74.097648	74.101438	73.939181	73.529323	72.603300	70.713445	70.485016	69.530583	68.137611	67.341979	66.486614	65.285216	65.135519	64.668501
std	192.401923	14669.711655	11.419725	11.573146	11.754784	11.922416	12.070827	12.112833	12.039270	12.223738	12.322275	12.305433	12.360195	12.547609	12.526113	12.801249	13.006163
min	0.000000	10001.000000	45.750330	45.171786	44.385688	43.523156	41.827329	40.854616	40.274638	38.811441	38.484895	37.206145	35.430022	35.070739	34.025360	32.956657	32.300568
25%	166.250000	24009.500000	66.585316	66.668914	66.580404	66.251332	65.815276	64.757661	62.755216	62.725070	61.713092	60.189047	59.463569	58.726896	57.608005	57.072282	56.591102
50%	332.500000	37126.000000	73.422482	73.593816	73.683809	73.514081	73.121945	72.123658	70.146444	69.996469	69.015290	67.717531	66.922792	66.007622	64.885836	64.735454	64.086434
75%	498.750000	51032.500000	80.168129	80.658646	80.742217	80.819539	80.453967	79.659168	77.966000	77.622660	76.661568	75.409592	74.807386	73.938938	72.472790	72.528113	72.158332
max	665.000000	54109.000000	125.371782	126.672925	124.912023	125.890516	125.275492	123.720335	120.625564	122.255614	120.007166	117.200991	116.837902	116.410885	113.512619	115.013336	115.277577

	cancer	smoking	poverty	PM25
code
DC	70.430824	28.307143	14.985714	13.100000
DE	65.833982	24.073810	12.930952	10.702381
GA	70.467286	26.336074	15.127089	11.670368
MD	69.042681	25.288988	13.492560	10.906875
NC	70.425814	26.308929	15.002214	11.563850

	cancer	smoking	poverty	PM25
code
DC	NaN	NaN	NaN	NaN
DE	5.155995	2.291626	8.771336	0.663413
GA	11.391757	3.016408	5.487087	1.249014
MD	13.861489	4.107085	4.554613	1.747535
NC	10.780429	2.866285	4.571188	1.485551

	cancer	poverty	smoking	PM25	NO2	SO2
cancer	1.000000	0.479222	0.666088	0.075867	-0.354428	-0.013786
poverty	0.479222	1.000000	0.563021	0.045417	-0.400164	-0.201537
smoking	0.666088	0.563021	1.000000	-0.109916	-0.399236	0.059811
PM25	0.075867	0.045417	-0.109916	1.000000	0.213000	0.334208
NO2	-0.354428	-0.400164	-0.399236	0.213000	1.000000	0.354689
SO2	-0.013786	-0.201537	0.059811	0.334208	0.354689	1.000000

	Actual	Predicted	residuals
0	64.430637	59.729031	4.701607
1	72.694270	68.983368	3.710903
2	70.500054	71.436115	-0.936061
3	70.807277	70.496765	0.310512
4	42.016316	60.720567	-18.704251
...	...	...	...
527	60.011697	75.106538	-15.094841
528	60.404273	61.051079	-0.646806
529	70.430824	75.524027	-5.093203
530	80.127531	69.435647	10.691883
531	54.504111	60.320473	-5.816363

	Observed	Predicted
0	48.491048	46.839061
1	61.438730	63.799547
2	75.027515	68.774623
3	73.848832	75.471841
4	64.101773	72.445046

	0	Co-efficient
0	poverty	0.102037
1	smoking	2.267721
2	PM25	1.912677
3	NO2	-0.886578
4	SO2	-14.391893

	Observed	Predicted
0	48.491048	47.903524
1	61.438730	64.971542
2	75.027515	72.385058
3	73.848832	75.984092
4	64.101773	76.804447

	Observed	Predicted
0	74.637249	71.776141
1	66.342705	65.403375
2	69.716354	73.201570
3	62.711264	62.072841
4	77.569145	76.212677

	Observed	Predicted
0	74.637249	72.133463
1	66.342705	65.680028
2	69.716354	73.340126
3	62.711264	62.237514
4	77.569145	76.829999

	cancer	poverty	smoking	PM25	NO2	SO2
count	666.000000	666.000000	666.000000	666.000000	666.000000	666.000000
mean	70.374582	15.000332	26.229151	11.582681	2.262804	0.076624
std	12.026311	5.610755	3.303682	1.365394	1.625918	0.075046
min	39.651539	3.207143	13.657143	7.391429	0.612143	0.001287
25%	62.717918	11.194643	24.773214	10.675179	1.299411	0.029596
50%	69.952792	14.296429	26.671429	11.852857	1.826357	0.052136
75%	77.430017	18.651786	28.326786	12.596964	2.732679	0.097188
max	120.869885	34.842857	35.035714	14.329286	15.050000	0.445191

	cancer	smoking	poverty	PM25	NO2	SO2
0	74.637249	-0.010994	-0.565725	1.504750	0.766461	-0.042326
1	66.342705	-0.833206	-1.025647	2.013096	1.730333	-0.054364
2	69.716354	0.421749	-0.653633	0.663440	0.903498	-0.206805
3	62.711264	-1.203201	0.486617	1.559197	2.050656	1.095013
4	77.569145	0.402276	0.763080	0.487010	-0.942605	-0.897868

	Observed	Predicted
0	117.223061	91.188492
1	79.362922	79.236511
2	120.869885	96.198563
3	58.128106	58.300262
4	77.942287	74.023018

Spatial Data Processing¶

Data¶

Age Adjusted Lung & Bronchus Cancer Mortality Rates by County¶

Age-standardized Cigarette Smoking (%) Prevalence by County¶

Age-standardized Poverty Rate by County¶

Particulate Matter (PM2.5)¶

Nitrogen dioxide (NO2)¶

Sulfer dioxide (SO2)¶

Load Libraries¶

Setting consistent plotting style throughout notebook¶

Set working directory¶

Import all CSV¶

Creat Centriods of all counties¶

Calculate 15 years Mean¶

Join the average data with county shape file¶

Write as a ESRI shape file¶

Visualization of Spatial Data¶

Map County Mean Data¶

Save as PNG files¶

Claculate State Mean and Standard Deviation¶

Interactive Map with Plotly¶

Exploratory Data Analysis¶

Create a Panda data-frame from GeoPandas data-frame for EDA¶

Descriptive Statistics¶

Box-plot¶

Seaborn Boxplot¶

Add jitter over boxplot | seaborn¶

Interactive Box-plot with Plotly¶

Correlation Analysis¶

Correlation matrix heatmap¶

Grid Correlation Matrix¶

Correlation Matrix Plot With Pandas¶

Correlation Matrix Plot With Seaborn¶

Sactter Plot for Visulizatioin the Univariate Relationship¶

Interactive plot with Plotly¶

Mariginal plot with Seaborn¶

Regression Analysis¶

Regression analysis with "scikit-learn"¶

Simple Linear Regression¶

Define X (predictors) and y (response) variables¶

Split data train and test data sets (80% training and 20% test)¶

Intercept:¶

Slope¶

Residuals¶

Predicted values at test counties¶

1:1 Plot¶

Multiple Linear Regression (MLR)¶

Create a data-frame¶

Split-data¶

Fit MLR model¶

Co-efficients MLR model¶

1:1 Plot¶

Cross-validation: evaluating estimator performance¶

K-Fold Cross-Validation¶¶

Leave-One-Out Cross-Validation¶

Regression analysis with "statsmodels"¶

Ordinary Least Squares (OLS)¶

Multiple Linear Regression (MLR)¶

Generalized Linear Models (GLM)¶

Gaussian Family¶

Plot Fitted vs Observed Values¶

Plot yhat vs. Pearson residuals:¶

Histogram of standardized deviance residuals:¶

QQ Plot of Deviance Residuals:¶

Poisson Family¶

Neural Network Regression With Keras-Tensorflow¶

# Standardize ranges¶

Split the data into train and test¶

Build the simple neural network regression model¶

Prediction at Test data¶

Mean Square Error¶

Root Mean Square Error¶

Lift Chart¶

Cross-validation: evaluating estimator performance ¶