Lesson_05_wotking_with_polygon.png

Spatial Data Processing

Data

Age Adjusted Lung & Bronchus Cancer Mortality Rates by County

Age Adjusted Lung & Bronchus Cancer Mortality Rates from 1980 t0 2014 by county was obtained from Institute for Health Metrics and Evaluation (IHME). The rate was estimated using spatially explicit Bayesian mixed effects regression models for both males and females (Mokdad et al., 2016). The model incorporated 7 covariates, such as proportion of the adult population that graduated high school, the proportion of the population that is Hispanic, the proportion of the population that is black, the proportion of the population that is a race other than black or white, The proportion of a county that is contained within a state or federal Native American reservation, the median household income, and the population density. To reduce the noise due to low mortality rate in a small population size, we smoothed data with the Empirical Bayesian smoother, a tool for dealing with rate instability associated with small sample sizes. It takes the population in the region as an indicator of confidence in the data, with higher population providing a higher confidence in the value at that location. We used SpaceStat (Evaluation version) Software for smoothing data.

Age-standardized Cigarette Smoking (%) Prevalence by County

Age-standardized cigarette smoking (%) prevalence by County estimated from the Behavioral Risk Factor Surveillance System (BRFSS) survey data (Dwyer-Lindgren et al., 2014). The prevalence (%) were estimated by logistic hierarchical mixed effects regression models, stratified by sex. All estimates were age-standardized following the age structure of the 2000 census. The uncertainty of the prevalence estimates was assessed using simulation methods.

Age-standardized Poverty Rate by County

County-level poverty data (% population below poverty level) are from US Census Small Area Income and Poverty Estimates (SAIPE) Program. Income and poverty rate were estimated by combining survey data with population estimates and administrative records.

Particulate Matter (PM2.5)

Annual mean PM2.5 from 1998 to 2012 were modeled from aerosol optical depth from multiple satellite (Multiangle Imaging Spectra Radiometer, MODIS Dark Target, MODIS and SeaWiFS Deep Blue, MODIS MAIAC) products and validated by ground-based PM2.5 monitoring sites (Aaron van Donkelaar et al., 2016). All raster grid of PM2.5 were re-sampled at 2.5 km x 2 km grid size using Empirical Bayesian Kriging in ArcPython - Geo-statistical Tool (ESRI, 2017)). County population weighted mean for each year were calculated from the population data.

Nitrogen dioxide (NO2)

Annual mean ambient NO2 Concentrations estimated from three satellite instruments, such as Global Ozone Monitoring Experiment (GOME), Scanning Imaging Absorption Spectrometer for Atmospheric Chartography (SCIAMACHY) and GOME-2 satellite in combination with chemical transport model (Gedders et al., (2016). All raster grid of PM2.5 were re-sampled at 2.5 km x 2 km grid size using Empirical Bayesian Kriging in ArcPython - Geo-statistical Tool (ESRI, 2017)). County population weighted mean for each year were calculated from the population data.

Sulfer dioxide (SO2)

Annual mean ambient SO2 Concentrations estimated were obtained from time series Multi-source SO2 emission retrievals and consistency of satellite and surface measurements of (Fioletov et al,(2017). All raster grid of PM2.5 were re-sampled at 2.5 km x 2.5 km grid size using Empirical Bayesian Kriging in ArcPython - Geo-statistical Tool (ESRI, 2017)). County population weighted mean for each year were calculated from the population data.

Load Libraries

In [1]:
import os
import sys

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.colors import ListedColormap, BoundaryNorm
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401 unused import
from mpl_toolkits.axes_grid1 import make_axes_locatable
import mapclassify

import plotly.graph_objects as go
import plotly.express as px

import seaborn as sns

from glob import glob
from functools import reduce
from osgeo import ogr, gdal
from osgeo import gdalconst
import geopandas as gpd
import pycrs
import fiona
from fiona.crs import from_epsg
from shapely.geometry import box
from shapely.geometry import Point
import shapely.geometry as geom

import json

from scipy.stats import sem
from scipy.stats import zscore
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_validate

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.api import abline_plot
from statsmodels import graphics

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping

%matplotlib inline
In [2]:
# What version of Python do you have?
print(f"Python {sys.version}")
print(f"geopandas {gpd.__version__}")
print(f"osgeo {gdal.__version__}")

import tensorflow.keras

print(f"Tensor Flow Version: {tf.__version__}")
print(f"Keras Version: {tensorflow.keras.__version__}")

print()
print(f"Python {sys.version}")
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")
Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 21:48:41) [MSC v.1916 64 bit (AMD64)]
geopandas 0.6.2
osgeo 3.0.2
Tensor Flow Version: 2.0.0
Keras Version: 2.2.4-tf

Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 21:48:41) [MSC v.1916 64 bit (AMD64)]
GPU is available

Setting consistent plotting style throughout notebook

In [3]:
sns.set_style("white")
sns.set(font_scale=1.5)

Set working directory

In [4]:
path= "E:/GitHub/geospatial-python-github.io/Lesson_05_working_with_polygon_data"
os.chdir(path)
print("Current Working Directory " , os.getcwd())
Current Working Directory  E:\GitHub\geospatial-python-github.io\Lesson_05_working_with_polygon_data

Import all CSV

In [5]:
cancer=pd.read_csv("Data_05/Lung_cancer_1998_2012.csv")
smoking=pd.read_csv("Data_05/SMOKING_1998_2012.csv")
poverty=pd.read_csv("Data_05/POVERTY_1998_2012.csv")
PM25=pd.read_csv("Data_05/PM25_1998_2012.csv")
NO2=pd.read_csv("Data_05/NO2_1998_2012.csv")
SO2=pd.read_csv("Data_05/SO2_1998_2012.csv")

Load County shape file and Extract Centriod

In [6]:
county_gpd=gpd.read_file('Data_05/county_map.shp')
county_gpd.plot()
county_gpd.head()
Out[6]:
FIDs FIPS REGION DIVISION STATE COUNTY code geometry
0 58 13111.0 South South Atlantic Georgia Fannin County GA POLYGON ((1037682.266 1388266.997, 1037756.908...
1 427 42115.0 Northeast Middle Atlantic Pennsylvania Susquehanna County PA POLYGON ((1622673.906 2281043.968, 1623277.874...
2 407 42075.0 Northeast Middle Atlantic Pennsylvania Lebanon County PA POLYGON ((1623002.490 2115965.841, 1632186.404...
3 593 51683.0 South South Atlantic Virginia Manassas city VA POLYGON ((1582645.685 1903767.263, 1584122.529...
4 236 36057.0 Northeast Middle Atlantic New York Montgomery County NY POLYGON ((1725612.650 2416731.662, 1729853.604...

Now, we will get centriod of each county

Creat Centriods of all counties

In [7]:
# copy poly to new GeoDataFrame
cen_points = county_gpd.copy()
# change the geometry
cen_points.geometry = cen_points['geometry'].centroid
# same crs
cen_points.crs =county_gpd.crs
cen_points.head()
Out[7]:
FIDs FIPS REGION DIVISION STATE COUNTY code geometry
0 58 13111.0 South South Atlantic Georgia Fannin County GA POINT (1056523.936 1376613.190)
1 427 42115.0 Northeast Middle Atlantic Pennsylvania Susquehanna County PA POINT (1653442.064 2267301.476)
2 407 42075.0 Northeast Middle Atlantic Pennsylvania Lebanon County PA POINT (1633708.275 2096683.059)
3 593 51683.0 South South Atlantic Virginia Manassas city VA POINT (1584048.852 1901443.401)
4 236 36057.0 Northeast Middle Atlantic New York Montgomery County NY POINT (1735811.102 2409536.493)
In [8]:
fig, ay = plt.subplots()
county_gpd.plot(ax=ay, color="white", edgecolor="gray")
cen_points.plot(ax=ay, color='red', markersize=1)
ay.set_title('County Centriods')
Out[8]:
Text(0.5, 1, 'County Centriods')
In [9]:
# Get X (m) and Y (m) Values from geometry column
cen_points['X'] = cen_points.geometry.x
cen_points['Y'] = cen_points.geometry.y
cen_points.head()
Out[9]:
FIDs FIPS REGION DIVISION STATE COUNTY code geometry X Y
0 58 13111.0 South South Atlantic Georgia Fannin County GA POINT (1056523.936 1376613.190) 1.056524e+06 1.376613e+06
1 427 42115.0 Northeast Middle Atlantic Pennsylvania Susquehanna County PA POINT (1653442.064 2267301.476) 1.653442e+06 2.267301e+06
2 407 42075.0 Northeast Middle Atlantic Pennsylvania Lebanon County PA POINT (1633708.275 2096683.059) 1.633708e+06 2.096683e+06
3 593 51683.0 South South Atlantic Virginia Manassas city VA POINT (1584048.852 1901443.401) 1.584049e+06 1.901443e+06
4 236 36057.0 Northeast Middle Atlantic New York Montgomery County NY POINT (1735811.102 2409536.493) 1.735811e+06 2.409536e+06
In [10]:
### Convert geopandas data frame to pandas data frame without geometry column
columns =['STATE','COUNTY','REGION','DIVISION','geometry' ]
cen_df = pd.DataFrame(cen_points.drop(columns=columns))
cen_df.head()
Out[10]:
FIDs FIPS code X Y
0 58 13111.0 GA 1.056524e+06 1.376613e+06
1 427 42115.0 PA 1.653442e+06 2.267301e+06
2 407 42075.0 PA 1.633708e+06 2.096683e+06
3 593 51683.0 VA 1.584049e+06 1.901443e+06
4 236 36057.0 NY 1.735811e+06 2.409536e+06

Calculate 15 years Mean

We will create a data frame of 15 years (1998-2012) mean all variables, and will use this data in do some statistical analysis.

In [11]:
# copy cen_df as "avg" pd-dataframe
avg_df = cen_df.copy()
In [12]:
avg_df["cancer"]= cancer.iloc[:,2:16].mean(axis=1)
avg_df["poverty"]= poverty.iloc[:,2:16].mean(axis=1)
avg_df["smoking"]= smoking.iloc[:,2:16].mean(axis=1)
avg_df["PM25"]= PM25.iloc[:,2:16].mean(axis=1)
avg_df["NO2"]= NO2.iloc[:,2:16].mean(axis=1)
avg_df["SO2"]= SO2.iloc[:,2:16].mean(axis=1)
In [13]:
avg_df.head()
Out[13]:
FIDs FIPS code X Y cancer poverty smoking PM25 NO2 SO2
0 58 13111.0 GA 1.056524e+06 1.376613e+06 74.637249 11.828571 26.192857 13.635714 3.508071 0.073450
1 427 42115.0 PA 1.653442e+06 2.267301e+06 66.342705 9.250000 23.478571 14.329286 5.074071 0.072547
2 407 42075.0 PA 1.633708e+06 2.096683e+06 69.716354 11.335714 27.621429 12.487857 3.730714 0.061116
3 593 51683.0 VA 1.584049e+06 1.901443e+06 62.711264 17.728571 22.257143 13.710000 5.594500 0.158739
4 236 36057.0 NY 1.735811e+06 2.409536e+06 77.569145 19.278571 27.557143 12.247143 0.731357 0.009293
In [14]:
# Write as CSV file
avg_df.to_csv('Data_05/county_mean.csv',index=False)

Join the average data with county shape file

In [15]:
# Drop code coloun for pd-data frame
avg_mf=avg_df.drop("code", axis=1)
county_gpd_data = county_gpd.merge(avg_mf, on='FIPS')
county_gpd_data.head()
Out[15]:
FIDs_x FIPS REGION DIVISION STATE COUNTY code geometry FIDs_y X Y cancer poverty smoking PM25 NO2 SO2
0 58 13111.0 South South Atlantic Georgia Fannin County GA POLYGON ((1037682.266 1388266.997, 1037756.908... 58 1.056524e+06 1.376613e+06 74.637249 11.828571 26.192857 13.635714 3.508071 0.073450
1 427 42115.0 Northeast Middle Atlantic Pennsylvania Susquehanna County PA POLYGON ((1622673.906 2281043.968, 1623277.874... 427 1.653442e+06 2.267301e+06 66.342705 9.250000 23.478571 14.329286 5.074071 0.072547
2 407 42075.0 Northeast Middle Atlantic Pennsylvania Lebanon County PA POLYGON ((1623002.490 2115965.841, 1632186.404... 407 1.633708e+06 2.096683e+06 69.716354 11.335714 27.621429 12.487857 3.730714 0.061116
3 593 51683.0 South South Atlantic Virginia Manassas city VA POLYGON ((1582645.685 1903767.263, 1584122.529... 593 1.584049e+06 1.901443e+06 62.711264 17.728571 22.257143 13.710000 5.594500 0.158739
4 236 36057.0 Northeast Middle Atlantic New York Montgomery County NY POLYGON ((1725612.650 2416731.662, 1729853.604... 236 1.735811e+06 2.409536e+06 77.569145 19.278571 27.557143 12.247143 0.731357 0.009293

Write as a ESRI shape file

In [16]:
county_gpd_data.to_file("Data_05/county_mean_data.shp")

Visualization of Spatial Data

Map County Mean Data

In [17]:
# Make subplots that are next to each other
fig, (bx1, bx2, bx3) = plt.subplots(nrows=1, ncols=3,  figsize=(12, 5))

# Plot Cancer 
county_gpd_data.plot(column='cancer', 
                 edgecolor="gray", 
                 ax=bx1,
                 cmap='Blues',
                 linewidth=0.5
                  )
bx1.set_title("LBC Mortality Rate");
bx1.axis('off');
fig.colorbar(bx1.collections[0], ax=bx1)

# Plot Poverty
county_gpd_data.plot(column='poverty',
                  edgecolor="gray", 
                  ax=bx2,
                  cmap= 'Blues',
                  linewidth=0.5
                   );
bx2.set_title("Povert (%)");
bx2.axis('off');
fig.colorbar(bx2.collections[0], ax=bx2)

# Plot Smoking
county_gpd_data.plot(column='smoking', 
                 edgecolor="gray", 
                 ax=bx3,
                 cmap='Blues',
                 linewidth=0.5
                 );
bx3.set_title("Smoking (%)");
bx3.axis('off');
fig.colorbar(bx3.collections[0], ax=bx3)
Out[17]:
<matplotlib.colorbar.Colorbar at 0x232a180e148>
In [18]:
# Make subplots that are next to each other
fig, (bx1, bx2, bx3) = plt.subplots(nrows=1, ncols=3,  figsize=(12, 5))

# Plot PM25 
county_gpd_data.plot(column='PM25', 
                 edgecolor="gray", 
                 ax=bx1,
                 cmap='Blues',
                 linewidth=0.5,
                  )
bx1.set_title("PM25");
bx1.axis('off');
fig.colorbar(bx1.collections[0], ax=bx1)

# Plot NO2
county_gpd_data.plot(column='NO2',
                  edgecolor="gray", 
                  ax=bx2,
                  cmap= 'Blues',
                  linewidth=0.5,
                   );
bx2.set_title("NO2");
bx2.axis('off');
fig.colorbar(bx2.collections[0], ax=bx2)

# Plot SO2
county_gpd_data.plot(column='SO2', 
                 edgecolor="gray", 
                 ax=bx3,
                 cmap='Blues',
                 linewidth=0.5,
                 );
bx3.set_title("SO2");
bx3.axis('off');
fig.colorbar(bx3.collections[0], ax=bx3)
Out[18]:
<matplotlib.colorbar.Colorbar at 0x232a1c8a0c8>