Code
library(tidyverse)
library(janitor)
The R package janitor provides a set of tools for cleaning and organizing data in R. The package is designed to help make data cleaning tasks easier and more efficient, with functions that handle common data cleaning tasks.
Some of the functions provided by the janitor package include:
clean_names: This function cleans column names by removing special characters and converting them to lowercase.
remove_empty: This function removes rows or columns that are entirely empty from a data frame.
tabyl: This function creates frequency tables with ease.
get_dupes: This function identifies duplicate rows in a data frame.
factorize: This function converts columns in a data frame to factors.
The janitor package can be installed using the following command in R:
install.packages(“janitor”)
We will create some “bad” data and clean them with janitor. We will apply following functions:
The clean_names() function is used to clean column names in a data frame. It converts the column names to lowercase and replaces all spaces and special characters with underscores.
Column.One Column.Two.. Column.Three.. X.Column.four
1 1 6 11 11
2 2 7 12 12
3 3 8 13 13
4 4 9 14 14
5 5 10 15 15
The remove_empty() function is used to remove rows or columns that contain only missing or empty values.
x y z
1 1 NA NA
2 NA NA NA
3 4 3 NA
The get_dupes() function is used to find duplicate rows in a data frame.
Now we will clean on very messy data using some functions of janitor packages. We will use Lung Cancer Mortality data.
Rows: 3,118
Columns: 26
$ `Lung Cancer Moratlity Rates and Risk in USA, Data Provider: Zia Ahmed` <chr> …
$ ...2 <chr> …
$ ...3 <chr> …
$ ...4 <chr> …
$ ...5 <chr> …
$ ...6 <chr> …
$ ...7 <chr> …
$ ...8 <chr> …
$ ...9 <chr> …
$ ...10 <chr> …
$ ...11 <chr> …
$ ...12 <chr> …
$ ...13 <chr> …
$ ...14 <chr> …
$ ...15 <chr> …
$ ...16 <chr> …
$ ...17 <chr> …
$ ...18 <chr> …
$ ...19 <chr> …
$ ...20 <chr> …
$ ...21 <chr> …
$ ...22 <chr> …
$ ...23 <chr> …
$ ...24 <chr> …
$ ...25 <chr> …
$ ...26 <chr> …
You’ve probably received plenty of data files like this, that have some text at the top of the spreadsheet before the actual data begins.
In this df.lbc data-frame, the column heading describe briefly the data. But we want 1st row as column heading. So we apply we will apply row_to_names(). The row_to_names() function takes the following arguments: the data source, the row number that column names should come from, whether that row should be deleted from the data, and whether the rows above should be deleted from the data:
Rows: 3,117
Columns: 26
$ REGION_ID <chr> "3", "3", "3", "3", NA, NA, "3", "3", "3", "3", "…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", NA, N…
$ County <chr> "Baldwin County", "Butler County", "Butler County…
$ `Empty Column 1` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ X <chr> "789777.5039", "877731.5725", "877731.5725", "984…
$ Y <chr> "884557.0795", "1007285.71", "1007285.71", "11486…
$ Fips <chr> "1003", "1013", "1013", "1017", NA, NA, "1023", "…
$ `Empty_Column 2` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `LCB Mortality Rate` <chr> "48.1", "38.3", "38.3", "49.6", NA, NA, "31.8", "…
$ Smoking <chr> "20.8", "26", "26", "25.1", NA, NA, "21.8", "22.6…
$ `PM 25` <chr> "7.89", "8.46", "8.46", "8.87", NA, NA, "8.58", "…
$ NO2 <chr> "0.7939", "0.6344", "0.6344", "0.8442", NA, NA, "…
$ SO2 <chr> "0.035343", "0.0135", "0.0135", "0.048177", NA, N…
$ Ozone <chr> "39.79", "38.31", "38.31", "40.1", NA, NA, "37.07…
$ `Pop 65` <chr> "19.5", "19", "19", "18.9", NA, NA, "22.1", "19",…
$ `Pop Black` <chr> "9.24", "43.94", "43.94", "39.24", NA, NA, "41.94…
$ `Pop Hipanic` <chr> "4.54", "1.26", "1.26", "2.14", NA, NA, "0.86", "…
$ `Pop White` <chr> "83.06", "52.64", "52.64", "56.42", NA, NA, "56.2…
$ Education <chr> "66", "38", "38", "47", NA, NA, "55", "39", "60",…
$ `Poverty %` <chr> "13.14", "26.14", "26.14", "21.52", NA, NA, "23.0…
$ `Income Equality` <chr> "4.5", "5.1", "5.1", "4.7", NA, NA, "5.8", "8.2",…
$ Uninsured <chr> "13.34", "12.74", "12.74", "13.34", NA, NA, "12.8…
$ DEM <chr> "36.78", "111.70", "111.70", "227.03", NA, NA, "6…
$ `Radon Zone Class` <chr> "Zone-3", "Zone-3", "Zone-3", "Zone-3", NA, NA, "…
$ `Urban Rural` <chr> "Medium/small metro", "Nonmetro", "Nonmetro", "No…
$ `Coal Production` <chr> "No", "No", "No", "No", NA, NA, "No", "No", "No",…
Still data has some empty columns and and empty rows, we are going to remove these empty columns and rows using remove_empty() function:
Rows: 3,110
Columns: 24
$ REGION_ID <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3",…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ County <chr> "Baldwin County", "Butler County", "Butler County…
$ X <chr> "789777.5039", "877731.5725", "877731.5725", "984…
$ Y <chr> "884557.0795", "1007285.71", "1007285.71", "11486…
$ Fips <chr> "1003", "1013", "1013", "1017", "1023", "1025", "…
$ `LCB Mortality Rate` <chr> "48.1", "38.3", "38.3", "49.6", "31.8", "42", "53…
$ Smoking <chr> "20.8", "26", "26", "25.1", "21.8", "22.6", "21.2…
$ `PM 25` <chr> "7.89", "8.46", "8.46", "8.87", "8.58", "8.42", "…
$ NO2 <chr> "0.7939", "0.6344", "0.6344", "0.8442", "0.5934",…
$ SO2 <chr> "0.035343", "0.0135", "0.0135", "0.048177", "0.02…
$ Ozone <chr> "39.79", "38.31", "38.31", "40.1", "37.07", "37.6…
$ `Pop 65` <chr> "19.5", "19", "19", "18.9", "22.1", "19", "16.3",…
$ `Pop Black` <chr> "9.24", "43.94", "43.94", "39.24", "41.94", "43.9…
$ `Pop Hipanic` <chr> "4.54", "1.26", "1.26", "2.14", "0.86", "1.34", "…
$ `Pop White` <chr> "83.06", "52.64", "52.64", "56.42", "56.28", "52.…
$ Education <chr> "66", "38", "38", "47", "55", "39", "60", "35", "…
$ `Poverty %` <chr> "13.14", "26.14", "26.14", "21.52", "23.06", "24.…
$ `Income Equality` <chr> "4.5", "5.1", "5.1", "4.7", "5.8", "8.2", "4.8", …
$ Uninsured <chr> "13.34", "12.74", "12.74", "13.34", "12.86", "13.…
$ DEM <chr> "36.78", "111.70", "111.70", "227.03", "68.24", "…
$ `Radon Zone Class` <chr> "Zone-3", "Zone-3", "Zone-3", "Zone-3", "Zone-3",…
$ `Urban Rural` <chr> "Medium/small metro", "Nonmetro", "Nonmetro", "No…
$ `Coal Production` <chr> "No", "No", "No", "No", "No", "No", "No", "No", "…
Now, we are going fix column headings using clean_names(). It converts the column names to lowercase and replaces all spaces and special characters with underscores.
Rows: 3,110
Columns: 24
$ region_id <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county <chr> "Baldwin County", "Butler County", "Butler County",…
$ x <chr> "789777.5039", "877731.5725", "877731.5725", "98421…
$ y <chr> "884557.0795", "1007285.71", "1007285.71", "1148648…
$ fips <chr> "1003", "1013", "1013", "1017", "1023", "1025", "10…
$ lcb_mortality_rate <chr> "48.1", "38.3", "38.3", "49.6", "31.8", "42", "53.7…
$ smoking <chr> "20.8", "26", "26", "25.1", "21.8", "22.6", "21.2",…
$ pm_25 <chr> "7.89", "8.46", "8.46", "8.87", "8.58", "8.42", "8.…
$ no2 <chr> "0.7939", "0.6344", "0.6344", "0.8442", "0.5934", "…
$ so2 <chr> "0.035343", "0.0135", "0.0135", "0.048177", "0.0239…
$ ozone <chr> "39.79", "38.31", "38.31", "40.1", "37.07", "37.68"…
$ pop_65 <chr> "19.5", "19", "19", "18.9", "22.1", "19", "16.3", "…
$ pop_black <chr> "9.24", "43.94", "43.94", "39.24", "41.94", "43.96"…
$ pop_hipanic <chr> "4.54", "1.26", "1.26", "2.14", "0.86", "1.34", "6.…
$ pop_white <chr> "83.06", "52.64", "52.64", "56.42", "56.28", "52.98…
$ education <chr> "66", "38", "38", "47", "55", "39", "60", "35", "53…
$ poverty_percent <chr> "13.14", "26.14", "26.14", "21.52", "23.06", "24.6"…
$ income_equality <chr> "4.5", "5.1", "5.1", "4.7", "5.8", "8.2", "4.8", "4…
$ uninsured <chr> "13.34", "12.74", "12.74", "13.34", "12.86", "13.28…
$ dem <chr> "36.78", "111.70", "111.70", "227.03", "68.24", "69…
$ radon_zone_class <chr> "Zone-3", "Zone-3", "Zone-3", "Zone-3", "Zone-3", "…
$ urban_rural <chr> "Medium/small metro", "Nonmetro", "Nonmetro", "Nonm…
$ coal_production <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No…
All data are exported in R as chr. We are going to convert column from 4 to 21 as.numeric and 22 to 23 as.factor. We use dplyr::mutate_at() function:
Rows: 3,110
Columns: 24
$ region_id <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county <chr> "Baldwin County", "Butler County", "Butler County",…
$ x <dbl> 789777.5, 877731.6, 877731.6, 984214.7, 726606.5, 7…
$ y <dbl> 884557.1, 1007285.7, 1007285.7, 1148648.7, 1023615.…
$ fips <dbl> 1003, 1013, 1013, 1017, 1023, 1025, 1031, 1035, 103…
$ lcb_mortality_rate <dbl> 48.1, 38.3, 38.3, 49.6, 31.8, 42.0, 53.7, 46.9, 65.…
$ smoking <dbl> 20.8, 26.0, 26.0, 25.1, 21.8, 22.6, 21.2, 24.9, 25.…
$ pm_25 <dbl> 7.89, 8.46, 8.46, 8.87, 8.58, 8.42, 8.42, 8.23, 8.2…
$ no2 <dbl> 0.7939, 0.6344, 0.6344, 0.8442, 0.5934, 0.6432, 0.5…
$ so2 <dbl> 0.035343, 0.013500, 0.013500, 0.048177, 0.023989, 0…
$ ozone <dbl> 39.79, 38.31, 38.31, 40.10, 37.07, 37.68, 38.46, 37…
$ pop_65 <dbl> 19.5, 19.0, 19.0, 18.9, 22.1, 19.0, 16.3, 21.6, 20.…
$ pop_black <dbl> 9.24, 43.94, 43.94, 39.24, 41.94, 43.96, 17.26, 45.…
$ pop_hipanic <dbl> 4.54, 1.26, 1.26, 2.14, 0.86, 1.34, 6.76, 1.84, 1.6…
$ pop_white <dbl> 83.06, 52.64, 52.64, 56.42, 56.28, 52.98, 70.90, 50…
$ education <dbl> 66, 38, 38, 47, 55, 39, 60, 35, 53, 44, 58, 38, 38,…
$ poverty_percent <dbl> 13.14, 26.14, 26.14, 21.52, 23.06, 24.60, 16.20, 29…
$ income_equality <dbl> 4.5, 5.1, 5.1, 4.7, 5.8, 8.2, 4.8, 4.9, 4.6, 5.8, 5…
$ uninsured <dbl> 13.34, 12.74, 12.74, 13.34, 12.86, 13.28, 13.16, 15…
$ dem <dbl> 36.78, 111.70, 111.70, 227.03, 68.24, 69.29, 99.32,…
$ radon_zone_class <fct> Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zon…
$ urban_rural <fct> Medium/small metro, Nonmetro, Nonmetro, Nonmetro, N…
$ coal_production <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,…
Now will check the duplicates record in the this data:
# A tibble: 6 × 25
fips dupe_co…¹ regio…² state county x y lcb_m…³ smoking pm_25 no2
<dbl> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1013 2 3 Alab… Butle… 8.78e5 1.01e6 38.3 26 8.46 0.634
2 1013 2 3 Alab… Butle… 8.78e5 1.01e6 38.3 26 8.46 0.634
3 1053 2 3 Alab… Escam… 8.39e5 9.34e5 58.3 25.3 8.08 0.574
4 1053 2 3 Alab… Escam… 8.39e5 9.34e5 58.3 25.3 8.08 0.574
5 5011 2 3 Arka… Bradl… 3.54e5 1.16e6 69.9 25 8.32 0.558
6 5011 2 3 Arka… Bradl… 3.54e5 1.16e6 69.9 25 8.32 0.558
# … with 14 more variables: so2 <dbl>, ozone <dbl>, pop_65 <dbl>,
# pop_black <dbl>, pop_hipanic <dbl>, pop_white <dbl>, education <dbl>,
# poverty_percent <dbl>, income_equality <dbl>, uninsured <dbl>, dem <dbl>,
# radon_zone_class <fct>, urban_rural <fct>, coal_production <fct>, and
# abbreviated variable names ¹dupe_count, ²region_id, ³lcb_mortality_rate
As shown above, the data frame is filtered down to those rows with duplicate values in the Fips column. For removing these duplicate rows, we have to use dplyr::distinct(.keep_all = TRUE)
Now will check the duplicates record in the this data:
Rows: 3,107
Columns: 24
$ region_id <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county <chr> "Baldwin County", "Butler County", "Chambers County…
$ x <dbl> 789777.5, 877731.6, 984214.7, 726606.5, 770408.9, 9…
$ y <dbl> 884557.1, 1007285.7, 1148648.7, 1023615.8, 988910.5…
$ fips <dbl> 1003, 1013, 1017, 1023, 1025, 1031, 1035, 1039, 104…
$ lcb_mortality_rate <dbl> 48.1, 38.3, 49.6, 31.8, 42.0, 53.7, 46.9, 65.5, 57.…
$ smoking <dbl> 20.8, 26.0, 25.1, 21.8, 22.6, 21.2, 24.9, 25.9, 22.…
$ pm_25 <dbl> 7.89, 8.46, 8.87, 8.58, 8.42, 8.42, 8.23, 8.24, 8.4…
$ no2 <dbl> 0.7939, 0.6344, 0.8442, 0.5934, 0.6432, 0.5698, 0.5…
$ so2 <dbl> 0.035343, 0.013500, 0.048177, 0.023989, 0.033700, 0…
$ ozone <dbl> 39.79, 38.31, 40.10, 37.07, 37.68, 38.46, 37.92, 38…
$ pop_65 <dbl> 19.5, 19.0, 18.9, 22.1, 19.0, 16.3, 21.6, 20.5, 18.…
$ pop_black <dbl> 9.24, 43.94, 39.24, 41.94, 43.96, 17.26, 45.94, 12.…
$ pop_hipanic <dbl> 4.54, 1.26, 2.14, 0.86, 1.34, 6.76, 1.84, 1.62, 1.8…
$ pop_white <dbl> 83.06, 52.64, 56.42, 56.28, 52.98, 70.90, 50.56, 83…
$ education <dbl> 66, 38, 47, 55, 39, 60, 35, 53, 44, 58, 38, 45, 58,…
$ poverty_percent <dbl> 13.14, 26.14, 21.52, 23.06, 24.60, 16.20, 29.76, 20…
$ income_equality <dbl> 4.5, 5.1, 4.7, 5.8, 8.2, 4.8, 4.9, 4.6, 5.8, 5.2, 5…
$ uninsured <dbl> 13.34, 12.74, 13.34, 12.86, 13.28, 13.16, 15.16, 13…
$ dem <dbl> 36.78, 111.70, 227.03, 68.24, 69.29, 99.32, 96.03, …
$ radon_zone_class <fct> Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zon…
$ urban_rural <fct> Medium/small metro, Nonmetro, Nonmetro, Nonmetro, N…
$ coal_production <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,…
Now we run all above function with Pipe (%>%):
Rows: 3,107
Columns: 24
$ region_id <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county <chr> "Baldwin County", "Butler County", "Chambers County…
$ x <dbl> 789777.5, 877731.6, 984214.7, 726606.5, 770408.9, 9…
$ y <dbl> 884557.1, 1007285.7, 1148648.7, 1023615.8, 988910.5…
$ fips <dbl> 1003, 1013, 1017, 1023, 1025, 1031, 1035, 1039, 104…
$ lcb_mortality_rate <dbl> 48.1, 38.3, 49.6, 31.8, 42.0, 53.7, 46.9, 65.5, 57.…
$ smoking <dbl> 20.8, 26.0, 25.1, 21.8, 22.6, 21.2, 24.9, 25.9, 22.…
$ pm_25 <dbl> 7.89, 8.46, 8.87, 8.58, 8.42, 8.42, 8.23, 8.24, 8.4…
$ no2 <dbl> 0.7939, 0.6344, 0.8442, 0.5934, 0.6432, 0.5698, 0.5…
$ so2 <dbl> 0.035343, 0.013500, 0.048177, 0.023989, 0.033700, 0…
$ ozone <dbl> 39.79, 38.31, 40.10, 37.07, 37.68, 38.46, 37.92, 38…
$ pop_65 <dbl> 19.5, 19.0, 18.9, 22.1, 19.0, 16.3, 21.6, 20.5, 18.…
$ pop_black <dbl> 9.24, 43.94, 39.24, 41.94, 43.96, 17.26, 45.94, 12.…
$ pop_hipanic <dbl> 4.54, 1.26, 2.14, 0.86, 1.34, 6.76, 1.84, 1.62, 1.8…
$ pop_white <dbl> 83.06, 52.64, 56.42, 56.28, 52.98, 70.90, 50.56, 83…
$ education <dbl> 66, 38, 47, 55, 39, 60, 35, 53, 44, 58, 38, 45, 58,…
$ poverty_percent <dbl> 13.14, 26.14, 21.52, 23.06, 24.60, 16.20, 29.76, 20…
$ income_equality <dbl> 4.5, 5.1, 4.7, 5.8, 8.2, 4.8, 4.9, 4.6, 5.8, 5.2, 5…
$ uninsured <dbl> 13.34, 12.74, 13.34, 12.86, 13.28, 13.16, 15.16, 13…
$ dem <dbl> 36.78, 111.70, 227.03, 68.24, 69.29, 99.32, 96.03, …
$ radon_zone_class <fct> Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zone-3, Zon…
$ urban_rural <fct> Medium/small metro, Nonmetro, Nonmetro, Nonmetro, N…
$ coal_production <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,…
Create a R-Markdown Project on exiting Homework directory
Create a R-Markdown documents (name homework_02.rmd) in this project directory and do all Tasks (1 to 6) using the data shown below.
Submit all codes and output as a HTML document (homework_02.html) before class of next week.
tidyverse and janitor
Download the data and save in your project directory. Use read_csv to load the data in your R-session. For example:
bf<-read_csv(“bd_arsenic_data_raw.csv”))
Use janitor::row_to_names() remove text from column heading
Remove empty rows and columns using janitor::remove_empty()
Clean column names using janitor::clean_names()
Use as.numeric and as.factor arguments in dplyr::mutate_at() function to convert ‘chr’ columns to numeric and factors accordingly
Find duplicate records and remove them
Run all above functions with pipe.
# Data Wrangling with Janitor {.unnumbered}
![](Image/Janitor.png)
The R package [janitor](https://github.com/sfirke/janitor) provides a set of tools for cleaning and organizing data in R. The package is designed to help make data cleaning tasks easier and more efficient, with functions that handle common data cleaning tasks.
Some of the functions provided by the janitor package include:
**clean_names**: This function cleans column names by removing special characters and converting them to lowercase.
**remove_empty**: This function removes rows or columns that are entirely empty from a data frame.
**tabyl**: This function creates frequency tables with ease.
**get_dupes**: This function identifies duplicate rows in a data frame.
**factorize**: This function converts columns in a data frame to factors.
The janitor package can be installed using the following command in R:
> install.packages("janitor")
### Load Packages
```{r}
#| warning: false
#| error: false
library(tidyverse)
library(janitor)
```
### Some Important functions
We will create some "bad" data and clean them with janitor. We will apply following functions:
- clean_names()
- remove_empty()
- trim_ws()
- get_dupes()
- remove_constant()
#### clean_names()
The **clean_names()** function is used to clean column names in a data frame. It converts the column names to lowercase and replaces all spaces and special characters with underscores.
```{r}
#| warning: false
#| error: false
#|
# Create a data frame with messy column names
df <- data.frame("Column One" = 1:5,
"Column Two!!" = 6:10,
"Column Three $" = 11:15,
"%Column four" = 11:15)
head(df)
```
```{r}
#| warning: false
#| error: false
df %>%
janitor::clean_names()
```
#### remove_empty()
The remove_empty() function is used to remove rows or columns that contain only missing or empty values.
```{r}
#| warning: false
#| error: false
# Create a data frame with empty rows and columns
df <- data.frame(x = c(1,NA,4),
y = c(NA,NA,3),
z = c(NA, NA, NA))
head(df)
```
```{r}
#| warning: false
#| error: false
df %>%
janitor::remove_empty(c("rows","cols"))
```
#### get_dupes()
The get_dupes() function is used to find duplicate rows in a data frame.
```{r}
#| warning: false
#| error: false
#|
df <- data.frame("Column One" = c(1, 2, 3, 1), "Column Two" = c("A", "B", "C", "A"))
get_dupes(df)
```
#### Clean a bad data
Now we will clean on very messy data using some functions of janitor packages. We will use [Lung Cancer Mortality data](https://www.dropbox.com/s/ovm7dc2szax6kcz/USA_LBC_Data.csv?dl=0).
```{r}
#| warning: false
#| error: false
## Create a data folder
dataFolder<-"E:/Dropbox/GitHub/Data/USA/"
df<-read_csv(paste0(dataFolder,"USA_LBC_Data.csv"))
```
```{r}
#| warning: false
#| error: false
glimpse(df)
```
You've probably received plenty of data files like this, that have some text at the top of the spreadsheet before the actual data begins.
In this df.lbc data-frame, the column heading describe briefly the data. But we want 1st row as column heading. So we apply we will apply **row_to_names()**. The row_to_names() function takes the following arguments: the data source, the row number that column names should come from, whether that row should be deleted from the data, and whether the rows above should be deleted from the data:
```{r}
#| warning: false
#| error: false
df.01 = df %>%
janitor::row_to_names(1, remove_row = TRUE, remove_rows_above = TRUE) %>%
glimpse()
```
Still data has some empty columns and and empty rows, we are going to remove these empty columns and rows using **remove_empty()** function:
```{r}
#| warning: false
#| error: false
df.02 = df.01 %>%
janitor::remove_empty() %>%
glimpse()
```
Now, we are going fix column headings using **clean_names()**. It converts the column names to lowercase and replaces all spaces and special characters with underscores.
```{r}
#| warning: false
#| error: false
df.03 = df.02 %>%
janitor::clean_names() %>%
glimpse()
```
All data are exported in R as **chr**. We are going to convert column from 4 to 21 **as.numeric** and 22 to 23 **as.factor**. We use **dplyr::mutate_at()** function:
```{r}
#| warning: false
#| error: false
df.04= df.03 %>%
dplyr::mutate_at(4:21, as.numeric) %>%
dplyr::mutate_at(22:24, as.factor) %>%
glimpse()
```
Now will check the duplicates record in the this data:
```{r}
#| warning: false
#| error: false
df.04 %>% janitor::get_dupes(fips)
```
As shown above, the data frame is filtered down to those rows with duplicate values in the Fips column. For removing these duplicate rows, we have to use **dplyr::distinct(.keep_all = TRUE)**
Now will check the duplicates record in the this data:
```{r}
#| warning: false
#| error: false
df.05= df.04 %>%
dplyr::distinct(fips,.keep_all = TRUE) %>%
glimpse()
```
Now we run all above function with Pipe (%\>%):
```{r}
#| warning: false
#| error: false
df_clean = df %>%
janitor::row_to_names(1, remove_row = TRUE, remove_rows_above = TRUE) %>%
janitor::remove_empty() %>%
janitor::clean_names() %>%
dplyr::mutate_at(4:21, as.numeric) %>%
dplyr::mutate_at(22:24, as.factor) %>%
dplyr::distinct(fips,.keep_all = TRUE) %>%
glimpse()
```
### Exercise
1. Create a R-Markdown Project on exiting Homework directory
2. Create a R-Markdown documents (name homework_02.rmd) in this project directory and do all Tasks (1 to 6) using the data shown below.
3. Submit all codes and output as a HTML document (homework_02.html) before class of next week.
#### Required R-Package
tidyverse and janitor
#### Data
[bd_arsenic_data_raw.csv](https://www.dropbox.com/s/6kw0wkprpw31kis/bd_arsenic_data_raw.csv?dl=0)
Download the data and save in your project directory. Use read_csv to load the data in your R-session. For example:
> bf\<-read_csv("bd_arsenic_data_raw.csv"))
#### Tasks
1. Use janitor::row_to_names() remove text from column heading
2. Remove empty rows and columns using janitor::remove_empty()
3. Clean column names using janitor::clean_names()
4. Use as.numeric and as.factor arguments in **dplyr::mutate_at()** function to convert 'chr' columns to numeric and factors accordingly
5. Find duplicate records and remove them
6. Run all above functions with pipe.
### Further Reading
1. [Overview of janitor functions](https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html)
2. [Cleaning and Exploring Data with the "janitor" Package](https://towardsdatascience.com/cleaning-and-exploring-data-with-the-janitor-package-ee4a3edf085e)