Cleaning

Goals of this notebook

The steps we’ll take to prepare our data:

  • Import data into our notebook
  • Clean up data types and columns
  • Export the data as an .rds

Setup

Loading the libraries.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

Import Data

Importing Austin Camp Mabry weather data.

weather <- read_csv("data-raw/weather_data.csv") %>% clean_names()
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 31296 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): STATION, NAME
dbl  (6): PRCP, SNOW, SNWD, TAVG, TMAX, TMIN
lgl  (1): TOBS
date (1): DATE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weather <- weather %>% 
  select(-tobs, -tavg)

glimpse(weather)
Rows: 31,296
Columns: 8
$ station <chr> "USW00013958", "USW00013958", "USW00013958", "USW00013958", "U…
$ name    <chr> "AUSTIN CAMP MABRY, TX US", "AUSTIN CAMP MABRY, TX US", "AUSTI…
$ date    <date> 1938-06-01, 1938-06-02, 1938-06-03, 1938-06-04, 1938-06-05, 1…
$ prcp    <dbl> 0.00, 0.00, 0.00, 0.40, 0.02, 0.00, 0.00, 0.00, 1.60, 0.01, 0.…
$ snow    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ snwd    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ tmax    <dbl> 91, 94, 94, 90, 94, 92, 95, 92, 87, 90, 92, 91, 91, 91, 89, 89…
$ tmin    <dbl> 72, 67, 70, 68, 68, 70, 70, 76, 64, 76, 75, 71, 70, 68, 71, 70…

Exports

Exporting files to .rds

weather |> 
   write_rds("data-processed/01-weather.rds")