library(tidyverse)
library(janitor)Goals of this notebook
The steps we’ll take to prepare our data:
- Download the data
- Import it into our notebook
- Clean up data types and columns
- Export the data for next notebook
Setup
Loading the libraries.
Downloading Data
The Billboard Hot 100 from Billboard Magazine.
# download.file(
# "https://github.com/utdata/rwd-billboard-data/blob/main/data-out/hot100_assignment.csv?raw=true",
# "data-raw/hot100_assignment.csv",
# mode = "wb"
# )Import Data
Importing Billboard Hot 100 data.
create the object, then fill it with data from the csv
#create the object, then fill it with data from the csv
hot100 <- read_csv("data-raw/hot100_assignment.csv") %>% clean_names()Rows: 341800 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): CHART WEEK, TITLE, PERFORMER
dbl (4): THIS WEEK, LAST WEEK, PEAK POS., WKS ON CHART
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#peek at the data
hot100 |> glimpse()Rows: 341,800
Columns: 7
$ chart_week <chr> "1/1/2022", "1/1/2022", "1/1/2022", "1/1/2022", "1/1/2022…
$ this_week <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ title <chr> "All I Want For Christmas Is You", "Rockin' Around The Ch…
$ performer <chr> "Mariah Carey", "Brenda Lee", "Bobby Helms", "Burl Ives",…
$ last_week <dbl> 1, 2, 4, 5, 3, 7, 9, 11, 6, 13, 15, 17, 18, 0, 8, 25, 19,…
$ peak_pos <dbl> 1, 2, 3, 4, 1, 5, 7, 6, 1, 10, 11, 8, 12, 14, 7, 16, 12, …
$ wks_on_chart <dbl> 50, 44, 41, 25, 11, 26, 24, 19, 24, 15, 31, 18, 14, 1, 49…
Fix our dates
Will utilize lubridate to create a new column with a real date.
# part we will build upon
hot100_date <- hot100 |>
mutate(
chart_date = mdy(chart_week)
) |>
arrange(chart_date, this_week)
# peek at the result
hot100_date |> glimpse()Rows: 341,800
Columns: 8
$ chart_week <chr> "8/4/1958", "8/4/1958", "8/4/1958", "8/4/1958", "8/4/1958…
$ this_week <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ title <chr> "Poor Little Fool", "Patricia", "Splish Splash", "Hard He…
$ performer <chr> "Ricky Nelson", "Perez Prado And His Orchestra", "Bobby D…
$ last_week <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ peak_pos <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ wks_on_chart <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ chart_date <date> 1958-08-04, 1958-08-04, 1958-08-04, 1958-08-04, 1958-08-…
Looking at Table
To be used for viewing the hot100 table.
hot100_date |> head(10)Printing
To print summary stats of hot100.
hot100_date |> summary() chart_week this_week title performer
Length:341800 Min. : 1.0 Length:341800 Length:341800
Class :character 1st Qu.: 26.0 Class :character Class :character
Mode :character Median : 51.0 Mode :character Mode :character
Mean : 50.5
3rd Qu.: 75.0
Max. :100.0
last_week peak_pos wks_on_chart chart_date
Min. : 0.00 Min. : 1.00 Min. : 1.000 Min. :1958-08-04
1st Qu.: 23.00 1st Qu.: 13.00 1st Qu.: 4.000 1st Qu.:1974-12-21
Median : 47.00 Median : 38.00 Median : 7.000 Median :1991-05-07
Mean : 47.28 Mean : 40.67 Mean : 9.295 Mean :1991-05-07
3rd Qu.: 71.00 3rd Qu.: 65.00 3rd Qu.:13.000 3rd Qu.:2007-09-22
Max. :100.00 Max. :100.00 Max. :91.000 Max. :2024-02-03
NA's :32460
Most recent chart date, 2024-02-03
Selecting columns
Dropping the text date column and renaming others.
hot100_clean <- hot100_date |>
select(
chart_date,
current_rank = this_week,
title,
performer,
previous_rank = last_week,
peak_rank = peak_pos,
wks_on_chart
)
hot100_clean |> glimpse()Rows: 341,800
Columns: 7
$ chart_date <date> 1958-08-04, 1958-08-04, 1958-08-04, 1958-08-04, 1958-08…
$ current_rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ title <chr> "Poor Little Fool", "Patricia", "Splish Splash", "Hard H…
$ performer <chr> "Ricky Nelson", "Perez Prado And His Orchestra", "Bobby …
$ previous_rank <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ peak_rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ wks_on_chart <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
Exports
Exporting files to .rds
hot100_clean |>
write_rds("data-processed/01-hot100.rds")