Goals of this notebook

The steps we’ll take to prepare our data:

  • Download the data
  • Import it into our notebook
  • Clean up data types and columns
  • Export the data for next notebook

Setup

Loading the libraries.

library(tidyverse)
library(janitor)

Downloading Data

The Billboard Hot 100 from Billboard Magazine.

# download.file(
#   "https://github.com/utdata/rwd-billboard-data/blob/main/data-out/hot100_assignment.csv?raw=true",
#   "data-raw/hot100_assignment.csv",
#   mode = "wb"
# )

Import Data

Importing Billboard Hot 100 data.

create the object, then fill it with data from the csv

#create the object, then fill it with data from the csv
hot100 <- read_csv("data-raw/hot100_assignment.csv") %>% clean_names()
Rows: 341800 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): CHART WEEK, TITLE, PERFORMER
dbl (4): THIS WEEK, LAST WEEK, PEAK POS., WKS ON CHART

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#peek at the data
hot100 |> glimpse()
Rows: 341,800
Columns: 7
$ chart_week   <chr> "1/1/2022", "1/1/2022", "1/1/2022", "1/1/2022", "1/1/2022…
$ this_week    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ title        <chr> "All I Want For Christmas Is You", "Rockin' Around The Ch…
$ performer    <chr> "Mariah Carey", "Brenda Lee", "Bobby Helms", "Burl Ives",…
$ last_week    <dbl> 1, 2, 4, 5, 3, 7, 9, 11, 6, 13, 15, 17, 18, 0, 8, 25, 19,…
$ peak_pos     <dbl> 1, 2, 3, 4, 1, 5, 7, 6, 1, 10, 11, 8, 12, 14, 7, 16, 12, …
$ wks_on_chart <dbl> 50, 44, 41, 25, 11, 26, 24, 19, 24, 15, 31, 18, 14, 1, 49…

Fix our dates

Will utilize lubridate to create a new column with a real date.

# part we will build upon
hot100_date <- hot100 |> 
  mutate(
    chart_date = mdy(chart_week)
  ) |>
  arrange(chart_date, this_week)

# peek at the result
hot100_date |> glimpse()
Rows: 341,800
Columns: 8
$ chart_week   <chr> "8/4/1958", "8/4/1958", "8/4/1958", "8/4/1958", "8/4/1958…
$ this_week    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ title        <chr> "Poor Little Fool", "Patricia", "Splish Splash", "Hard He…
$ performer    <chr> "Ricky Nelson", "Perez Prado And His Orchestra", "Bobby D…
$ last_week    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ peak_pos     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ wks_on_chart <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ chart_date   <date> 1958-08-04, 1958-08-04, 1958-08-04, 1958-08-04, 1958-08-…

Looking at Table

To be used for viewing the hot100 table.

hot100_date |> head(10)

Printing

To print summary stats of hot100.

hot100_date |> summary()
  chart_week          this_week        title            performer        
 Length:341800      Min.   :  1.0   Length:341800      Length:341800     
 Class :character   1st Qu.: 26.0   Class :character   Class :character  
 Mode  :character   Median : 51.0   Mode  :character   Mode  :character  
                    Mean   : 50.5                                        
                    3rd Qu.: 75.0                                        
                    Max.   :100.0                                        
                                                                         
   last_week         peak_pos       wks_on_chart      chart_date        
 Min.   :  0.00   Min.   :  1.00   Min.   : 1.000   Min.   :1958-08-04  
 1st Qu.: 23.00   1st Qu.: 13.00   1st Qu.: 4.000   1st Qu.:1974-12-21  
 Median : 47.00   Median : 38.00   Median : 7.000   Median :1991-05-07  
 Mean   : 47.28   Mean   : 40.67   Mean   : 9.295   Mean   :1991-05-07  
 3rd Qu.: 71.00   3rd Qu.: 65.00   3rd Qu.:13.000   3rd Qu.:2007-09-22  
 Max.   :100.00   Max.   :100.00   Max.   :91.000   Max.   :2024-02-03  
 NA's   :32460                                                          

Most recent chart date, 2024-02-03

Selecting columns

Dropping the text date column and renaming others.

hot100_clean <- hot100_date |>
  select(
    chart_date,
    current_rank = this_week,
    title,
    performer,
    previous_rank = last_week,
    peak_rank = peak_pos,
    wks_on_chart
  )

hot100_clean |> glimpse()
Rows: 341,800
Columns: 7
$ chart_date    <date> 1958-08-04, 1958-08-04, 1958-08-04, 1958-08-04, 1958-08…
$ current_rank  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ title         <chr> "Poor Little Fool", "Patricia", "Splish Splash", "Hard H…
$ performer     <chr> "Ricky Nelson", "Perez Prado And His Orchestra", "Bobby …
$ previous_rank <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ peak_rank     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ wks_on_chart  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Exports

Exporting files to .rds

hot100_clean |> 
   write_rds("data-processed/01-hot100.rds")