Tidy Data

Introduction

In this task the dplyr and tidyr packages from tidyverse are used to work with different datasets. The first sub-task involves using dplyr to transform an manipulate a dataset of flights from New York City. The second sub-task involves using dplyr and tidyr to make a dataset tidy.

Sub-task 1 steps

1. Define a string and print it

This section defines a string containing my name, the unit name and the task name.

description = 'Darrin William Stephens, SIT741: Statistical Data Analysis, Task T02.P2'

Print the string using the cat() function.

cat(description)

Darrin William Stephens, SIT741: Statistical Data Analysis, Task T02.P2

2. Read the dataset

The weather data in the “nycflights13” dataset was downloaded as a CSV file from https://github.com/tidyverse/nycflights13/blob/main/data-raw/weather.csv. The weather data can then be loaded into R.

weather_data = read.csv('weather.csv')

3. Inspect the head of the dataset

To inspect the header and the first n columns of the dataset, I pass the dataframe to the head() function. As I don’t specify the number of rows to show the default number of rows (6) is shown.

head(weather_data)

  origin year month day hour  temp  dewp humid wind_dir wind_speed wind_gust
1    EWR 2013     1   1    1 39.02 26.06 59.37      270   10.35702        NA
2    EWR 2013     1   1    2 39.02 26.96 61.63      250    8.05546        NA
3    EWR 2013     1   1    3 39.02 28.04 64.43      240   11.50780        NA
4    EWR 2013     1   1    4 39.92 28.04 62.21      250   12.65858        NA
5    EWR 2013     1   1    5 39.02 28.04 64.43      260   12.65858        NA
6    EWR 2013     1   1    6 37.94 28.04 67.21      240   11.50780        NA
  precip pressure visib            time_hour
1      0   1012.0    10 2013-01-01T06:00:00Z
2      0   1012.3    10 2013-01-01T07:00:00Z
3      0   1012.5    10 2013-01-01T08:00:00Z
4      0   1012.2    10 2013-01-01T09:00:00Z
5      0   1011.9    10 2013-01-01T10:00:00Z
6      0   1012.4    10 2013-01-01T11:00:00Z

4. Inspect the size of the dataset

To get the number of observations and the number of variables I use the dim() function. This returns the dimensions of the dataframe.

# Get the dimensions of the dataset
dims = dim(weather_data)
obs = dims[1]
vars = dims[2]

# Print the values
cat("Number of observations:", obs, "\n")

Number of observations: 26115

cat("Number of variables:", vars, "\n")

Number of variables: 15

5. Change the “origin” variable to have the factor type

Before changing the class of the “origin” variable, I first check its current class.

class(weather_data$origin)

[1] "character"

Now I use mutate to modify the “origin” column class to be factor. Since dplyr doesn’t modify the existing dataframe, I need to assign the new dataframe to the old to overwrite it.

weather_data = weather_data |> 
  mutate(origin = factor(origin))

Check the mutate function has updated the class of “origin”.

class(weather_data$origin)

[1] "factor"

6. Summarise the mean and median for each level in “origin”

The mean values for the measurements grouped by origin.

weather_data |> 
  group_by(origin) |> 
  summarise(across(temp:visib, ~ mean(.x, na.rm =TRUE)))

# A tibble: 3 × 10
  origin  temp  dewp humid wind_dir wind_speed wind_gust  precip pressure visib
  <fct>  <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>   <dbl>    <dbl> <dbl>
1 EWR     55.5  41.9  63.1     195.       9.46      24.1 0.00504    1018.  9.28
2 JFK     54.5  41.9  65.2     204.      11.5       27.6 0.00398    1018.  9.18
3 LGA     55.8  40.6  59.3     200.      10.6       25.1 0.00438    1018.  9.31

The median values for the measurements grouped by origin.

weather_data |> 
  group_by(origin) |> 
  summarise(across(temp:visib, ~ median(.x, na.rm =TRUE)))

# A tibble: 3 × 10
  origin  temp  dewp humid wind_dir wind_speed wind_gust precip pressure visib
  <fct>  <dbl> <dbl> <dbl>    <int>      <dbl>     <dbl>  <dbl>    <dbl> <dbl>
1 EWR     55.9  43.0  62.6      230       9.21      23.0      0    1018.    10
2 JFK     54.0  43.0  65.8      220      10.4       26.5      0    1018.    10
3 LGA     55.9  41    57.3      220      10.4       24.2      0    1017.    10

7. Merge data from the airports dataset with latitude and longitude from flights dataset

Load the datasets.

library(nycflights13)
data(airports)
data(flights)

Perform a left join between flights and airports selecting the faa, lat and lon columns from airport and renaming lat to o_lat and lon to o_lon. The join uses the origin column from flights and the lat and lon columns from airports. A similar left join to performed using the dest column from flights and lat and lon columns from airports.

flights |> 
  # Join for origin
  left_join(
    (airports |> 
      dplyr::select(faa, o_lat = lat, o_lon = lon)),
    by = c("origin" ="faa")
  ) |> 
  # Join for destination
  left_join(
    (airports |> 
      dplyr::select(faa, d_lat = lat, d_lon = lon)),
    by = c("dest" ="faa")
  )

# A tibble: 336,776 × 23
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 15 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, o_lat <dbl>, o_lon <dbl>,
#   d_lat <dbl>, d_lon <dbl>

Sub-task 2

For this subtask the dataset is climate statistics for the Melbourne Airport from the Bureau of Meteorology, available from http://www.bom.gov.au/clim_data/cdio/tables/text/IDCJCM0035_086282.csv.

Load the data

The first eleven lines of the CSV file contains meta data and can be skipped.

climate_data = read.csv('IDCJCM0035_086282.csv', skip=11)
head(climate_data)

                                                Statistic.Element      January
1           Mean maximum temperature (°C) for years 1970 to 2025          26.7
2                Highest temperature (°C) for years 1970 to 2025          46.0
3        Date of Highest temperature  °C  for years 1970 to 2025  25 Jan  2019
4         Lowest maximum temperature (°C) for years 1970 to 2025          13.9
5 Date of Lowest maximum temperature  °C  for years 1970 to 2025  05 Jan  1991
6       Decile 1 maximum temperature (°C) for years 1970 to 2022          19.7
      February        March        April          May         June         July
1         26.7         24.3         20.4         16.7         13.7         13.2
2         46.8         40.8         34.5         27.0         21.8         22.7
3 07 Feb  2009 08 Mar  1991 10 Apr  2005 07 May  2002 08 Jun  2005 30 Jul  1975
4         13.5         12.7         11.7          8.0          6.2          5.7
5 02 Feb  2005 29 Mar  1973 07 Apr  1995 31 May  1977 19 Jun  1975 03 Jul  1984
6         19.8         18.2         15.5         13.3         11.1         10.6
        August    September      October     November     December       Annual
1         14.6         16.9         19.5         22.1         24.7         20.0
2         25.6         30.2         36.0         41.6         44.6         46.8
3 29 Aug  1982 23 Sep  2017 12 Oct  2006 21 Nov  2019 20 Dec  2019 07 Feb  2009
4          6.5          8.2         10.4         11.6         13.0          5.7
5 16 Aug  1970 05 Sep  1995 16 Oct  1974 15 Nov  2006 04 Dec  1995 03 Jul  1984
6         11.5         12.8         14.4         15.9         18.2             
  Number.of.Years Start.Year End.Year
1              55       1970     2025
2              55       1970     2025
3             N/A       1970     2025
4              55       1970     2025
5             N/A       1970     2025
6              47       1970     2022

Tidy the data

Tidy data has the following characteristics

Each column is a variable.
Each row is an observation.
Each cell is a single value.
Each type of observational unit forms a table.

The climate data is messy because the columns are not variables and the rows are not individual observations. In this dataset the rows contain some measurements across 12 months.

Pivot longer

To tidy this data, the columns for the months need to be pivoted into rows. The Select operation is used to re-order the columns in the new dataframe after the pivot.

Renaming columns

Rename columns headings to remove spaces and make them the same case.

climate_mod = climate_mod |> 
  rename(
    statistic_element = Statistic.Element,
    number_of_years = Number.of.Years,
    start_year = Start.Year,
    end_year = End.Year,
    annual_value = Annual
  )

head(climate_mod, n=13)

# A tibble: 13 × 7
   statistic_element                month value annual_value start_year end_year
   <chr>                            <chr> <chr> <chr>        <chr>      <chr>   
 1 "Mean maximum temperature (°C) … Janu… 26.7  20.0         1970       2025    
 2 "Mean maximum temperature (°C) … Febr… 26.7  20.0         1970       2025    
 3 "Mean maximum temperature (°C) … March 24.3  20.0         1970       2025    
 4 "Mean maximum temperature (°C) … April 20.4  20.0         1970       2025    
 5 "Mean maximum temperature (°C) … May   16.7  20.0         1970       2025    
 6 "Mean maximum temperature (°C) … June  13.7  20.0         1970       2025    
 7 "Mean maximum temperature (°C) … July  13.2  20.0         1970       2025    
 8 "Mean maximum temperature (°C) … Augu… 14.6  20.0         1970       2025    
 9 "Mean maximum temperature (°C) … Sept… 16.9  20.0         1970       2025    
10 "Mean maximum temperature (°C) … Octo… 19.5  20.0         1970       2025    
11 "Mean maximum temperature (°C) … Nove… 22.1  20.0         1970       2025    
12 "Mean maximum temperature (°C) … Dece… 24.7  20.0         1970       2025    
13 "Highest temperature (°C) for y… Janu… 46.0  46.8         1970       2025    
# ℹ 1 more variable: number_of_years <chr>

Correcting data types for some columns

# Correct the data types for columns
climate_mod = climate_mod |>
  # Cannot change value and annual_value as they contains a mix of dates and numbers
  mutate(across(c(5:7), as.numeric)) |> 
  mutate(month = factor(month))

Warning: There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(c(5:7), as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

head(climate_mod, n=1)

# A tibble: 1 × 7
  statistic_element month value annual_value start_year end_year number_of_years
  <chr>             <fct> <chr> <chr>             <dbl>    <dbl>           <dbl>
1 "Mean maximum te… Janu… 26.7  20.0               1970     2025              55

Extracting categorical variables from the statistic_element

The values in the statistic_element variable appear to also contain categorical data (temperature, rainfall, mean, highest, lowest, etc) that could be used to group and filter the data. Therefore, I have extracted this information into two separate variables using keywords and units.

quantity_type: temperature, rainfall, wind, sunshine, solar, cloud_cover and humidity.
statistic_type: mean. highest, lowest, date, decile.

# Create the quantity_type column
climate_mod = climate_mod |> 
  mutate(
    # Categorize by looking for key words
    quantity_type = case_when(
      str_detect(statistic_element, pattern = "temperature|°C") ~ "temperature",
      str_detect(statistic_element, pattern = "rainfall|rain|evaporation|mm") ~ "rainfall",
      str_detect(statistic_element, pattern = "wind|km/h") ~ "wind",
      str_detect(statistic_element, pattern = "sunshine|hours") ~ "sunshine",
      str_detect(statistic_element, pattern = "solar|MJ/(m*m)") ~ "solar",
      str_detect(statistic_element, pattern = "cloud|clear|hours|oktas") ~ "cloud_cover",
      str_detect(statistic_element, pattern = "humidity|%") ~ "humidity",
      .default= "other"
    ),
    .before = month
  ) |> 
  mutate(quantity_type = factor(quantity_type))

# Create the statistic_type column
# Need to use startswith() so I can find instances where Maximum is used instead of Highest
climate_mod= climate_mod |> 
  mutate(
    statistic_type = case_when(
      startsWith(statistic_element, "Mean") ~ "mean",
      startsWith(statistic_element, "Highest") ~ "highest",
      startsWith(statistic_element, "Highest") ~ "highest",
      startsWith(statistic_element, "Lowest") ~ "lowest", 
      startsWith(statistic_element, "Date of") ~ "date",
      startsWith(statistic_element, "Decile") ~ "decile",
      .default = "other"
    ),
    .before = month
  ) |> 
  mutate(statistic_type = factor(statistic_type))

head(climate_mod, n=13)

# A tibble: 13 × 9
   statistic_element       quantity_type statistic_type month value annual_value
   <chr>                   <fct>         <fct>          <fct> <chr> <chr>       
 1 "Mean maximum temperat… temperature   mean           Janu… 26.7  20.0        
 2 "Mean maximum temperat… temperature   mean           Febr… 26.7  20.0        
 3 "Mean maximum temperat… temperature   mean           March 24.3  20.0        
 4 "Mean maximum temperat… temperature   mean           April 20.4  20.0        
 5 "Mean maximum temperat… temperature   mean           May   16.7  20.0        
 6 "Mean maximum temperat… temperature   mean           June  13.7  20.0        
 7 "Mean maximum temperat… temperature   mean           July  13.2  20.0        
 8 "Mean maximum temperat… temperature   mean           Augu… 14.6  20.0        
 9 "Mean maximum temperat… temperature   mean           Sept… 16.9  20.0        
10 "Mean maximum temperat… temperature   mean           Octo… 19.5  20.0        
11 "Mean maximum temperat… temperature   mean           Nove… 22.1  20.0        
12 "Mean maximum temperat… temperature   mean           Dece… 24.7  20.0        
13 "Highest temperature (… temperature   highest        Janu… 46.0  46.8        
# ℹ 3 more variables: start_year <dbl>, end_year <dbl>, number_of_years <dbl>

Splitting the table

The value variable currently contains a mix of dates and numbers. The last part of tidying the data is to split the data into two tables. One containing the data for dates and the other containing the numerical values.

# Create the table with only dates
climate_tidy_dates = climate_mod |> 
  filter(statistic_type == "date")

glimpse(climate_tidy_dates)

Rows: 108
Columns: 9
$ statistic_element <chr> "Date of Highest temperature  °C  for years 1970 to …
$ quantity_type     <fct> temperature, temperature, temperature, temperature, …
$ statistic_type    <fct> date, date, date, date, date, date, date, date, date…
$ month             <fct> January, February, March, April, May, June, July, Au…
$ value             <chr> "25 Jan  2019", "07 Feb  2009", "08 Mar  1991", "10 …
$ annual_value      <chr> "07 Feb  2009", "07 Feb  2009", "07 Feb  2009", "07 …
$ start_year        <dbl> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970…
$ end_year          <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025…
$ number_of_years   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

# Create the table with only dates
climate_tidy_numeric = climate_mod |> 
  filter(statistic_type != "date") |> 
    mutate(across(c("value","annual_value"), as.numeric))

glimpse(climate_tidy_numeric)

Rows: 576
Columns: 9
$ statistic_element <chr> "Mean maximum temperature (°C) for years 1970 to 202…
$ quantity_type     <fct> temperature, temperature, temperature, temperature, …
$ statistic_type    <fct> mean, mean, mean, mean, mean, mean, mean, mean, mean…
$ month             <fct> January, February, March, April, May, June, July, Au…
$ value             <dbl> 26.7, 26.7, 24.3, 20.4, 16.7, 13.7, 13.2, 14.6, 16.9…
$ annual_value      <dbl> 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0…
$ start_year        <dbl> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970…
$ end_year          <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025…
$ number_of_years   <dbl> 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, …