description = 'Darrin William Stephens, SIT741: Statistical Data Analysis, Task T02.P2'Tidy Data
Introduction
In this task the dplyr and tidyr packages from tidyverse are used to work with different datasets. The first sub-task involves using dplyr to transform an manipulate a dataset of flights from New York City. The second sub-task involves using dplyr and tidyr to make a dataset tidy.
Sub-task 1 steps
1. Define a string and print it
This section defines a string containing my name, the unit name and the task name.
Print the string using the cat() function.
cat(description)Darrin William Stephens, SIT741: Statistical Data Analysis, Task T02.P2
2. Read the dataset
The weather data in the “nycflights13” dataset was downloaded as a CSV file from https://github.com/tidyverse/nycflights13/blob/main/data-raw/weather.csv. The weather data can then be loaded into R.
weather_data = read.csv('weather.csv')3. Inspect the head of the dataset
To inspect the header and the first n columns of the dataset, I pass the dataframe to the head() function. As I don’t specify the number of rows to show the default number of rows (6) is shown.
head(weather_data) origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
1 EWR 2013 1 1 1 39.02 26.06 59.37 270 10.35702 NA
2 EWR 2013 1 1 2 39.02 26.96 61.63 250 8.05546 NA
3 EWR 2013 1 1 3 39.02 28.04 64.43 240 11.50780 NA
4 EWR 2013 1 1 4 39.92 28.04 62.21 250 12.65858 NA
5 EWR 2013 1 1 5 39.02 28.04 64.43 260 12.65858 NA
6 EWR 2013 1 1 6 37.94 28.04 67.21 240 11.50780 NA
precip pressure visib time_hour
1 0 1012.0 10 2013-01-01T06:00:00Z
2 0 1012.3 10 2013-01-01T07:00:00Z
3 0 1012.5 10 2013-01-01T08:00:00Z
4 0 1012.2 10 2013-01-01T09:00:00Z
5 0 1011.9 10 2013-01-01T10:00:00Z
6 0 1012.4 10 2013-01-01T11:00:00Z
4. Inspect the size of the dataset
To get the number of observations and the number of variables I use the dim() function. This returns the dimensions of the dataframe.
5. Change the “origin” variable to have the factor type
Before changing the class of the “origin” variable, I first check its current class.
class(weather_data$origin)[1] "character"
Now I use mutate to modify the “origin” column class to be factor. Since dplyr doesn’t modify the existing dataframe, I need to assign the new dataframe to the old to overwrite it.
weather_data = weather_data |>
mutate(origin = factor(origin))Check the mutate function has updated the class of “origin”.
class(weather_data$origin)[1] "factor"
6. Summarise the mean and median for each level in “origin”
The mean values for the measurements grouped by origin.
weather_data |>
group_by(origin) |>
summarise(across(temp:visib, ~ mean(.x, na.rm =TRUE)))# A tibble: 3 × 10
origin temp dewp humid wind_dir wind_speed wind_gust precip pressure visib
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 55.5 41.9 63.1 195. 9.46 24.1 0.00504 1018. 9.28
2 JFK 54.5 41.9 65.2 204. 11.5 27.6 0.00398 1018. 9.18
3 LGA 55.8 40.6 59.3 200. 10.6 25.1 0.00438 1018. 9.31
The median values for the measurements grouped by origin.
weather_data |>
group_by(origin) |>
summarise(across(temp:visib, ~ median(.x, na.rm =TRUE)))# A tibble: 3 × 10
origin temp dewp humid wind_dir wind_speed wind_gust precip pressure visib
<fct> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 55.9 43.0 62.6 230 9.21 23.0 0 1018. 10
2 JFK 54.0 43.0 65.8 220 10.4 26.5 0 1018. 10
3 LGA 55.9 41 57.3 220 10.4 24.2 0 1017. 10
7. Merge data from the airports dataset with latitude and longitude from flights dataset
Load the datasets.
library(nycflights13)
data(airports)
data(flights)Perform a left join between flights and airports selecting the faa, lat and lon columns from airport and renaming lat to o_lat and lon to o_lon. The join uses the origin column from flights and the lat and lon columns from airports. A similar left join to performed using the dest column from flights and lat and lon columns from airports.
flights |>
# Join for origin
left_join(
(airports |>
dplyr::select(faa, o_lat = lat, o_lon = lon)),
by = c("origin" ="faa")
) |>
# Join for destination
left_join(
(airports |>
dplyr::select(faa, d_lat = lat, d_lon = lon)),
by = c("dest" ="faa")
) # A tibble: 336,776 × 23
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 15 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, o_lat <dbl>, o_lon <dbl>,
# d_lat <dbl>, d_lon <dbl>
Sub-task 2
For this subtask the dataset is climate statistics for the Melbourne Airport from the Bureau of Meteorology, available from http://www.bom.gov.au/clim_data/cdio/tables/text/IDCJCM0035_086282.csv.
Load the data
The first eleven lines of the CSV file contains meta data and can be skipped.
Statistic.Element January
1 Mean maximum temperature (°C) for years 1970 to 2025 26.7
2 Highest temperature (°C) for years 1970 to 2025 46.0
3 Date of Highest temperature °C for years 1970 to 2025 25 Jan 2019
4 Lowest maximum temperature (°C) for years 1970 to 2025 13.9
5 Date of Lowest maximum temperature °C for years 1970 to 2025 05 Jan 1991
6 Decile 1 maximum temperature (°C) for years 1970 to 2022 19.7
February March April May June July
1 26.7 24.3 20.4 16.7 13.7 13.2
2 46.8 40.8 34.5 27.0 21.8 22.7
3 07 Feb 2009 08 Mar 1991 10 Apr 2005 07 May 2002 08 Jun 2005 30 Jul 1975
4 13.5 12.7 11.7 8.0 6.2 5.7
5 02 Feb 2005 29 Mar 1973 07 Apr 1995 31 May 1977 19 Jun 1975 03 Jul 1984
6 19.8 18.2 15.5 13.3 11.1 10.6
August September October November December Annual
1 14.6 16.9 19.5 22.1 24.7 20.0
2 25.6 30.2 36.0 41.6 44.6 46.8
3 29 Aug 1982 23 Sep 2017 12 Oct 2006 21 Nov 2019 20 Dec 2019 07 Feb 2009
4 6.5 8.2 10.4 11.6 13.0 5.7
5 16 Aug 1970 05 Sep 1995 16 Oct 1974 15 Nov 2006 04 Dec 1995 03 Jul 1984
6 11.5 12.8 14.4 15.9 18.2
Number.of.Years Start.Year End.Year
1 55 1970 2025
2 55 1970 2025
3 N/A 1970 2025
4 55 1970 2025
5 N/A 1970 2025
6 47 1970 2022
Tidy the data
Tidy data has the following characteristics
- Each column is a variable.
- Each row is an observation.
- Each cell is a single value.
- Each type of observational unit forms a table.
The climate data is messy because the columns are not variables and the rows are not individual observations. In this dataset the rows contain some measurements across 12 months.
Pivot longer
To tidy this data, the columns for the months need to be pivoted into rows. The Select operation is used to re-order the columns in the new dataframe after the pivot.
Renaming columns
Rename columns headings to remove spaces and make them the same case.
climate_mod = climate_mod |>
rename(
statistic_element = Statistic.Element,
number_of_years = Number.of.Years,
start_year = Start.Year,
end_year = End.Year,
annual_value = Annual
)
head(climate_mod, n=13)# A tibble: 13 × 7
statistic_element month value annual_value start_year end_year
<chr> <chr> <chr> <chr> <chr> <chr>
1 "Mean maximum temperature (°C) … Janu… 26.7 20.0 1970 2025
2 "Mean maximum temperature (°C) … Febr… 26.7 20.0 1970 2025
3 "Mean maximum temperature (°C) … March 24.3 20.0 1970 2025
4 "Mean maximum temperature (°C) … April 20.4 20.0 1970 2025
5 "Mean maximum temperature (°C) … May 16.7 20.0 1970 2025
6 "Mean maximum temperature (°C) … June 13.7 20.0 1970 2025
7 "Mean maximum temperature (°C) … July 13.2 20.0 1970 2025
8 "Mean maximum temperature (°C) … Augu… 14.6 20.0 1970 2025
9 "Mean maximum temperature (°C) … Sept… 16.9 20.0 1970 2025
10 "Mean maximum temperature (°C) … Octo… 19.5 20.0 1970 2025
11 "Mean maximum temperature (°C) … Nove… 22.1 20.0 1970 2025
12 "Mean maximum temperature (°C) … Dece… 24.7 20.0 1970 2025
13 "Highest temperature (°C) for y… Janu… 46.0 46.8 1970 2025
# ℹ 1 more variable: number_of_years <chr>
Correcting data types for some columns
# Correct the data types for columns
climate_mod = climate_mod |>
# Cannot change value and annual_value as they contains a mix of dates and numbers
mutate(across(c(5:7), as.numeric)) |>
mutate(month = factor(month))Warning: There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(c(5:7), as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
head(climate_mod, n=1)# A tibble: 1 × 7
statistic_element month value annual_value start_year end_year number_of_years
<chr> <fct> <chr> <chr> <dbl> <dbl> <dbl>
1 "Mean maximum te… Janu… 26.7 20.0 1970 2025 55
Extracting categorical variables from the statistic_element
The values in the statistic_element variable appear to also contain categorical data (temperature, rainfall, mean, highest, lowest, etc) that could be used to group and filter the data. Therefore, I have extracted this information into two separate variables using keywords and units.
- quantity_type: temperature, rainfall, wind, sunshine, solar, cloud_cover and humidity.
- statistic_type: mean. highest, lowest, date, decile.
# Create the quantity_type column
climate_mod = climate_mod |>
mutate(
# Categorize by looking for key words
quantity_type = case_when(
str_detect(statistic_element, pattern = "temperature|°C") ~ "temperature",
str_detect(statistic_element, pattern = "rainfall|rain|evaporation|mm") ~ "rainfall",
str_detect(statistic_element, pattern = "wind|km/h") ~ "wind",
str_detect(statistic_element, pattern = "sunshine|hours") ~ "sunshine",
str_detect(statistic_element, pattern = "solar|MJ/(m*m)") ~ "solar",
str_detect(statistic_element, pattern = "cloud|clear|hours|oktas") ~ "cloud_cover",
str_detect(statistic_element, pattern = "humidity|%") ~ "humidity",
.default= "other"
),
.before = month
) |>
mutate(quantity_type = factor(quantity_type))# Create the statistic_type column
# Need to use startswith() so I can find instances where Maximum is used instead of Highest
climate_mod= climate_mod |>
mutate(
statistic_type = case_when(
startsWith(statistic_element, "Mean") ~ "mean",
startsWith(statistic_element, "Highest") ~ "highest",
startsWith(statistic_element, "Highest") ~ "highest",
startsWith(statistic_element, "Lowest") ~ "lowest",
startsWith(statistic_element, "Date of") ~ "date",
startsWith(statistic_element, "Decile") ~ "decile",
.default = "other"
),
.before = month
) |>
mutate(statistic_type = factor(statistic_type))
head(climate_mod, n=13)# A tibble: 13 × 9
statistic_element quantity_type statistic_type month value annual_value
<chr> <fct> <fct> <fct> <chr> <chr>
1 "Mean maximum temperat… temperature mean Janu… 26.7 20.0
2 "Mean maximum temperat… temperature mean Febr… 26.7 20.0
3 "Mean maximum temperat… temperature mean March 24.3 20.0
4 "Mean maximum temperat… temperature mean April 20.4 20.0
5 "Mean maximum temperat… temperature mean May 16.7 20.0
6 "Mean maximum temperat… temperature mean June 13.7 20.0
7 "Mean maximum temperat… temperature mean July 13.2 20.0
8 "Mean maximum temperat… temperature mean Augu… 14.6 20.0
9 "Mean maximum temperat… temperature mean Sept… 16.9 20.0
10 "Mean maximum temperat… temperature mean Octo… 19.5 20.0
11 "Mean maximum temperat… temperature mean Nove… 22.1 20.0
12 "Mean maximum temperat… temperature mean Dece… 24.7 20.0
13 "Highest temperature (… temperature highest Janu… 46.0 46.8
# ℹ 3 more variables: start_year <dbl>, end_year <dbl>, number_of_years <dbl>
Splitting the table
The value variable currently contains a mix of dates and numbers. The last part of tidying the data is to split the data into two tables. One containing the data for dates and the other containing the numerical values.
# Create the table with only dates
climate_tidy_dates = climate_mod |>
filter(statistic_type == "date")
glimpse(climate_tidy_dates)Rows: 108
Columns: 9
$ statistic_element <chr> "Date of Highest temperature °C for years 1970 to …
$ quantity_type <fct> temperature, temperature, temperature, temperature, …
$ statistic_type <fct> date, date, date, date, date, date, date, date, date…
$ month <fct> January, February, March, April, May, June, July, Au…
$ value <chr> "25 Jan 2019", "07 Feb 2009", "08 Mar 1991", "10 …
$ annual_value <chr> "07 Feb 2009", "07 Feb 2009", "07 Feb 2009", "07 …
$ start_year <dbl> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970…
$ end_year <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025…
$ number_of_years <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
# Create the table with only dates
climate_tidy_numeric = climate_mod |>
filter(statistic_type != "date") |>
mutate(across(c("value","annual_value"), as.numeric))
glimpse(climate_tidy_numeric)Rows: 576
Columns: 9
$ statistic_element <chr> "Mean maximum temperature (°C) for years 1970 to 202…
$ quantity_type <fct> temperature, temperature, temperature, temperature, …
$ statistic_type <fct> mean, mean, mean, mean, mean, mean, mean, mean, mean…
$ month <fct> January, February, March, April, May, June, July, Au…
$ value <dbl> 26.7, 26.7, 24.3, 20.4, 16.7, 13.7, 13.2, 14.6, 16.9…
$ annual_value <dbl> 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0, 20.0…
$ start_year <dbl> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970…
$ end_year <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025…
$ number_of_years <dbl> 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, …