Data types

Introduction

In this task we use some basic R commands for working with a dataset. These include importing, inspecting the size, inspecting the data types and inspecting the variable types. Additionally a function is created that summarises categorical and numerical variables in a dataframe.

Task steps

Define a string

This section defines a string containing my name, the unit name and the task name.

description = 'Darrin William Stephens, SIT741: Statistical Data Analysis, Task T01.P1'

Print the string using the cat() function.

cat(description)

Darrin William Stephens, SIT741: Statistical Data Analysis, Task T01.P1

The dataset

The dataset I chose for this task was the blood pressure measurement data from the US Centers for Disease Control and Prevention
National Health and Nutrition Examination Survey. It contains three categorical and nine numerical variables.

Based on 2017-March 2020 Blood Pressure - Oscillometric Measurement (P_BPXO). Published: May 2021
Description: https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/P_BPXO.htm
License: public domain as per https://www.cdc.gov/other/agencymaterials.html

Information about dataset variables is shown in the table below.

Variable name	Data type	Description
SEQN	Numeric	Respondent sequence number
BPAOARM	Character	Arm selected - oscillometric
BPAOCSZ	Numeric	Coded cuff size - oscillometric
BPXOSY1	Numeric	Systolic - 1st oscillometric reading
BPXODI1	Numeric	Diastolic - 1st oscillometric reading
BPXOSY2	Numeric	Systolic - 2nd oscillometric reading
BPXODI2	Numeric	Diastolic - 2nd oscillometric reading
BPXOSY3	Numeric	Systolic - 3rd oscillometric reading
BPXODI3	Numeric	Diastolic - 3rd oscillometric reading
BPXOPLS1	Numeric	Pulse - 1st oscillometric reading
BPXOPLS2	Numeric	Pulse - 2nd oscillometric reading
BPXOPLS3	Numeric	Pulse - 3rd oscillometric reading

I have used this dataset for a previous Deakin unit and had cleaned (missing values removed) CSV file available.

Import the dataset

The class of the columns is set on import. The first three are specified as categorical (factor) and the next nine are specified as numeric.

data = read.csv('Blood_Pressure_Measurement_cleaned.csv', colClasses=c(rep("factor", 3), rep("numeric", 9)))

Inspect the head of the dataset

To inspect the header and the first n columns of the dataset, I pass the dataframe to the head() function. As I don’t specify the number of rows to show the default number of rows (6) is shown.

head(data)

    SEQN BPAOARM BPAOCSZ BPXOSY1 BPXODI1 BPXOSY2 BPXODI2 BPXOSY3 BPXODI3
1 109264       R       3     109      67     109      68     106      66
2 109266       R       4      99      56      99      55      99      52
3 109270       R       3     123      73     124      77     127      70
4 109271       R       4     102      65     108      68     111      68
5 109273       R       3     116      68     110      66     115      68
6 109277       R       3     104      58     102      54     101      53
  BPXOPLS1 BPXOPLS2 BPXOPLS3
1       94       95       91
2       68       66       66
3       95       98       93
4       73       71       70
5       71       70       70
6       67       69       77

Inspect the size of the dataset

To get the number of observations and the number of variables I use the dim() function. This returns the dimensions of the dataframe.

# Get the dimensions of the dataset
dims = dim(data)
obs = dims[1]
vars = dims[2]

# Print the values
cat("Number of observations:", obs, "\n")

Number of observations: 9410

cat("Number of variables:", vars, "\n")

Number of variables: 12

Inspect data types

To inspect the data type I use the sapply function to apply the typeof function to each column in the dataframe. The typeof function returns the data type for each column (variable).

sapply(data, typeof)

     SEQN   BPAOARM   BPAOCSZ   BPXOSY1   BPXODI1   BPXOSY2   BPXODI2   BPXOSY3 
"integer" "integer" "integer"  "double"  "double"  "double"  "double"  "double" 
  BPXODI3  BPXOPLS1  BPXOPLS2  BPXOPLS3 
 "double"  "double"  "double"  "double"

Function to summarise

A function that takes a dataframe and summarises the data within the dataframe. The function loops through all columns of a dataframe and for each column it checks if the column is categorical or numerical. For categorical columns a frequency table is produced and for numerical columns the mean is calculated. The summary infrmation for each column is printed.

summarise = function(df){
  # This function loops through all columns of the provided dataframe
  # For each column it checks if the column is categorical or numerical
  # If categorical a frequency table is produced
  # If numerical the mean is calculated
  for (col_name in names(df)){
    # Print column name
    cat("Variable:", col_name)
    
    # Check if categorical 
    if (is.factor(df[[col_name]])){
      cat(" is categorical.\n")
      cat("Frequency table:")
      print(table(df[[col_name]]))
      cat("\n")
    }
    
    # Check if numerical
    if (is.numeric(df[[col_name]])){
      cat(" is numerical.\n")
      cat("Mean:", mean(df[[col_name]]), "\n")
      cat("\n")
    }
  }
}

Summarise the imported dataset

The first column (SEQN) of the imported dataset is a nominal categorical variable that represents an identification number for each person in the survey. Since identification numbers are uniquie a frequency table for this variable would consist of 9410 ones. This is a lot of information to print, therefore, I have removed the first column from the dataframe for this section. The code below shows creating a new dataframe by excluding the first column from the original dataframe. Each column (variable) in the new dataframe is then summarised by using the summarise function.

new_data = data[-c(1)]
summarise(new_data)

Variable: BPAOARM is categorical.
Frequency table:
   L    R 
  47 9363 

Variable: BPAOCSZ is categorical.
Frequency table:
   2    3    4    5 
 519 4151 4192  548 

Variable: BPXOSY1 is numerical.
Mean: 119.7632 

Variable: BPXODI1 is numerical.
Mean: 71.96801 

Variable: BPXOSY2 is numerical.
Mean: 119.51 

Variable: BPXODI2 is numerical.
Mean: 71.43581 

Variable: BPXOSY3 is numerical.
Mean: 119.4749 

Variable: BPXODI3 is numerical.
Mean: 71.17279 

Variable: BPXOPLS1 is numerical.
Mean: 71.18618 

Variable: BPXOPLS2 is numerical.
Mean: 71.94878 

Variable: BPXOPLS3 is numerical.
Mean: 72.55069