description = 'Darrin William Stephens, SIT741: Statistical Data Analysis, Task T01.P1'Data types
Introduction
In this task we use some basic R commands for working with a dataset. These include importing, inspecting the size, inspecting the data types and inspecting the variable types. Additionally a function is created that summarises categorical and numerical variables in a dataframe.
Task steps
Define a string
This section defines a string containing my name, the unit name and the task name.
Print the string using the cat() function.
cat(description)Darrin William Stephens, SIT741: Statistical Data Analysis, Task T01.P1
The dataset
The dataset I chose for this task was the blood pressure measurement data from the US Centers for Disease Control and Prevention
National Health and Nutrition Examination Survey. It contains three categorical and nine numerical variables.
Based on 2017-March 2020 Blood Pressure - Oscillometric Measurement (P_BPXO). Published: May 2021
Description: https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/P_BPXO.htm
License: public domain as per https://www.cdc.gov/other/agencymaterials.html
Information about dataset variables is shown in the table below.
| Variable name | Data type | Description |
|---|---|---|
| SEQN | Numeric | Respondent sequence number |
| BPAOARM | Character | Arm selected - oscillometric |
| BPAOCSZ | Numeric | Coded cuff size - oscillometric |
| BPXOSY1 | Numeric | Systolic - 1st oscillometric reading |
| BPXODI1 | Numeric | Diastolic - 1st oscillometric reading |
| BPXOSY2 | Numeric | Systolic - 2nd oscillometric reading |
| BPXODI2 | Numeric | Diastolic - 2nd oscillometric reading |
| BPXOSY3 | Numeric | Systolic - 3rd oscillometric reading |
| BPXODI3 | Numeric | Diastolic - 3rd oscillometric reading |
| BPXOPLS1 | Numeric | Pulse - 1st oscillometric reading |
| BPXOPLS2 | Numeric | Pulse - 2nd oscillometric reading |
| BPXOPLS3 | Numeric | Pulse - 3rd oscillometric reading |
I have used this dataset for a previous Deakin unit and had cleaned (missing values removed) CSV file available.
Import the dataset
The class of the columns is set on import. The first three are specified as categorical (factor) and the next nine are specified as numeric.
Inspect the head of the dataset
To inspect the header and the first n columns of the dataset, I pass the dataframe to the head() function. As I don’t specify the number of rows to show the default number of rows (6) is shown.
head(data) SEQN BPAOARM BPAOCSZ BPXOSY1 BPXODI1 BPXOSY2 BPXODI2 BPXOSY3 BPXODI3
1 109264 R 3 109 67 109 68 106 66
2 109266 R 4 99 56 99 55 99 52
3 109270 R 3 123 73 124 77 127 70
4 109271 R 4 102 65 108 68 111 68
5 109273 R 3 116 68 110 66 115 68
6 109277 R 3 104 58 102 54 101 53
BPXOPLS1 BPXOPLS2 BPXOPLS3
1 94 95 91
2 68 66 66
3 95 98 93
4 73 71 70
5 71 70 70
6 67 69 77
Inspect the size of the dataset
To get the number of observations and the number of variables I use the dim() function. This returns the dimensions of the dataframe.
Inspect data types
To inspect the data type I use the sapply function to apply the typeof function to each column in the dataframe. The typeof function returns the data type for each column (variable).
sapply(data, typeof) SEQN BPAOARM BPAOCSZ BPXOSY1 BPXODI1 BPXOSY2 BPXODI2 BPXOSY3
"integer" "integer" "integer" "double" "double" "double" "double" "double"
BPXODI3 BPXOPLS1 BPXOPLS2 BPXOPLS3
"double" "double" "double" "double"
Function to summarise
A function that takes a dataframe and summarises the data within the dataframe. The function loops through all columns of a dataframe and for each column it checks if the column is categorical or numerical. For categorical columns a frequency table is produced and for numerical columns the mean is calculated. The summary infrmation for each column is printed.
summarise = function(df){
# This function loops through all columns of the provided dataframe
# For each column it checks if the column is categorical or numerical
# If categorical a frequency table is produced
# If numerical the mean is calculated
for (col_name in names(df)){
# Print column name
cat("Variable:", col_name)
# Check if categorical
if (is.factor(df[[col_name]])){
cat(" is categorical.\n")
cat("Frequency table:")
print(table(df[[col_name]]))
cat("\n")
}
# Check if numerical
if (is.numeric(df[[col_name]])){
cat(" is numerical.\n")
cat("Mean:", mean(df[[col_name]]), "\n")
cat("\n")
}
}
}Summarise the imported dataset
The first column (SEQN) of the imported dataset is a nominal categorical variable that represents an identification number for each person in the survey. Since identification numbers are uniquie a frequency table for this variable would consist of 9410 ones. This is a lot of information to print, therefore, I have removed the first column from the dataframe for this section. The code below shows creating a new dataframe by excluding the first column from the original dataframe. Each column (variable) in the new dataframe is then summarised by using the summarise function.
new_data = data[-c(1)]
summarise(new_data)Variable: BPAOARM is categorical.
Frequency table:
L R
47 9363
Variable: BPAOCSZ is categorical.
Frequency table:
2 3 4 5
519 4151 4192 548
Variable: BPXOSY1 is numerical.
Mean: 119.7632
Variable: BPXODI1 is numerical.
Mean: 71.96801
Variable: BPXOSY2 is numerical.
Mean: 119.51
Variable: BPXODI2 is numerical.
Mean: 71.43581
Variable: BPXOSY3 is numerical.
Mean: 119.4749
Variable: BPXODI3 is numerical.
Mean: 71.17279
Variable: BPXOPLS1 is numerical.
Mean: 71.18618
Variable: BPXOPLS2 is numerical.
Mean: 71.94878
Variable: BPXOPLS3 is numerical.
Mean: 72.55069