Data Wrangling

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Introduction

This task involves undertaking several data wrangling steps, including loading data, identifying missing values, handling inconsistent and missing data, aggregating the data into useful summary statistics, data transformation (feature encoding), and scaling data.

Data

The dataset used in this notebook is the Microclimate sensors data from the City of Melbourne. The data is licensed under CC BY, allowing it to be freely used for this exercise. The dataset contains climate readings from sensors located within the city of Melbourne.

Load data

The data is loaded into a pandas dataframe from the downloaded CSV file. After loading the data, the first five rows of the dataframe are inspected using the head() function.

# Load the data file into a dataframe
climate_df = pd.read_csv("microclimate-sensors-data.csv", comment="#")

# Inspect the head of the data
print(climate_df.head())
print()
            Device_id                       Time  \
0  ICTMicroclimate-08  2025-02-09T11:54:37+11:00   
1  ICTMicroclimate-11  2025-02-09T12:02:11+11:00   
2  ICTMicroclimate-05  2025-02-09T12:03:24+11:00   
3  ICTMicroclimate-01  2025-02-09T12:02:43+11:00   
4  ICTMicroclimate-09  2025-02-09T12:17:37+11:00   

                                      SensorLocation  \
0  Swanston St - Tram Stop 13 adjacent Federation...   
1                                   1 Treasury Place   
2                 Enterprize Park - Pole ID: COM1667   
3                    Birrarung Marr Park - Pole 1131   
4  SkyFarm (Jeff's Shed). Rooftop - Melbourne Con...   

                    LatLong  MinimumWindDirection  AverageWindDirection  \
0  -37.8184515, 144.9678474                   0.0                 153.0   
1   -37.812888, 144.9750857                   0.0                 144.0   
2  -37.8204083, 144.9591192                   0.0                  45.0   
3  -37.8185931, 144.9716404                   NaN                 150.0   
4  -37.8223306, 144.9521696                   0.0                 241.0   

   MaximumWindDirection  MinimumWindSpeed  AverageWindSpeed  GustWindSpeed  \
0                 358.0               0.0               3.9            7.9   
1                 356.0               0.0               2.0            7.8   
2                 133.0               0.0               1.5            2.7   
3                   NaN               NaN               1.6            NaN   
4                 359.0               0.0               0.9            4.4   

   AirTemperature  RelativeHumidity  AtmosphericPressure  PM25  PM10  \
0            23.9         57.300000               1009.7   0.0   0.0   
1            24.5         56.200000               1005.3   0.0   0.0   
2            25.0         60.000000               1009.6   1.0   3.0   
3            23.1         61.099998               1009.0   0.0   5.0   
4            25.6         53.700000               1007.9   0.0   0.0   

       Noise  
0  80.500000  
1  62.900000  
2  68.500000  
3  51.700001  
4  60.200000  

Inspect the data for missing values

The isnull() function is used to identify cells in the dataframe with missing or NaN values. These are summed to calculate the total number of missing values for each feature. To understand the distribution of missing data across sensors, the dataframe was also grouped by sensor device ID before calculating the sum of the missing values.

# Check how many null values in the data frame
print("Feature Name           Number of missing entries")
print(climate_df.isnull().sum())
Feature Name           Number of missing entries
Device_id                   0
Time                        0
SensorLocation           6143
LatLong                 11483
MinimumWindDirection    39343
AverageWindDirection      503
MaximumWindDirection    39501
MinimumWindSpeed        39501
AverageWindSpeed          503
GustWindSpeed           39501
AirTemperature            503
RelativeHumidity          503
AtmosphericPressure       503
PM25                    18779
PM10                    18779
Noise                   18779
dtype: int64
# Group the data by the Device_id and then sum the null values to understand the distribution of missing data across devices.
climate_df.groupby("Device_id").agg(lambda x:  x.isnull().sum())
Time SensorLocation LatLong MinimumWindDirection AverageWindDirection MaximumWindDirection MinimumWindSpeed AverageWindSpeed GustWindSpeed AirTemperature RelativeHumidity AtmosphericPressure PM25 PM10 Noise
Device_id
ICTMicroclimate-01 0 107 107 39026 28 39026 39026 28 39026 28 28 28 28 28 28
ICTMicroclimate-02 0 108 108 16 16 16 16 16 16 16 16 16 16 16 16
ICTMicroclimate-03 0 107 107 41 41 41 41 41 41 41 41 41 41 41 41
ICTMicroclimate-04 0 9 9 3 3 3 3 3 3 3 3 3 3 3 3
ICTMicroclimate-05 0 0 0 10 10 10 10 10 10 10 10 10 10 10 10
ICTMicroclimate-06 0 107 107 8 8 8 8 8 8 8 8 8 8 8 8
ICTMicroclimate-07 0 107 107 10 10 10 10 10 10 10 10 10 10 10 10
ICTMicroclimate-08 0 103 103 5 5 5 5 5 5 5 5 5 5 5 5
ICTMicroclimate-09 0 107 107 8 8 8 8 8 8 8 8 8 8 8 8
ICTMicroclimate-10 0 4695 9390 64 70 70 70 70 70 70 70 70 70 70 70
ICTMicroclimate-11 0 645 1290 152 304 304 304 304 304 304 304 304 304 304 304
aws5-0999 0 48 48 0 0 0 0 0 0 0 0 0 18276 18276 18276

Data Imputation

From the above data inspections, it is observed that some features have significant missing data, e.g. PM25, PM10 and Noise for device aws5-0999. The values for certain features should be constant, such as sensor location, latitude and longitude, making the correction of missing data for them easier. Whereas the values of other features are based on measurement, and therefore, the correction of missing data will require some assumptions and computations.

SensorLocation

The number of unique sensor locations for each device is one, confirming the assumption that each sensor is in a fixed location. Therefore, the existing values can be used to correct any missing values in the dataset. To correct the missing sensor locations, I grouped the data by Device_id and then selected the SensorLocation column using only the first entry in each group. The first entry in each group is then used in a map command to replace the missing sensor locations.

# Get the number of unique locations for each sensor. 
num_locations = climate_df.groupby("Device_id")[["SensorLocation"]].nunique()
print("Number of unique sensor locations for each device:", num_locations)
Number of unique sensor locations for each device:                     SensorLocation
Device_id                         
ICTMicroclimate-01               1
ICTMicroclimate-02               1
ICTMicroclimate-03               1
ICTMicroclimate-04               1
ICTMicroclimate-05               1
ICTMicroclimate-06               1
ICTMicroclimate-07               1
ICTMicroclimate-08               1
ICTMicroclimate-09               1
ICTMicroclimate-10               1
ICTMicroclimate-11               1
aws5-0999                        1
# Create sensor location mapping for each device
device_location_map = climate_df.groupby("Device_id")[["SensorLocation"]].first()

# Apply the device location map to correct the missing sensor locations 
climate_df["SensorLocation"] = climate_df["Device_id"].map(device_location_map["SensorLocation"])

# Check the number of missing sensor locations after correction
print("Number of missing sensor locations values:", climate_df["SensorLocation"].isnull().sum())
Number of missing sensor locations values: 0

LatLong

The same approach used to correct the SensorLocation is used for the LatLong. First, I check that each device only has one unique value, then I retrieve the valid value for each device and use it to correct the missing values.

# Get the number of unique latitude/longitude locations for each sensor. 
num_latlong = climate_df.groupby("Device_id")[["LatLong"]].nunique()
print("Number of unique sensor latitude/longitude locations for each device:")
print(num_locations)
Number of unique sensor latitude/longitude locations for each device:
                    SensorLocation
Device_id                         
ICTMicroclimate-01               1
ICTMicroclimate-02               1
ICTMicroclimate-03               1
ICTMicroclimate-04               1
ICTMicroclimate-05               1
ICTMicroclimate-06               1
ICTMicroclimate-07               1
ICTMicroclimate-08               1
ICTMicroclimate-09               1
ICTMicroclimate-10               1
ICTMicroclimate-11               1
aws5-0999                        1
# Create lat/long mapping for each device
device_latlong_map = climate_df.groupby("Device_id")[["LatLong"]].first()

# Apply the device lat/long map to correct the missing sensor latitudes and longitudes 
climate_df["LatLong"] = climate_df["Device_id"].map(device_latlong_map["LatLong"])

# Check the number of missing LatLong values after correction
print("Number of missing sensor latitude/logitude values:", climate_df["LatLong"].isnull().sum())
Number of missing sensor latitude/logitude values: 0

Continuous features

Inconsistent data

Before making any corrections to the continuous features, it is a good idea to inspect some basic statistics.

climate_df.describe()
MinimumWindDirection AverageWindDirection MaximumWindDirection MinimumWindSpeed AverageWindSpeed GustWindSpeed AirTemperature RelativeHumidity AtmosphericPressure PM25 PM10 Noise
count 352450.000000 391290.000000 352292.000000 352292.000000 391290.000000 352292.000000 391290.000000 391290.000000 391290.000000 373014.000000 373014.000000 373014.000000
mean 20.447147 166.414570 305.508777 4.794662 1.058048 3.441384 16.378137 66.467364 1002.152184 20.116674 8.138810 66.417315
std 57.606792 125.167169 89.833971 38.599842 0.980283 2.607634 6.075031 18.367640 108.405272 118.284124 12.522434 13.052584
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.800000 4.000000 20.900000 0.000000 0.000000 0.000000
25% 0.000000 43.000000 314.000000 0.000000 0.400000 1.500000 12.100000 55.100000 1008.800000 1.000000 3.000000 59.000000
50% 0.000000 159.000000 353.000000 0.000000 0.800000 2.800000 15.900000 68.200000 1014.700000 3.000000 5.000000 68.300000
75% 0.000000 301.000000 358.000000 0.100000 1.500000 4.800000 19.800000 79.600000 1020.100000 7.000000 9.000000 72.400000
max 359.000000 359.000000 360.000000 359.000000 11.100000 52.500000 40.800000 99.800003 1042.900000 1030.700000 308.000000 131.100000

From the above table, there appears to be some issue with the MinimumWindSpeed, AtmosphericPressure and PM25.

  • MinimumWindSpeed: The maximum minimum wind speed value is 359 m/s. This is not sensible and beyond the upper bound (60 m/s) of the instrument’s measurement range.
  • AtmosphericPressure: The minimum atmospheric pressure value of 20.9 hPa is too low.
  • PM25: The maximum value of 1030.7 is beyond the measurement range of the instrument (0-1000 micrograms/m^3).

To inspect further, I calculated the descriptive statistics for each device.

# Calculate the min, max, mean and standard deviation for each device, only for the MinimumWindSpeed, AtmosphericPressure, and PM25 features
climate_df.groupby("Device_id")[["MinimumWindSpeed", "AtmosphericPressure", "PM25"]].agg(["min", "max", "mean", "std"])
MinimumWindSpeed AtmosphericPressure PM25
min max mean std min max mean std min max mean std
Device_id
ICTMicroclimate-01 NaN NaN NaN NaN 991.400024 1040.300049 1015.682445 8.009809 0.0 235.0 5.278630 11.171503
ICTMicroclimate-02 0.0 11.4 0.577927 0.871374 987.300000 1036.000000 1011.590638 7.957121 1.0 147.0 7.800387 9.512505
ICTMicroclimate-03 0.0 10.1 0.725178 0.824416 986.400000 1035.200000 1010.733407 7.959522 1.0 184.0 6.979975 8.255256
ICTMicroclimate-04 0.0 1.7 0.001496 0.024614 996.700000 1037.900000 1018.512888 7.243488 0.0 81.0 5.085106 7.566532
ICTMicroclimate-05 0.0 7.8 0.197126 0.360950 993.600000 1037.700000 1015.707429 7.856305 1.0 116.0 6.684698 8.098758
ICTMicroclimate-06 0.0 0.8 0.002879 0.027912 992.100000 1040.900000 1016.449209 7.983454 0.0 92.0 5.183268 8.530823
ICTMicroclimate-07 0.0 1.1 0.000896 0.018917 994.300000 1042.900000 1018.497433 8.014416 0.0 62.0 5.542080 7.993104
ICTMicroclimate-08 0.0 1.5 0.004890 0.048268 991.900000 1040.900000 1016.326436 8.032214 0.0 98.0 7.447921 9.736321
ICTMicroclimate-09 0.0 0.4 0.000449 0.009074 990.700000 1039.700000 1015.006325 7.896735 0.0 307.0 7.610188 24.851515
ICTMicroclimate-10 0.0 359.0 56.584262 123.228729 20.900000 1030.700000 838.528264 363.300887 0.0 1030.7 188.700070 389.082039
ICTMicroclimate-11 0.0 359.0 6.172891 46.169169 32.700000 1029.700000 994.319084 125.177193 0.0 1024.1 21.611508 133.095076
aws5-0999 0.0 4.4 0.415397 0.466901 989.700000 1038.700000 1015.150175 8.420255 NaN NaN NaN NaN

Inspection of the statistics for each device shows that the problematic values only occur for the ICTMicroclimate-10 and ICTMicroclimate-11 sensors, which are both at the same location, Treasury Place.

climate_df.groupby("Device_id")[["SensorLocation"]].first().iloc[-3:-1]
SensorLocation
Device_id
ICTMicroclimate-10 1 Treasury Place
ICTMicroclimate-11 1 Treasury Place

The next step is to visualise the temporal distribution of the MinimumWindSpeed, AtmosphericPressure, and PM25 for the ICTMicroclimate-10 and ICTMicroclimate-11 sensors to see if the inconsistent values were random or followed a pattern. For the visualisation, I used a heatmap of Day in the Month versus Month in the Year.

# Ensure Time is datetime
climate_df["Time"] = pd.to_datetime(climate_df["Time"], utc=True)

# Drop timezone information
climate_df["Time"] = climate_df["Time"].dt.tz_localize(None)

# Extract year-month and hour
climate_df["year_month"] = climate_df["Time"].dt.to_period("M").astype(str)
climate_df["day"] = climate_df["Time"].dt.day

# Extract groups ICTMicroclimate-10 and ICTMicroclimate-11 as DataFrames
d_10_df = climate_df.groupby("Device_id").get_group("ICTMicroclimate-10")
d_11_df = climate_df.groupby("Device_id").get_group("ICTMicroclimate-11")

# Group by year_month and hour, and get max windspeed
g_10 = d_10_df.groupby(["day","year_month", ])["MinimumWindSpeed"].max().unstack()
g_11 = d_11_df.groupby(["day","year_month", ])["MinimumWindSpeed"].max().unstack()

# Set common colour scale limits
vmin = 0
vmax = 360

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 10), sharey=True)

cmap = plt.cm.YlGnBu
cmap.set_bad(color="grey")

# Plot heatmap
sns.heatmap(g_10, ax=axes[0], annot=True, fmt=".1f", cmap=cmap, vmin=vmin, vmax=vmax, cbar=False)
axes[0].set_title("ICTMicroclimate-10 maximum of MinimumWindSpeed")
axes[0].set_ylabel("Day in Month")
axes[0].set_xlabel("Month in Year")

# Plot heatmap
sns.heatmap(g_11, ax=axes[1], vmin=vmin, vmax=vmax, annot=True, fmt=".1f", cmap=cmap, cbar=False)
axes[1].set_title("ICTMicroclimate-11 maximum of MinimumWindSpeed")
axes[1].set_ylabel("")
axes[1].tick_params(left=False)
axes[1].set_xlabel("Month in Year")

# Add a title to the entire figure
fig.suptitle("Maximum Minimum Windspeed by Day in Month and Month in Year", fontsize=16)

plt.tight_layout()
plt.show()
C:\Users\darrin\anaconda3\Lib\site-packages\seaborn\matrix.py:260: FutureWarning:

Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior

C:\Users\darrin\anaconda3\Lib\site-packages\seaborn\matrix.py:260: FutureWarning:

Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior

# Group by year_month and hour, and get min AtmosphericPressure
g_10 = d_10_df.groupby(["day","year_month", ])["AtmosphericPressure"].min().unstack()
g_11 = d_11_df.groupby(["day","year_month", ])["AtmosphericPressure"].min().unstack()

# Set common color scale limits
vmin = 0
vmax = 1300

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 10), sharey=True)

cmap = plt.cm.YlGnBu_r
cmap.set_bad(color="grey")

# Plot heatmap
sns.heatmap(g_10, ax=axes[0], annot=True, fmt=".1f", cmap=cmap, vmin=vmin, vmax=vmax, cbar=False)
axes[0].set_title("ICTMicroclimate-10 minimum of AtmosphericPressure")
axes[0].set_ylabel("Day in Month")
axes[0].set_xlabel("Month in Year")

# Plot heatmap
sns.heatmap(g_11, ax=axes[1], vmin=vmin, vmax=vmax, annot=True, fmt=".1f", cmap=cmap, cbar=False)
axes[1].set_title("ICTMicroclimate-11 minimum of AtmosphericPressure")
axes[1].set_ylabel("")
axes[1].tick_params(left=False)
axes[1].set_xlabel("Month in Year")

# Add a title to the entire figure
fig.suptitle("Minimum AtmosphericPressure by Month-Year and Day of Month", fontsize=16)

plt.tight_layout()
plt.show()
C:\Users\darrin\anaconda3\Lib\site-packages\seaborn\matrix.py:260: FutureWarning:

Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior

C:\Users\darrin\anaconda3\Lib\site-packages\seaborn\matrix.py:260: FutureWarning:

Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior

# Group by year_month and hour, and get min AtmosphericPressure
g_10 = d_10_df.groupby(["day","year_month", ])["PM25"].max().unstack()
g_11 = d_11_df.groupby(["day","year_month", ])["PM25"].max().unstack()

# Set common colour scale limits
vmin = 0
vmax = 1300

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 10), sharey=True)

cmap = plt.cm.YlGnBu
cmap.set_bad(color="grey")

# Plot heatmap
sns.heatmap(g_10, ax=axes[0], annot=True, fmt=".1f", cmap=cmap, vmin=vmin, vmax=vmax, cbar=False)
axes[0].set_title("ICTMicroclimate-10 maximum of PM25")
axes[0].set_ylabel("Day in Month")
axes[0].set_xlabel("Month in Year")

# Plot heatmap
sns.heatmap(g_11, ax=axes[1], vmin=vmin, vmax=vmax, annot=True, fmt=".1f", cmap=cmap, cbar=False)
axes[1].set_title("ICTMicroclimate-11 maximum of PM25")
axes[1].set_ylabel("")
axes[1].tick_params(left=False)
axes[1].set_xlabel("Month in Year")

# Add a title to the entire figure
fig.suptitle("Maximum PM25 by Month-Year and Day of Month", fontsize=16)

plt.tight_layout()
plt.show()
C:\Users\darrin\anaconda3\Lib\site-packages\seaborn\matrix.py:260: FutureWarning:

Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior

C:\Users\darrin\anaconda3\Lib\site-packages\seaborn\matrix.py:260: FutureWarning:

Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior

From the above plots, it is clear that the data for sensors ICTMicroclimate-10 and ICTMicroclimate-11 is unreliable from August 2024 until the 2nd of October 2024. For the purposes of this exercise, the data for these sensors during this period will be removed from the dataframe.

# Define the device and time range to filter out
devices_to_remove = ["ICTMicroclimate-10","ICTMicroclimate-11"]
end_time = "2024-10-01 23:59:59"

# Filter out rows for that device in the given time range
df_filtered = climate_df[~((climate_df["Device_id"].isin(devices_to_remove)) & (climate_df["Time"] <= end_time))].copy()

df_filtered.groupby("Device_id")[["MinimumWindSpeed", "AtmosphericPressure", "PM25"]].agg(["min", "max", "mean", "std"])

# Drop the columns created for the above plotting
df_filtered.drop(columns=["day","year_month"], inplace=True)

# Calculate the min, max, mean and standard deviation for each device, only for the MinimumWindSpeed, AtmosphericPressure, and PM25 features
df_filtered.groupby("Device_id")[["MinimumWindSpeed", "AtmosphericPressure", "PM25"]].agg(["min", "max", "mean", "std"])
MinimumWindSpeed AtmosphericPressure PM25
min max mean std min max mean std min max mean std
Device_id
ICTMicroclimate-01 NaN NaN NaN NaN 991.400024 1040.300049 1015.682445 8.009809 0.0 235.0 5.278630 11.171503
ICTMicroclimate-02 0.0 11.4 0.577927 0.871374 987.300000 1036.000000 1011.590638 7.957121 1.0 147.0 7.800387 9.512505
ICTMicroclimate-03 0.0 10.1 0.725178 0.824416 986.400000 1035.200000 1010.733407 7.959522 1.0 184.0 6.979975 8.255256
ICTMicroclimate-04 0.0 1.7 0.001496 0.024614 996.700000 1037.900000 1018.512888 7.243488 0.0 81.0 5.085106 7.566532
ICTMicroclimate-05 0.0 7.8 0.197126 0.360950 993.600000 1037.700000 1015.707429 7.856305 1.0 116.0 6.684698 8.098758
ICTMicroclimate-06 0.0 0.8 0.002879 0.027912 992.100000 1040.900000 1016.449209 7.983454 0.0 92.0 5.183268 8.530823
ICTMicroclimate-07 0.0 1.1 0.000896 0.018917 994.300000 1042.900000 1018.497433 8.014416 0.0 62.0 5.542080 7.993104
ICTMicroclimate-08 0.0 1.5 0.004890 0.048268 991.900000 1040.900000 1016.326436 8.032214 0.0 98.0 7.447921 9.736321
ICTMicroclimate-09 0.0 0.4 0.000449 0.009074 990.700000 1039.700000 1015.006325 7.896735 0.0 307.0 7.610188 24.851515
ICTMicroclimate-10 0.0 10.5 0.711077 0.889946 989.700000 1028.900000 1010.033453 7.718326 0.0 85.0 5.132205 7.261089
ICTMicroclimate-11 0.0 1.6 0.008341 0.060927 989.500000 1029.700000 1010.929081 7.414035 0.0 179.0 3.857938 6.942787
aws5-0999 0.0 4.4 0.415397 0.466901 989.700000 1038.700000 1015.150175 8.420255 NaN NaN NaN NaN

The table above shows some descriptive statistics for the MinimumWindSpeed, AtmosphericPressure, and PM25 features after filtering out the corrupted data for sensors ICTMicroclimate-10 and ICTMicroclimate-11. It is observed that the previosuly identified issues are no longer present.

Missing data

The task description requires the use of either the mean or the median to fill in missing values. The median is the best choice for distributions that are skewed or contain outliers. Therefore, I’ll plot the distributions of all variables to decide if the median or mean should be used.

# Get list of features to plot
float_features = df_filtered.select_dtypes(include='number').columns.tolist()

# Plot histograms using the pandas hist() function
axes = df_filtered.hist(column=float_features, bins=15,layout=(8,2), figsize=(10,15), edgecolor="white",log=True)
for ax in axes.flatten():
    ax.set_ylabel("Count (log scale)")
    
plt.tight_layout()
plt.show()

The majority of the above distributions are highly skewed or contain outliers. The distributions for AtmosphericPressure and AirTemperature are the closest to symmetric distributions. Based on this evidence, I decided to use the median value for imputing missing values as it is less affected by the skewness of a distribution.

# Calculate the median value for each feature
feature_medians = df_filtered.median(numeric_only=True)
print("Meadian values for each feature:")
print(feature_medians)

# Replace missing values in numeric feature columns with the median
df_filled = df_filtered.fillna(feature_medians)
Meadian values for each feature:
MinimumWindDirection       0.0
AverageWindDirection     159.0
MaximumWindDirection     353.0
MinimumWindSpeed           0.0
AverageWindSpeed           0.8
GustWindSpeed              2.8
AirTemperature            16.1
RelativeHumidity          68.4
AtmosphericPressure     1014.8
PM25                       3.0
PM10                       5.0
Noise                     68.3
dtype: float64
# Check how many null values in the data frame after filling missing values
print("Feature Name           Number of missing entries")
print(df_filled.isnull().sum())
Feature Name           Number of missing entries
Device_id               0
Time                    0
SensorLocation          0
LatLong                 0
MinimumWindDirection    0
AverageWindDirection    0
MaximumWindDirection    0
MinimumWindSpeed        0
AverageWindSpeed        0
GustWindSpeed           0
AirTemperature          0
RelativeHumidity        0
AtmosphericPressure     0
PM25                    0
PM10                    0
Noise                   0
dtype: int64

Distributions of PM10 and PM25

The histograms of the PM10 and PM25 features are shown in the figures below. Both distributions have the shape of an exponential distribution, with high counts at low particulate matter, and the counts decreasing exponentially as the particulate matter increases.

num_bins = 10

# Create figure with specified size
hist_fig = plt.figure(figsize=(10, 7))

# Plot PM10 
ax1 = plt.subplot(2, 1, 1)  # 2 rows, 1 column, 1st subplot
ax1.hist(df_filled["PM10"], num_bins, edgecolor="white", log=True)
ax1.set_title(f'Histogram of PM10 (n={df_filtered["PM10"].shape[0]})')
ax1.set_ylabel("Count (log scale)")
ax1.set_xlabel(r"Particular matter 10 $\mu m$ diameter $\left(\frac{\mu g}{m^3}\right)$")

# Plot PM25
ax2 = plt.subplot(212, sharex=ax1)  # 2 rows, 1 column, 2nd subplot
ax2.hist(df_filled["PM25"], num_bins, edgecolor="white", log=True)
ax2.set_title(f'Histogram of PM25 (n={df_filtered["PM25"].shape[0]})')
ax2.set_ylabel("Count (log scale)")
ax2.set_xlabel(r"Particulate matter 2.5 $\mu m$ diameter $\left(\frac{\mu g}{m^3}\right)$")

# Set the x limit for both plots
#plt.xlim(0, 350)

plt.tight_layout()
plt.show()

To investigate any possible correlation between the PM10 and PM25 features, a scater plot was created to allow visual inspection. Then the pearson correlation cofficient was calculated.

# Create figure with specified size
plt.figure(figsize=(10, 7))

# Scatter plot of 
plt.scatter(df_filtered["PM25"], df_filtered["PM10"], s=10, edgecolor="lightskyblue", alpha=0.5)
plt.grid(zorder=1, alpha=0.5)

plt.title("Scatter plot of PM25 verus PM10")
plt.xlabel(r"Particulate matter 2.5 $\mu m$ diameter $\left(\frac{\mu g}{m^3}\right)$")
plt.ylabel(r"Particulate matter 10 $\mu m$ diameter $\left(\frac{\mu g}{m^3}\right)$")

plt.xlim(0, 350)
plt.ylim(0, 350)

plt.tight_layout()
plt.show()

# Pearson correlation coefficient
correlation = df_filled["PM25"].corr(df_filled["PM10"])
print(f"Correlation coefficient between features PM25 and PM10 is: {correlation:.3}")
Correlation coefficient between features PM25 and PM10 is: 0.985

The scatter plot between the features PM25 and PM10 shows a strong positive linear relationship. This is confirmed with the high pearson correlation.

LatLong encoding

The first step in encoding the LatLong feature was to split it into the latitude and longitude components.

df_filled[['latitude', 'longitude']] = df_filled['LatLong'].str.split(',', expand=True).astype(float)

The second step is to choose an encoding system. Here, I have chosen to encode the latitude and longitude into a 3D Cartesian coordinate system, giving values for x, y and z. This approximates the Earth as a sphere, which is reasonable since the region of interest is a tiny portion of the Earth’s surface. This encoding approach will enable accurate distance measurement between points if required in the machine learning algorithm.

# Converting lat/long to cartesian
import numpy as np

# Convert from degrees to radians
df_filled["lat_rad"] = np.radians(df_filled["latitude"])
df_filled["long_rad"] = np.radians(df_filled["longitude"])

# Calculate the Cartesian values
df_filled["x"] = np.cos(df_filled["lat_rad"]) * np.cos(df_filled["long_rad"])
df_filled["y"] = np.cos(df_filled["lat_rad"]) * np.sin(df_filled["long_rad"])
df_filled["z"] = np.sin(df_filled["lat_rad"])

# Drop the intermediate lat_rad and long_rad columns
df_filled.drop(columns=["lat_rad","long_rad"], inplace=True)
    
# Print out the encoding
print("Table showing the encoded values, x, y and z for each latitude and longitude pair.")
df_filled.groupby(['latitude','longitude'])[['x','y','z']].agg(['unique'])
Table showing the encoded values, x, y and z for each latitude and longitude pair.
x y z
unique unique unique
latitude longitude
-37.822331 144.952170 [-0.6466829154533684] [0.45361725572202244] [-0.6132149640802587]
-37.822234 144.982941 [-0.6469272879344449] [0.453270475112011] [-0.6132136336689813]
-37.822183 144.956222 [-0.6467162961870897] [0.45357241883254756] [-0.6132129264133661]
-37.820408 144.959119 [-0.6467547756329378] [0.4535506263400986] [-0.6131884616840834]
-37.819499 144.978721 [-0.6469178722458754] [0.45333491635230483] [-0.6131759292280787]
-37.818593 144.971640 [-0.6468697848649976] [0.453420426464369] [-0.6131634352222753]
-37.818452 144.967847 [-0.6468410076851618] [0.45346411834148037] [-0.6131614829338421]
-37.814604 144.970299 [-0.6468941255968796] [0.4534600727651171] [-0.61310843468028]
-37.814035 144.967280 [-0.6468752177437501] [0.4534976554876418] [-0.6131005864751623]
-37.812888 144.975086 [-0.6469470430816873] [0.45341656703589844] [-0.6130847740608488]
-37.812860 144.974539 [-0.6469429703681334] [0.45342290938269164] [-0.613084381091373]
-37.795617 144.951901 [-0.6469147812012331] [0.4537844284725181] [-0.6128466026170258]
# Drop the latitude and longitude columns
df_filled.drop(columns=["latitude","longitude","LatLong"], inplace=True)

Feature scaling

Each feature in the dataset has different scale (range). To normalise the features, min-max scaling is applied to all continuous data.

Feature distributions before scaling

Distributions of the features are plotted before scaling.

# Get list of features to plot
scale_float_features = df_filled.select_dtypes(include='number').columns.tolist()

axes = df_filled.hist(column=scale_float_features, bins=15,layout=(8,2), figsize=(10,15), edgecolor="white", xlabelsize=7, ylabelsize=7, log=True)
for ax in axes.flatten():
    ax.set_ylabel("Count (log scale)")

plt.tight_layout()
plt.show()

Scaling operation

# Create a copy of the filled dataframe
df_scaled = df_filled.copy()

# Apply min-max scaling to the selected columns
for feature in scale_float_features:
    min_val = df_scaled[feature].min()
    max_val = df_scaled[feature].max()
    df_scaled[feature] = (df_scaled[feature] - min_val) / (max_val - min_val)

Feature distributions after scaling

Distributions of the features plotted after scaling. Inspection of the before an after figures for each features distribution shows that the scaling did not affect the relative shape of the distribution (e.g. skewness).

# Loop through all of the continuous features, plotting their distributions
axes = df_scaled.hist(column=scale_float_features, bins=15,layout=(8,2), figsize=(10,15), edgecolor="white", xlabelsize=7, ylabelsize=7, log=True)

for ax in axes.flatten():
    ax.set_ylabel("Count (log scale)")
plt.tight_layout()
plt.show()

Back to top