import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsData Wrangling
Introduction
This task involves undertaking several data wrangling steps, including loading data, identifying missing values, handling inconsistent and missing data, aggregating the data into useful summary statistics, data transformation (feature encoding), and scaling data.
Data
The dataset used in this notebook is the Microclimate sensors data from the City of Melbourne. The data is licensed under CC BY, allowing it to be freely used for this exercise. The dataset contains climate readings from sensors located within the city of Melbourne.
Load data
The data is loaded into a pandas dataframe from the downloaded CSV file. After loading the data, the first five rows of the dataframe are inspected using the head() function.
# Load the data file into a dataframe
climate_df = pd.read_csv("microclimate-sensors-data.csv", comment="#")
# Inspect the head of the data
print(climate_df.head())
print() Device_id Time \
0 ICTMicroclimate-08 2025-02-09T11:54:37+11:00
1 ICTMicroclimate-11 2025-02-09T12:02:11+11:00
2 ICTMicroclimate-05 2025-02-09T12:03:24+11:00
3 ICTMicroclimate-01 2025-02-09T12:02:43+11:00
4 ICTMicroclimate-09 2025-02-09T12:17:37+11:00
SensorLocation \
0 Swanston St - Tram Stop 13 adjacent Federation...
1 1 Treasury Place
2 Enterprize Park - Pole ID: COM1667
3 Birrarung Marr Park - Pole 1131
4 SkyFarm (Jeff's Shed). Rooftop - Melbourne Con...
LatLong MinimumWindDirection AverageWindDirection \
0 -37.8184515, 144.9678474 0.0 153.0
1 -37.812888, 144.9750857 0.0 144.0
2 -37.8204083, 144.9591192 0.0 45.0
3 -37.8185931, 144.9716404 NaN 150.0
4 -37.8223306, 144.9521696 0.0 241.0
MaximumWindDirection MinimumWindSpeed AverageWindSpeed GustWindSpeed \
0 358.0 0.0 3.9 7.9
1 356.0 0.0 2.0 7.8
2 133.0 0.0 1.5 2.7
3 NaN NaN 1.6 NaN
4 359.0 0.0 0.9 4.4
AirTemperature RelativeHumidity AtmosphericPressure PM25 PM10 \
0 23.9 57.300000 1009.7 0.0 0.0
1 24.5 56.200000 1005.3 0.0 0.0
2 25.0 60.000000 1009.6 1.0 3.0
3 23.1 61.099998 1009.0 0.0 5.0
4 25.6 53.700000 1007.9 0.0 0.0
Noise
0 80.500000
1 62.900000
2 68.500000
3 51.700001
4 60.200000
Inspect the data for missing values
The isnull() function is used to identify cells in the dataframe with missing or NaN values. These are summed to calculate the total number of missing values for each feature. To understand the distribution of missing data across sensors, the dataframe was also grouped by sensor device ID before calculating the sum of the missing values.
# Check how many null values in the data frame
print("Feature Name Number of missing entries")
print(climate_df.isnull().sum())Feature Name Number of missing entries
Device_id 0
Time 0
SensorLocation 6143
LatLong 11483
MinimumWindDirection 39343
AverageWindDirection 503
MaximumWindDirection 39501
MinimumWindSpeed 39501
AverageWindSpeed 503
GustWindSpeed 39501
AirTemperature 503
RelativeHumidity 503
AtmosphericPressure 503
PM25 18779
PM10 18779
Noise 18779
dtype: int64
# Group the data by the Device_id and then sum the null values to understand the distribution of missing data across devices.
climate_df.groupby("Device_id").agg(lambda x: x.isnull().sum())| Time | SensorLocation | LatLong | MinimumWindDirection | AverageWindDirection | MaximumWindDirection | MinimumWindSpeed | AverageWindSpeed | GustWindSpeed | AirTemperature | RelativeHumidity | AtmosphericPressure | PM25 | PM10 | Noise | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Device_id | |||||||||||||||
| ICTMicroclimate-01 | 0 | 107 | 107 | 39026 | 28 | 39026 | 39026 | 28 | 39026 | 28 | 28 | 28 | 28 | 28 | 28 |
| ICTMicroclimate-02 | 0 | 108 | 108 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 |
| ICTMicroclimate-03 | 0 | 107 | 107 | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 41 |
| ICTMicroclimate-04 | 0 | 9 | 9 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| ICTMicroclimate-05 | 0 | 0 | 0 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| ICTMicroclimate-06 | 0 | 107 | 107 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
| ICTMicroclimate-07 | 0 | 107 | 107 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| ICTMicroclimate-08 | 0 | 103 | 103 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| ICTMicroclimate-09 | 0 | 107 | 107 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
| ICTMicroclimate-10 | 0 | 4695 | 9390 | 64 | 70 | 70 | 70 | 70 | 70 | 70 | 70 | 70 | 70 | 70 | 70 |
| ICTMicroclimate-11 | 0 | 645 | 1290 | 152 | 304 | 304 | 304 | 304 | 304 | 304 | 304 | 304 | 304 | 304 | 304 |
| aws5-0999 | 0 | 48 | 48 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18276 | 18276 | 18276 |
Data Imputation
From the above data inspections, it is observed that some features have significant missing data, e.g. PM25, PM10 and Noise for device aws5-0999. The values for certain features should be constant, such as sensor location, latitude and longitude, making the correction of missing data for them easier. Whereas the values of other features are based on measurement, and therefore, the correction of missing data will require some assumptions and computations.
SensorLocation
The number of unique sensor locations for each device is one, confirming the assumption that each sensor is in a fixed location. Therefore, the existing values can be used to correct any missing values in the dataset. To correct the missing sensor locations, I grouped the data by Device_id and then selected the SensorLocation column using only the first entry in each group. The first entry in each group is then used in a map command to replace the missing sensor locations.
# Get the number of unique locations for each sensor.
num_locations = climate_df.groupby("Device_id")[["SensorLocation"]].nunique()
print("Number of unique sensor locations for each device:", num_locations)Number of unique sensor locations for each device: SensorLocation
Device_id
ICTMicroclimate-01 1
ICTMicroclimate-02 1
ICTMicroclimate-03 1
ICTMicroclimate-04 1
ICTMicroclimate-05 1
ICTMicroclimate-06 1
ICTMicroclimate-07 1
ICTMicroclimate-08 1
ICTMicroclimate-09 1
ICTMicroclimate-10 1
ICTMicroclimate-11 1
aws5-0999 1
# Create sensor location mapping for each device
device_location_map = climate_df.groupby("Device_id")[["SensorLocation"]].first()
# Apply the device location map to correct the missing sensor locations
climate_df["SensorLocation"] = climate_df["Device_id"].map(device_location_map["SensorLocation"])
# Check the number of missing sensor locations after correction
print("Number of missing sensor locations values:", climate_df["SensorLocation"].isnull().sum())Number of missing sensor locations values: 0
LatLong
The same approach used to correct the SensorLocation is used for the LatLong. First, I check that each device only has one unique value, then I retrieve the valid value for each device and use it to correct the missing values.
# Get the number of unique latitude/longitude locations for each sensor.
num_latlong = climate_df.groupby("Device_id")[["LatLong"]].nunique()
print("Number of unique sensor latitude/longitude locations for each device:")
print(num_locations)Number of unique sensor latitude/longitude locations for each device:
SensorLocation
Device_id
ICTMicroclimate-01 1
ICTMicroclimate-02 1
ICTMicroclimate-03 1
ICTMicroclimate-04 1
ICTMicroclimate-05 1
ICTMicroclimate-06 1
ICTMicroclimate-07 1
ICTMicroclimate-08 1
ICTMicroclimate-09 1
ICTMicroclimate-10 1
ICTMicroclimate-11 1
aws5-0999 1
# Create lat/long mapping for each device
device_latlong_map = climate_df.groupby("Device_id")[["LatLong"]].first()
# Apply the device lat/long map to correct the missing sensor latitudes and longitudes
climate_df["LatLong"] = climate_df["Device_id"].map(device_latlong_map["LatLong"])
# Check the number of missing LatLong values after correction
print("Number of missing sensor latitude/logitude values:", climate_df["LatLong"].isnull().sum())Number of missing sensor latitude/logitude values: 0
Continuous features
Inconsistent data
Before making any corrections to the continuous features, it is a good idea to inspect some basic statistics.
climate_df.describe()| MinimumWindDirection | AverageWindDirection | MaximumWindDirection | MinimumWindSpeed | AverageWindSpeed | GustWindSpeed | AirTemperature | RelativeHumidity | AtmosphericPressure | PM25 | PM10 | Noise | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 352450.000000 | 391290.000000 | 352292.000000 | 352292.000000 | 391290.000000 | 352292.000000 | 391290.000000 | 391290.000000 | 391290.000000 | 373014.000000 | 373014.000000 | 373014.000000 |
| mean | 20.447147 | 166.414570 | 305.508777 | 4.794662 | 1.058048 | 3.441384 | 16.378137 | 66.467364 | 1002.152184 | 20.116674 | 8.138810 | 66.417315 |
| std | 57.606792 | 125.167169 | 89.833971 | 38.599842 | 0.980283 | 2.607634 | 6.075031 | 18.367640 | 108.405272 | 118.284124 | 12.522434 | 13.052584 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -0.800000 | 4.000000 | 20.900000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 43.000000 | 314.000000 | 0.000000 | 0.400000 | 1.500000 | 12.100000 | 55.100000 | 1008.800000 | 1.000000 | 3.000000 | 59.000000 |
| 50% | 0.000000 | 159.000000 | 353.000000 | 0.000000 | 0.800000 | 2.800000 | 15.900000 | 68.200000 | 1014.700000 | 3.000000 | 5.000000 | 68.300000 |
| 75% | 0.000000 | 301.000000 | 358.000000 | 0.100000 | 1.500000 | 4.800000 | 19.800000 | 79.600000 | 1020.100000 | 7.000000 | 9.000000 | 72.400000 |
| max | 359.000000 | 359.000000 | 360.000000 | 359.000000 | 11.100000 | 52.500000 | 40.800000 | 99.800003 | 1042.900000 | 1030.700000 | 308.000000 | 131.100000 |
From the above table, there appears to be some issue with the MinimumWindSpeed, AtmosphericPressure and PM25.
- MinimumWindSpeed: The maximum minimum wind speed value is 359 m/s. This is not sensible and beyond the upper bound (60 m/s) of the instrument’s measurement range.
- AtmosphericPressure: The minimum atmospheric pressure value of 20.9 hPa is too low.
- PM25: The maximum value of 1030.7 is beyond the measurement range of the instrument (0-1000 micrograms/m^3).
To inspect further, I calculated the descriptive statistics for each device.
# Calculate the min, max, mean and standard deviation for each device, only for the MinimumWindSpeed, AtmosphericPressure, and PM25 features
climate_df.groupby("Device_id")[["MinimumWindSpeed", "AtmosphericPressure", "PM25"]].agg(["min", "max", "mean", "std"])| MinimumWindSpeed | AtmosphericPressure | PM25 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| min | max | mean | std | min | max | mean | std | min | max | mean | std | |
| Device_id | ||||||||||||
| ICTMicroclimate-01 | NaN | NaN | NaN | NaN | 991.400024 | 1040.300049 | 1015.682445 | 8.009809 | 0.0 | 235.0 | 5.278630 | 11.171503 |
| ICTMicroclimate-02 | 0.0 | 11.4 | 0.577927 | 0.871374 | 987.300000 | 1036.000000 | 1011.590638 | 7.957121 | 1.0 | 147.0 | 7.800387 | 9.512505 |
| ICTMicroclimate-03 | 0.0 | 10.1 | 0.725178 | 0.824416 | 986.400000 | 1035.200000 | 1010.733407 | 7.959522 | 1.0 | 184.0 | 6.979975 | 8.255256 |
| ICTMicroclimate-04 | 0.0 | 1.7 | 0.001496 | 0.024614 | 996.700000 | 1037.900000 | 1018.512888 | 7.243488 | 0.0 | 81.0 | 5.085106 | 7.566532 |
| ICTMicroclimate-05 | 0.0 | 7.8 | 0.197126 | 0.360950 | 993.600000 | 1037.700000 | 1015.707429 | 7.856305 | 1.0 | 116.0 | 6.684698 | 8.098758 |
| ICTMicroclimate-06 | 0.0 | 0.8 | 0.002879 | 0.027912 | 992.100000 | 1040.900000 | 1016.449209 | 7.983454 | 0.0 | 92.0 | 5.183268 | 8.530823 |
| ICTMicroclimate-07 | 0.0 | 1.1 | 0.000896 | 0.018917 | 994.300000 | 1042.900000 | 1018.497433 | 8.014416 | 0.0 | 62.0 | 5.542080 | 7.993104 |
| ICTMicroclimate-08 | 0.0 | 1.5 | 0.004890 | 0.048268 | 991.900000 | 1040.900000 | 1016.326436 | 8.032214 | 0.0 | 98.0 | 7.447921 | 9.736321 |
| ICTMicroclimate-09 | 0.0 | 0.4 | 0.000449 | 0.009074 | 990.700000 | 1039.700000 | 1015.006325 | 7.896735 | 0.0 | 307.0 | 7.610188 | 24.851515 |
| ICTMicroclimate-10 | 0.0 | 359.0 | 56.584262 | 123.228729 | 20.900000 | 1030.700000 | 838.528264 | 363.300887 | 0.0 | 1030.7 | 188.700070 | 389.082039 |
| ICTMicroclimate-11 | 0.0 | 359.0 | 6.172891 | 46.169169 | 32.700000 | 1029.700000 | 994.319084 | 125.177193 | 0.0 | 1024.1 | 21.611508 | 133.095076 |
| aws5-0999 | 0.0 | 4.4 | 0.415397 | 0.466901 | 989.700000 | 1038.700000 | 1015.150175 | 8.420255 | NaN | NaN | NaN | NaN |
Inspection of the statistics for each device shows that the problematic values only occur for the ICTMicroclimate-10 and ICTMicroclimate-11 sensors, which are both at the same location, Treasury Place.
climate_df.groupby("Device_id")[["SensorLocation"]].first().iloc[-3:-1]| SensorLocation | |
|---|---|
| Device_id | |
| ICTMicroclimate-10 | 1 Treasury Place |
| ICTMicroclimate-11 | 1 Treasury Place |
The next step is to visualise the temporal distribution of the MinimumWindSpeed, AtmosphericPressure, and PM25 for the ICTMicroclimate-10 and ICTMicroclimate-11 sensors to see if the inconsistent values were random or followed a pattern. For the visualisation, I used a heatmap of Day in the Month versus Month in the Year.
# Ensure Time is datetime
climate_df["Time"] = pd.to_datetime(climate_df["Time"], utc=True)
# Drop timezone information
climate_df["Time"] = climate_df["Time"].dt.tz_localize(None)
# Extract year-month and hour
climate_df["year_month"] = climate_df["Time"].dt.to_period("M").astype(str)
climate_df["day"] = climate_df["Time"].dt.day
# Extract groups ICTMicroclimate-10 and ICTMicroclimate-11 as DataFrames
d_10_df = climate_df.groupby("Device_id").get_group("ICTMicroclimate-10")
d_11_df = climate_df.groupby("Device_id").get_group("ICTMicroclimate-11")
# Group by year_month and hour, and get max windspeed
g_10 = d_10_df.groupby(["day","year_month", ])["MinimumWindSpeed"].max().unstack()
g_11 = d_11_df.groupby(["day","year_month", ])["MinimumWindSpeed"].max().unstack()
# Set common colour scale limits
vmin = 0
vmax = 360
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 10), sharey=True)
cmap = plt.cm.YlGnBu
cmap.set_bad(color="grey")
# Plot heatmap
sns.heatmap(g_10, ax=axes[0], annot=True, fmt=".1f", cmap=cmap, vmin=vmin, vmax=vmax, cbar=False)
axes[0].set_title("ICTMicroclimate-10 maximum of MinimumWindSpeed")
axes[0].set_ylabel("Day in Month")
axes[0].set_xlabel("Month in Year")
# Plot heatmap
sns.heatmap(g_11, ax=axes[1], vmin=vmin, vmax=vmax, annot=True, fmt=".1f", cmap=cmap, cbar=False)
axes[1].set_title("ICTMicroclimate-11 maximum of MinimumWindSpeed")
axes[1].set_ylabel("")
axes[1].tick_params(left=False)
axes[1].set_xlabel("Month in Year")
# Add a title to the entire figure
fig.suptitle("Maximum Minimum Windspeed by Day in Month and Month in Year", fontsize=16)
plt.tight_layout()
plt.show()C:\Users\darrin\anaconda3\Lib\site-packages\seaborn\matrix.py:260: FutureWarning:
Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior
C:\Users\darrin\anaconda3\Lib\site-packages\seaborn\matrix.py:260: FutureWarning:
Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior
# Group by year_month and hour, and get min AtmosphericPressure
g_10 = d_10_df.groupby(["day","year_month", ])["AtmosphericPressure"].min().unstack()
g_11 = d_11_df.groupby(["day","year_month", ])["AtmosphericPressure"].min().unstack()
# Set common color scale limits
vmin = 0
vmax = 1300
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 10), sharey=True)
cmap = plt.cm.YlGnBu_r
cmap.set_bad(color="grey")
# Plot heatmap
sns.heatmap(g_10, ax=axes[0], annot=True, fmt=".1f", cmap=cmap, vmin=vmin, vmax=vmax, cbar=False)
axes[0].set_title("ICTMicroclimate-10 minimum of AtmosphericPressure")
axes[0].set_ylabel("Day in Month")
axes[0].set_xlabel("Month in Year")
# Plot heatmap
sns.heatmap(g_11, ax=axes[1], vmin=vmin, vmax=vmax, annot=True, fmt=".1f", cmap=cmap, cbar=False)
axes[1].set_title("ICTMicroclimate-11 minimum of AtmosphericPressure")
axes[1].set_ylabel("")
axes[1].tick_params(left=False)
axes[1].set_xlabel("Month in Year")
# Add a title to the entire figure
fig.suptitle("Minimum AtmosphericPressure by Month-Year and Day of Month", fontsize=16)
plt.tight_layout()
plt.show()C:\Users\darrin\anaconda3\Lib\site-packages\seaborn\matrix.py:260: FutureWarning:
Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior
C:\Users\darrin\anaconda3\Lib\site-packages\seaborn\matrix.py:260: FutureWarning:
Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior
# Group by year_month and hour, and get min AtmosphericPressure
g_10 = d_10_df.groupby(["day","year_month", ])["PM25"].max().unstack()
g_11 = d_11_df.groupby(["day","year_month", ])["PM25"].max().unstack()
# Set common colour scale limits
vmin = 0
vmax = 1300
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 10), sharey=True)
cmap = plt.cm.YlGnBu
cmap.set_bad(color="grey")
# Plot heatmap
sns.heatmap(g_10, ax=axes[0], annot=True, fmt=".1f", cmap=cmap, vmin=vmin, vmax=vmax, cbar=False)
axes[0].set_title("ICTMicroclimate-10 maximum of PM25")
axes[0].set_ylabel("Day in Month")
axes[0].set_xlabel("Month in Year")
# Plot heatmap
sns.heatmap(g_11, ax=axes[1], vmin=vmin, vmax=vmax, annot=True, fmt=".1f", cmap=cmap, cbar=False)
axes[1].set_title("ICTMicroclimate-11 maximum of PM25")
axes[1].set_ylabel("")
axes[1].tick_params(left=False)
axes[1].set_xlabel("Month in Year")
# Add a title to the entire figure
fig.suptitle("Maximum PM25 by Month-Year and Day of Month", fontsize=16)
plt.tight_layout()
plt.show()C:\Users\darrin\anaconda3\Lib\site-packages\seaborn\matrix.py:260: FutureWarning:
Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior
C:\Users\darrin\anaconda3\Lib\site-packages\seaborn\matrix.py:260: FutureWarning:
Format strings passed to MaskedConstant are ignored, but in future may error or produce different behavior
From the above plots, it is clear that the data for sensors ICTMicroclimate-10 and ICTMicroclimate-11 is unreliable from August 2024 until the 2nd of October 2024. For the purposes of this exercise, the data for these sensors during this period will be removed from the dataframe.
# Define the device and time range to filter out
devices_to_remove = ["ICTMicroclimate-10","ICTMicroclimate-11"]
end_time = "2024-10-01 23:59:59"
# Filter out rows for that device in the given time range
df_filtered = climate_df[~((climate_df["Device_id"].isin(devices_to_remove)) & (climate_df["Time"] <= end_time))].copy()
df_filtered.groupby("Device_id")[["MinimumWindSpeed", "AtmosphericPressure", "PM25"]].agg(["min", "max", "mean", "std"])
# Drop the columns created for the above plotting
df_filtered.drop(columns=["day","year_month"], inplace=True)
# Calculate the min, max, mean and standard deviation for each device, only for the MinimumWindSpeed, AtmosphericPressure, and PM25 features
df_filtered.groupby("Device_id")[["MinimumWindSpeed", "AtmosphericPressure", "PM25"]].agg(["min", "max", "mean", "std"])| MinimumWindSpeed | AtmosphericPressure | PM25 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| min | max | mean | std | min | max | mean | std | min | max | mean | std | |
| Device_id | ||||||||||||
| ICTMicroclimate-01 | NaN | NaN | NaN | NaN | 991.400024 | 1040.300049 | 1015.682445 | 8.009809 | 0.0 | 235.0 | 5.278630 | 11.171503 |
| ICTMicroclimate-02 | 0.0 | 11.4 | 0.577927 | 0.871374 | 987.300000 | 1036.000000 | 1011.590638 | 7.957121 | 1.0 | 147.0 | 7.800387 | 9.512505 |
| ICTMicroclimate-03 | 0.0 | 10.1 | 0.725178 | 0.824416 | 986.400000 | 1035.200000 | 1010.733407 | 7.959522 | 1.0 | 184.0 | 6.979975 | 8.255256 |
| ICTMicroclimate-04 | 0.0 | 1.7 | 0.001496 | 0.024614 | 996.700000 | 1037.900000 | 1018.512888 | 7.243488 | 0.0 | 81.0 | 5.085106 | 7.566532 |
| ICTMicroclimate-05 | 0.0 | 7.8 | 0.197126 | 0.360950 | 993.600000 | 1037.700000 | 1015.707429 | 7.856305 | 1.0 | 116.0 | 6.684698 | 8.098758 |
| ICTMicroclimate-06 | 0.0 | 0.8 | 0.002879 | 0.027912 | 992.100000 | 1040.900000 | 1016.449209 | 7.983454 | 0.0 | 92.0 | 5.183268 | 8.530823 |
| ICTMicroclimate-07 | 0.0 | 1.1 | 0.000896 | 0.018917 | 994.300000 | 1042.900000 | 1018.497433 | 8.014416 | 0.0 | 62.0 | 5.542080 | 7.993104 |
| ICTMicroclimate-08 | 0.0 | 1.5 | 0.004890 | 0.048268 | 991.900000 | 1040.900000 | 1016.326436 | 8.032214 | 0.0 | 98.0 | 7.447921 | 9.736321 |
| ICTMicroclimate-09 | 0.0 | 0.4 | 0.000449 | 0.009074 | 990.700000 | 1039.700000 | 1015.006325 | 7.896735 | 0.0 | 307.0 | 7.610188 | 24.851515 |
| ICTMicroclimate-10 | 0.0 | 10.5 | 0.711077 | 0.889946 | 989.700000 | 1028.900000 | 1010.033453 | 7.718326 | 0.0 | 85.0 | 5.132205 | 7.261089 |
| ICTMicroclimate-11 | 0.0 | 1.6 | 0.008341 | 0.060927 | 989.500000 | 1029.700000 | 1010.929081 | 7.414035 | 0.0 | 179.0 | 3.857938 | 6.942787 |
| aws5-0999 | 0.0 | 4.4 | 0.415397 | 0.466901 | 989.700000 | 1038.700000 | 1015.150175 | 8.420255 | NaN | NaN | NaN | NaN |
The table above shows some descriptive statistics for the MinimumWindSpeed, AtmosphericPressure, and PM25 features after filtering out the corrupted data for sensors ICTMicroclimate-10 and ICTMicroclimate-11. It is observed that the previosuly identified issues are no longer present.
Missing data
The task description requires the use of either the mean or the median to fill in missing values. The median is the best choice for distributions that are skewed or contain outliers. Therefore, I’ll plot the distributions of all variables to decide if the median or mean should be used.
# Get list of features to plot
float_features = df_filtered.select_dtypes(include='number').columns.tolist()
# Plot histograms using the pandas hist() function
axes = df_filtered.hist(column=float_features, bins=15,layout=(8,2), figsize=(10,15), edgecolor="white",log=True)
for ax in axes.flatten():
ax.set_ylabel("Count (log scale)")
plt.tight_layout()
plt.show()The majority of the above distributions are highly skewed or contain outliers. The distributions for AtmosphericPressure and AirTemperature are the closest to symmetric distributions. Based on this evidence, I decided to use the median value for imputing missing values as it is less affected by the skewness of a distribution.
# Calculate the median value for each feature
feature_medians = df_filtered.median(numeric_only=True)
print("Meadian values for each feature:")
print(feature_medians)
# Replace missing values in numeric feature columns with the median
df_filled = df_filtered.fillna(feature_medians)Meadian values for each feature:
MinimumWindDirection 0.0
AverageWindDirection 159.0
MaximumWindDirection 353.0
MinimumWindSpeed 0.0
AverageWindSpeed 0.8
GustWindSpeed 2.8
AirTemperature 16.1
RelativeHumidity 68.4
AtmosphericPressure 1014.8
PM25 3.0
PM10 5.0
Noise 68.3
dtype: float64
# Check how many null values in the data frame after filling missing values
print("Feature Name Number of missing entries")
print(df_filled.isnull().sum())Feature Name Number of missing entries
Device_id 0
Time 0
SensorLocation 0
LatLong 0
MinimumWindDirection 0
AverageWindDirection 0
MaximumWindDirection 0
MinimumWindSpeed 0
AverageWindSpeed 0
GustWindSpeed 0
AirTemperature 0
RelativeHumidity 0
AtmosphericPressure 0
PM25 0
PM10 0
Noise 0
dtype: int64
Distributions of PM10 and PM25
The histograms of the PM10 and PM25 features are shown in the figures below. Both distributions have the shape of an exponential distribution, with high counts at low particulate matter, and the counts decreasing exponentially as the particulate matter increases.
num_bins = 10
# Create figure with specified size
hist_fig = plt.figure(figsize=(10, 7))
# Plot PM10
ax1 = plt.subplot(2, 1, 1) # 2 rows, 1 column, 1st subplot
ax1.hist(df_filled["PM10"], num_bins, edgecolor="white", log=True)
ax1.set_title(f'Histogram of PM10 (n={df_filtered["PM10"].shape[0]})')
ax1.set_ylabel("Count (log scale)")
ax1.set_xlabel(r"Particular matter 10 $\mu m$ diameter $\left(\frac{\mu g}{m^3}\right)$")
# Plot PM25
ax2 = plt.subplot(212, sharex=ax1) # 2 rows, 1 column, 2nd subplot
ax2.hist(df_filled["PM25"], num_bins, edgecolor="white", log=True)
ax2.set_title(f'Histogram of PM25 (n={df_filtered["PM25"].shape[0]})')
ax2.set_ylabel("Count (log scale)")
ax2.set_xlabel(r"Particulate matter 2.5 $\mu m$ diameter $\left(\frac{\mu g}{m^3}\right)$")
# Set the x limit for both plots
#plt.xlim(0, 350)
plt.tight_layout()
plt.show()To investigate any possible correlation between the PM10 and PM25 features, a scater plot was created to allow visual inspection. Then the pearson correlation cofficient was calculated.
# Create figure with specified size
plt.figure(figsize=(10, 7))
# Scatter plot of
plt.scatter(df_filtered["PM25"], df_filtered["PM10"], s=10, edgecolor="lightskyblue", alpha=0.5)
plt.grid(zorder=1, alpha=0.5)
plt.title("Scatter plot of PM25 verus PM10")
plt.xlabel(r"Particulate matter 2.5 $\mu m$ diameter $\left(\frac{\mu g}{m^3}\right)$")
plt.ylabel(r"Particulate matter 10 $\mu m$ diameter $\left(\frac{\mu g}{m^3}\right)$")
plt.xlim(0, 350)
plt.ylim(0, 350)
plt.tight_layout()
plt.show()# Pearson correlation coefficient
correlation = df_filled["PM25"].corr(df_filled["PM10"])
print(f"Correlation coefficient between features PM25 and PM10 is: {correlation:.3}")Correlation coefficient between features PM25 and PM10 is: 0.985
The scatter plot between the features PM25 and PM10 shows a strong positive linear relationship. This is confirmed with the high pearson correlation.
LatLong encoding
The first step in encoding the LatLong feature was to split it into the latitude and longitude components.
df_filled[['latitude', 'longitude']] = df_filled['LatLong'].str.split(',', expand=True).astype(float)The second step is to choose an encoding system. Here, I have chosen to encode the latitude and longitude into a 3D Cartesian coordinate system, giving values for x, y and z. This approximates the Earth as a sphere, which is reasonable since the region of interest is a tiny portion of the Earth’s surface. This encoding approach will enable accurate distance measurement between points if required in the machine learning algorithm.
# Converting lat/long to cartesian
import numpy as np
# Convert from degrees to radians
df_filled["lat_rad"] = np.radians(df_filled["latitude"])
df_filled["long_rad"] = np.radians(df_filled["longitude"])
# Calculate the Cartesian values
df_filled["x"] = np.cos(df_filled["lat_rad"]) * np.cos(df_filled["long_rad"])
df_filled["y"] = np.cos(df_filled["lat_rad"]) * np.sin(df_filled["long_rad"])
df_filled["z"] = np.sin(df_filled["lat_rad"])
# Drop the intermediate lat_rad and long_rad columns
df_filled.drop(columns=["lat_rad","long_rad"], inplace=True)
# Print out the encoding
print("Table showing the encoded values, x, y and z for each latitude and longitude pair.")
df_filled.groupby(['latitude','longitude'])[['x','y','z']].agg(['unique'])Table showing the encoded values, x, y and z for each latitude and longitude pair.
| x | y | z | ||
|---|---|---|---|---|
| unique | unique | unique | ||
| latitude | longitude | |||
| -37.822331 | 144.952170 | [-0.6466829154533684] | [0.45361725572202244] | [-0.6132149640802587] |
| -37.822234 | 144.982941 | [-0.6469272879344449] | [0.453270475112011] | [-0.6132136336689813] |
| -37.822183 | 144.956222 | [-0.6467162961870897] | [0.45357241883254756] | [-0.6132129264133661] |
| -37.820408 | 144.959119 | [-0.6467547756329378] | [0.4535506263400986] | [-0.6131884616840834] |
| -37.819499 | 144.978721 | [-0.6469178722458754] | [0.45333491635230483] | [-0.6131759292280787] |
| -37.818593 | 144.971640 | [-0.6468697848649976] | [0.453420426464369] | [-0.6131634352222753] |
| -37.818452 | 144.967847 | [-0.6468410076851618] | [0.45346411834148037] | [-0.6131614829338421] |
| -37.814604 | 144.970299 | [-0.6468941255968796] | [0.4534600727651171] | [-0.61310843468028] |
| -37.814035 | 144.967280 | [-0.6468752177437501] | [0.4534976554876418] | [-0.6131005864751623] |
| -37.812888 | 144.975086 | [-0.6469470430816873] | [0.45341656703589844] | [-0.6130847740608488] |
| -37.812860 | 144.974539 | [-0.6469429703681334] | [0.45342290938269164] | [-0.613084381091373] |
| -37.795617 | 144.951901 | [-0.6469147812012331] | [0.4537844284725181] | [-0.6128466026170258] |
# Drop the latitude and longitude columns
df_filled.drop(columns=["latitude","longitude","LatLong"], inplace=True)Feature scaling
Each feature in the dataset has different scale (range). To normalise the features, min-max scaling is applied to all continuous data.
Feature distributions before scaling
Distributions of the features are plotted before scaling.
# Get list of features to plot
scale_float_features = df_filled.select_dtypes(include='number').columns.tolist()
axes = df_filled.hist(column=scale_float_features, bins=15,layout=(8,2), figsize=(10,15), edgecolor="white", xlabelsize=7, ylabelsize=7, log=True)
for ax in axes.flatten():
ax.set_ylabel("Count (log scale)")
plt.tight_layout()
plt.show()Scaling operation
# Create a copy of the filled dataframe
df_scaled = df_filled.copy()
# Apply min-max scaling to the selected columns
for feature in scale_float_features:
min_val = df_scaled[feature].min()
max_val = df_scaled[feature].max()
df_scaled[feature] = (df_scaled[feature] - min_val) / (max_val - min_val)Feature distributions after scaling
Distributions of the features plotted after scaling. Inspection of the before an after figures for each features distribution shows that the scaling did not affect the relative shape of the distribution (e.g. skewness).
# Loop through all of the continuous features, plotting their distributions
axes = df_scaled.hist(column=scale_float_features, bins=15,layout=(8,2), figsize=(10,15), edgecolor="white", xlabelsize=7, ylabelsize=7, log=True)
for ax in axes.flatten():
ax.set_ylabel("Count (log scale)")
plt.tight_layout()
plt.show()






