Pandas/Polars versus SQL

import os
import sqlite3
import pandas as pd
import numpy as np
import polars as pl
import tempfile
import timeit

Introduction

This task requires working with pandas to write code that produces equivalent output to various SQL queries. The nycflights13 dataset is used for this exercise. The dataset gives information about all 336,776 flights that departed from three New York airports in 2013 to destinations in the United States, Puerto Rico, and the American Virgin Islands. Additionally, equivalent queries were also written using the polars library, and the execution time for the pandas and polars libraries was calculated and compared.

Database connection

An SQLite3 database file is created on the local disk for use during these tasks.

#
# Create a database file in a temporary location
dbfile = os.path.join(tempfile.mkdtemp(), "nycflights.db")
print(dbfile)

# Connect to database
conn = sqlite3.connect(dbfile)

C:\Users\darrin\AppData\Local\Temp\tmpll_hbexh\nycflights.db

Load data files and export to database

Compressed csv files containing data about New York City flights, airlines, airports, planes and weather are loaded from the local disk into pandas dataframes. The pandas dataframes are exported as tables to the SQLite3 database file.

Polars dataframes (having _pl suffix on the name) are also created from the pandas dataframes. These will be used for testing the polars library in the tasks below.

flights = pd.read_csv("nycflights13_flights.csv.gz", comment="#")
flights_pl = pl.from_pandas(flights)

airlines = pd.read_csv("nycflights13_airlines.csv.gz", comment="#")
airlines_pl = pl.from_pandas(airlines)

airports = pd.read_csv("nycflights13_airports.csv.gz", comment="#")
airports_pl = pl.from_pandas(airports)

planes = pd.read_csv("nycflights13_planes.csv.gz", comment="#")
planes_pl = pl.from_pandas(planes)

weather = pd.read_csv("nycflights13_weather.csv.gz", comment="#")
weather_pl = pl.from_pandas(weather)

# Export dataframes to the database
flights.to_sql("flights", conn, index=False)
airlines.to_sql("airlines", conn, index=False)
airports.to_sql("Badges", conn, index=False)
planes.to_sql("planes", conn, index=False)
weather.to_sql("weather", conn, index=False)

SQL queries and equivalent Pandas and Polars queries

This section contains 17 different SQL queries and the pandas and polars code to give equivalent output.

# Helper function to testing equality between dataframes
def test_dataframe_equality(df1, df2, name="SQL"):
    """
    Function to test if two dataframes are equal.

    :param df1: First pandas dataframe.
    :param df2: Second pandas dataframe.
    :param name: Name of the method used to generate the first pandas dataframe.
    :return: None.
    """
    try:
        pd.testing.assert_frame_equal(df1, df2)
        print(f"Outcome of comparison: {name} and Pandas results are equal")
    except AssertionError as e:
        print(f"Outcome of comparison: {name} and Pandas results are not equal: {e}")

# Helper function to print execution times
def print_execution_data(pd_dur, pl_dur):
    """
    Function to print a table comparing Pandas to Polars execution time.
    
    :param pd_dur: pandas execution duration.
    :param pl_dur: polars execution duration.
    :return: None.
    """
    print()
    print("Metric                Pandas    Polars")
    print(f"Execution time [s]:  {pd_dur:.4g}  {pl_dur:.4g}")
    print(f"Relative to Pandas:  {1}           {pl_dur/pd_dur:.3g}")

Task 1

This SQL query selects the unique engines in the engines column from the planes table.

My pandas command: - In the planes dataframe, drop the duplicate rows by considering the engines column. Then select only the engines column and reset the index.

# SQL result
task1_sql = pd.read_sql_query("""SELECT DISTINCT engine FROM planes""", conn)

# Pandas result with timing
t1_st = timeit.default_timer()

task1_my = planes.drop_duplicates(subset="engine")[["engine"]].reset_index(drop=True)

print("Head of the Pandas data frame resulting from the query:")
print(task1_my.head(), "\n")

t1_et = timeit.default_timer()
t1_dur = t1_et - t1_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task1_sql, task1_my)

Head of the Pandas data frame resulting from the query:
          engine
0      Turbo-fan
1      Turbo-jet
2  Reciprocating
3        4 Cycle
4    Turbo-shaft 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t1_st_pl = timeit.default_timer()

task1_pl = planes_pl.unique(subset=["engine"]).select(["engine"])

t1_et_pl = timeit.default_timer()
t1_dur_pl = t1_et_pl - t1_st_pl

# Checking the Pandas result == Polars result. We need to covert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task1_my.sort_values(by=["engine"]).reset_index(drop=True),
                        task1_pl.to_pandas().sort_values(by=["engine"]).reset_index(drop=True),
                        name="Polars")

print_execution_data(t1_dur, t1_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.002907  0.003202
Relative to Pandas:  1           1.1

Task 2

This SQL query selects rows with the unique combination of type and engines from the planes table.

My pandas command: - In the planes dataframe, drop the duplicate rows by considering the type and engine columns. Then select only the type and engine columns and reset the index.

# SQL result
task2_sql = pd.read_sql_query("""SELECT DISTINCT type, engine FROM planes""", conn)

# Pandas result with timing
t2_st = timeit.default_timer()

task2_my = planes.drop_duplicates(subset=["type", "engine"])[["type", "engine"]].reset_index(drop=True)

print("Head of the Pandas data frame resulting from the query:")
print(task2_my.head(), "\n")

t2_et = timeit.default_timer()
t2_dur = t2_et - t2_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task2_sql, task2_my)

Head of the Pandas data frame resulting from the query:
                       type         engine
0   Fixed wing multi engine      Turbo-fan
1   Fixed wing multi engine      Turbo-jet
2  Fixed wing single engine  Reciprocating
3   Fixed wing multi engine  Reciprocating
4  Fixed wing single engine        4 Cycle 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t2_st_pl = timeit.default_timer()

task2_pl = planes_pl.unique(subset=["type", "engine"]).select(["type", "engine"])

t2_et_pl = timeit.default_timer()
t2_dur_pl = t2_et_pl - t2_st_pl

# Checking the Pandas result == Polars result. We need to covert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task2_my.sort_values(by=["type", "engine"]).reset_index(drop=True),
                        task2_pl.to_pandas().sort_values(by=["type", "engine"]).reset_index(drop=True),
                        name="Polars")

print_execution_data(t2_dur, t2_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.001832  0.001769
Relative to Pandas:  1           0.965

Task 3

This SQL query counts how many planes there are in the planes table for each engine type and returns the count along with the engine type

My pandas command: - In the planes dataframe, group the rows by the engine column and get the size of each group. Then reset the index and rename its column to COUNT(*). Select the columns in order to match the SQL query result.

# SQL result
task3_sql = pd.read_sql_query("""SELECT COUNT(*), engine FROM planes GROUP BY engine""", conn)

# Pandas result with timing
t3_st = timeit.default_timer()

task3_my = planes.groupby("engine").size().reset_index(name="COUNT(*)")[["COUNT(*)", "engine"]]

print("Head of the Pandas data frame resulting from the query:")
print(task3_my.head(), "\n")

t3_et = timeit.default_timer()
t3_dur = t3_et - t3_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task3_sql, task3_my)

Head of the Pandas data frame resulting from the query:
   COUNT(*)         engine
0         2        4 Cycle
1        28  Reciprocating
2      2750      Turbo-fan
3       535      Turbo-jet
4         2     Turbo-prop 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t3_st_pl = timeit.default_timer()

task3_pl = planes_pl.group_by("engine").agg(pl.len().alias("COUNT(*)")).select(["COUNT(*)", "engine"])

t3_et_pl = timeit.default_timer()
t3_dur_pl = t3_et_pl - t3_st_pl

# Polars returns uint32 for len(), so we need to cast to allow comparison with pandas.
task3_pl = task3_pl.with_columns(pl.col("COUNT(*)").cast(pl.Int64))

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task3_my.sort_values(by=["engine"]).reset_index(drop=True),
                        task3_pl.to_pandas().sort_values(by=["engine"]).reset_index(drop=True),
                        name="Polars")

print_execution_data(t3_dur, t3_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.001778  0.00439
Relative to Pandas:  1           2.47

Task 4

This SQL query counts the number of planes in the planes table for each engine and plane type group and returns the count along with the engine type and plane type.

My pandas command - In the planes dataframe, group the rows by the engine and type columns and get the size of each group. Then reset the index and rename its column to COUNT(*). Then select the columns in order to match the SQL query result.

# SQL result
task4_sql = pd.read_sql_query("""SELECT COUNT(*), engine, type FROM planes GROUP BY engine, type""", conn)

# Pandas result with timing
t4_st = timeit.default_timer()

task4_my = planes.groupby(["engine", "type"]).size().reset_index(name="COUNT(*)")[["COUNT(*)", "engine", "type"]]

print("Head of the Pandas data frame resulting from the query:")
print(task4_my.head(), "\n")

t4_et = timeit.default_timer()
t4_dur = t4_et - t4_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task4_sql, task4_my)

Head of the Pandas data frame resulting from the query:
   COUNT(*)         engine                      type
0         2        4 Cycle  Fixed wing single engine
1         5  Reciprocating   Fixed wing multi engine
2        23  Reciprocating  Fixed wing single engine
3      2750      Turbo-fan   Fixed wing multi engine
4       535      Turbo-jet   Fixed wing multi engine 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t4_st_pl = timeit.default_timer()

task4_pl = planes_pl.group_by(["engine", "type"]).agg(pl.len().alias("COUNT(*)")).select(["COUNT(*)", "engine", "type"])

t4_et_pl = timeit.default_timer()
t4_dur_pl = t4_et_pl - t4_st_pl

# Polars returns uint32 for len(), so we need to cast to allow comparison with pandas.
task4_pl = task4_pl.with_columns(pl.col("COUNT(*)").cast(pl.Int64))

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task4_my.sort_values(by=["engine", "type"]).reset_index(drop=True),
                        task4_pl.to_pandas().sort_values(by=["engine", "type"]).reset_index(drop=True),
                        name="Polars")

print_execution_data(t4_dur, t4_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.002324  0.002474
Relative to Pandas:  1           1.06

Task 5

This SQL query gives the minimum, average and maximum years for each unique combination of engine and manufacturer.

My pandas command: - For the planes dataframe, group the rows by engine and manufacturer. Select the year column and perform aggregation for the minimum, mean and maximum values. Then reset the index and select the min, mean, max, engine and manufacture columns to return. Rename the aggregation columns to match the SQL result.

# SQL result
task5_sql = pd.read_sql_query("""SELECT MIN(year), AVG(year), MAX(year), engine, manufacturer
    FROM planes GROUP BY engine, manufacturer""", conn)

# Pandas result with timing
t5_st = timeit.default_timer()

task5_my = planes.groupby(["engine", "manufacturer"]).year.agg(
    ["min", "mean", "max"]).reset_index()[["min", "mean", "max", "engine", "manufacturer"]].rename(
    columns={"min": "MIN(year)", "mean": "AVG(year)", "max": "MAX(year)"})

print("Head of the Pandas data frame resulting from the query:")
print(task5_my.head(), "\n")

t5_et = timeit.default_timer()
t5_dur = t5_et - t5_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task5_sql, task5_my)

Head of the Pandas data frame resulting from the query:
   MIN(year)  AVG(year)  MAX(year)         engine           manufacturer
0     1975.0     1975.0     1975.0        4 Cycle                 CESSNA
1        NaN        NaN        NaN        4 Cycle            JOHN G HESS
2        NaN        NaN        NaN  Reciprocating  AMERICAN AIRCRAFT INC
3     2007.0     2007.0     2007.0  Reciprocating     AVIAT AIRCRAFT INC
4        NaN        NaN        NaN  Reciprocating          BARKER JACK L 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t5_st_pl = timeit.default_timer()

task5_pl = planes_pl.group_by(["engine", "manufacturer"]).agg(
    [pl.col("year").min().alias("MIN(year)"), pl.col("year").mean().alias("AVG(year)"),
    pl.col("year").max().alias("MAX(year)")]).select(["MIN(year)", "AVG(year)", "MAX(year)", "engine", "manufacturer"])

t5_et_pl = timeit.default_timer()
t5_dur_pl = t5_et_pl - t5_st_pl

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task5_my.sort_values(by=["engine", "manufacturer"]).reset_index(drop=True),
                        task5_pl.to_pandas().sort_values(by=["engine", "manufacturer"]).reset_index(drop=True),
                        name="Polars")

print_execution_data(t5_dur, t5_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.003621  0.006494
Relative to Pandas:  1           1.79

Task 6

This SQL query selects rows from the planes table where the speed column does not have a null value.

My pandas command: - Creates a logical vector for slicing by checking if the speed column in the planes dataframe is not Null. This slice is used with the planes dataframe to select those rows where the speed was not Null. Finally, the index is reset.

# SQL result
task6_sql = pd.read_sql_query("""SELECT * FROM planes WHERE speed IS NOT NULL""", conn)

# Pandas result with timing
t6_st = timeit.default_timer()

task6_my = planes[planes["speed"].notnull()].reset_index(drop=True)

print("Head of the Pandas data frame resulting from the query:")
print(task6_my.head(), "\n")

t6_et = timeit.default_timer()
t6_dur = t6_et - t6_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task6_sql, task6_my)

Head of the Pandas data frame resulting from the query:
  tailnum    year                      type manufacturer      model  engines  \
0  N201AA  1959.0  Fixed wing single engine       CESSNA        150        1   
1  N202AA  1980.0   Fixed wing multi engine       CESSNA       421C        2   
2  N350AA  1980.0   Fixed wing multi engine        PIPER  PA-31-350        2   
3  N364AA  1973.0   Fixed wing multi engine       CESSNA       310Q        2   
4  N378AA  1963.0  Fixed wing single engine       CESSNA       172E        1   

   seats  speed         engine  
0      2   90.0  Reciprocating  
1      8   90.0  Reciprocating  
2      8  162.0  Reciprocating  
3      6  167.0  Reciprocating  
4      4  105.0  Reciprocating   

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t6_st_pl = timeit.default_timer()

task6_pl = planes_pl.filter(pl.col("speed").is_not_null()).sort(by=[])

t6_et_pl = timeit.default_timer()
t6_dur_pl = t6_et_pl - t6_st_pl

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task6_my.sort_values(by=["speed"]).reset_index(drop=True),
                        task6_pl.to_pandas().sort_values(by=["speed"]).reset_index(drop=True),
                        name="Polars")

print_execution_data(t6_dur, t6_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.001803  0.0008909
Relative to Pandas:  1           0.494

Task 7

This SQL query selects rows from the tailnum column in the planes table with a number of seats between 150 and 210 and a year greater than or equal to 2011.

My pandas command - Create a logical vector for slicing by checking rows in the seats column in the planes dataframe where the value is between 150 and 210, and checking rows where the year column in the planes dataframe has a value equal to or greater than 2011. This logical vector is then used to slice the planes data frame. Then only the tailnum column is returned, and the index is reset.

# SQL result
task7_sql = pd.read_sql_query("""SELECT tailnum FROM planes WHERE seats BETWEEN 150 AND 210 AND year >= 2011""", conn)

# Pandas result with timing
t7_st = timeit.default_timer()

task7_my = planes[(planes["seats"].between(150,210) & (planes["year"] >= 2011))][["tailnum"]].reset_index(drop=True)

print("Head of the Pandas data frame resulting from the query:")
print(task7_my.head(), "\n")

t7_et = timeit.default_timer()
t7_dur = t7_et - t7_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task7_sql, task7_my)

Head of the Pandas data frame resulting from the query:
  tailnum
0  N150UW
1  N151UW
2  N152UW
3  N153UW
4  N154UW 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t7_st_pl = timeit.default_timer()

task7_pl = planes_pl.filter((pl.col("seats").is_between(150,210)) &
                            (pl.col("year") >= 2011)).select(["tailnum"]).sort(by=[])

t7_et_pl = timeit.default_timer()
t7_dur_pl = t7_et_pl - t7_st_pl

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task7_my.sort_values(by=["tailnum"]).reset_index(drop=True),
                        task7_pl.to_pandas().sort_values(by=["tailnum"]).reset_index(drop=True),
                        name="Polars")

print_execution_data(t7_dur, t7_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.001393  0.001259
Relative to Pandas:  1           0.904

Task 8

This SQL query selects rows from the tailnum, manufacturer, and seats columns in the planes table where the manufacturer is one of ‘BOEING’, ‘AIRBUS’, or ‘EMBRAER’ and the number of seats is greater than 390.

My pandas command: - Create a logical vector for slicing by checking rows in the manufacturer column in the planes dataframe where the value is one of ‘BOEING’, ‘AIRBUS’, ‘EMBRAER’ and checking rows in the seats column in the planes dataframe where the capacity is greater than 390. This logical vector is then used to slice the planes data frame. Then the tailnum, manufacturer and seats columns are returned, and the index is reset.

# SQL result
task8_sql = pd.read_sql_query("""SELECT tailnum, manufacturer, seats FROM planes
    WHERE manufacturer IN ('BOEING', 'AIRBUS', 'EMBRAER') AND seats>390""", conn)

# Pandas result with timing
t8_st = timeit.default_timer()

task8_my = planes[(planes["manufacturer"].isin(["BOEING", "AIRBUS", "EMBRAER"])) & 
           (planes["seats"] > 390)][["tailnum", "manufacturer", "seats"]].reset_index(drop=True)

print("Head of the Pandas data frame resulting from the query:")
print(task8_my.head(), "\n")

t8_et = timeit.default_timer()
t8_dur = t8_et - t8_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task8_sql, task8_my)

Head of the Pandas data frame resulting from the query:
  tailnum manufacturer  seats
0  N206UA       BOEING    400
1  N228UA       BOEING    400
2  N272AT       BOEING    400
3  N57016       BOEING    400
4  N670US       BOEING    450 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t8_st_pl = timeit.default_timer()

task8_pl = planes_pl.filter((pl.col("manufacturer").is_in(["BOEING", "AIRBUS", "EMBRAER"])) &
                            (pl.col("seats") > 390)).select(["tailnum", "manufacturer", "seats"]).sort(by=[])

t8_et_pl = timeit.default_timer()
t8_dur_pl = t8_et_pl - t8_st_pl

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task8_my.reset_index(drop=True),
                        task8_pl.to_pandas().reset_index(drop=True), name="Polars")

print_execution_data(t8_dur, t8_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.001728  0.001018
Relative to Pandas:  1           0.589

Task 9

This SQL query selects unique rows from the year and seats columns in the planes table where the year is greater than or equal to 2012. The returned data is sorted in ascending order by year and descending order by seats.

My pandas command: - Create a logical vector for slicing by checking rows in the year column in the planes dataframe where the value is greater than or equal to 2012. This logical vector is then used to slice the planes data frame. Drop duplicates corresponding to the year and seat columns in the resulting data frame. Then select the year and seats columns. Sort the dataframe by year in ascending order, followed by seats in descending order, and reset the index.

# SQL result
task9_sql = pd.read_sql_query("""SELECT DISTINCT year, seats FROM planes 
    WHERE year >= 2012 ORDER BY year ASC, seats DESC""", conn)

# Pandas result with timing
t9_st = timeit.default_timer()

task9_my = planes[(planes["year"] >= 2012)].drop_duplicates(
    subset=["year", "seats"])[["year", "seats"]].sort_values(
    by=["year", "seats"], ascending=[True, False]).reset_index(drop=True)

print("Head of the Pandas data frame resulting from the query:")
print(task9_my.head(), "\n")

t9_et = timeit.default_timer()
t9_dur = t9_et - t9_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task9_sql, task9_my)

Head of the Pandas data frame resulting from the query:
     year  seats
0  2012.0    379
1  2012.0    377
2  2012.0    260
3  2012.0    222
4  2012.0    200 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t9_st_pl = timeit.default_timer()

task9_pl = planes_pl.filter(pl.col("year") >= 2012).unique(
    subset=["year", "seats"]).select(["year", "seats"]).sort(by=["year","seats"], descending=[False, True])
t9_et_pl = timeit.default_timer()

t9_dur_pl = t9_et_pl - t9_st_pl

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task9_my.reset_index(drop=True),
                        task9_pl.to_pandas().reset_index(drop=True), name="Polars")

print_execution_data(t9_dur, t9_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.002258  0.003938
Relative to Pandas:  1           1.74

Task 10

This SQL query selects unique rows from the year and seats columns in the planes table where the year is greater than or equal to 2012. The returned data is sorted in descending order by seats, then ascending order by year.

My pandas command: - Create a logical vector for slicing by checking rows in the year column in the planes dataframe where the value is greater than or equal to 2012. This logical vector is then used to slice the planes dataframe. In the resulting dataframe, drop duplicates corresponding to the year and seats columns. Then select the year and seats columns. Then sort the dataframe by seats in descending order, followed by year in ascending order, and reset the index.

# SQL result
task10_sql = pd.read_sql_query("""SELECT DISTINCT year, seats FROM planes WHERE year >= 2012 ORDER BY seats DESC, year ASC""", conn)

# Pandas result with timing
t10_st = timeit.default_timer()

task10_my = planes[(planes['year'] >= 2012)].drop_duplicates(
    subset=["year", "seats"])[["year", "seats"]].sort_values(
    by=["seats", "year"], ascending= [False,True]).reset_index(drop=True)

print("Head of the Pandas data frame resulting from the query:")
print(task10_my.head(), "\n")

t10_et = timeit.default_timer()
t10_dur = t10_et - t10_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task10_sql, task10_my)

Head of the Pandas data frame resulting from the query:
     year  seats
0  2012.0    379
1  2013.0    379
2  2012.0    377
3  2013.0    377
4  2012.0    260 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t10_st_pl = timeit.default_timer()

task10_pl = planes_pl.filter(pl.col("year") >= 2012).unique(
    subset=["year", "seats"]).select(["year", "seats"]).sort(by=["seats", "year"], descending=[True, False])

t10_et_pl = timeit.default_timer()
t10_dur_pl = t10_et_pl - t10_st_pl

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task10_my.reset_index(drop=True),
                        task10_pl.to_pandas().reset_index(drop=True),
                        name="Polars")

print_execution_data(t10_dur, t10_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.002203  0.003303
Relative to Pandas:  1           1.5

Task 11

This SQL query counts the number of planes with the number of seats above 200 and groups them by manufacturer. Showing how many planes each manufacturer has where the seats are greater than 200.

My pandas command - Create a logical vector for slicing by checking rows in the seats column in the planes dataframe where the value is greater than 200. This logical vector is then used to slice the planes dataframe. The resulting dataframe is then grouped by the manufacturer, and the size of each group is evaluated. I reset the index at the same time as renaming the count column to COUNT(*).

# SQL result
task11_sql = pd.read_sql_query("""SELECT manufacturer, COUNT(*) FROM planes 
    WHERE seats > 200 GROUP BY manufacturer""", conn)

# Pandas result with timing
t11_st = timeit.default_timer()

task11_my = planes[(planes["seats"] > 200)].groupby("manufacturer").size().reset_index(name="COUNT(*)")

print("Head of the Pandas data frame resulting from the query:")
print(task11_my.head(), "\n")

t11_et = timeit.default_timer()
t11_dur = t11_et - t11_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task11_sql, task11_my)

Head of the Pandas data frame resulting from the query:
       manufacturer  COUNT(*)
0            AIRBUS        66
1  AIRBUS INDUSTRIE         4
2            BOEING       225 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t11_st_pl = timeit.default_timer()

task11_pl = planes_pl.filter(pl.col("seats") > 200).group_by("manufacturer").agg(
    pl.len().alias("COUNT(*)"))

t11_et_pl = timeit.default_timer()
t11_dur_pl = t11_et_pl - t11_st_pl

# Polars returns uint32 for len(), so we need to cast to allow comparison with pandas.
task11_pl = task11_pl.with_columns(pl.col("COUNT(*)").cast(pl.Int64))

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task11_my.sort_values(by=["manufacturer", "COUNT(*)"]).reset_index(drop=True),
                        task11_pl.to_pandas().sort_values(by=["manufacturer", "COUNT(*)"]).reset_index(drop=True),
                        name="Polars")

print_execution_data(t11_dur, t11_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.001559  0.001686
Relative to Pandas:  1           1.08

Task 12

This SQL query counts the number of planes for each manufacturer and selects those manufacturers with more than 10 planes. The manufacture and count are returned.

My pandas command: - Group the rows of the planes dataframe by the manufacturer and retrieve the size of each group. Reset the index and rename the index column to count. Use the query function to return only those rows where the count is greater than 10. Rename the count column to COUNT(*) and reset the index.

# SQL result
task12_sql = pd.read_sql_query("""SELECT manufacturer, COUNT(*) FROM planes 
GROUP BY manufacturer HAVING COUNT(*) > 10""", conn)

# Pandas result with timing
t12_st = timeit.default_timer()

task12_my = planes.groupby("manufacturer").size().reset_index(name="count").query(
    "count > 10").rename(columns={"count": "COUNT(*)"}).reset_index(drop=True)

print("Head of the Pandas data frame resulting from the query:")
print(task12_my.head(), "\n")

t12_et = timeit.default_timer()
t12_dur = t12_et - t12_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task12_sql, task12_my)

Head of the Pandas data frame resulting from the query:
       manufacturer  COUNT(*)
0            AIRBUS       336
1  AIRBUS INDUSTRIE       400
2            BOEING      1630
3    BOMBARDIER INC       368
4           EMBRAER       299 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t12_st_pl = timeit.default_timer()

task12_pl = planes_pl.group_by("manufacturer").agg(pl.len().alias("COUNT(*)")).filter(
    pl.col("COUNT(*)") > 10)

t12_et_pl = timeit.default_timer()
t12_dur_pl = t12_et_pl - t12_st_pl

# Polars returns uint32 for len(), so we need to cast to allow comparison with pandas.
task12_pl = task12_pl.with_columns(pl.col("COUNT(*)").cast(pl.Int64))

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task12_my.sort_values(by=["manufacturer", "COUNT(*)"]).reset_index(drop=True),
                        task12_pl.to_pandas().sort_values(by=["manufacturer", "COUNT(*)"]).reset_index(drop=True),
                        name="Polars")

print_execution_data(t12_dur, t12_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.002903  0.004606
Relative to Pandas:  1           1.59

Task 13

This SQL query will show the number of manufacturers that have more than 10 planes with a seat capacity greater than 200.

My pandas command: - Create a logical vector for slicing by checking rows in the seats column in the planes dataframe where the value is greater than 200. Slice the planes dataframe using this logical vector, then group the rows by the manufacturer column. Retrieve the size of each group. Reset the index and rename the index column to count. Use the query function to return only those rows where the count is greater than 10. Rename the count column to COUNT(*) and reset the index

# SQL result
task13_sql = pd.read_sql_query("""SELECT manufacturer, COUNT(*) FROM planes 
    WHERE seats > 200 GROUP BY manufacturer HAVING COUNT(*) > 10""", conn)

# Pandas result with timing
t13_st = timeit.default_timer()

task13_my = planes[planes["seats"] > 200].groupby("manufacturer").size().reset_index(
    name="count").query("count >10").rename(columns={"count": "COUNT(*)"}).reset_index(drop=True)

print("Head of the Pandas data frame resulting from the query:")
print(task13_my.head(), "\n")

t13_et = timeit.default_timer()
t13_dur = t13_et - t13_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task13_sql, task13_my)

Head of the Pandas data frame resulting from the query:
  manufacturer  COUNT(*)
0       AIRBUS        66
1       BOEING       225 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t13_st_pl = timeit.default_timer()

task13_pl = planes_pl.filter(pl.col("seats") > 200).group_by("manufacturer").agg(pl.len().alias("COUNT(*)")).filter(
    pl.col("COUNT(*)") > 10)

t13_et_pl = timeit.default_timer()
t13_dur_pl = t13_et_pl - t13_st_pl

# Polars returns uint32 for len(), so we need to cast to allow comparison with pandas.
task13_pl = task13_pl.with_columns(pl.col("COUNT(*)").cast(pl.Int64))

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task13_my.sort_values(by=["manufacturer", "COUNT(*)"]).reset_index(drop=True),
                        task13_pl.to_pandas().sort_values(by=["manufacturer", "COUNT(*)"]).reset_index(drop=True),
                        name="Polars")

print_execution_data(t13_dur, t13_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.002486  0.00208
Relative to Pandas:  1           0.837

Task 14

This SQL query shows the top 10 manufacturers by the number of planes they have in the planes table.

My pandas command: - Group the rows in the planes dataframe by the manufacturer column. Retrieve the size of each group. Reset the index and rename the index column to howmany. Sort the dataframe by the howmany column in descending order. Reset the index and display the first 10 rows.

# SQL result
task14_sql = pd.read_sql_query("""SELECT manufacturer, COUNT(*) AS howmany FROM planes
    GROUP BY manufacturer ORDER BY howmany DESC LIMIT 10""", conn)

# Pandas result with timing
t14_st = timeit.default_timer()

task14_my = planes.groupby("manufacturer").size().reset_index(
    name="howmany").sort_values(by="howmany", ascending=False).reset_index(drop=True).head(10)

print("Head of the Pandas data frame resulting from the query:")
print(task14_my.head(), "\n")

t14_et = timeit.default_timer()
t14_dur = t14_et - t14_st

# Comparing the result from SQL and Pandas
test_dataframe_equality(task14_sql, task14_my)

Head of the Pandas data frame resulting from the query:
       manufacturer  howmany
0            BOEING     1630
1  AIRBUS INDUSTRIE      400
2    BOMBARDIER INC      368
3            AIRBUS      336
4           EMBRAER      299 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t14_st_pl = timeit.default_timer()

task14_pl = planes_pl.group_by("manufacturer").agg(pl.len().alias("howmany")).sort(
    "howmany", descending=True).limit(10)

t14_et_pl = timeit.default_timer()
t14_dur_pl = t14_et_pl - t14_st_pl

# Polars returns uint32 for len(), so we need to cast to allow comparison with pandas.
task14_pl = task14_pl.with_columns(pl.col("howmany").cast(pl.Int64))

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task14_my.sort_values(by=["manufacturer", "howmany"]).reset_index(drop=True),
                        task14_pl.to_pandas().sort_values(by=["manufacturer", "howmany"]).reset_index(drop=True),
                        name="Polars")

print_execution_data(t14_dur, t14_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.001631  0.006682
Relative to Pandas:  1           4.1

Task 15

This SQL query returns all rows and columns from the flights table and columns year, speed, and seats from the planes as plane_year, plane_speed, and plane_seat. Any row where the tailnum from flights matches the tailnum in planes the value from the year, speed and seats columns from planes will be added to those columns for that row in the output. If no matching tailnum is found in the planes table, then NULL is returned for the plane_year, plane_speed, and plane_seat for that row.

My pandas command: - Select the tailnum, year, speed, and seats columns from planes and assign them to temp15. - Perform a left merge of the flights and temp15 dataframes using tailnum column. Disable suffixes for flights and set to _p for the planes columns. This will prevent columns in both flights and planes from being renamed in flights with a suffix. The planes column will be added with the suffix _p. Finally, rename the year, speed, and seats columns with plane_ prefix.

# SQL result
task15_sql = pd.read_sql_query("""SELECT flights.*, planes.year AS plane_year, 
    planes.speed AS plane_speed, planes.seats AS plane_seats FROM flights
    LEFT JOIN planes ON flights.tailnum=planes.tailnum""", conn)

# Pandas result with timing
t15_st = timeit.default_timer()

# The following two lines could be done as one without using a temporary variable. However, it is easier to read this way.
temp15 = planes[["tailnum", "year", "speed", "seats"]]
task15_my = pd.merge(flights, temp15,
    on="tailnum", how="left", suffixes=(None, "_p")).rename(
    columns={"year_p":"plane_year","speed":"plane_speed", "seats":"plane_seats"})

print("Head of the Pandas data frame resulting from the query:")
print(task15_my.head(), "\n")

t15_et = timeit.default_timer()
t15_dur = t15_et - t15_st

# Comparing the result from SQL and Pandas
# This command will generate a future warning as seen below.
# This can be resoled by ensuring all Null type (None, NA, NAN) are replaced with np.nan 
print("This feature warning can be and has been disable in subsequent tasks by making the null like values consistent in the dataframes being compared.")
test_dataframe_equality(task15_sql, task15_my)

# Make null type values consistent
task15_sql.fillna(value=np.nan, inplace=True)
task15_my.fillna(value=np.nan, inplace=True)

# Assert after making null type values consistent
pd.testing.assert_frame_equal(task15_sql, task15_my)

Head of the Pandas data frame resulting from the query:
   year  month  day  dep_time  sched_dep_time  dep_delay  arr_time  \
0  2013      1    1     517.0             515        2.0     830.0   
1  2013      1    1     533.0             529        4.0     850.0   
2  2013      1    1     542.0             540        2.0     923.0   
3  2013      1    1     544.0             545       -1.0    1004.0   
4  2013      1    1     554.0             600       -6.0     812.0   

   sched_arr_time  arr_delay carrier  ...  origin dest air_time distance  \
0             819       11.0      UA  ...     EWR  IAH    227.0     1400   
1             830       20.0      UA  ...     LGA  IAH    227.0     1416   
2             850       33.0      AA  ...     JFK  MIA    160.0     1089   
3            1022      -18.0      B6  ...     JFK  BQN    183.0     1576   
4             837      -25.0      DL  ...     LGA  ATL    116.0      762   

   hour  minute            time_hour  plane_year plane_speed  plane_seats  
0     5      15  2013-01-01 05:00:00      1999.0         NaN        149.0  
1     5      29  2013-01-01 05:00:00      1998.0         NaN        149.0  
2     5      40  2013-01-01 05:00:00      1990.0         NaN        178.0  
3     5      45  2013-01-01 05:00:00      2012.0         NaN        200.0  
4     6       0  2013-01-01 06:00:00      1991.0         NaN        178.0  

[5 rows x 22 columns] 

This feature warning can be and has been disable in subsequent tasks by making the null like values consistent in the dataframes being compared.

C:\Users\darrin\AppData\Local\Temp\ipykernel_21080\4104427652.py:12: FutureWarning:

Mismatched null-like values None and nan found. In a future version, pandas equality-testing functions (e.g. assert_frame_equal) will consider these not-matching and raise.

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t15_st_pl = timeit.default_timer()

# The following two lines could be done as one without using a temporary variable. However, it is easier to read this way.
temp15_pl = planes_pl.select(["tailnum", "year", "speed", "seats"])
task15_pl = flights_pl.join(temp15_pl, on=["tailnum"], how="left").rename({"year_right":"plane_year","speed":"plane_speed", "seats":"plane_seats"})

t15_et_pl = timeit.default_timer()
t15_dur_pl = t15_et_pl - t15_st_pl

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
# Make null types consistent
task15_pl_pd = task15_pl.to_pandas()
task15_pl_pd.fillna(value=np.nan, inplace=True)
test_dataframe_equality(task15_my.reset_index(drop=True),
                        task15_pl_pd.reset_index(drop=True), name="Polars")

print_execution_data(t15_dur, t15_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.239  0.01319
Relative to Pandas:  1           0.0552

Task 16

This SQL query returns a dataframe that includes all the columns from both planes and airlines, with only the unique carrier and tailnum combinations that exist in the flight dataframe.

My pandas command: - From the flights dataframe select the carrier and tailnum columns and drop duplicate rows. Join the result of the previous step with the planes dataframe on the tailnum column using an inner join and assign to temp16. Then join temp16 with the airlines dataframe on the carrier column using an inner join.

# SQL result
task16_sql = pd.read_sql_query("""SELECT planes.*, airlines.* 
FROM (SELECT DISTINCT carrier, tailnum FROM flights) AS cartail
INNER JOIN planes ON cartail.tailnum=planes.tailnum
INNER JOIN airlines ON cartail.carrier=airlines.carrier""", conn)

# Sort needed for comparison with pandas result
task16_sql = task16_sql.sort_values(by=["tailnum", "carrier"], ascending=[True, True]).reset_index(drop=True)

# Pandas result with timing
t16_st = timeit.default_timer()

# The following two lines could be done as one without using a temporary variable. However, it is easier to read this way.
temp16 = pd.merge(flights[["carrier", "tailnum"]].drop_duplicates(), planes, on="tailnum", how="inner")
task16_my = pd.merge(temp16, airlines, on="carrier", how="inner")

print("Head of the Pandas data frame resulting from the query:")
print(task16_my.head(), "\n")

t16_et = timeit.default_timer()
t16_dur = t16_et - t16_st

# Create a copy to use with polars comparison
task16_pd= task16_my.copy()

# Move the carrier column and sort for comparison with SQL result 
task16_my.insert(9, "carrier", task16_my.pop("carrier"))
task16_my=task16_my.sort_values(by=["tailnum","carrier"], ascending=[True, True]).reset_index(drop=True)

# Comparing the result from SQL and Pandas
test_dataframe_equality(task16_sql, task16_my)

Head of the Pandas data frame resulting from the query:
  carrier tailnum    year                     type manufacturer     model  \
0      UA  N14228  1999.0  Fixed wing multi engine       BOEING   737-824   
1      UA  N24211  1998.0  Fixed wing multi engine       BOEING   737-824   
2      AA  N619AA  1990.0  Fixed wing multi engine       BOEING   757-223   
3      B6  N804JB  2012.0  Fixed wing multi engine       AIRBUS  A320-232   
4      DL  N668DN  1991.0  Fixed wing multi engine       BOEING   757-232   

   engines  seats  speed     engine                    name  
0        2    149    NaN  Turbo-fan   United Air Lines Inc.  
1        2    149    NaN  Turbo-fan   United Air Lines Inc.  
2        2    178    NaN  Turbo-fan  American Airlines Inc.  
3        2    200    NaN  Turbo-fan         JetBlue Airways  
4        2    178    NaN  Turbo-fan    Delta Air Lines Inc.   

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t16_st_pl = timeit.default_timer()

# The following two lines could be done as one without using a temporary variable. However, it is easier to read this way.
temp16_pl = flights_pl.select(["carrier", "tailnum"]).unique().join(planes_pl, on=["tailnum"], how="inner")
task16_pl = temp16_pl.join(airlines_pl, on=["carrier"], how="inner")

t16_et_pl = timeit.default_timer()
t16_dur_pl = t16_et_pl - t16_st_pl

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task16_pd.sort_values(by=["tailnum","carrier"], ascending=[True, True]).reset_index(drop=True),
                        task16_pl.to_pandas().sort_values(by=["tailnum","carrier"], ascending=[True, True]).reset_index(drop=True),
                        name="Polars")

print_execution_data(t16_dur, t16_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.1513  0.02181
Relative to Pandas:  1           0.144

Task 17

This SQL query retrieves flight data originating from EWR and joins it with averaged daily weather from the same location.

My pandas command: - From the flights dataframe only select rows where the origin column equals ‘EWR’ and assign the result to flights2 - From the weather dataframe, select only rows where the origin column equal ‘EWR’. Then group rows by the year, month, and day. Perform averaging on the temp and humidity columns. In the resulting dataframe, rename the temp and humidity columns to atemp and ahumid. Reset the dataframe index and assign the result to weather2. - Merge the weather2 into the flights2 dataframe from the left on the year, month and days columns. A left merge will result in all the rows of flights2 being copied into the result. If there is no matching atemp or ahumid for any rows in flights2, then Nan will be inserted into these columns for the rows.

# SQL query
task17_sql = pd.read_sql_query("""SELECT flights2.*, atemp, ahumid FROM ( SELECT * FROM flights WHERE origin='EWR') AS flights2
LEFT JOIN (
SELECT year, month, day, AVG(temp) AS atemp, AVG(humid) AS ahumid FROM weather WHERE origin='EWR' GROUP BY year, month, day
) AS weather2
ON flights2.year=weather2.year
AND flights2.month=weather2.month
AND flights2.day=weather2.day""", conn)

# Pandas result with timing
t17_st = timeit.default_timer()
# From the flights dataframe, select only rows where the origin column equals EWR and assign the result to flights2 
flights2 = flights[(flights.origin=='EWR')]

# From the weather dataframe, select only rows where the origin column equals EWR. Then group rows by year, month and day. Perform averaging on the temp and humidity columns.
# Rename the temp and humidity columns in the resulting data frame as atemp and ahumid. Reset the dataframe index and assign the result to weather2.
weather2 = weather[weather.origin=="EWR"].groupby(["year","month","day"])[["temp","humid"]].mean(numeric_only=True).rename(columns={'temp':'atemp','humid':'ahumid'}).reset_index()

# Merge the weather2 dataframe into the flights2 dataframe from the left on the year, month and days columns. Left merge will result in all the rows of flights2 will be in the result.
# If there is no matching atemp or ahumid for any rows in flights2, then a NaN will be inserted into the columns for those rows.
task17_my = pd.merge(flights2, weather2, on=['year', 'month', 'day'], how='left')

print("Head of the Pandas data frame resulting from the query:")
print(task17_my.head(), "\n")

t17_et = timeit.default_timer()
t17_dur = t17_et - t17_st

# Make null type values consistent - see Task 15 for description
task17_sql.fillna(value=np.nan, inplace=True)
task17_my.fillna(value=np.nan, inplace=True)

# Comparing the result from SQL and Pandas
test_dataframe_equality(task17_sql, task17_my)

Head of the Pandas data frame resulting from the query:
   year  month  day  dep_time  sched_dep_time  dep_delay  arr_time  \
0  2013      1    1     517.0             515        2.0     830.0   
1  2013      1    1     554.0             558       -4.0     740.0   
2  2013      1    1     555.0             600       -5.0     913.0   
3  2013      1    1     558.0             600       -2.0     923.0   
4  2013      1    1     559.0             600       -1.0     854.0   

   sched_arr_time  arr_delay carrier  ...  tailnum origin dest air_time  \
0             819       11.0      UA  ...   N14228    EWR  IAH    227.0   
1             728       12.0      UA  ...   N39463    EWR  ORD    150.0   
2             854       19.0      B6  ...   N516JB    EWR  FLL    158.0   
3             937      -14.0      UA  ...   N53441    EWR  SFO    361.0   
4             902       -8.0      UA  ...   N76515    EWR  LAS    337.0   

   distance  hour  minute            time_hour  atemp     ahumid  
0      1400     5      15  2013-01-01 05:00:00  38.48  58.386087  
1       719     5      58  2013-01-01 05:00:00  38.48  58.386087  
2      1065     6       0  2013-01-01 06:00:00  38.48  58.386087  
3      2565     6       0  2013-01-01 06:00:00  38.48  58.386087  
4      2227     6       0  2013-01-01 06:00:00  38.48  58.386087  

[5 rows x 21 columns] 

Outcome of comparison: SQL and Pandas results are equal

# Polars result with timing
t17_st_pl = timeit.default_timer()

flights2_pl = flights_pl.filter(pl.col("origin") == "EWR")
weather2_pl = (weather_pl.filter(pl.col("origin") == "EWR").group_by(["year", "month", "day"]).agg([pl.col("temp").mean().alias("atemp"),
                pl.col("humid").mean().alias("ahumid")]))
task17_pl=flights2_pl.join(weather2_pl, on=["year", "month", "day"], how="left")

t17_et_pl = timeit.default_timer()
t17_dur_pl = t17_et_pl - t17_st_pl

# Make null type values consistent - see Task 15 for description
task17_pl_pd = task17_pl.to_pandas()
task17_pl_pd.fillna(value=np.nan, inplace=True)

# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task17_my, task17_pl_pd, name="Polars")

print_execution_data(t17_dur, t17_dur_pl)

Outcome of comparison: Polars and Pandas results are equal

Metric                Pandas    Polars
Execution time [s]:  0.1417  0.01958
Relative to Pandas:  1           0.138

# Close database connection
conn.close()

Summary

This Jupyter Notebook demonstrates the use of pandas and polars for performing equivalent SQL queries. For cases where the queries were simple, the pandas queries were faster (up to 2x) than polars. However, for cases where the queries were complex, such as the joins, polars was faster (up tp 11x) than pandas.