import os
import sqlite3
import pandas as pd
import numpy as np
import polars as pl
import tempfile
import timeitPandas/Polars versus SQL
Introduction
This task requires working with pandas to write code that produces equivalent output to various SQL queries. The nycflights13 dataset is used for this exercise. The dataset gives information about all 336,776 flights that departed from three New York airports in 2013 to destinations in the United States, Puerto Rico, and the American Virgin Islands. Additionally, equivalent queries were also written using the polars library, and the execution time for the pandas and polars libraries was calculated and compared.
Database connection
An SQLite3 database file is created on the local disk for use during these tasks.
#
# Create a database file in a temporary location
dbfile = os.path.join(tempfile.mkdtemp(), "nycflights.db")
print(dbfile)
# Connect to database
conn = sqlite3.connect(dbfile)C:\Users\darrin\AppData\Local\Temp\tmpll_hbexh\nycflights.db
Load data files and export to database
Compressed csv files containing data about New York City flights, airlines, airports, planes and weather are loaded from the local disk into pandas dataframes. The pandas dataframes are exported as tables to the SQLite3 database file.
Polars dataframes (having _pl suffix on the name) are also created from the pandas dataframes. These will be used for testing the polars library in the tasks below.
flights = pd.read_csv("nycflights13_flights.csv.gz", comment="#")
flights_pl = pl.from_pandas(flights)airlines = pd.read_csv("nycflights13_airlines.csv.gz", comment="#")
airlines_pl = pl.from_pandas(airlines)airports = pd.read_csv("nycflights13_airports.csv.gz", comment="#")
airports_pl = pl.from_pandas(airports)planes = pd.read_csv("nycflights13_planes.csv.gz", comment="#")
planes_pl = pl.from_pandas(planes)weather = pd.read_csv("nycflights13_weather.csv.gz", comment="#")
weather_pl = pl.from_pandas(weather)# Export dataframes to the database
flights.to_sql("flights", conn, index=False)
airlines.to_sql("airlines", conn, index=False)
airports.to_sql("Badges", conn, index=False)
planes.to_sql("planes", conn, index=False)
weather.to_sql("weather", conn, index=False)26130
SQL queries and equivalent Pandas and Polars queries
This section contains 17 different SQL queries and the pandas and polars code to give equivalent output.
# Helper function to testing equality between dataframes
def test_dataframe_equality(df1, df2, name="SQL"):
"""
Function to test if two dataframes are equal.
:param df1: First pandas dataframe.
:param df2: Second pandas dataframe.
:param name: Name of the method used to generate the first pandas dataframe.
:return: None.
"""
try:
pd.testing.assert_frame_equal(df1, df2)
print(f"Outcome of comparison: {name} and Pandas results are equal")
except AssertionError as e:
print(f"Outcome of comparison: {name} and Pandas results are not equal: {e}")# Helper function to print execution times
def print_execution_data(pd_dur, pl_dur):
"""
Function to print a table comparing Pandas to Polars execution time.
:param pd_dur: pandas execution duration.
:param pl_dur: polars execution duration.
:return: None.
"""
print()
print("Metric Pandas Polars")
print(f"Execution time [s]: {pd_dur:.4g} {pl_dur:.4g}")
print(f"Relative to Pandas: {1} {pl_dur/pd_dur:.3g}")Task 1
This SQL query selects the unique engines in the engines column from the planes table.
My pandas command: - In the planes dataframe, drop the duplicate rows by considering the engines column. Then select only the engines column and reset the index.
# SQL result
task1_sql = pd.read_sql_query("""SELECT DISTINCT engine FROM planes""", conn)
# Pandas result with timing
t1_st = timeit.default_timer()
task1_my = planes.drop_duplicates(subset="engine")[["engine"]].reset_index(drop=True)
print("Head of the Pandas data frame resulting from the query:")
print(task1_my.head(), "\n")
t1_et = timeit.default_timer()
t1_dur = t1_et - t1_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task1_sql, task1_my)Head of the Pandas data frame resulting from the query:
engine
0 Turbo-fan
1 Turbo-jet
2 Reciprocating
3 4 Cycle
4 Turbo-shaft
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t1_st_pl = timeit.default_timer()
task1_pl = planes_pl.unique(subset=["engine"]).select(["engine"])
t1_et_pl = timeit.default_timer()
t1_dur_pl = t1_et_pl - t1_st_pl
# Checking the Pandas result == Polars result. We need to covert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task1_my.sort_values(by=["engine"]).reset_index(drop=True),
task1_pl.to_pandas().sort_values(by=["engine"]).reset_index(drop=True),
name="Polars")
print_execution_data(t1_dur, t1_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.002907 0.003202
Relative to Pandas: 1 1.1
Task 2
This SQL query selects rows with the unique combination of type and engines from the planes table.
My pandas command: - In the planes dataframe, drop the duplicate rows by considering the type and engine columns. Then select only the type and engine columns and reset the index.
# SQL result
task2_sql = pd.read_sql_query("""SELECT DISTINCT type, engine FROM planes""", conn)
# Pandas result with timing
t2_st = timeit.default_timer()
task2_my = planes.drop_duplicates(subset=["type", "engine"])[["type", "engine"]].reset_index(drop=True)
print("Head of the Pandas data frame resulting from the query:")
print(task2_my.head(), "\n")
t2_et = timeit.default_timer()
t2_dur = t2_et - t2_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task2_sql, task2_my)Head of the Pandas data frame resulting from the query:
type engine
0 Fixed wing multi engine Turbo-fan
1 Fixed wing multi engine Turbo-jet
2 Fixed wing single engine Reciprocating
3 Fixed wing multi engine Reciprocating
4 Fixed wing single engine 4 Cycle
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t2_st_pl = timeit.default_timer()
task2_pl = planes_pl.unique(subset=["type", "engine"]).select(["type", "engine"])
t2_et_pl = timeit.default_timer()
t2_dur_pl = t2_et_pl - t2_st_pl
# Checking the Pandas result == Polars result. We need to covert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task2_my.sort_values(by=["type", "engine"]).reset_index(drop=True),
task2_pl.to_pandas().sort_values(by=["type", "engine"]).reset_index(drop=True),
name="Polars")
print_execution_data(t2_dur, t2_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.001832 0.001769
Relative to Pandas: 1 0.965
Task 3
This SQL query counts how many planes there are in the planes table for each engine type and returns the count along with the engine type
My pandas command: - In the planes dataframe, group the rows by the engine column and get the size of each group. Then reset the index and rename its column to COUNT(*). Select the columns in order to match the SQL query result.
# SQL result
task3_sql = pd.read_sql_query("""SELECT COUNT(*), engine FROM planes GROUP BY engine""", conn)
# Pandas result with timing
t3_st = timeit.default_timer()
task3_my = planes.groupby("engine").size().reset_index(name="COUNT(*)")[["COUNT(*)", "engine"]]
print("Head of the Pandas data frame resulting from the query:")
print(task3_my.head(), "\n")
t3_et = timeit.default_timer()
t3_dur = t3_et - t3_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task3_sql, task3_my)Head of the Pandas data frame resulting from the query:
COUNT(*) engine
0 2 4 Cycle
1 28 Reciprocating
2 2750 Turbo-fan
3 535 Turbo-jet
4 2 Turbo-prop
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t3_st_pl = timeit.default_timer()
task3_pl = planes_pl.group_by("engine").agg(pl.len().alias("COUNT(*)")).select(["COUNT(*)", "engine"])
t3_et_pl = timeit.default_timer()
t3_dur_pl = t3_et_pl - t3_st_pl
# Polars returns uint32 for len(), so we need to cast to allow comparison with pandas.
task3_pl = task3_pl.with_columns(pl.col("COUNT(*)").cast(pl.Int64))
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task3_my.sort_values(by=["engine"]).reset_index(drop=True),
task3_pl.to_pandas().sort_values(by=["engine"]).reset_index(drop=True),
name="Polars")
print_execution_data(t3_dur, t3_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.001778 0.00439
Relative to Pandas: 1 2.47
Task 4
This SQL query counts the number of planes in the planes table for each engine and plane type group and returns the count along with the engine type and plane type.
My pandas command - In the planes dataframe, group the rows by the engine and type columns and get the size of each group. Then reset the index and rename its column to COUNT(*). Then select the columns in order to match the SQL query result.
# SQL result
task4_sql = pd.read_sql_query("""SELECT COUNT(*), engine, type FROM planes GROUP BY engine, type""", conn)
# Pandas result with timing
t4_st = timeit.default_timer()
task4_my = planes.groupby(["engine", "type"]).size().reset_index(name="COUNT(*)")[["COUNT(*)", "engine", "type"]]
print("Head of the Pandas data frame resulting from the query:")
print(task4_my.head(), "\n")
t4_et = timeit.default_timer()
t4_dur = t4_et - t4_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task4_sql, task4_my)Head of the Pandas data frame resulting from the query:
COUNT(*) engine type
0 2 4 Cycle Fixed wing single engine
1 5 Reciprocating Fixed wing multi engine
2 23 Reciprocating Fixed wing single engine
3 2750 Turbo-fan Fixed wing multi engine
4 535 Turbo-jet Fixed wing multi engine
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t4_st_pl = timeit.default_timer()
task4_pl = planes_pl.group_by(["engine", "type"]).agg(pl.len().alias("COUNT(*)")).select(["COUNT(*)", "engine", "type"])
t4_et_pl = timeit.default_timer()
t4_dur_pl = t4_et_pl - t4_st_pl
# Polars returns uint32 for len(), so we need to cast to allow comparison with pandas.
task4_pl = task4_pl.with_columns(pl.col("COUNT(*)").cast(pl.Int64))
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task4_my.sort_values(by=["engine", "type"]).reset_index(drop=True),
task4_pl.to_pandas().sort_values(by=["engine", "type"]).reset_index(drop=True),
name="Polars")
print_execution_data(t4_dur, t4_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.002324 0.002474
Relative to Pandas: 1 1.06
Task 5
This SQL query gives the minimum, average and maximum years for each unique combination of engine and manufacturer.
My pandas command: - For the planes dataframe, group the rows by engine and manufacturer. Select the year column and perform aggregation for the minimum, mean and maximum values. Then reset the index and select the min, mean, max, engine and manufacture columns to return. Rename the aggregation columns to match the SQL result.
# SQL result
task5_sql = pd.read_sql_query("""SELECT MIN(year), AVG(year), MAX(year), engine, manufacturer
FROM planes GROUP BY engine, manufacturer""", conn)
# Pandas result with timing
t5_st = timeit.default_timer()
task5_my = planes.groupby(["engine", "manufacturer"]).year.agg(
["min", "mean", "max"]).reset_index()[["min", "mean", "max", "engine", "manufacturer"]].rename(
columns={"min": "MIN(year)", "mean": "AVG(year)", "max": "MAX(year)"})
print("Head of the Pandas data frame resulting from the query:")
print(task5_my.head(), "\n")
t5_et = timeit.default_timer()
t5_dur = t5_et - t5_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task5_sql, task5_my)Head of the Pandas data frame resulting from the query:
MIN(year) AVG(year) MAX(year) engine manufacturer
0 1975.0 1975.0 1975.0 4 Cycle CESSNA
1 NaN NaN NaN 4 Cycle JOHN G HESS
2 NaN NaN NaN Reciprocating AMERICAN AIRCRAFT INC
3 2007.0 2007.0 2007.0 Reciprocating AVIAT AIRCRAFT INC
4 NaN NaN NaN Reciprocating BARKER JACK L
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t5_st_pl = timeit.default_timer()
task5_pl = planes_pl.group_by(["engine", "manufacturer"]).agg(
[pl.col("year").min().alias("MIN(year)"), pl.col("year").mean().alias("AVG(year)"),
pl.col("year").max().alias("MAX(year)")]).select(["MIN(year)", "AVG(year)", "MAX(year)", "engine", "manufacturer"])
t5_et_pl = timeit.default_timer()
t5_dur_pl = t5_et_pl - t5_st_pl
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task5_my.sort_values(by=["engine", "manufacturer"]).reset_index(drop=True),
task5_pl.to_pandas().sort_values(by=["engine", "manufacturer"]).reset_index(drop=True),
name="Polars")
print_execution_data(t5_dur, t5_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.003621 0.006494
Relative to Pandas: 1 1.79
Task 6
This SQL query selects rows from the planes table where the speed column does not have a null value.
My pandas command: - Creates a logical vector for slicing by checking if the speed column in the planes dataframe is not Null. This slice is used with the planes dataframe to select those rows where the speed was not Null. Finally, the index is reset.
# SQL result
task6_sql = pd.read_sql_query("""SELECT * FROM planes WHERE speed IS NOT NULL""", conn)
# Pandas result with timing
t6_st = timeit.default_timer()
task6_my = planes[planes["speed"].notnull()].reset_index(drop=True)
print("Head of the Pandas data frame resulting from the query:")
print(task6_my.head(), "\n")
t6_et = timeit.default_timer()
t6_dur = t6_et - t6_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task6_sql, task6_my)Head of the Pandas data frame resulting from the query:
tailnum year type manufacturer model engines \
0 N201AA 1959.0 Fixed wing single engine CESSNA 150 1
1 N202AA 1980.0 Fixed wing multi engine CESSNA 421C 2
2 N350AA 1980.0 Fixed wing multi engine PIPER PA-31-350 2
3 N364AA 1973.0 Fixed wing multi engine CESSNA 310Q 2
4 N378AA 1963.0 Fixed wing single engine CESSNA 172E 1
seats speed engine
0 2 90.0 Reciprocating
1 8 90.0 Reciprocating
2 8 162.0 Reciprocating
3 6 167.0 Reciprocating
4 4 105.0 Reciprocating
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t6_st_pl = timeit.default_timer()
task6_pl = planes_pl.filter(pl.col("speed").is_not_null()).sort(by=[])
t6_et_pl = timeit.default_timer()
t6_dur_pl = t6_et_pl - t6_st_pl
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task6_my.sort_values(by=["speed"]).reset_index(drop=True),
task6_pl.to_pandas().sort_values(by=["speed"]).reset_index(drop=True),
name="Polars")
print_execution_data(t6_dur, t6_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.001803 0.0008909
Relative to Pandas: 1 0.494
Task 7
This SQL query selects rows from the tailnum column in the planes table with a number of seats between 150 and 210 and a year greater than or equal to 2011.
My pandas command - Create a logical vector for slicing by checking rows in the seats column in the planes dataframe where the value is between 150 and 210, and checking rows where the year column in the planes dataframe has a value equal to or greater than 2011. This logical vector is then used to slice the planes data frame. Then only the tailnum column is returned, and the index is reset.
# SQL result
task7_sql = pd.read_sql_query("""SELECT tailnum FROM planes WHERE seats BETWEEN 150 AND 210 AND year >= 2011""", conn)
# Pandas result with timing
t7_st = timeit.default_timer()
task7_my = planes[(planes["seats"].between(150,210) & (planes["year"] >= 2011))][["tailnum"]].reset_index(drop=True)
print("Head of the Pandas data frame resulting from the query:")
print(task7_my.head(), "\n")
t7_et = timeit.default_timer()
t7_dur = t7_et - t7_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task7_sql, task7_my)Head of the Pandas data frame resulting from the query:
tailnum
0 N150UW
1 N151UW
2 N152UW
3 N153UW
4 N154UW
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t7_st_pl = timeit.default_timer()
task7_pl = planes_pl.filter((pl.col("seats").is_between(150,210)) &
(pl.col("year") >= 2011)).select(["tailnum"]).sort(by=[])
t7_et_pl = timeit.default_timer()
t7_dur_pl = t7_et_pl - t7_st_pl
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task7_my.sort_values(by=["tailnum"]).reset_index(drop=True),
task7_pl.to_pandas().sort_values(by=["tailnum"]).reset_index(drop=True),
name="Polars")
print_execution_data(t7_dur, t7_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.001393 0.001259
Relative to Pandas: 1 0.904
Task 8
This SQL query selects rows from the tailnum, manufacturer, and seats columns in the planes table where the manufacturer is one of ‘BOEING’, ‘AIRBUS’, or ‘EMBRAER’ and the number of seats is greater than 390.
My pandas command: - Create a logical vector for slicing by checking rows in the manufacturer column in the planes dataframe where the value is one of ‘BOEING’, ‘AIRBUS’, ‘EMBRAER’ and checking rows in the seats column in the planes dataframe where the capacity is greater than 390. This logical vector is then used to slice the planes data frame. Then the tailnum, manufacturer and seats columns are returned, and the index is reset.
# SQL result
task8_sql = pd.read_sql_query("""SELECT tailnum, manufacturer, seats FROM planes
WHERE manufacturer IN ('BOEING', 'AIRBUS', 'EMBRAER') AND seats>390""", conn)
# Pandas result with timing
t8_st = timeit.default_timer()
task8_my = planes[(planes["manufacturer"].isin(["BOEING", "AIRBUS", "EMBRAER"])) &
(planes["seats"] > 390)][["tailnum", "manufacturer", "seats"]].reset_index(drop=True)
print("Head of the Pandas data frame resulting from the query:")
print(task8_my.head(), "\n")
t8_et = timeit.default_timer()
t8_dur = t8_et - t8_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task8_sql, task8_my)Head of the Pandas data frame resulting from the query:
tailnum manufacturer seats
0 N206UA BOEING 400
1 N228UA BOEING 400
2 N272AT BOEING 400
3 N57016 BOEING 400
4 N670US BOEING 450
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t8_st_pl = timeit.default_timer()
task8_pl = planes_pl.filter((pl.col("manufacturer").is_in(["BOEING", "AIRBUS", "EMBRAER"])) &
(pl.col("seats") > 390)).select(["tailnum", "manufacturer", "seats"]).sort(by=[])
t8_et_pl = timeit.default_timer()
t8_dur_pl = t8_et_pl - t8_st_pl
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task8_my.reset_index(drop=True),
task8_pl.to_pandas().reset_index(drop=True), name="Polars")
print_execution_data(t8_dur, t8_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.001728 0.001018
Relative to Pandas: 1 0.589
Task 9
This SQL query selects unique rows from the year and seats columns in the planes table where the year is greater than or equal to 2012. The returned data is sorted in ascending order by year and descending order by seats.
My pandas command: - Create a logical vector for slicing by checking rows in the year column in the planes dataframe where the value is greater than or equal to 2012. This logical vector is then used to slice the planes data frame. Drop duplicates corresponding to the year and seat columns in the resulting data frame. Then select the year and seats columns. Sort the dataframe by year in ascending order, followed by seats in descending order, and reset the index.
# SQL result
task9_sql = pd.read_sql_query("""SELECT DISTINCT year, seats FROM planes
WHERE year >= 2012 ORDER BY year ASC, seats DESC""", conn)
# Pandas result with timing
t9_st = timeit.default_timer()
task9_my = planes[(planes["year"] >= 2012)].drop_duplicates(
subset=["year", "seats"])[["year", "seats"]].sort_values(
by=["year", "seats"], ascending=[True, False]).reset_index(drop=True)
print("Head of the Pandas data frame resulting from the query:")
print(task9_my.head(), "\n")
t9_et = timeit.default_timer()
t9_dur = t9_et - t9_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task9_sql, task9_my)Head of the Pandas data frame resulting from the query:
year seats
0 2012.0 379
1 2012.0 377
2 2012.0 260
3 2012.0 222
4 2012.0 200
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t9_st_pl = timeit.default_timer()
task9_pl = planes_pl.filter(pl.col("year") >= 2012).unique(
subset=["year", "seats"]).select(["year", "seats"]).sort(by=["year","seats"], descending=[False, True])
t9_et_pl = timeit.default_timer()
t9_dur_pl = t9_et_pl - t9_st_pl
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task9_my.reset_index(drop=True),
task9_pl.to_pandas().reset_index(drop=True), name="Polars")
print_execution_data(t9_dur, t9_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.002258 0.003938
Relative to Pandas: 1 1.74
Task 10
This SQL query selects unique rows from the year and seats columns in the planes table where the year is greater than or equal to 2012. The returned data is sorted in descending order by seats, then ascending order by year.
My pandas command: - Create a logical vector for slicing by checking rows in the year column in the planes dataframe where the value is greater than or equal to 2012. This logical vector is then used to slice the planes dataframe. In the resulting dataframe, drop duplicates corresponding to the year and seats columns. Then select the year and seats columns. Then sort the dataframe by seats in descending order, followed by year in ascending order, and reset the index.
# SQL result
task10_sql = pd.read_sql_query("""SELECT DISTINCT year, seats FROM planes WHERE year >= 2012 ORDER BY seats DESC, year ASC""", conn)
# Pandas result with timing
t10_st = timeit.default_timer()
task10_my = planes[(planes['year'] >= 2012)].drop_duplicates(
subset=["year", "seats"])[["year", "seats"]].sort_values(
by=["seats", "year"], ascending= [False,True]).reset_index(drop=True)
print("Head of the Pandas data frame resulting from the query:")
print(task10_my.head(), "\n")
t10_et = timeit.default_timer()
t10_dur = t10_et - t10_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task10_sql, task10_my)Head of the Pandas data frame resulting from the query:
year seats
0 2012.0 379
1 2013.0 379
2 2012.0 377
3 2013.0 377
4 2012.0 260
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t10_st_pl = timeit.default_timer()
task10_pl = planes_pl.filter(pl.col("year") >= 2012).unique(
subset=["year", "seats"]).select(["year", "seats"]).sort(by=["seats", "year"], descending=[True, False])
t10_et_pl = timeit.default_timer()
t10_dur_pl = t10_et_pl - t10_st_pl
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task10_my.reset_index(drop=True),
task10_pl.to_pandas().reset_index(drop=True),
name="Polars")
print_execution_data(t10_dur, t10_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.002203 0.003303
Relative to Pandas: 1 1.5
Task 11
This SQL query counts the number of planes with the number of seats above 200 and groups them by manufacturer. Showing how many planes each manufacturer has where the seats are greater than 200.
My pandas command - Create a logical vector for slicing by checking rows in the seats column in the planes dataframe where the value is greater than 200. This logical vector is then used to slice the planes dataframe. The resulting dataframe is then grouped by the manufacturer, and the size of each group is evaluated. I reset the index at the same time as renaming the count column to COUNT(*).
# SQL result
task11_sql = pd.read_sql_query("""SELECT manufacturer, COUNT(*) FROM planes
WHERE seats > 200 GROUP BY manufacturer""", conn)
# Pandas result with timing
t11_st = timeit.default_timer()
task11_my = planes[(planes["seats"] > 200)].groupby("manufacturer").size().reset_index(name="COUNT(*)")
print("Head of the Pandas data frame resulting from the query:")
print(task11_my.head(), "\n")
t11_et = timeit.default_timer()
t11_dur = t11_et - t11_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task11_sql, task11_my)Head of the Pandas data frame resulting from the query:
manufacturer COUNT(*)
0 AIRBUS 66
1 AIRBUS INDUSTRIE 4
2 BOEING 225
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t11_st_pl = timeit.default_timer()
task11_pl = planes_pl.filter(pl.col("seats") > 200).group_by("manufacturer").agg(
pl.len().alias("COUNT(*)"))
t11_et_pl = timeit.default_timer()
t11_dur_pl = t11_et_pl - t11_st_pl
# Polars returns uint32 for len(), so we need to cast to allow comparison with pandas.
task11_pl = task11_pl.with_columns(pl.col("COUNT(*)").cast(pl.Int64))
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task11_my.sort_values(by=["manufacturer", "COUNT(*)"]).reset_index(drop=True),
task11_pl.to_pandas().sort_values(by=["manufacturer", "COUNT(*)"]).reset_index(drop=True),
name="Polars")
print_execution_data(t11_dur, t11_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.001559 0.001686
Relative to Pandas: 1 1.08
Task 12
This SQL query counts the number of planes for each manufacturer and selects those manufacturers with more than 10 planes. The manufacture and count are returned.
My pandas command: - Group the rows of the planes dataframe by the manufacturer and retrieve the size of each group. Reset the index and rename the index column to count. Use the query function to return only those rows where the count is greater than 10. Rename the count column to COUNT(*) and reset the index.
# SQL result
task12_sql = pd.read_sql_query("""SELECT manufacturer, COUNT(*) FROM planes
GROUP BY manufacturer HAVING COUNT(*) > 10""", conn)
# Pandas result with timing
t12_st = timeit.default_timer()
task12_my = planes.groupby("manufacturer").size().reset_index(name="count").query(
"count > 10").rename(columns={"count": "COUNT(*)"}).reset_index(drop=True)
print("Head of the Pandas data frame resulting from the query:")
print(task12_my.head(), "\n")
t12_et = timeit.default_timer()
t12_dur = t12_et - t12_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task12_sql, task12_my)Head of the Pandas data frame resulting from the query:
manufacturer COUNT(*)
0 AIRBUS 336
1 AIRBUS INDUSTRIE 400
2 BOEING 1630
3 BOMBARDIER INC 368
4 EMBRAER 299
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t12_st_pl = timeit.default_timer()
task12_pl = planes_pl.group_by("manufacturer").agg(pl.len().alias("COUNT(*)")).filter(
pl.col("COUNT(*)") > 10)
t12_et_pl = timeit.default_timer()
t12_dur_pl = t12_et_pl - t12_st_pl
# Polars returns uint32 for len(), so we need to cast to allow comparison with pandas.
task12_pl = task12_pl.with_columns(pl.col("COUNT(*)").cast(pl.Int64))
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task12_my.sort_values(by=["manufacturer", "COUNT(*)"]).reset_index(drop=True),
task12_pl.to_pandas().sort_values(by=["manufacturer", "COUNT(*)"]).reset_index(drop=True),
name="Polars")
print_execution_data(t12_dur, t12_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.002903 0.004606
Relative to Pandas: 1 1.59
Task 13
This SQL query will show the number of manufacturers that have more than 10 planes with a seat capacity greater than 200.
My pandas command: - Create a logical vector for slicing by checking rows in the seats column in the planes dataframe where the value is greater than 200. Slice the planes dataframe using this logical vector, then group the rows by the manufacturer column. Retrieve the size of each group. Reset the index and rename the index column to count. Use the query function to return only those rows where the count is greater than 10. Rename the count column to COUNT(*) and reset the index
# SQL result
task13_sql = pd.read_sql_query("""SELECT manufacturer, COUNT(*) FROM planes
WHERE seats > 200 GROUP BY manufacturer HAVING COUNT(*) > 10""", conn)
# Pandas result with timing
t13_st = timeit.default_timer()
task13_my = planes[planes["seats"] > 200].groupby("manufacturer").size().reset_index(
name="count").query("count >10").rename(columns={"count": "COUNT(*)"}).reset_index(drop=True)
print("Head of the Pandas data frame resulting from the query:")
print(task13_my.head(), "\n")
t13_et = timeit.default_timer()
t13_dur = t13_et - t13_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task13_sql, task13_my)Head of the Pandas data frame resulting from the query:
manufacturer COUNT(*)
0 AIRBUS 66
1 BOEING 225
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t13_st_pl = timeit.default_timer()
task13_pl = planes_pl.filter(pl.col("seats") > 200).group_by("manufacturer").agg(pl.len().alias("COUNT(*)")).filter(
pl.col("COUNT(*)") > 10)
t13_et_pl = timeit.default_timer()
t13_dur_pl = t13_et_pl - t13_st_pl
# Polars returns uint32 for len(), so we need to cast to allow comparison with pandas.
task13_pl = task13_pl.with_columns(pl.col("COUNT(*)").cast(pl.Int64))
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task13_my.sort_values(by=["manufacturer", "COUNT(*)"]).reset_index(drop=True),
task13_pl.to_pandas().sort_values(by=["manufacturer", "COUNT(*)"]).reset_index(drop=True),
name="Polars")
print_execution_data(t13_dur, t13_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.002486 0.00208
Relative to Pandas: 1 0.837
Task 14
This SQL query shows the top 10 manufacturers by the number of planes they have in the planes table.
My pandas command: - Group the rows in the planes dataframe by the manufacturer column. Retrieve the size of each group. Reset the index and rename the index column to howmany. Sort the dataframe by the howmany column in descending order. Reset the index and display the first 10 rows.
# SQL result
task14_sql = pd.read_sql_query("""SELECT manufacturer, COUNT(*) AS howmany FROM planes
GROUP BY manufacturer ORDER BY howmany DESC LIMIT 10""", conn)
# Pandas result with timing
t14_st = timeit.default_timer()
task14_my = planes.groupby("manufacturer").size().reset_index(
name="howmany").sort_values(by="howmany", ascending=False).reset_index(drop=True).head(10)
print("Head of the Pandas data frame resulting from the query:")
print(task14_my.head(), "\n")
t14_et = timeit.default_timer()
t14_dur = t14_et - t14_st
# Comparing the result from SQL and Pandas
test_dataframe_equality(task14_sql, task14_my)Head of the Pandas data frame resulting from the query:
manufacturer howmany
0 BOEING 1630
1 AIRBUS INDUSTRIE 400
2 BOMBARDIER INC 368
3 AIRBUS 336
4 EMBRAER 299
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t14_st_pl = timeit.default_timer()
task14_pl = planes_pl.group_by("manufacturer").agg(pl.len().alias("howmany")).sort(
"howmany", descending=True).limit(10)
t14_et_pl = timeit.default_timer()
t14_dur_pl = t14_et_pl - t14_st_pl
# Polars returns uint32 for len(), so we need to cast to allow comparison with pandas.
task14_pl = task14_pl.with_columns(pl.col("howmany").cast(pl.Int64))
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task14_my.sort_values(by=["manufacturer", "howmany"]).reset_index(drop=True),
task14_pl.to_pandas().sort_values(by=["manufacturer", "howmany"]).reset_index(drop=True),
name="Polars")
print_execution_data(t14_dur, t14_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.001631 0.006682
Relative to Pandas: 1 4.1
Task 15
This SQL query returns all rows and columns from the flights table and columns year, speed, and seats from the planes as plane_year, plane_speed, and plane_seat. Any row where the tailnum from flights matches the tailnum in planes the value from the year, speed and seats columns from planes will be added to those columns for that row in the output. If no matching tailnum is found in the planes table, then NULL is returned for the plane_year, plane_speed, and plane_seat for that row.
My pandas command: - Select the tailnum, year, speed, and seats columns from planes and assign them to temp15. - Perform a left merge of the flights and temp15 dataframes using tailnum column. Disable suffixes for flights and set to _p for the planes columns. This will prevent columns in both flights and planes from being renamed in flights with a suffix. The planes column will be added with the suffix _p. Finally, rename the year, speed, and seats columns with plane_ prefix.
# SQL result
task15_sql = pd.read_sql_query("""SELECT flights.*, planes.year AS plane_year,
planes.speed AS plane_speed, planes.seats AS plane_seats FROM flights
LEFT JOIN planes ON flights.tailnum=planes.tailnum""", conn)
# Pandas result with timing
t15_st = timeit.default_timer()
# The following two lines could be done as one without using a temporary variable. However, it is easier to read this way.
temp15 = planes[["tailnum", "year", "speed", "seats"]]
task15_my = pd.merge(flights, temp15,
on="tailnum", how="left", suffixes=(None, "_p")).rename(
columns={"year_p":"plane_year","speed":"plane_speed", "seats":"plane_seats"})
print("Head of the Pandas data frame resulting from the query:")
print(task15_my.head(), "\n")
t15_et = timeit.default_timer()
t15_dur = t15_et - t15_st
# Comparing the result from SQL and Pandas
# This command will generate a future warning as seen below.
# This can be resoled by ensuring all Null type (None, NA, NAN) are replaced with np.nan
print("This feature warning can be and has been disable in subsequent tasks by making the null like values consistent in the dataframes being compared.")
test_dataframe_equality(task15_sql, task15_my)
# Make null type values consistent
task15_sql.fillna(value=np.nan, inplace=True)
task15_my.fillna(value=np.nan, inplace=True)
# Assert after making null type values consistent
pd.testing.assert_frame_equal(task15_sql, task15_my)Head of the Pandas data frame resulting from the query:
year month day dep_time sched_dep_time dep_delay arr_time \
0 2013 1 1 517.0 515 2.0 830.0
1 2013 1 1 533.0 529 4.0 850.0
2 2013 1 1 542.0 540 2.0 923.0
3 2013 1 1 544.0 545 -1.0 1004.0
4 2013 1 1 554.0 600 -6.0 812.0
sched_arr_time arr_delay carrier ... origin dest air_time distance \
0 819 11.0 UA ... EWR IAH 227.0 1400
1 830 20.0 UA ... LGA IAH 227.0 1416
2 850 33.0 AA ... JFK MIA 160.0 1089
3 1022 -18.0 B6 ... JFK BQN 183.0 1576
4 837 -25.0 DL ... LGA ATL 116.0 762
hour minute time_hour plane_year plane_speed plane_seats
0 5 15 2013-01-01 05:00:00 1999.0 NaN 149.0
1 5 29 2013-01-01 05:00:00 1998.0 NaN 149.0
2 5 40 2013-01-01 05:00:00 1990.0 NaN 178.0
3 5 45 2013-01-01 05:00:00 2012.0 NaN 200.0
4 6 0 2013-01-01 06:00:00 1991.0 NaN 178.0
[5 rows x 22 columns]
This feature warning can be and has been disable in subsequent tasks by making the null like values consistent in the dataframes being compared.
C:\Users\darrin\AppData\Local\Temp\ipykernel_21080\4104427652.py:12: FutureWarning:
Mismatched null-like values None and nan found. In a future version, pandas equality-testing functions (e.g. assert_frame_equal) will consider these not-matching and raise.
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t15_st_pl = timeit.default_timer()
# The following two lines could be done as one without using a temporary variable. However, it is easier to read this way.
temp15_pl = planes_pl.select(["tailnum", "year", "speed", "seats"])
task15_pl = flights_pl.join(temp15_pl, on=["tailnum"], how="left").rename({"year_right":"plane_year","speed":"plane_speed", "seats":"plane_seats"})
t15_et_pl = timeit.default_timer()
t15_dur_pl = t15_et_pl - t15_st_pl
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
# Make null types consistent
task15_pl_pd = task15_pl.to_pandas()
task15_pl_pd.fillna(value=np.nan, inplace=True)
test_dataframe_equality(task15_my.reset_index(drop=True),
task15_pl_pd.reset_index(drop=True), name="Polars")
print_execution_data(t15_dur, t15_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.239 0.01319
Relative to Pandas: 1 0.0552
Task 16
This SQL query returns a dataframe that includes all the columns from both planes and airlines, with only the unique carrier and tailnum combinations that exist in the flight dataframe.
My pandas command: - From the flights dataframe select the carrier and tailnum columns and drop duplicate rows. Join the result of the previous step with the planes dataframe on the tailnum column using an inner join and assign to temp16. Then join temp16 with the airlines dataframe on the carrier column using an inner join.
# SQL result
task16_sql = pd.read_sql_query("""SELECT planes.*, airlines.*
FROM (SELECT DISTINCT carrier, tailnum FROM flights) AS cartail
INNER JOIN planes ON cartail.tailnum=planes.tailnum
INNER JOIN airlines ON cartail.carrier=airlines.carrier""", conn)
# Sort needed for comparison with pandas result
task16_sql = task16_sql.sort_values(by=["tailnum", "carrier"], ascending=[True, True]).reset_index(drop=True)
# Pandas result with timing
t16_st = timeit.default_timer()
# The following two lines could be done as one without using a temporary variable. However, it is easier to read this way.
temp16 = pd.merge(flights[["carrier", "tailnum"]].drop_duplicates(), planes, on="tailnum", how="inner")
task16_my = pd.merge(temp16, airlines, on="carrier", how="inner")
print("Head of the Pandas data frame resulting from the query:")
print(task16_my.head(), "\n")
t16_et = timeit.default_timer()
t16_dur = t16_et - t16_st
# Create a copy to use with polars comparison
task16_pd= task16_my.copy()
# Move the carrier column and sort for comparison with SQL result
task16_my.insert(9, "carrier", task16_my.pop("carrier"))
task16_my=task16_my.sort_values(by=["tailnum","carrier"], ascending=[True, True]).reset_index(drop=True)
# Comparing the result from SQL and Pandas
test_dataframe_equality(task16_sql, task16_my)Head of the Pandas data frame resulting from the query:
carrier tailnum year type manufacturer model \
0 UA N14228 1999.0 Fixed wing multi engine BOEING 737-824
1 UA N24211 1998.0 Fixed wing multi engine BOEING 737-824
2 AA N619AA 1990.0 Fixed wing multi engine BOEING 757-223
3 B6 N804JB 2012.0 Fixed wing multi engine AIRBUS A320-232
4 DL N668DN 1991.0 Fixed wing multi engine BOEING 757-232
engines seats speed engine name
0 2 149 NaN Turbo-fan United Air Lines Inc.
1 2 149 NaN Turbo-fan United Air Lines Inc.
2 2 178 NaN Turbo-fan American Airlines Inc.
3 2 200 NaN Turbo-fan JetBlue Airways
4 2 178 NaN Turbo-fan Delta Air Lines Inc.
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t16_st_pl = timeit.default_timer()
# The following two lines could be done as one without using a temporary variable. However, it is easier to read this way.
temp16_pl = flights_pl.select(["carrier", "tailnum"]).unique().join(planes_pl, on=["tailnum"], how="inner")
task16_pl = temp16_pl.join(airlines_pl, on=["carrier"], how="inner")
t16_et_pl = timeit.default_timer()
t16_dur_pl = t16_et_pl - t16_st_pl
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task16_pd.sort_values(by=["tailnum","carrier"], ascending=[True, True]).reset_index(drop=True),
task16_pl.to_pandas().sort_values(by=["tailnum","carrier"], ascending=[True, True]).reset_index(drop=True),
name="Polars")
print_execution_data(t16_dur, t16_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.1513 0.02181
Relative to Pandas: 1 0.144
Task 17
This SQL query retrieves flight data originating from EWR and joins it with averaged daily weather from the same location.
My pandas command: - From the flights dataframe only select rows where the origin column equals ‘EWR’ and assign the result to flights2 - From the weather dataframe, select only rows where the origin column equal ‘EWR’. Then group rows by the year, month, and day. Perform averaging on the temp and humidity columns. In the resulting dataframe, rename the temp and humidity columns to atemp and ahumid. Reset the dataframe index and assign the result to weather2. - Merge the weather2 into the flights2 dataframe from the left on the year, month and days columns. A left merge will result in all the rows of flights2 being copied into the result. If there is no matching atemp or ahumid for any rows in flights2, then Nan will be inserted into these columns for the rows.
# SQL query
task17_sql = pd.read_sql_query("""SELECT flights2.*, atemp, ahumid FROM ( SELECT * FROM flights WHERE origin='EWR') AS flights2
LEFT JOIN (
SELECT year, month, day, AVG(temp) AS atemp, AVG(humid) AS ahumid FROM weather WHERE origin='EWR' GROUP BY year, month, day
) AS weather2
ON flights2.year=weather2.year
AND flights2.month=weather2.month
AND flights2.day=weather2.day""", conn)
# Pandas result with timing
t17_st = timeit.default_timer()
# From the flights dataframe, select only rows where the origin column equals EWR and assign the result to flights2
flights2 = flights[(flights.origin=='EWR')]
# From the weather dataframe, select only rows where the origin column equals EWR. Then group rows by year, month and day. Perform averaging on the temp and humidity columns.
# Rename the temp and humidity columns in the resulting data frame as atemp and ahumid. Reset the dataframe index and assign the result to weather2.
weather2 = weather[weather.origin=="EWR"].groupby(["year","month","day"])[["temp","humid"]].mean(numeric_only=True).rename(columns={'temp':'atemp','humid':'ahumid'}).reset_index()
# Merge the weather2 dataframe into the flights2 dataframe from the left on the year, month and days columns. Left merge will result in all the rows of flights2 will be in the result.
# If there is no matching atemp or ahumid for any rows in flights2, then a NaN will be inserted into the columns for those rows.
task17_my = pd.merge(flights2, weather2, on=['year', 'month', 'day'], how='left')
print("Head of the Pandas data frame resulting from the query:")
print(task17_my.head(), "\n")
t17_et = timeit.default_timer()
t17_dur = t17_et - t17_st
# Make null type values consistent - see Task 15 for description
task17_sql.fillna(value=np.nan, inplace=True)
task17_my.fillna(value=np.nan, inplace=True)
# Comparing the result from SQL and Pandas
test_dataframe_equality(task17_sql, task17_my)Head of the Pandas data frame resulting from the query:
year month day dep_time sched_dep_time dep_delay arr_time \
0 2013 1 1 517.0 515 2.0 830.0
1 2013 1 1 554.0 558 -4.0 740.0
2 2013 1 1 555.0 600 -5.0 913.0
3 2013 1 1 558.0 600 -2.0 923.0
4 2013 1 1 559.0 600 -1.0 854.0
sched_arr_time arr_delay carrier ... tailnum origin dest air_time \
0 819 11.0 UA ... N14228 EWR IAH 227.0
1 728 12.0 UA ... N39463 EWR ORD 150.0
2 854 19.0 B6 ... N516JB EWR FLL 158.0
3 937 -14.0 UA ... N53441 EWR SFO 361.0
4 902 -8.0 UA ... N76515 EWR LAS 337.0
distance hour minute time_hour atemp ahumid
0 1400 5 15 2013-01-01 05:00:00 38.48 58.386087
1 719 5 58 2013-01-01 05:00:00 38.48 58.386087
2 1065 6 0 2013-01-01 06:00:00 38.48 58.386087
3 2565 6 0 2013-01-01 06:00:00 38.48 58.386087
4 2227 6 0 2013-01-01 06:00:00 38.48 58.386087
[5 rows x 21 columns]
Outcome of comparison: SQL and Pandas results are equal
# Polars result with timing
t17_st_pl = timeit.default_timer()
flights2_pl = flights_pl.filter(pl.col("origin") == "EWR")
weather2_pl = (weather_pl.filter(pl.col("origin") == "EWR").group_by(["year", "month", "day"]).agg([pl.col("temp").mean().alias("atemp"),
pl.col("humid").mean().alias("ahumid")]))
task17_pl=flights2_pl.join(weather2_pl, on=["year", "month", "day"], how="left")
t17_et_pl = timeit.default_timer()
t17_dur_pl = t17_et_pl - t17_st_pl
# Make null type values consistent - see Task 15 for description
task17_pl_pd = task17_pl.to_pandas()
task17_pl_pd.fillna(value=np.nan, inplace=True)
# Checking the Pandas result == Polars result. We need to convert the polars dataframe to a pandas dataframe,
# sort and reset the index for the comparison
test_dataframe_equality(task17_my, task17_pl_pd, name="Polars")
print_execution_data(t17_dur, t17_dur_pl)Outcome of comparison: Polars and Pandas results are equal
Metric Pandas Polars
Execution time [s]: 0.1417 0.01958
Relative to Pandas: 1 0.138
# Close database connection
conn.close()Summary
This Jupyter Notebook demonstrates the use of pandas and polars for performing equivalent SQL queries. For cases where the queries were simple, the pandas queries were faster (up to 2x) than polars. However, for cases where the queries were complex, such as the joins, polars was faster (up tp 11x) than pandas.