Spark exercises#
Start the spark session#
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.cores.max", "4") \
.master('spark://b2-120-gra11:7077') \
.getOrCreate()
spark
Load the data#
Here we read the NYC taxi data files of year 2018 and select some variables.
columns = ['tpep_pickup_datetime', 'passenger_count', 'payment_type', 'fare_amount',
'tip_amount', 'total_amount']
df = (spark.read.parquet("hdfs://localhost:54310/data/nyctaxi/2018.parquet/").select(*columns))
Plot the tip_fraction as a function of the hour of day
import matplotlib.pyplot as plt
from pyspark.sql.functions import * # Importing PySpark functions
results = (df.filter(df.fare_amount > 0)
.withColumn('tip_fraction', df.tip_amount / df.fare_amount)
.withColumn('hour_of_day', hour(df.tpep_pickup_datetime))
.groupBy("hour_of_day")
.agg({'tip_fraction': 'avg'})
.orderBy("hour_of_day")
.toPandas()
)
results.plot(x = "hour_of_day", y = "avg(tip_fraction)", kind="bar")
Same plot for the day of the week?
# ...
Investigate the
payment_type
column. See how well each of the payment types correlate with thetip_fraction
. Did you find anything interesting? Any guesses on what the different payment types might be? If you’re interested you may be able to find more information on the NYC TLC’s website
Cheat Sheets and documentation#
Use the PySpark API.