Spark exercises#

Start the spark session#

from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .config("spark.cores.max", "4") \
        .master('spark://b2-120-gra11:7077') \
        .getOrCreate()
spark

Load the data#

  • Here we read the NYC taxi data files of year 2018 and select some variables.

columns = ['tpep_pickup_datetime', 'passenger_count', 'payment_type', 'fare_amount', 
           'tip_amount', 'total_amount']
df = (spark.read.parquet("hdfs://localhost:54310/data/nyctaxi/2018.parquet/").select(*columns))
  1. Plot the tip_fraction as a function of the hour of day

import matplotlib.pyplot as plt
from pyspark.sql.functions import * # Importing PySpark functions

results = (df.filter(df.fare_amount > 0)
   .withColumn('tip_fraction', df.tip_amount / df.fare_amount)
   .withColumn('hour_of_day', hour(df.tpep_pickup_datetime))
   .groupBy("hour_of_day")
   .agg({'tip_fraction': 'avg'})
   .orderBy("hour_of_day")
   .toPandas()
)
results.plot(x = "hour_of_day", y = "avg(tip_fraction)", kind="bar")
  1. Same plot for the day of the week?

# ...
  1. Investigate the payment_type column. See how well each of the payment types correlate with the tip_fraction. Did you find anything interesting? Any guesses on what the different payment types might be? If you’re interested you may be able to find more information on the NYC TLC’s website

Cheat Sheets and documentation#

Use the PySpark API.