Pandas Dataframes

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_rows", 8)
plt.rcParams['figure.figsize'] = (9, 6)

Create a DataFrame

dates = pd.date_range('20130101', periods=6)
pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
pd.DataFrame({'A' : 1.,
              'B' : pd.Timestamp('20130102'),
              'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
              'D' : np.arange(4,dtype='int32'),
              'E' : pd.Categorical(["test","train","test","train"]),
              'F' : 'foo' })

Load Data from CSV File

url = "https://www.fun-mooc.fr/c4x/agrocampusouest/40001S03/asset/AnaDo_JeuDonnees_TemperatFrance.csv"
french_cities = pd.read_csv(url, delimiter=";", encoding="latin1", index_col=0)
french_cities

Viewing Data

french_cities.head()
french_cities.tail()

Index

french_cities.index

We can rename an index by setting its name.

french_cities.index.name = "City"
french_cities.head()
import locale
import calendar
 
locale.setlocale(locale.LC_ALL,'C')
 
months = calendar.month_abbr
print(*months)
 
french_cities.rename(
  columns={ old : new 
           for old, new in zip(french_cities.columns[:12], months[1:])
          if old != new },
  inplace=True)
 
french_cities.rename(columns={'Moye':'Mean'}, inplace=True)
french_cities

Exercise: Rename DataFrame Months in English

From a local or remote HTML file

We can download and extract data about mean sea level stations around the world from the PSMSL website.

# Needs `lxml`, `beautifulSoup4` and `html5lib` python packages
table_list = pd.read_html("http://www.psmsl.org/data/obtaining/")
# there is 1 table on that page which contains metadata about the stations where 
# sea levels are recorded
local_sea_level_stations = table_list[0]
local_sea_level_stations

Indexing on DataFrames

french_cities['Lati']  # DF [] accesses columns (Series)

.loc and .iloc allow to access individual values, slices or masked selections:

french_cities.loc['Rennes', "Sep"]
french_cities.loc['Rennes', ["Sep", "Dec"]]
french_cities.loc['Rennes', "Sep":"Dec"]

Masking

mask = [True, False] * 6 + 5 * [False]
print(french_cities.iloc[:, mask])
print(french_cities.loc["Rennes", mask])

New column

french_cities["std"] = french_cities.iloc[:,:12].std(axis=1)
french_cities
french_cities = french_cities.drop("std", axis=1) # remove this new column
french_cities

Modifying a dataframe with multiple indexing

# french_cities['Rennes']['Sep'] = 25 # It does not works and breaks the DataFrame
french_cities.loc['Rennes']['Sep'] # = 25 is the right way to do it
french_cities

Transforming datasets

french_cities['Mean'].min(), french_cities['Ampl'].max()

Apply

Let’s convert the temperature mean from Celsius to Fahrenheit degree.

fahrenheit = lambda T: T*9/5+32
french_cities['Mean'].apply(fahrenheit)

Sort

french_cities.sort_values(by='Lati')
french_cities = french_cities.sort_values(by='Lati',ascending=False)
french_cities

Stack and unstack

Instead of seeing the months along the axis 1, and the cities along the axis 0, let’s try to convert these into an outer and an inner axis along only 1 time dimension.

pd.set_option("display.max_rows", 20)
unstacked = french_cities.iloc[:,:12].unstack()
unstacked
type(unstacked)

Transpose

The result is grouped in the wrong order since it sorts first the axis that was unstacked. We need to transpose the dataframe.

city_temp = french_cities.iloc[:,:12].transpose()
city_temp.plot()
city_temp.boxplot(rot=90);

Describing

french_cities['Région'].describe()
french_cities['Région'].unique()
french_cities['Région'].value_counts()
# To save memory, we can convert it to a categorical column:
french_cities["Région"] = french_cities["Région"].astype("category")
french_cities.memory_usage()

Data Aggregation/summarization

groupby

fc_grouped_region = french_cities.groupby("Région")
type(fc_grouped_region)
for group_name, subdf in fc_grouped_region:
    print(group_name)
    print(subdf)
    print("")

Exercise

Consider the following dataset UCI Machine Learning Repository Combined Cycle Power Plant Data Set. This dataset consists of records of measurements relating to peaker power plants of 10000 points over 6 years (2006-2011).

Variables - AT = Atmospheric Temperature in C - V = Exhaust Vaccum Speed - AP = Atmospheric Pressure - RH = Relative Humidity - PE = Power Output

We want to model the power output as a function of the other parameters.

Observations are in 5 excel sheets of about 10000 records in “Folds5x2_pp.xlsx”. These 5 sheets are same data shuffled. - Read this file with the pandas function read_excel. What is the type returned by this function? - Implement a select function to regroup all observations in a pandas serie. - Use select function and corr to compute the maximum correlation. - Parallelize this loop with concurrent.futures.