%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
"display.max_rows", 8)
pd.set_option('figure.figsize'] = (9, 6) plt.rcParams[
Pandas Dataframes
Create a DataFrame
= pd.date_range('20130101', periods=6)
dates 6,4), index=dates, columns=list('ABCD')) pd.DataFrame(np.random.randn(
'A' : 1.,
pd.DataFrame({'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.arange(4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
Load Data from CSV File
= "https://www.fun-mooc.fr/c4x/agrocampusouest/40001S03/asset/AnaDo_JeuDonnees_TemperatFrance.csv"
url = pd.read_csv(url, delimiter=";", encoding="latin1", index_col=0)
french_cities french_cities
Viewing Data
french_cities.head()
french_cities.tail()
Index
french_cities.index
We can rename an index by setting its name.
= "City"
french_cities.index.name french_cities.head()
import locale
import calendar
'C')
locale.setlocale(locale.LC_ALL,
= calendar.month_abbr
months print(*months)
french_cities.rename(={ old : new
columnsfor old, new in zip(french_cities.columns[:12], months[1:])
if old != new },
=True)
inplace
={'Moye':'Mean'}, inplace=True)
french_cities.rename(columns french_cities
Exercise: Rename DataFrame Months in English
From a local or remote HTML file
We can download and extract data about mean sea level stations around the world from the PSMSL website.
# Needs `lxml`, `beautifulSoup4` and `html5lib` python packages
= pd.read_html("http://www.psmsl.org/data/obtaining/") table_list
# there is 1 table on that page which contains metadata about the stations where
# sea levels are recorded
= table_list[0]
local_sea_level_stations local_sea_level_stations
Indexing on DataFrames
'Lati'] # DF [] accesses columns (Series) french_cities[
.loc
and .iloc
allow to access individual values, slices or masked selections:
'Rennes', "Sep"] french_cities.loc[
'Rennes', ["Sep", "Dec"]] french_cities.loc[
'Rennes', "Sep":"Dec"] french_cities.loc[
Masking
= [True, False] * 6 + 5 * [False]
mask print(french_cities.iloc[:, mask])
print(french_cities.loc["Rennes", mask])
New column
"std"] = french_cities.iloc[:,:12].std(axis=1)
french_cities[ french_cities
= french_cities.drop("std", axis=1) # remove this new column french_cities
french_cities
Modifying a dataframe with multiple indexing
# french_cities['Rennes']['Sep'] = 25 # It does not works and breaks the DataFrame
'Rennes']['Sep'] # = 25 is the right way to do it french_cities.loc[
french_cities
Transforming datasets
'Mean'].min(), french_cities['Ampl'].max() french_cities[
Apply
Let’s convert the temperature mean from Celsius to Fahrenheit degree.
= lambda T: T*9/5+32
fahrenheit 'Mean'].apply(fahrenheit) french_cities[
Sort
='Lati') french_cities.sort_values(by
= french_cities.sort_values(by='Lati',ascending=False)
french_cities french_cities
Stack and unstack
Instead of seeing the months along the axis 1, and the cities along the axis 0, let’s try to convert these into an outer and an inner axis along only 1 time dimension.
"display.max_rows", 20)
pd.set_option(= french_cities.iloc[:,:12].unstack()
unstacked unstacked
type(unstacked)
Transpose
The result is grouped in the wrong order since it sorts first the axis that was unstacked. We need to transpose the dataframe.
= french_cities.iloc[:,:12].transpose()
city_temp city_temp.plot()
=90); city_temp.boxplot(rot
Describing
'Région'].describe() french_cities[
'Région'].unique() french_cities[
'Région'].value_counts() french_cities[
# To save memory, we can convert it to a categorical column:
"Région"] = french_cities["Région"].astype("category") french_cities[
french_cities.memory_usage()
Data Aggregation/summarization
groupby
= french_cities.groupby("Région")
fc_grouped_region type(fc_grouped_region)
for group_name, subdf in fc_grouped_region:
print(group_name)
print(subdf)
print("")
Exercise
Consider the following dataset UCI Machine Learning Repository Combined Cycle Power Plant Data Set. This dataset consists of records of measurements relating to peaker power plants of 10000 points over 6 years (2006-2011).
Variables - AT = Atmospheric Temperature in C - V = Exhaust Vaccum Speed - AP = Atmospheric Pressure - RH = Relative Humidity - PE = Power Output
We want to model the power output as a function of the other parameters.
Observations are in 5 excel sheets of about 10000 records in “Folds5x2_pp.xlsx”. These 5 sheets are same data shuffled. - Read this file with the pandas function read_excel. What is the type returned by this function? - Implement a select
function to regroup all observations in a pandas serie. - Use select
function and corr
to compute the maximum correlation. - Parallelize this loop with concurrent.futures
.