%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option("display.max_rows", 8)
plt.rcParams['figure.figsize'] = (9, 6)Pandas Dataframes
Create a DataFrame
dates = pd.date_range('20130101', periods=6)
pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))pd.DataFrame({'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.arange(4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })Load Data from CSV File
url = "https://www.fun-mooc.fr/c4x/agrocampusouest/40001S03/asset/AnaDo_JeuDonnees_TemperatFrance.csv"
french_cities = pd.read_csv(url, delimiter=";", encoding="latin1", index_col=0)
french_citiesViewing Data
french_cities.head()french_cities.tail()Index
french_cities.indexWe can rename an index by setting its name.
french_cities.index.name = "City"
french_cities.head()import locale
import calendar
locale.setlocale(locale.LC_ALL,'C')
months = calendar.month_abbr
print(*months)
french_cities.rename(
columns={ old : new
for old, new in zip(french_cities.columns[:12], months[1:])
if old != new },
inplace=True)
french_cities.rename(columns={'Moye':'Mean'}, inplace=True)
french_citiesExercise: Rename DataFrame Months in English
From a local or remote HTML file
We can download and extract data about mean sea level stations around the world from the PSMSL website.
# Needs `lxml`, `beautifulSoup4` and `html5lib` python packages
table_list = pd.read_html("http://www.psmsl.org/data/obtaining/")# there is 1 table on that page which contains metadata about the stations where
# sea levels are recorded
local_sea_level_stations = table_list[0]
local_sea_level_stationsIndexing on DataFrames
french_cities['Lati'] # DF [] accesses columns (Series).loc and .iloc allow to access individual values, slices or masked selections:
french_cities.loc['Rennes', "Sep"]french_cities.loc['Rennes', ["Sep", "Dec"]]french_cities.loc['Rennes', "Sep":"Dec"]Masking
mask = [True, False] * 6 + 5 * [False]
print(french_cities.iloc[:, mask])print(french_cities.loc["Rennes", mask])New column
french_cities["std"] = french_cities.iloc[:,:12].std(axis=1)
french_citiesfrench_cities = french_cities.drop("std", axis=1) # remove this new columnfrench_citiesModifying a dataframe with multiple indexing
# french_cities['Rennes']['Sep'] = 25 # It does not works and breaks the DataFrame
french_cities.loc['Rennes']['Sep'] # = 25 is the right way to do itfrench_citiesTransforming datasets
french_cities['Mean'].min(), french_cities['Ampl'].max()Apply
Let’s convert the temperature mean from Celsius to Fahrenheit degree.
fahrenheit = lambda T: T*9/5+32
french_cities['Mean'].apply(fahrenheit)Sort
french_cities.sort_values(by='Lati')french_cities = french_cities.sort_values(by='Lati',ascending=False)
french_citiesStack and unstack
Instead of seeing the months along the axis 1, and the cities along the axis 0, let’s try to convert these into an outer and an inner axis along only 1 time dimension.
pd.set_option("display.max_rows", 20)
unstacked = french_cities.iloc[:,:12].unstack()
unstackedtype(unstacked)Transpose
The result is grouped in the wrong order since it sorts first the axis that was unstacked. We need to transpose the dataframe.
city_temp = french_cities.iloc[:,:12].transpose()
city_temp.plot()city_temp.boxplot(rot=90);Describing
french_cities['Région'].describe()french_cities['Région'].unique()french_cities['Région'].value_counts()# To save memory, we can convert it to a categorical column:
french_cities["Région"] = french_cities["Région"].astype("category")french_cities.memory_usage()Data Aggregation/summarization
groupby
fc_grouped_region = french_cities.groupby("Région")
type(fc_grouped_region)for group_name, subdf in fc_grouped_region:
print(group_name)
print(subdf)
print("")Exercise
Consider the following dataset UCI Machine Learning Repository Combined Cycle Power Plant Data Set. This dataset consists of records of measurements relating to peaker power plants of 10000 points over 6 years (2006-2011).
Variables - AT = Atmospheric Temperature in C - V = Exhaust Vaccum Speed - AP = Atmospheric Pressure - RH = Relative Humidity - PE = Power Output
We want to model the power output as a function of the other parameters.
Observations are in 5 excel sheets of about 10000 records in “Folds5x2_pp.xlsx”. These 5 sheets are same data shuffled. - Read this file with the pandas function read_excel. What is the type returned by this function? - Implement a select function to regroup all observations in a pandas serie. - Use select function and corr to compute the maximum correlation. - Parallelize this loop with concurrent.futures.