Pandas Dataframes#
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option("display.max_rows", 8)
plt.rcParams['figure.figsize'] = (9, 6)
Create a DataFrame#
dates = pd.date_range('20130101', periods=6)
pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
pd.DataFrame({'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.arange(4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
Load Data from CSV File#
url = "https://www.fun-mooc.fr/c4x/agrocampusouest/40001S03/asset/AnaDo_JeuDonnees_TemperatFrance.csv"
french_cities = pd.read_csv(url, delimiter=";", encoding="latin1", index_col=0)
french_cities
Viewing Data#
french_cities.head()
french_cities.tail()
Index#
french_cities.index
We can rename an index by setting its name.
french_cities.index.name = "City"
french_cities.head()
import locale
import calendar
locale.setlocale(locale.LC_ALL,'C')
months = calendar.month_abbr
print(*months)
french_cities.rename(
columns={ old : new
for old, new in zip(french_cities.columns[:12], months[1:])
if old != new },
inplace=True)
french_cities.rename(columns={'Moye':'Mean'}, inplace=True)
french_cities
Exercise: Rename DataFrame Months in English#
From a local or remote HTML file#
We can download and extract data about mean sea level stations around the world from the PSMSL website.
# Needs `lxml`, `beautifulSoup4` and `html5lib` python packages
table_list = pd.read_html("http://www.psmsl.org/data/obtaining/")
# there is 1 table on that page which contains metadata about the stations where
# sea levels are recorded
local_sea_level_stations = table_list[0]
local_sea_level_stations
Indexing on DataFrames#
french_cities['Lati'] # DF [] accesses columns (Series)
.loc
and .iloc
allow to access individual values, slices or masked selections:
french_cities.loc['Rennes', "Sep"]
french_cities.loc['Rennes', ["Sep", "Dec"]]
french_cities.loc['Rennes', "Sep":"Dec"]
Masking#
mask = [True, False] * 6 + 5 * [False]
print(french_cities.iloc[:, mask])
print(french_cities.loc["Rennes", mask])
New column#
french_cities["std"] = french_cities.iloc[:,:12].std(axis=1)
french_cities
french_cities = french_cities.drop("std", axis=1) # remove this new column
french_cities
Modifying a dataframe with multiple indexing#
# french_cities['Rennes']['Sep'] = 25 # It does not works and breaks the DataFrame
french_cities.loc['Rennes']['Sep'] # = 25 is the right way to do it
french_cities
Transforming datasets#
french_cities['Mean'].min(), french_cities['Ampl'].max()
Apply#
Let’s convert the temperature mean from Celsius to Fahrenheit degree.
fahrenheit = lambda T: T*9/5+32
french_cities['Mean'].apply(fahrenheit)
Sort#
french_cities.sort_values(by='Lati')
french_cities = french_cities.sort_values(by='Lati',ascending=False)
french_cities
Stack and unstack#
Instead of seeing the months along the axis 1, and the cities along the axis 0, let’s try to convert these into an outer and an inner axis along only 1 time dimension.
pd.set_option("display.max_rows", 20)
unstacked = french_cities.iloc[:,:12].unstack()
unstacked
type(unstacked)
Transpose#
The result is grouped in the wrong order since it sorts first the axis that was unstacked. We need to transpose the dataframe.
city_temp = french_cities.iloc[:,:12].transpose()
city_temp.plot()
city_temp.boxplot(rot=90);
Describing#
french_cities['Région'].describe()
french_cities['Région'].unique()
french_cities['Région'].value_counts()
# To save memory, we can convert it to a categorical column:
french_cities["Région"] = french_cities["Région"].astype("category")
french_cities.memory_usage()
Data Aggregation/summarization#
groupby#
fc_grouped_region = french_cities.groupby("Région")
type(fc_grouped_region)
for group_name, subdf in fc_grouped_region:
print(group_name)
print(subdf)
print("")
Exercise#
Consider the following dataset UCI Machine Learning Repository Combined Cycle Power Plant Data Set. This dataset consists of records of measurements relating to peaker power plants of 10000 points over 6 years (2006-2011).
Variables
AT = Atmospheric Temperature in C
V = Exhaust Vaccum Speed
AP = Atmospheric Pressure
RH = Relative Humidity
PE = Power Output
We want to model the power output as a function of the other parameters.
Observations are in 5 excel sheets of about 10000 records in “Folds5x2_pp.xlsx”. These 5 sheets are same data shuffled.
Read this file with the pandas function read_excel. What is the type returned by this function?
Implement a
select
function to regroup all observations in a pandas serie.Use
select
function andcorr
to compute the maximum correlation.Parallelize this loop with
concurrent.futures
.