%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
set()
sns."display.max_rows", 8)
pd.set_option('figure.figsize'] = (9, 6) plt.rcParams[
Pandas Series
- Started by Wes MacKinney with a first release in 2011.
- Based on NumPy, it is the most used library for all things data.
- Motivated by the toolbox in R for manipulating data easily.
- A lot of names in Pandas come from R world.
- It is Open source (BSD)
https://pandas.pydata.org/
import pandas as pd
“Pandas provides high-performance, easy-to-use data structures and data analysis tools in Python”
- Self-describing data structures
- Data loaders to/from common file formats
- Plotting functions
- Basic statistical tools.
Series
- A Series contains a one-dimensional array of data, and an associated sequence of labels called the index.
- The index can contain numeric, string, or date/time values.
- When the index is a time value, the series is a time series.
- The index must be the same length as the data.
- If no index is supplied it is automatically generated as
range(len(data))
.
1,3,5,np.nan,6,8], dtype=np.float64) pd.Series([
=pd.period_range('09/11/2017', '09/18/2017', freq="D"), dtype=np.int8) pd.Series(index
Exercise
- Create a text with
lorem
and count word occurences with acollection.Counter
. Put the result in adict
.
Exercise
- From the results create a Pandas series name latin_series with words in alphabetical order as index.
= pd.Series(result)
df df
Exercise
- Plot the series using ‘bar’ kind.
Exercise
- Pandas provides explicit functions for indexing
loc
andiloc
.- Use
loc
to display the number of occurrences of ‘dolore’. - Use
iloc
to diplay the number of occurrences of the last word in index.
- Use
Exercise
- Sort words by number of occurrences.
- Plot the Series.
Full globe temperature between 1901 and 2000.
We read the text file and load the results in a pandas dataframe. In cells below you need to clean the data and convert the dataframe to a time series.
import os
= os.getcwd()
here
= os.path.join(here,"data","monthly.land.90S.90N.df_1901-2000mean.dat.txt")
filename
= pd.read_table(filename, sep="\s+",
df =["year", "month", "mean temp"])
names df
Exercise
- Insert a third column with value one named “day” with
.insert
. - convert df index to datetime with
pd.to_datetime
function. - convert df to Series containing only “mean temp” column.
Exercise
- Display the beginning of the file with
.head
.
Exercise
- Display the end of the file with
.tail
.
In the dataset, -999.00 was used to indicate that there was no value for that year.
Exercise
- Display values equal to -999 with
.values
. - Replace the missing value (-999.000) by
np.nan
Once they have been converted to np.nan, missing values can be removed (dropped).
Exercise
- Remove missing values with
.dropna
.
Exercise
- Generate a basic visualization using
.plot
.
Exercise
Convert df index from timestamp to period is more meaningfull since it was measured and averaged over the month. Use to_period
method.
Resampling
Series can be resample, downsample or upsample. - Frequencies can be specified as strings: “us”, “ms”, “S”, “T”, “H”, “D”, “B”, “W”, “M”, “A”, “3min”, “2h20”, … - More aliases at http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Exercise
- With
resample
method, convert df Series to 10 year blocks:
Saving Work
HDF5 is widely used and one of the most powerful file format to store binary data. It allows to store both Series and DataFrames.
with pd.HDFStore("data/pandas_series.h5") as writer:
"/temperatures/full_globe") df.to_hdf(writer,
Reloading data
with pd.HDFStore("data/pandas_series.h5") as store:
= store["/temperatures/full_globe"] df