Pandas Series

Started by Wes MacKinney with a first release in 2011.
Based on NumPy, it is the most used library for all things data.
Motivated by the toolbox in R for manipulating data easily.
A lot of names in Pandas come from R world.
It is Open source (BSD)

https://pandas.pydata.org/

import pandas as pd

“Pandas provides high-performance, easy-to-use data structures and data analysis tools in Python”

Self-describing data structures
Data loaders to/from common file formats
Plotting functions
Basic statistical tools.

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
pd.set_option("display.max_rows", 8)
plt.rcParams['figure.figsize'] = (9, 6)

Series

A Series contains a one-dimensional array of data, and an associated sequence of labels called the index.
The index can contain numeric, string, or date/time values.
When the index is a time value, the series is a time series.
The index must be the same length as the data.
If no index is supplied it is automatically generated as range(len(data)).

pd.Series([1,3,5,np.nan,6,8], dtype=np.float64)

pd.Series(index=pd.period_range('09/11/2017', '09/18/2017', freq="D"), dtype=np.int8)

Exercise

Create a text with lorem and count word occurences with a collection.Counter. Put the result in a dict.

Exercise

From the results create a Pandas series name latin_series with words in alphabetical order as index.

df = pd.Series(result)
df

Exercise

Plot the series using ‘bar’ kind.

Exercise

Pandas provides explicit functions for indexing loc and iloc.
- Use loc to display the number of occurrences of ‘dolore’.
- Use iloc to diplay the number of occurrences of the last word in index.

Exercise

Sort words by number of occurrences.
Plot the Series.

Full globe temperature between 1901 and 2000.

We read the text file and load the results in a pandas dataframe. In cells below you need to clean the data and convert the dataframe to a time series.

import os
here = os.getcwd()

filename = os.path.join(here,"data","monthly.land.90S.90N.df_1901-2000mean.dat.txt")

df = pd.read_table(filename, sep="\s+", 
                   names=["year", "month", "mean temp"])
df

Exercise

Insert a third column with value one named “day” with .insert.
convert df index to datetime with pd.to_datetime function.
convert df to Series containing only “mean temp” column.

Exercise

Display the beginning of the file with .head.

Exercise

Display the end of the file with .tail.

In the dataset, -999.00 was used to indicate that there was no value for that year.

Exercise

Display values equal to -999 with .values.
Replace the missing value (-999.000) by np.nan

Once they have been converted to np.nan, missing values can be removed (dropped).

Exercise

Remove missing values with .dropna.

Exercise

Generate a basic visualization using .plot.

Exercise

Convert df index from timestamp to period is more meaningfull since it was measured and averaged over the month. Use to_period method.

Resampling

Series can be resample, downsample or upsample. - Frequencies can be specified as strings: “us”, “ms”, “S”, “T”, “H”, “D”, “B”, “W”, “M”, “A”, “3min”, “2h20”, … - More aliases at http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Exercise

With resample method, convert df Series to 10 year blocks:

Saving Work

HDF5 is widely used and one of the most powerful file format to store binary data. It allows to store both Series and DataFrames.

with pd.HDFStore("data/pandas_series.h5") as writer:
    df.to_hdf(writer, "/temperatures/full_globe")

Reloading data

with pd.HDFStore("data/pandas_series.h5") as store:
    df = store["/temperatures/full_globe"]