Pandas Series

pandas

https://pandas.pydata.org/

import pandas as pd

Pandas provides high-performance, easy-to-use data structures and data analysis tools in Python

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
pd.set_option("display.max_rows", 8)
plt.rcParams['figure.figsize'] = (9, 6)

Series

  • A Series contains a one-dimensional array of data, and an associated sequence of labels called the index.
  • The index can contain numeric, string, or date/time values.
  • When the index is a time value, the series is a time series.
  • The index must be the same length as the data.
  • If no index is supplied it is automatically generated as range(len(data)).
pd.Series([1,3,5,np.nan,6,8], dtype=np.float64)
pd.Series(index=pd.period_range('09/11/2017', '09/18/2017', freq="D"), dtype=np.int8)

Exercise

  • Create a text with lorem and count word occurences with a collection.Counter. Put the result in a dict.

Exercise

  • From the results create a Pandas series name latin_series with words in alphabetical order as index.
df = pd.Series(result)
df

Exercise

  • Plot the series using ‘bar’ kind.

Exercise

  • Pandas provides explicit functions for indexing loc and iloc.
    • Use loc to display the number of occurrences of ‘dolore’.
    • Use iloc to diplay the number of occurrences of the last word in index.

Exercise

  • Sort words by number of occurrences.
  • Plot the Series.

Full globe temperature between 1901 and 2000.

We read the text file and load the results in a pandas dataframe. In cells below you need to clean the data and convert the dataframe to a time series.

import os
here = os.getcwd()

filename = os.path.join(here,"data","monthly.land.90S.90N.df_1901-2000mean.dat.txt")

df = pd.read_table(filename, sep="\s+", 
                   names=["year", "month", "mean temp"])
df

Exercise

  • Insert a third column with value one named “day” with .insert.
  • convert df index to datetime with pd.to_datetime function.
  • convert df to Series containing only “mean temp” column.

Exercise

  • Display the beginning of the file with .head.

Exercise

  • Display the end of the file with .tail.

In the dataset, -999.00 was used to indicate that there was no value for that year.

Exercise

  • Display values equal to -999 with .values.
  • Replace the missing value (-999.000) by np.nan

Once they have been converted to np.nan, missing values can be removed (dropped).

Exercise

  • Remove missing values with .dropna.

Exercise

  • Generate a basic visualization using .plot.

Exercise

Convert df index from timestamp to period is more meaningfull since it was measured and averaged over the month. Use to_period method.

Resampling

Series can be resample, downsample or upsample. - Frequencies can be specified as strings: “us”, “ms”, “S”, “T”, “H”, “D”, “B”, “W”, “M”, “A”, “3min”, “2h20”, … - More aliases at http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Exercise

  • With resample method, convert df Series to 10 year blocks:

Saving Work

HDF5 is widely used and one of the most powerful file format to store binary data. It allows to store both Series and DataFrames.

with pd.HDFStore("data/pandas_series.h5") as writer:
    df.to_hdf(writer, "/temperatures/full_globe")

Reloading data

with pd.HDFStore("data/pandas_series.h5") as store:
    df = store["/temperatures/full_globe"]