Parallel Computing with Python

Contents

Parallel Computing with Python#

This course is being taught at as part of Master For Smart Data Science ENSAI Rennes.

Run Jupyter notebooks with docker#

Get docker app#

You can run these notebooks with Docker. The following command starts a container with the Notebook server listening for HTTP connections on port 8888 and 4040 without authentication configured.

git clone https://github.com/pnavaro/big-data.git
docker run --rm -v $PWD/big-data:/home/jovyan/ -p 8888:8888 -p 4040:4040 pnavaro/big-data

References#

Books#

Python for Data Analysis by Wes McKinney.
Python Data Science Handbook by Jake VanderPlas

Software documentation#

Tutorials#

Python
- Analyzing and Manipulating Data with Pandas Beginner: SciPy 2016 Tutorial by Jonathan Rocher.
Dask
- Dask Examples
- Parallel Data Analysis with Dask Dask tutorial at PyCon 2018 by Tom Augspurger.
- Parallelizing Scientific Python with Dask SciPy 2018 Tutorial by James Crist and Martin Durant
- Parallelizing Scientific Python with Dask, SciPy 2017 Tutorial by James Crist.
- Parallel Python: Analyzing Large Datasets Intermediate, SciPy 2016 Tutorial by Matthew Rocklin.
- Parallel Data Analysis in Python, SciPy 2017 Tutorial by Matthew Rocklin, Ben Zaitlen & Aron Ahmadia.
- Matthew Rocklin - Streaming Processing with Dask
- Jacob Tomlinson - Dask Video Tutorial 2020
Hadoop
- Writing an Hadoop MapReduce Program in Python by Michael G. Noll.
Spark
- GetOting Started with Apache Spark Tutorial - Databricks
- Hortonworks Data Tutorials

Blog posts#

Why Polars uses less memory than Pandas
Reducing Pandas memory usage #1: lossless compression
Reducing Pandas memory usage #2: lossy compression
Reducing Pandas memory usage #3: Reading in chunks
Don’t use Hadoop - your data isn’t that big
Format Wars: From VHS and Beta to Avro and Parquet overview of Hadoop File formats.
Should you replace Hadoop with your laptop? by Vicki Boykis.
Implementing MapReduce with multiprocessing by Doug Hellmann.
Deploying Dask on YARN by Jim Crist.
Native Hadoop file system (HDFS) connectivity in Python by Wes McKinney.
Working Notes from Matthew Rocklin (must read)
Large SVDs with Dask
Machine Learning – 7 astuces pour scaler Python sur de grands datasets
The Best Format to Save Pandas Data

Online courses#

DataCamp Cheat Sheets
Outils pour le Big Data by Pierre Nerzic. 🇫🇷
wikistat - Ateliers Big Data by Philippe Besse. 🇫🇷
Data Science and Big Data with Python by Steve Phelps.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.