Parallel Computing with Python
This course is being taught at as part of Master For Smart Data Science ENSAI Rennes.
Run Jupyter notebooks with docker
Get docker app
You can run these notebooks with Docker. The following command starts a container with the Notebook server listening for HTTP connections on port 8888 and 4040 without authentication configured.
git clone https://github.com/pnavaro/big-data.git
docker run --rm -v $PWD/big-data:/home/jovyan/ -p 8888:8888 -p 4040:4040 pnavaro/big-data
References
Books
- Python for Data Analysis by Wes McKinney.
- Python Data Science Handbook by Jake VanderPlas
Software documentation
Tutorials
- Python
- Analyzing and Manipulating Data with Pandas Beginner: SciPy 2016 Tutorial by Jonathan Rocher.
- Dask
- Dask Examples
- Parallel Data Analysis with Dask Dask tutorial at PyCon 2018 by Tom Augspurger.
- Parallelizing Scientific Python with Dask SciPy 2018 Tutorial by James Crist and Martin Durant
- Parallelizing Scientific Python with Dask, SciPy 2017 Tutorial by James Crist.
- Parallel Python: Analyzing Large Datasets Intermediate, SciPy 2016 Tutorial by Matthew Rocklin.
- Parallel Data Analysis in Python, SciPy 2017 Tutorial by Matthew Rocklin, Ben Zaitlen & Aron Ahmadia.
- Matthew Rocklin - Streaming Processing with Dask
- Jacob Tomlinson - Dask Video Tutorial 2020
- Hadoop
- Writing an Hadoop MapReduce Program in Python by Michael G. Noll.
- Spark
Blog posts
- Why Polars uses less memory than Pandas
- Reducing Pandas memory usage #1: lossless compression
- Reducing Pandas memory usage #2: lossy compression
- Reducing Pandas memory usage #3: Reading in chunks
- Don’t use Hadoop - your data isn’t that big
- Format Wars: From VHS and Beta to Avro and Parquet overview of Hadoop File formats.
- Should you replace Hadoop with your laptop? by Vicki Boykis.
- Implementing MapReduce with multiprocessing by Doug Hellmann.
- Deploying Dask on YARN by Jim Crist.
- Native Hadoop file system (HDFS) connectivity in Python by Wes McKinney.
- Working Notes from Matthew Rocklin (must read)
- Large SVDs with Dask
- Machine Learning – 7 astuces pour scaler Python sur de grands datasets
- The Best Format to Save Pandas Data
Online courses
- DataCamp Cheat Sheets
- Outils pour le Big Data by Pierre Nerzic. 🇫🇷
- wikistat - Ateliers Big Data by Philippe Besse. 🇫🇷
- Data Science and Big Data with Python by Steve Phelps.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.