Parallel Computing with Python#
This course is being taught at as part of Master For Smart Data Science ENSAI Rennes.
Run Jupyter notebooks with docker#
Get docker app#
You can run these notebooks with Docker. The following command starts a container with the Notebook server listening for HTTP connections on port 8888 and 4040 without authentication configured.
git clone https://github.com/pnavaro/big-data.git
docker run --rm -v $PWD/big-data:/home/jovyan/ -p 8888:8888 -p 4040:4040 pnavaro/big-data
References#
Books#
Python for Data Analysis by Wes McKinney.
Python Data Science Handbook by Jake VanderPlas
Software documentation#
Tutorials#
Python
Analyzing and Manipulating Data with Pandas Beginner: SciPy 2016 Tutorial by Jonathan Rocher.
Dask
Parallel Data Analysis with Dask Dask tutorial at PyCon 2018 by Tom Augspurger.
Parallelizing Scientific Python with Dask SciPy 2018 Tutorial by James Crist and Martin Durant
Parallelizing Scientific Python with Dask, SciPy 2017 Tutorial by James Crist.
Parallel Python: Analyzing Large Datasets Intermediate, SciPy 2016 Tutorial by Matthew Rocklin.
Parallel Data Analysis in Python, SciPy 2017 Tutorial by Matthew Rocklin, Ben Zaitlen & Aron Ahmadia.
Hadoop
Writing an Hadoop MapReduce Program in Python by Michael G. Noll.
Spark
Blog posts#
Format Wars: From VHS and Beta to Avro and Parquet overview of Hadoop File formats.
Should you replace Hadoop with your laptop? by Vicki Boykis.
Implementing MapReduce with multiprocessing by Doug Hellmann.
Deploying Dask on YARN by Jim Crist.
Native Hadoop file system (HDFS) connectivity in Python by Wes McKinney.
Working Notes from Matthew Rocklin (must read)
Machine Learning – 7 astuces pour scaler Python sur de grands datasets
Online courses#
Outils pour le Big Data by Pierre Nerzic. 🇫🇷
wikistat - Ateliers Big Data by Philippe Besse. 🇫🇷
Data Science and Big Data with Python by Steve Phelps.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.