Parallel Computation

Parallel Computation#

This notebook objective is to learn how to parallelize application with Python

Parallel computers#

Multiprocessor/multicore: several processors work on data stored in shared memory
Cluster: several processor/memory units work together by exchanging data over a network
Co-processor: a general-purpose processor delegates specific tasks to a special-purpose processor (GPU)

Parallel Programming#

Decomposition of the complete task into independent subtasks and the data flow between them.
Distribution of the subtasks over the processors minimizing the total execution time.
For clusters: distribution of the data over the nodes minimizing the communication time.
For multiprocessors: optimization of the memory access patterns minimizing waiting times.
Synchronization of the individual processes.

MapReduce#

from time import sleep
def f(x):
    sleep(1)
    return x*x
L = list(range(8))
L

%time sum(f(x) for x in L)

%time sum(map(f,L))

Multiprocessing#

multiprocessing is a package that supports spawning processes.

We can use it to display how many concurrent processes you can launch on your computer.

from multiprocessing import cpu_count

cpu_count()

Futures#

The concurrent.futures module provides a high-level interface for asynchronously executing callables.

The asynchronous execution can be performed with:

threads, using ThreadPoolExecutor,
separate processes, using ProcessPoolExecutor. Both implement the same interface, which is defined by the abstract Executor class.

concurrent.futures can’t launch processes on windows. Windows users must install loky.

%%file pmap.py
from concurrent.futures import ProcessPoolExecutor
from time import sleep, time

def f(x):
    sleep(1)
    return x*x

L = list(range(8))

if __name__ == '__main__':
    
    begin = time()
    with ProcessPoolExecutor() as pool:

        result = sum(pool.map(f, L))
    end = time()
    
    print(f"result = {result} and time = {end-begin}")

import sys
!{sys.executable} pmap.py

ProcessPoolExecutor launches one slave process per physical core on the computer.
pool.map divides the input list into chunks and puts the tasks (function + chunk) on a queue.
Each slave process takes a task (function + a chunk of data), runs map(function, chunk), and puts the result on a result list.
pool.map on the master process waits until all tasks are handled and returns the concatenation of the result lists.

%%time
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor() as pool:

    results = sum(pool.map(f, L))
    
print(results)

Thread and Process: Differences#

A process is an instance of a running program.
Process may contain one or more threads, but a thread cannot contain a process.
Process has a self-contained execution environment. It has its own memory space.
Application running on your computer may be a set of cooperating processes.
Process don’t share its memory, communication between processes implies data serialization.
A thread is made of and exist within a process; every process has at least one thread.
Multiple threads in a process share resources, which helps in efficient communication between threads.
Threads can be concurrent on a multi-core system, with every core executing the separate threads simultaneously.

The Global Interpreter Lock (GIL)#

The Python interpreter is not thread safe.
A few critical internal data structures may only be accessed by one thread at a time. Access to them is protected by the GIL.
Attempts at removing the GIL from Python have failed until now. The main difficulty is maintaining the C API for extension modules.
Multiprocessing avoids the GIL by having separate processes which each have an independent copy of the interpreter data structures.
The price to pay: serialization of tasks, arguments, and results.

Parallelize text files downloads#

Victor Hugo http://www.gutenberg.org/files/135/135-0.txt
Marcel Proust http://www.gutenberg.org/files/7178/7178-8.txt
Emile Zola http://www.gutenberg.org/files/1069/1069-0.txt
Stendhal http://www.gutenberg.org/files/44747/44747-0.txt

Exercise 6.1#

Use ThreadPoolExecutor to parallelize the code above.

%mkdir books

%%time
import urllib.request as url
source = "https://mmassd.github.io/"  # "http://svmass2.mass.uhb.fr/hub/static/datasets/"
url.urlretrieve(source+"books/hugo.txt",     filename="books/hugo.txt")
url.urlretrieve(source+"books/proust.txt",   filename="books/proust.txt")
url.urlretrieve(source+"books/zola.txt",     filename="books/zola.txt")
url.urlretrieve(source+"books/stendhal.txt", filename="books/stendhal.txt")

Wordcount#

from glob import glob
from collections import defaultdict
from operator import itemgetter
from itertools import chain
from concurrent.futures import ThreadPoolExecutor

def mapper(filename):
    " split text to list of key/value pairs (word,1)"

    with open(filename) as f:
        data = f.read()
        
    data = data.strip().replace(".","").lower().split()
        
    return sorted([(w,1) for w in data])

def partitioner(mapped_values):
    """ get lists from mapper and create a dict with
    (word,[1,1,1])"""
    
    res = defaultdict(list)
    for w, c in mapped_values:
        res[w].append(c)
        
    return res.items()

def reducer( item ):
    """ Compute words occurences from dict computed
    by partioner
    """
    w, v = item
    return (w,len(v))

Parallel map#

Let’s improve the mapper function by print out inside the function the current process name.

Example

import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor

def process_name(n):
    " prints out the current process name "
    print(f"{mp.current_process().name} ")

with ProcessPoolExecutor() as e:
    _ = e.map(process_name, range(mp.cpu_count()))

Exercise 6.2#

Modify the mapper function by adding this print.

Parallel reduce#

For parallel reduce operation, data must be aligned in a container. We already created a partitioner function that returns this container.

Exercise 6.3#

Write a parallel program that uses the three functions above using ThreadPoolExecutor. It reads all the “sample*.txt” files. Map and reduce steps are parallel.

Increase volume of data#

Due to the proxy, code above is not runnable on workstations

Getting the data#

The Latin Library contains a huge collection of freely accessible Latin texts. We get links on the Latin Library’s homepage ignoring some links that are not associated with a particular author.

from bs4 import BeautifulSoup  # web scraping library
from urllib.request import *

base_url = "http://www.thelatinlibrary.com/"
home_content = urlopen(base_url)

soup = BeautifulSoup(home_content, "lxml")
author_page_links = soup.find_all("a")
author_pages = [ap["href"] for i, ap in enumerate(author_page_links) if i < 49]

Generate html links#

Create a list of all links pointing to Latin texts. The Latin Library uses a special format which makes it easy to find the corresponding links: All of these links contain the name of the text author.

ap_content = list()
for ap in author_pages:
    ap_content.append(urlopen(base_url + ap))

book_links = list()
for path, content in zip(author_pages, ap_content):
    author_name = path.split(".")[0]
    ap_soup = BeautifulSoup(content, "lxml")
    book_links += ([link for link in ap_soup.find_all("a", {"href": True}) if author_name in link["href"]])

Download webpages content#

from urllib.error import HTTPError

num_pages = 100

for i, bl in enumerate(book_links[:num_pages]):
    print("Getting content " + str(i + 1) + " of " + str(num_pages), end="\r", flush=True)
    try:
        content = urlopen(base_url + bl["href"]).read()
        with open(f"book-{i:03d}.dat","wb") as f:
            f.write(content)
    except HTTPError as err:
        print("Unable to retrieve " + bl["href"] + ".")
        continue

Extract data files#

I already put the content of pages in files named book-*.txt
You can extract data from the archive by running the cell below

import os  # library to get directory and file paths
import tarfile # this module makes possible to read and write tar archives

def extract_data():
    datadir = os.path.join('data','latinbooks')
    if not os.path.exists(datadir):
       print("Extracting data...")
       tar_path = os.path.join('data', 'latinbooks.tgz')
       with tarfile.open(tar_path, mode='r:gz') as books:
          books.extractall('data')
            
extract_data() # this function call will extract text files in data/latinbooks

Read data files#

from glob import glob
files = glob('book*.dat')
texts = list()
for file in files:
    with open(file,'rb') as f:
        text = f.read()
    texts.append(text)

Extract the text from html and split the text at periods to convert it into sentences.#

%%time
from bs4 import BeautifulSoup

sentences = list()

for i, text in enumerate(texts):
    print("Document " + str(i + 1) + " of " + str(len(texts)), end="\r", flush=True)
    textSoup = BeautifulSoup(text, "lxml")
    paragraphs = textSoup.find_all("p", attrs={"class":None})
    prepared = ("".join([p.text.strip().lower() for p in paragraphs[1:-1]]))
    for t in prepared.split("."):
        part = "".join([c for c in t if c.isalpha() or c.isspace()])
        sentences.append(part.strip())

# print first and last sentence to check the results
print(sentences[0])
print(sentences[-1])

Exercise 6.4#

Parallelize this last process using concurrent.futures.

References#

Using Conditional Random Fields and Python for Latin word segmentation