Threading in Python - Cloudera CDP Guide

Python has become the de-facto language framework for writing machine learning code due to its rich modules ecosystem especially in the data science realm.

By default, Python uses a single CPU thread to execute the code. This is largely due to GIL (Global Interpreter Lock) as single thread is safe to prevent corrupted output caused by race condition. While thread-safe is good, it does come with a price - the allocated/available CPU resource is often underutilized and it takes longer time to run the code. Today, there are ways to improve performance by using some modules with the likes of multithreading and multiprocessing but this must be done carefully because there's always a reason for GIL to exist. At first glance, the term multithreading sounds promising to speed up a particular Python program. Let's run some experiments to find out if this is really the case.

The following experiments are carried out using Cloudera Machine Learning (CML) on Kubernetes platform powered by Openshift 4.8 with the hardware specification as described below. CML is embedded with workbench and Jupyterlab notebook IDE for data scientist to do coding, EDA, etc. In this experiment, the CML workbench is used as it is lightweight, quicker to spin up and easy to use.

CPU	Memory	Disk
Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz	DIMM DDR4 Synchronous Registered (Buffered) 2933 MHz (0.3 ns)	SSD P4610 1.6TB SFF

Single Thread in Python

Create a CML workbench session with 1 CPU/2 GiB memory profile. Open a Terminal Access box and run the top command to check the total threads spawned by the process ID of this session pod. In this example, the process ID is 94. Note that there are 30 threads being opened by the process for this running session pod.
```
$ top -p 94 -H
```

Create a simple input file with the following Bash script in the terminal console.

$ > input;for i in {1..20};do echo line$i >> input;done; head -20 input
line1
line2
line3
line4
line5
line6
line7
line8
line9
line10
line11
line12
line13
line14
line15
line16
line17
line18
line19
line20

Create a new file and copy this Python script into the workbench and run it. This script is coded to read line by line from the input file and write each line into the output file in a sequential manner.

After the process is completed, check the outcome of the output file. It shows details such as current CPU thread core of the hosting node, current pid, total of threads, which CPU core is scheduled to run the program at specific point of time and finally the read/written line. As expected, the code writes each line number to the output file sequentially without data corruption as Python runs the code linearly with a single Mainthread process.

$ more output
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line1
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line2
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line3
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line4
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line5
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line6
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line7
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line8
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line9
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line10
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line11
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line12
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line13
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line14
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line15
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line16
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line17
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line18
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line19
 
current thread:<_MainThread(MainThread, started 140555361314624)>
current pid:94
total threads:30
CPU number:6
line20

Let's recreate the file with more entries and rerun the abovementioned Python code.

$ > input;for i in {1..10000};do echo line$i >> input;done; head -20 input

Observe the processing time taken by Python to run the code. Run a few times to ensure consistent result. In this case, it takes approximately 32 seconds to run the program with the intended outcome.
```
Job Starts: 2071176.127430943
Job Ends: 2071208.742637604
Totals Execution Time:32.62 seconds.
```

Multithreading in Python

Let's shift gear a bit by executing the similar code with Threading module.

Create a CML workbench session with 1 CPU/2 GiB memory profile.
Run this Python script. Use the same input file with 10000 entries. Similar to the previous script, the script is amended by adding Threading class module.

After the process is completed, check the outcome of the newly generated output file. It shows the process uses multiple threads to run the code as indicated below.

$ more output
current thread:<Thread(Thread-3, started 139772086433536)>
current pid:97
total threads:31
CPU number:1
line1
 
current thread:<Thread(Thread-4, started 139772086433536)>
current pid:97
total threads:31
CPU number:1
line2
 
current thread:<Thread(Thread-5, started 139772086433536)>
current pid:97
total threads:31
CPU number:1
line3
 
current thread:<Thread(Thread-6, started 139772086433536)>
current pid:97
total threads:31
CPU number:1
line4
 
current thread:<Thread(Thread-7, started 139772086433536)>
current pid:97
total threads:31
CPU number:1
line5
 
current thread:<Thread(Thread-8, started 139772086433536)>
current pid:97
total threads:31
CPU number:1
line6
 
current thread:<Thread(Thread-9, started 139772086433536)>
current pid:97
total threads:31
CPU number:1
line7
 
current thread:<Thread(Thread-10, started 139772086433536)>
current pid:97
total threads:31
CPU number:1
line8
 
current thread:<Thread(Thread-11, started 139772086433536)>
current pid:97
total threads:31
CPU number:1
line9

Next, check the integrity of the output file. To reiterate, the purpose of this code is to read line by line from the input file and write each line into the output file in a sequential manner. Run the following script. In the event of no output, it concludes that the code achieves the intended outcome without data corruption. In this case, the result shows the same outcome as of the previous test (using single thread). It signals that the outcome is safe-thread when using the correct method.
```
$ cnt=0;for i in `cat output | grep line`; do cnt=`expr $cnt + 1` ; if [ $i != line$cnt ]; then echo $i;fi ; done
```
However, the processing time taken by Python to run this code with multithreading is longer with approximately 50 seconds!
```
Job Starts: 2071685.346476694
Job Ends: 2071735.755200198
Totals Execution Time:50.41 seconds.
```

concurrent.futures.ThreadPoolExecutor in Python

Now let's add concurrent.futures.ThreadPoolExecutor module to run the same code with the same intended outcome.

Run this script in the CML workbench session.
The processing time taken by Python to run the code with concurrent.futures.ThreadPoolExecutor module is shorter at 24 seconds. Great!
```
Job Starts: 2070084.875582402
Job Ends: 2070109.707474271
Totals Execution Time:24.83 seconds.
```

Explore outcome of the newly generated output file. It shows that the process spawns 10 threads (defined by the number of workers) to run the code. The total threads before the running code was 30 and has increased to 40 when executing the code.

Threadpool1

current thread:<Thread(ThreadPoolExecutor-0_1, started 139664655841024)>
current pid:98
total threads:36
CPU number:2
line1
 
current thread:<Thread(ThreadPoolExecutor-0_3, started 139664639055616)>
current pid:98
total threads:36
CPU number:2
line2
 
current thread:<Thread(ThreadPoolExecutor-0_0, started 139664664233728)>
current pid:98
total threads:36
CPU number:2
line3
 
current thread:<Thread(ThreadPoolExecutor-0_2, started 139664647448320)>
current pid:98
total threads:38
CPU number:2
line4
 
current thread:<Thread(ThreadPoolExecutor-0_5, started 139664283399936)>
current pid:98
total threads:40
CPU number:2
line5
 
current thread:<Thread(ThreadPoolExecutor-0_1, started 139664655841024)>
current pid:98
total threads:40
CPU number:2
line6
 
current thread:<Thread(ThreadPoolExecutor-0_4, started 139664630662912)>
current pid:98
total threads:40
CPU number:2
line7
 
current thread:<Thread(ThreadPoolExecutor-0_6, started 139664275007232)>
current pid:98
total threads:40
CPU number:2
line8
 
current thread:<Thread(ThreadPoolExecutor-0_8, started 139664258221824)>
current pid:98
total threads:40
CPU number:2
line9
 
current thread:<Thread(ThreadPoolExecutor-0_0, started 139664664233728)>
current pid:98
total threads:40
CPU number:2
line10
 
current thread:<Thread(ThreadPoolExecutor-0_4, started 139664630662912)>
current pid:98
total threads:40
CPU number:2
line11
 
current thread:<Thread(ThreadPoolExecutor-0_9, started 139664249829120)>
current pid:98
total threads:40
CPU number:2
line12
 
current thread:<Thread(ThreadPoolExecutor-0_3, started 139664639055616)>
current pid:98
total threads:40
CPU number:2
line13
 
current thread:<Thread(ThreadPoolExecutor-0_7, started 139664266614528)>
current pid:98
total threads:40
CPU number:2
line14

Subsequently, check the integrity of the output. Run the following script to check if the code writes each line number to the output file sequentially. The result is negative.

$ cnt=0;for i in `cat output | grep line`; do cnt=`expr $cnt + 1` ; if [ $i != line$cnt ]; then echo $i;fi ; done
line26
line25
line41
line40
line84
line83
line96
line95

Excerpt from the output file:

current thread:<Thread(ThreadPoolExecutor-1_9, started 139664249829120)>
current pid:98
total threads:40
CPU number:2
line26
 
current thread:<Thread(ThreadPoolExecutor-1_8, started 139664283399936)>
current pid:98
total threads:40
CPU number:2
line25
 
current thread:<Thread(ThreadPoolExecutor-1_0, started 139664258221824)>
current pid:98
total threads:40
CPU number:2
line27

The outcome is not in line with the intention of the code. The line number is not written sequentially beginning from line26. Unlike the previous outcome, the line numbers are no longer in sequence, e.g. line25 is written after line26 by ThreadPoolExecutor-1_8 and ThreadPoolExecutor-1_9 respectively. This is the behaviour of race condition as a result of multiple threads writing into the same file in parallel. In other words, concurrent.futures.ThreadPoolExecutor is not thread-safe (without locking mechanism) despite able to complete the program at a faster speed.

concurrent.futures.ThreadPoolExecutor with Lock in Python

Now let's use concurrent.futures.ThreadPoolExecutor module with locking mechanism to run the code with the same intention.

Run this script in the CML workbench session.
Next, check the integrity of the output. Run the following script to check if the code writes each line number to the output file sequentially. The outcome matches the objective with sequential entries in place.
```
$ cnt=0;for i in `cat output | grep line`; do cnt=`expr $cnt + 1` ; if [ $i != line$cnt ]; then echo $i;fi ; done
```
The processing time taken by Python to run the code with concurrent.futures.ThreadPoolExecutor with Lock module is now taking longer than the code with single thread despite 10 workers/threads are spawned.
```
Job Starts: 2077114.640889352
Job Ends: 2077154.378118582
Totals Execution Time:39.74 seconds.
```
Let's rerun the same test by adjusting max_workers=1. Ironically, the speed is almost similar to that of running the script using the default single thread process.
```
Job Starts: 2078032.097896568
Job Ends: 2078064.409348964
Totals Execution Time:32.31 seconds.
```
While introducing lock in concurrent.futures.ThreadPoolExecutor module ensures integrity of the outcome, it comes with a price of lower speed because only one thread can obtain the lock at a time. Other threads need to wait for the lock to be released.

Conclusion: At first glance, multithreading sounds promising to enhance the speed to run specific Python code. In reality, the outcome depends heavily on how the code is written and most importantly whether or not it achieves the end goal. While running a Python code with more threads could be faster but it might not necessarily achieve the intended outcome. Single thread could work faster than multithreading for ensuring a high degree of integrity. In addition, creating multiple threads will also creates resource overhead. In a nutshell, applying multithreading will not guarantee highest performance and could lead to a backfire as it might effectuate data corruption as witnessed in the above experiments. Lastly, GIL still exists for a reason :)