CCDP Guide
Machine LearningThreading in Python

Threading in Python

Python has become the de-facto language framework for writing machine learning code due to its rich modules ecosystem especially in the data science realm.

By default, Python uses a single CPU thread to execute the code. This is largely due to GIL (Global Interpreter Lock) as single thread is safe to prevent corrupted output caused by race condition. While thread-safe is good, it does come with a price - the allocated/available CPU resource is often underutilized and it takes longer time to run the code. Today, there are ways to improve performance by using some modules with the likes of multithreading and multiprocessing but this must be done carefully because there's always a reason for GIL to exist. At first glance, the term multithreading sounds promising to speed up a particular Python program. Let's run some experiments to find out if this is really the case.

The following experiments are carried out using Cloudera Machine Learning (CML) on Kubernetes platform powered by Openshift 4.8 with the hardware specification as described below. CML is embedded with workbench and Jupyterlab notebook IDE for data scientist to do coding, EDA, etc. In this experiment, the CML workbench is used as it is lightweight, quicker to spin up and easy to use.

CPUMemoryDisk
Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHzDIMM DDR4 Synchronous Registered (Buffered) 2933 MHz (0.3 ns)SSD P4610 1.6TB SFF

Single Thread in Python

  1. Create a CML workbench session with 1 CPU/2 GiB memory profile. Open a Terminal Access box and run the top command to check the total threads spawned by the process ID of this session pod. In this example, the process ID is 94. Note that there are 30 threads being opened by the process for this running session pod.

    $ top -p 94 -H

    Singlethread1

  2. Create a simple input file with the following Bash script in the terminal console.

    $ > input;for i in {1..20};do echo line$i >> input;done; head -20 input
    line1
    line2
    line3
    line4
    line5
    line6
    line7
    line8
    line9
    line10
    line11
    line12
    line13
    line14
    line15
    line16
    line17
    line18
    line19
    line20
  3. Create a new file and copy this Python script into the workbench and run it. This script is coded to read line by line from the input file and write each line into the output file in a sequential manner.

    Singlethread2

  4. After the process is completed, check the outcome of the output file. It shows details such as current CPU thread core of the hosting node, current pid, total of threads, which CPU core is scheduled to run the program at specific point of time and finally the read/written line. As expected, the code writes each line number to the output file sequentially without data corruption as Python runs the code linearly with a single Mainthread process.

    $ more output
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line1
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line2
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line3
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line4
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line5
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line6
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line7
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line8
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line9
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line10
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line11
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line12
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line13
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line14
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line15
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line16
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line17
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line18
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line19
     
    current thread:<_MainThread(MainThread, started 140555361314624)>
    current pid:94
    total threads:30
    CPU number:6
    line20 
  5. Let's recreate the file with more entries and rerun the abovementioned Python code.

    $ > input;for i in {1..10000};do echo line$i >> input;done; head -20 input
  6. Observe the processing time taken by Python to run the code. Run a few times to ensure consistent result. In this case, it takes approximately 32 seconds to run the program with the intended outcome.

    Job Starts: 2071176.127430943
    Job Ends: 2071208.742637604
    Totals Execution Time:32.62 seconds.

Multithreading in Python

Let's shift gear a bit by executing the similar code with Threading module.

  1. Create a CML workbench session with 1 CPU/2 GiB memory profile.

  2. Run this Python script. Use the same input file with 10000 entries. Similar to the previous script, the script is amended by adding Threading class module.

  3. After the process is completed, check the outcome of the newly generated output file. It shows the process uses multiple threads to run the code as indicated below.

    $ more output
    current thread:<Thread(Thread-3, started 139772086433536)>
    current pid:97
    total threads:31
    CPU number:1
    line1
     
    current thread:<Thread(Thread-4, started 139772086433536)>
    current pid:97
    total threads:31
    CPU number:1
    line2
     
    current thread:<Thread(Thread-5, started 139772086433536)>
    current pid:97
    total threads:31
    CPU number:1
    line3
     
    current thread:<Thread(Thread-6, started 139772086433536)>
    current pid:97
    total threads:31
    CPU number:1
    line4
     
    current thread:<Thread(Thread-7, started 139772086433536)>
    current pid:97
    total threads:31
    CPU number:1
    line5
     
    current thread:<Thread(Thread-8, started 139772086433536)>
    current pid:97
    total threads:31
    CPU number:1
    line6
     
    current thread:<Thread(Thread-9, started 139772086433536)>
    current pid:97
    total threads:31
    CPU number:1
    line7
     
    current thread:<Thread(Thread-10, started 139772086433536)>
    current pid:97
    total threads:31
    CPU number:1
    line8
     
    current thread:<Thread(Thread-11, started 139772086433536)>
    current pid:97
    total threads:31
    CPU number:1
    line9
  4. Next, check the integrity of the output file. To reiterate, the purpose of this code is to read line by line from the input file and write each line into the output file in a sequential manner. Run the following script. In the event of no output, it concludes that the code achieves the intended outcome without data corruption. In this case, the result shows the same outcome as of the previous test (using single thread). It signals that the outcome is safe-thread when using the correct method.

    $ cnt=0;for i in `cat output | grep line`; do cnt=`expr $cnt + 1` ; if [ $i != line$cnt ]; then echo $i;fi ; done
  5. However, the processing time taken by Python to run this code with multithreading is longer with approximately 50 seconds!

    Job Starts: 2071685.346476694
    Job Ends: 2071735.755200198
    Totals Execution Time:50.41 seconds.

concurrent.futures.ThreadPoolExecutor in Python

Now let's add concurrent.futures.ThreadPoolExecutor module to run the same code with the same intended outcome.

  1. Run this script in the CML workbench session.

  2. The processing time taken by Python to run the code with concurrent.futures.ThreadPoolExecutor module is shorter at 24 seconds. Great!

    Job Starts: 2070084.875582402
    Job Ends: 2070109.707474271
    Totals Execution Time:24.83 seconds.
  3. Explore outcome of the newly generated output file. It shows that the process spawns 10 threads (defined by the number of workers) to run the code. The total threads before the running code was 30 and has increased to 40 when executing the code.

    Threadpool1

    current thread:<Thread(ThreadPoolExecutor-0_1, started 139664655841024)>
    current pid:98
    total threads:36
    CPU number:2
    line1
     
    current thread:<Thread(ThreadPoolExecutor-0_3, started 139664639055616)>
    current pid:98
    total threads:36
    CPU number:2
    line2
     
    current thread:<Thread(ThreadPoolExecutor-0_0, started 139664664233728)>
    current pid:98
    total threads:36
    CPU number:2
    line3
     
    current thread:<Thread(ThreadPoolExecutor-0_2, started 139664647448320)>
    current pid:98
    total threads:38
    CPU number:2
    line4
     
    current thread:<Thread(ThreadPoolExecutor-0_5, started 139664283399936)>
    current pid:98
    total threads:40
    CPU number:2
    line5
     
    current thread:<Thread(ThreadPoolExecutor-0_1, started 139664655841024)>
    current pid:98
    total threads:40
    CPU number:2
    line6
     
    current thread:<Thread(ThreadPoolExecutor-0_4, started 139664630662912)>
    current pid:98
    total threads:40
    CPU number:2
    line7
     
    current thread:<Thread(ThreadPoolExecutor-0_6, started 139664275007232)>
    current pid:98
    total threads:40
    CPU number:2
    line8
     
    current thread:<Thread(ThreadPoolExecutor-0_8, started 139664258221824)>
    current pid:98
    total threads:40
    CPU number:2
    line9
     
    current thread:<Thread(ThreadPoolExecutor-0_0, started 139664664233728)>
    current pid:98
    total threads:40
    CPU number:2
    line10
     
    current thread:<Thread(ThreadPoolExecutor-0_4, started 139664630662912)>
    current pid:98
    total threads:40
    CPU number:2
    line11
     
    current thread:<Thread(ThreadPoolExecutor-0_9, started 139664249829120)>
    current pid:98
    total threads:40
    CPU number:2
    line12
     
    current thread:<Thread(ThreadPoolExecutor-0_3, started 139664639055616)>
    current pid:98
    total threads:40
    CPU number:2
    line13
     
    current thread:<Thread(ThreadPoolExecutor-0_7, started 139664266614528)>
    current pid:98
    total threads:40
    CPU number:2
    line14
  4. Subsequently, check the integrity of the output. Run the following script to check if the code writes each line number to the output file sequentially. The result is negative.

    $ cnt=0;for i in `cat output | grep line`; do cnt=`expr $cnt + 1` ; if [ $i != line$cnt ]; then echo $i;fi ; done
    line26
    line25
    line41
    line40
    line84
    line83
    line96
    line95

    Excerpt from the output file:

    current thread:<Thread(ThreadPoolExecutor-1_9, started 139664249829120)>
    current pid:98
    total threads:40
    CPU number:2
    line26
     
    current thread:<Thread(ThreadPoolExecutor-1_8, started 139664283399936)>
    current pid:98
    total threads:40
    CPU number:2
    line25
     
    current thread:<Thread(ThreadPoolExecutor-1_0, started 139664258221824)>
    current pid:98
    total threads:40
    CPU number:2
    line27
  5. The outcome is not in line with the intention of the code. The line number is not written sequentially beginning from line26. Unlike the previous outcome, the line numbers are no longer in sequence, e.g. line25 is written after line26 by ThreadPoolExecutor-1_8 and ThreadPoolExecutor-1_9 respectively. This is the behaviour of race condition as a result of multiple threads writing into the same file in parallel. In other words, concurrent.futures.ThreadPoolExecutor is not thread-safe (without locking mechanism) despite able to complete the program at a faster speed.

concurrent.futures.ThreadPoolExecutor with Lock in Python

Now let's use concurrent.futures.ThreadPoolExecutor module with locking mechanism to run the code with the same intention.

  1. Run this script in the CML workbench session.

  2. Next, check the integrity of the output. Run the following script to check if the code writes each line number to the output file sequentially. The outcome matches the objective with sequential entries in place.

    $ cnt=0;for i in `cat output | grep line`; do cnt=`expr $cnt + 1` ; if [ $i != line$cnt ]; then echo $i;fi ; done
  3. The processing time taken by Python to run the code with concurrent.futures.ThreadPoolExecutor with Lock module is now taking longer than the code with single thread despite 10 workers/threads are spawned.

    Job Starts: 2077114.640889352
    Job Ends: 2077154.378118582
    Totals Execution Time:39.74 seconds.
  4. Let's rerun the same test by adjusting max_workers=1. Ironically, the speed is almost similar to that of running the script using the default single thread process.

    Job Starts: 2078032.097896568
    Job Ends: 2078064.409348964
    Totals Execution Time:32.31 seconds.
  5. While introducing lock in concurrent.futures.ThreadPoolExecutor module ensures integrity of the outcome, it comes with a price of lower speed because only one thread can obtain the lock at a time. Other threads need to wait for the lock to be released.

Conclusion: At first glance, multithreading sounds promising to enhance the speed to run specific Python code. In reality, the outcome depends heavily on how the code is written and most importantly whether or not it achieves the end goal. While running a Python code with more threads could be faster but it might not necessarily achieve the intended outcome. Single thread could work faster than multithreading for ensuring a high degree of integrity. In addition, creating multiple threads will also creates resource overhead. In a nutshell, applying multithreading will not guarantee highest performance and could lead to a backfire as it might effectuate data corruption as witnessed in the above experiments. Lastly, GIL still exists for a reason :)


CCDP Guide
All trademarks, logos, service marks and company names appeared here are the property of their respective owners.