Notes on Parallel and Concurrent Programming
To refresh my knowledge about concurrency and parallel programming, I took “Parallel and Concurrent Programming with Java 1–2” course in LinkedIn and I thought of sharing my notes, hoping it will help you, too.
If you would like to have a look, here is my previous post (Part I):
Synchronization
Condition Variable
- queue of threads waiting for a certain condition associated with a mutex.
- busy waiting / spinning: repeatedly acquiring and releasing the lock to check for a certain condition to continue.
- a mutex limitation
- solution => condition variable: queue of threads waiting for a certain condition. a waiting place for threads to get notified.
- they work together to construct a higher level construct called monitor.
- lock.newCondition() -> to create a condition variable in Java.
Monitor
- condition variables work together with a mutex serving as a monitor.
- protect section of code with mutual exclusion.
- provide ability for threads to wait until a condition occurs.
- signaling mechanism for notifying threads.
3 Operations:
- wait
- signal
- broadcast
- wait
- automatically release lock on the mutex
- go to sleep and enter waiting queue
- reacquire lock when woken up - signal (notify/wake)
- wake up one thread from condition variable queue - broadcast
- wake up all threads from condition variable queue
Shared Queue or Buffer
- for cases requiring multiple condition variables
- implements a shared queue/buffer
- Mutex
- Condition Variables (BufferNotFull, BufferNotEmpty)
- first signal, then unlock
- if “signal” wakes up the wrong thread, the program gets stuck (so signalAll).
mutex.lock();
while (some condition is not true) // a condition
conditionVariable.await(); // a place
// execute the critical section
mutex.unlock()
------------------------------------
mutex.lock();
// state changing stuff
conditionVariable.signal();
mutex.unlock();
Producer - Consumer Pattern
- 1 or more threads/processes acts as a producer.
- consumer(s): remove elements from shared data structure.
- puts in a queue.
- First In First Out (FIFO): items are removed in the same order that they’re added to the queue.
Challenges:
- (since it’s a shared queue) enforce mutual exclusion of producers and consumers.
- prevent producers from trying to add data to a full queue. (buffer overflow)
- prevent consumers from trying to remove data from an empty queue.
Unbounded Queue: (some programming languages offer) a queue with an unlimited capacity. implemented using LinkedLists (limited with physical memory though)
- there may also be bursts.
- average rate of production < average rate of consumption
- pipeline of tasks
Pipeline
- consists of a chain of processing elements arranged so that the output of each element is the input to the next one.
- producer-consumer pairs. connected target with buffer like queue between each consecutive element.
- you do not have to implement the queue protection part in Java.
- interface: BlockingQueue, class: ArrayBlockingQueue<String>(5);
Semaphore
- another synchronization mechanism that can be used to control access to shared resources.
- unlike a mutex, can be used by multiple threads at the same time.
- includes a counter to track availability (how many times it’s been acquired/released)
- Semaphore, acquire() / release()
new Semaphore(1) => if equal to 1, binary
- counting semaphore: value≥0; used to track limited resources (pool of connections like server, db; track items in a queue)
- binary semaphore: value=0 (locked) or 1 (unlocked); used similar to mutex by acquiring/releasing (difference is mutex can only be acquired/released by the same thread; semaphore can be acquired/released by different threads.)
- Producer-Consumer Semaphore
Barriers
- Race Condition: (not a data race but both can cause the other) a flaw in the timing or ordering of a program’s execution that causes incorrect behavior.
- even if you use a mutex, can happen because of the order of the threads.
- just like add and multiply thread order here; (1+3)*2 != 1+(2*3)
- sleep statements
- if you try to debug, can have a Heisenbug: a software bug that disappears when you try to study it.
- Barrier: prevents a group of threads from proceeding until enough threads have reached the barrier. schedule time different but same result each time.
- CyclicBarrier in Java, can be reused after the waiting threads are released.
- BrokenBarrierException
- int getParties() => total # of threads needed to trip barrier
- int getNumberWaiting() => current number of threads waiting on the barrier
- void reset() => reset barrier to initial state (before reuse)
- boolean isBroken => has a thread broken out since last reset due to interruption/timeout
CountDownLatch
- allows 1 or more threads to wait until a set of operations being performed in other threads complete.
- CountDownLatch doesn’t require the threads that are calling count down to wait until there until proceeding, they’re free to continue. It only prevents threads that call a wait from proceeding before the count reaches 0.
- CountDownLatch(value) in Java
- await() => wait for count value to reach 0
- countDown => decrement count value
CyclicBarrier vs. CountDownLatch
- CyclicBarrier => # of threads value is reached, reset exists
- CountDownLatch => 0 is reached, no reset
mini reminder:
- CountDownLatch inited to 1 => multiple threads need to wait until after one thread completes some action to continue.
- Barriers can be used to control the relative order in which threads execute certain operations.
- the OS execution scheduler -> responsible for causing a race condition. the order in which the OS schedules threads to execute is non-deterministic
- a race condition can occur independently of a data race
- the order in which two threads execute their respective operations will change the output => creates a potential for a race condition.
Asynchronous Tasks
- computational graph => Directed Acyclic Graph (DAG)
- work (# amount)
critical path
span (shortest possible execution time) - Ideal Parallelism = work / span
Thread Pool
- creates and maintains a collection of worker threads.
- reuses existing threads to execute tasks
- like a to-do list
- we get rid of the overhead of creating a thread
To create a thread pool in Java;
ExecutorService (interface) in java.util.concurrent
- higher-level interface for running tasks
- includes features to manage the executor and task lifecycle
Executors Class
- new SingleThreadExecutor() = creates an executor that uses a single thread to execute tasks
- new FixedThreadPool(int nThreads) = creates a thread pool that reuses a fixed number of threads to execute tasks.
- pool.submit()
- you need to shut down methods or waits for new tasks to be submitted.
- pool.shutdown() = orderly shutdown. already started tasks will continue but the pool will not accept any new tasks.
Future
- placeholder for a result that will be available later.
- mechanism to access the result of an asynchronous operation.
- Future is read-only
- IOU; I’ll wait until it finishes. writing value to the future; resolving, fulfilling; fulfilling the promise
- Callable<V> Interface: to get the result back from an asynchronous task in Java.
- call() method, throws Exception
- submit() returns Future
- if you result.get(); if result is not ready, waits in blocked state.
Parallel Execution across Multiple Processors, Algorithm Classes
Divide and Conquer Algorithms
1. Divide the problem into subproblems (of roughly equal size)
2. Conquer the subproblems by solving them recursively
3. Combine the solutions to the subproblems.
if “base case” => subdivided into a small enough piece to solve directly
— solve problem
else
— partition problem into “left” and “right” subproblems
— solve “left” problem using divide-and-conquer (solve recursively)
— solve “right” problem using divide-and-conquer (solve recursively)
— combine solutions to “left” and “right” problems
Fork/Join Framework (Java)
ForkJoinPool — distributes tasks to its worker threads.
- ExecutorService that executes ForkJoinTasks
- fork() — asynchronously execute task in ForkJoinPool
- join() — return result of computation when it is done
ForkJoinTask Subclasses
- RecursiveTask<V> — Returns a result
- RecursiveAction — Does not return a result
class RecursiveSum extends RecursiveTask<Long> {
protected Long computer() {
// if base case ...; else divide
}
...
RecursiveSum left = new RecursiveSum(lo, mid);
RecursiveSum right = new RecursiveSum(mid + 1, hi);
left.fork(); // forked thread left half
return right.computer() + left.join(); // current thread computes right half
...
ForkJoinPool pool = ForkJoinPool.commonPool();
Long t = pool.invoke(new RecursiveSum(0, 1_000_000));
pool.shutdown();mini reminder:
Recursive -> Divide-And-Conquer Algorithm
* ForkJoinPool implementation instead of creating new threads to handle subproblems because ForkJoinPool manages a thread pool to execute its ForkJoinTasks which reduces the overhead of thread creation.
* Callable (call) returns object whereas Runnable (run) does not
* Future: a placeholder to access a result that may not been computed yet.
* Span: sum of the time for all task nodes along the critical path.
* Critical Path: longest series of sequential operations through the program
* Work: sum of the time for all task nodes in a computational graph.
* Computational Graphs are useful; they help to identify opportunities for parallel execution.
Evaluating Parallel Performance
Weak Scaling = Variable number of processes with fixed problem size per processor (more work in same amount of time)
Strong Scaling = Variable number of processors with fixed total problem size (same work in less time)
- Throughput = # of tasks completed in a given amount of time (increases in each way).
Throughput = (# tasks) / time
- Latency = amount of time needed to execute a task
Latency = time / task
- Speedup = efficiency
Speedup = (sequential execution time) / (parallel execution time with N workers)
Amdahl’s Law
- calculates an upper limit for the overall speedup that parallelizing a program will achieve.
Overall SpeedUp = 1 / (1-P) + (P/S)
- P = portion of program that’s parallelizable
- S = speedup of the parallelized portion
- answers the question; how much portion of a program can be parallelized with S processors.
- after a point, even adding processors would not be any much of use. the sequential part affects the overall performance.
Efficiency = how well additional resources are utilized.
Efficiency = speedup / (# of processors)
- If you are planning to run a performance test in Java, “dry run” for once to warm up the system (in case the algorithm needs any Just In Time Compilation or perhaps leaves the cache in a certain state.
- Take average of multiple runs’ results because the execution time will vary from run to run depending on how the operating system chooses to schedule your program.
Designing Parallel Programs
- Partitioning (discrete parts)
- Communication
- Agglomeration
- Mapping
Partitioning (1)
- Domain (Data) Decomposition
- Functional (Computational Work) Decomposition
* Block Decomposition
* Cyclic Decomposition
Communication (2)
- if there’s need to share information between tasks
- point-to-point communication between neighboring tasks (sender -> receiver)
- Collective Communication — Broadcast, Scatter — Gather
- Synchronous Blocking Communication (cannot do any other work before communication is completed) / Asynchronous Nonblocking Communication
- Overhead, Latency, Bandwith (bytes per second)
Agglomeration (3)
- combine tasks or replicate data
- Tasks >> Processors
Granularity = (time spent on Computation) / (time spent on Communication)
Fine-Grained Parallelism
- large number of small tasks
- advantage -> good distribution of workload (load balancing)
- disadvantage -> low computation-to-communication ratio
Coarse-Grained Parallelism
- small number of large tasks
- advantage -> high computation-to-communication ratio
- disadvantage -> inefficient load balancing
Mapping (4)
- specify where each task will execute.
- does not apply to -> single core processors, automated task scheduling
- the OS automatically handles scheduling threads to execute on each processor core -> mapping design state does not apply to common desktop OS.
- dynamic load balancing techniques, mapping algorithms
Happy Coding!