Introduction¶

The clock speed and processing power of a single computing core are not increasing anymore in the rate they used do in the past. Instead, increasing computing power is obtained by having more computing units running in parallel. This change in the computing environment requires algorithms to be modified to work in such a setting if one wants to use all the resources provided.

The classic figure below (source: H. Sutter, DDJ) shows how the clock speeds of CPUs stopped growing exponentially while the numbers of transistors in them kept increasing.

A more recent figure below (source: Max Rosser, OurWorldInData.org) shows that Moore’s law, telling that the number of transistor in a chip increases exponentially, has been holding at least until 2018:

_images/Transistor-Count-over-time-to-2018.png

As mentioned above, the increasing number of transistors has been used to having more computing units in a chip. This means that the programmers must change their algorithms and software so that they can use the parallel resources provided; see, for instance, this video by Intel explaining why “software free lunch” is over, meaning that the software developers cannot simply wait for faster chips but have to consider how to parallelize their code.

Multicore processors¶

In this round, we will focus on multithreaded algorithms run in multicore machines. In such machines, like standard desktop PCs and modern laptops, the processor contains multiple computing units, called CPU “cores”, running in parallel. The cores can communicate through shared memory. As an example, the figure below shows that architecture of a 12-core Intel processor (source: Intel 64 and IA-32 Architectures Optimization Reference Manual).

In this setting, code is run in parallel in different threads that have shared memory. Even in this setting, there are different levels of abstraction such as:

Programming with threads directly (using locks, semaphores etc to synchronize).
Using a concurrency platform. For instance, in the fork-join framework threads in a thread pool execute tasks in parallel.
Using a parallel collections library such as Scala parallel collections.
Using futures and promises.

In this course, we focus on the fork-join framework and see how they can be used to define methods in a parallel collections library. In fact, the Java fork-join framework, that we will use, is also used to implement the scala.collection.parallel library (documentation). In addition, we see how some algorithms, such as mergesort and other sorting algorithms, can be parallelized in the fork-join framework.

Other parallel hardware systems¶

Multicore processors are only one kind of computing environment exhibiting parallelism.

The general-purpose graphics processing units, GPUs, have even a larger amount of cores in a chip. As an example, the figure below shows the architecture of the Kepler GK110/GK210 GPU having 192 cores and some other special units (source: NVIDIA Kepler TM GK110/210 whitepaper). Programming such GPUs is somewhat different to that of multicore processors: they typically use the single instruction, multiple threads (SIMT) paradigm, where a set of threads is executed in a synchronized manner. For low-level programming, one can use the CUDA or OpenCL programming platforms. Of course, one can also use thrust or some other libraries provided by NVIDIA and others.

On the other end of the spectrum, computing clusters and data centers offer a large number of computers that are connected through high-speed connections. As an example, below is a picture of a Google data center (source: google gallery). Again, programming such environment is different to programming multicore processors and one can use Apache spark engine, MapReduce programming model, OpenMP and so on. One can see, for instance, the following online book:

Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets

More about different forms of parallelism are discussed in the following Aalto courses:

CS-E4580 Programming Parallel Computers (5cr, period V): more about programming multicore machines and GPUs
CS-E4110 Concurrent programming (5cr, periods I and II): memory models, synchronization primitives, concurrency theory, actors, debugging
CS-E4640 Big Data Platforms (5cr, periods I and II): cloud computing technologies
CS-E4510 Distributed Algorithms (5cr, periods I and II): algorithms that are distributed to several machines and exchange messages in order to compute the final result