Part 2 of week 4′s lectures delved into Sort algorithms.
Sort is difficult in parallel programming because things get more complex when adding the following objectives:
keeping the hardware busy
Limiting branch divergence
Coalescing memory access
Odd-Even Sort (brick sort), parallel version of bubble sort:
Start with an array of elements comparing odds and evens with different polarity each iteration. The elements are swapped if they are out of order.
Step complexity: O(n), Work Complexity: O(n^2)
Divide and conquer approach, ideal for GPUs.
Merging two sort lists together, n sorted, n/2 sorted, n/4 sorted…
Step complexity: O(log n), Work complexity: O(n log n)
General implementations use a different sort to get to 1024 chunk sorted lists.
Merging 2 sorted can be done in parallel, remembering compact.
How parallel merge works:
With 2 sorted arrays launch a thread for every element in each array. The goal of each thread is to calculate the position of its element in the final list.
Thread calculates it position in its own list, its index…
Calculate its position in the other array -> Binary search every thread does a binary search on the other array.
The different stages of merge sort require different algorithms to get the most out of hardware
When we get to the top of the merge tree, allocating one huge task to only 1 streaming multiprocessor [SM] means that there will be however many other SMs we have sitting idle. So we want one merge to be spread across multiple SMs.
How do we make sub tasks? Spliters. The logic does not seem to complex but I will review the section on it below:
Moving away from the merge sort, a new approach was explored – Sorting networks. This is a form of oblivious algorithms, ie an algorithm that does the same thing all the time. This enables high level of parallelism.
Bitonic sequences Work complexity: (n log n^2), however step complexity
All scenarios take the same amount of time, the algorithm is oblivious! If input set are small enough to fit into shared memory then sorting networks are generally good performers.
The best performing algorithm on GPUs is Radix sort – the steps for radix sort on integers is:
Start with the least significant bit of the integer
split input into 2 sets, 1s and 0s – (Compact – see part 1)
Proceed to next LSB and repeat’
Step complexity O(kn) where k is bits in representation and n is number of elements to sort.
The final sort algorithms discussed were Quicksort and Key Value sort. Quicksort requires recursion and the control structure is quite difficult to implement.
The topic of week 4′s lectures and assignment was Sort and Scan. These are slightly more complex due to the many-to-many, all-to-all communication patterns.
Part 1 focused on Scan.
Important properties of Scan, work complexity: O(n), step complexity: O(log n).
Variations on the Scan algorithm:
Compact/Filter – Gathering a subset that meet a certain criteria (input, predicate, output). Ie only keep input elements where the predicate is true. Outputs can be dense or sparse. Sparse is easy, as elements map to the same locations. Dense will result in contiguous array of the filtered elements (much faster for subsequent reading and writing).
why is using compact more efficient?
In the card deck example where we want the output of only diamonds using compact will be much faster if the computecard() function is above a minimal computational cost. So compact is very useful when the number of elelments filtered out is large and the computation on each surviving element is high.
Summary of how to Compact:
Scan-In Array: 1 / 0
Exclusive sum scan (scatter addresses for the compacted array)
Scatter input input element into output array using the scatter address addresses.
The predicate and scan parts of Compact will have the same run time regardless of the number of elements filtered. If very few items are to be written to the output then the scatter step will be faster.
A brief side not in the lecture introduced Allocate is a generalized version of Compact which can be used for clipping.
Next came Segmented Scan, we want to keep our kernels busy and independent. Segmented scan uses a second array to keep track of segment heads. This allows independent computation of scan values that can be reconciled when there is less work to do.
Sparse Matrix/Dense Vector multiplication [SpMv] was a practical application for the techniques we had just learned. SpMv is a good practical example as they are common in many problems domains, ie: (Page Ranking by search engines).
Traditional way to represent a spare matrix is a Compressed Sparse Row [CSR] – Value (non-zero data element values), Column (column of elements origin), RowPtr (element index that starts a new row)
Nice and simple example of CSR calculation
Now that we have the CSR representation of the sparse matrix we can conduct the multiplication.
Segment the Value array using the rowptr values as demarkation points
Gather the vector values for each element in the array using the column array
Use a Map function to multiply the value array with the gathered vector array.
Conduct and exclusve segmented sum scan to get the final answer
Unit 3 worked on analyse the performance of CUDA programs along with some key algorithms: Reduce, Scan and Histogram. Parallelism is easiest with one-to-one or many-to-one communication patterns week 3 looked into how we still get great performance gains from parallelism when we need to do all-to-all and many-to-many communication patterns between threads. Just because ideal scaling may not be possible for an algorithm parallelism is likely still going to result in more efficient computing.
Step and work complexity are the first steps to establishing the truthfulness of this claim.
A parallel algorithm will be considered ‘work efficient’ if its work complexity is asymptotically to same as its serial counterpart
Comparing the step and work complexity of parallel algorithms is a good start to understanding their efficiency.
If we can reduce the step complexity in out parallel implementation compared to the serial implementation whilst maintaining work efficency we will have a faster execution.
Reduce has 2 inputs:
Set of elements
Reduction operation (ie: +)
Reduction operators must be binary, operate on two inputs and generate one output.
Reduction operators must be associative, a op b op c == c op b op a
After learning the GPU programming model and writing a basic CUDA program Week 2 introduces some concepts for efficient algorithms.
Communication is the first issues; how do threads communicate efficiently. This is easier in some problems, ie: Map than others ie: Scan/Sort. The memory access and write patterns of some of the key algorithm types are discussed. Input-to-Output relationships:
The basic problem of concurrent memory access was illustrated via a scatter example. With one input tying to write a 3rd of its value to the neighboring elements we can see that trouble will result from independent thread execution.
one-to-many memory writes will pose a problem with independent threads writing over the same memory locations
To overcome these barriers threads must communicate in some ways. Shared memory and synchronization points were described as tools for this job.
Some important information about CUDA and what it guarantees (and doesn’t guarantee) about thread execution was also touched on:
No guarantees about when and where thread blocks will run – this is an enabler for massive parallelism. There are also some limitation due to this, no assumptions about what blocks will run on what SM and there can be no direct communication between blocks.
Does guarantee – all threads in a block are guaranteed to run on the same SM at the same time.
Does guarantee – all blocks in a kernel finish before blocks from the next kernel run
Introduction to parallel programming is the second MOOC course that I signed up for. The emergence of parallel and distributed computing is not slowing down and it seems that most developers are not accustomed to the very different train of though that parallelism invokes. Most recent GPUs have 1024 simple compute units each of which can run parallel threads. The general course over by the course instructor, John Owens:
The first week started off pretty simple focussing on why GPUs, parallel programming and CUDA. I found the pace of the videos just right and much more engaging than other courses I have looked at.
The basic of CUDA:
CUDA programs are controlled by the host CPU and memory and the libraries enable interaction with the GPU/s.
The final week of the course brought together all of the previous concepts and capabilities to leverage lazy evaluation.
Structural Inductions on trees can be conducted in a number of ways. Obviously they are all bound by logic and I was somewhat confused by the level of detail that this topic was covered on. Perhaps because this concept is not readily available in imperative languages.
Streams was the next topic and one that was used extensively in the week 7 assignment.
((1000 to 10000) filter isPrime)(1)
Above is very inefficient as it finds all prime numbers between 1000 and 10,000, whilst only using element 0 and 1 from the results.
A good solution is to avoid evaluating all of the numbers in the list from 1000 to 10,000 is the .toStream function.
((1000 to 10000).toSteam filter isPrime)(1)
Example of a stream implementation in Scala
Lazy evaluation was demonstrated next, its importance paralleled to that of streams
Week 5 – Lists, investigate a ‘fundamental’ data structure in functional programming. The recursive nature of lists in Scala makes them quite different from arrays. This tied in with the power of pattern matching and the recursive nature of function programming gives credibility to the ‘fundamental’ label.
Pattern matching on lists is a powerful tool
Sorting of lists was covered in good detail and in addition to list1.head and list1.tail more functions were revealed: