CUDA Sorting

Pyculib provides routines for sorting arrays on CUDA GPUs.

Sorting Large Arrays

The pyculib.sorting.RadixSort class is recommended for sorting large (approx. more than 1 million items) arrays of numeric types.

class pyculib.sorting.RadixSort(maxcount, dtype, descending=False, stream=0)

Provides radix sort and radix select.

The algorithm implemented here is best for large arrays (N > 1e6) due to the latency introduced by its use of multiple kernel launches. It is recommended to use segmented_sort instead for batches of smaller arrays.

Parameters:
  • maxcount (int) – Maximum number of items to sort
  • dtype (numpy.dtype) – The element type to sort
  • descending (bool) – Sort in descending order?
  • stream – The CUDA stream to run the kernels in
argselect(k, keys, begin_bit=0, end_bit=None)

Similar to RadixSort.select but returns the new sorted indices.

Parameters:
  • keys (numpy.ndarray) – Keys to sort inplace
  • begin_bit (int) – The first bit to sort
  • end_bit (int) – Optional. The last bit to sort
Returns:

The indices indicating the new ordering as an array on the CUDA device or on the host.

argsort(keys, begin_bit=0, end_bit=None)

Similar to RadixSort.sort but returns the new sorted indices.

Parameters:
  • keys (numpy.ndarray) – Keys to sort inplace
  • begin_bit (int) – The first bit to sort
  • end_bit (int) – Optional. The last bit to sort
Returns:

The indices indicating the new ordering as an array on the CUDA device or on the host.

close()

Explicitly release internal resources

Called automatically when the object is deleted.

init_arg(size)

Initialize an empty CUDA ndarray of uint32 with ascending integers starting from zero

Parameters:size (int) – Number of elements for the output array
Returns:An array with values [0, 1, 2, ...m size - 1 ]
select(k, keys, vals=None, begin_bit=0, end_bit=None)

Perform a inplace k-select on keys.

Memory transfer is performed automatically.

Parameters:
  • keys (numpy.ndarray) – Keys to sort inplace
  • vals (numpy.ndarray) – Optional. Additional values to be reordered along the sort. It is modified in place. Only the uint32 dtype is supported in this version.
  • begin_bit (int) – The first bit to sort
  • end_bit (int) – Optional. The last bit to sort
sort(keys, vals=None, begin_bit=0, end_bit=None)

Perform a inplace sort on keys. Memory transfer is performed automatically.

Parameters:
  • keys (numpy.ndarray) – Keys to sort inplace
  • vals (numpy.ndarray) – Optional. Additional values to be reordered along the sort. It is modified in place. Only the uint32 dtype is supported in this version.
  • begin_bit (int) – The first bit to sort
  • end_bit (int) – Optional. The last bit to sort

Sorting Many Small Arrays

Using pyculib.sorting.RadixSort on small (approx. less than 1 million items) arrays has significant overhead due to multiple kernel launches.

A better alternative is to use pyculib.sorting.segmented_sort()-which launches a single kernel for sorting a batch of many small arrays.

pyculib.sorting.segmented_sort(keys, vals, segments, stream=0)

Performs an inplace sort on small segments (N < 1e6).

Parameters:
  • keys (numpy.ndarray) – Keys to sort inplace.
  • vals (numpy.ndarray) – Values to be reordered inplace along the sort. Only the uint32 dtype is supported in this implementation.
  • segments (numpy.ndarray) – Segment separation location. e.g. array([3, 6, 8]) for segments of keys[:3], keys[3:6], keys[6:8], keys[8:].
  • stream – Optional. A cuda stream in which the kernels are executed.