GPU processing with pycuda

Cuda is Nvidia software that allows programming of the the graphics unit on an nvidia card. It has been mentioned in several other posts on this forum, especially for impressive rendering times using Vray.

Firstly: a warning if you don’t know how to fix a broken linux graphic system then you should probably not try this. That said if you are running ubuntu jaunty you should be ok.

Secondly: Nvidia and others are developing a standard called opencl. This will become the standard but it appears from this mail message that it is still some way off[email protected]/msg00582.html, however, opencl and cuda are very similar and the python bindings are being developed by the same person so they may look rather similar in the end.

Instructions below are for Ubuntu jaunty64, jaunty32 should be similar, make changes where necessary.

The latest nvidia graphics drivers are required, fortunately there is a build for ubuntu jaunty on

All that is required is the Cuda toolkit which can be downloaded from here:

Install it:

sudo sh

Append this to bashrc

export PATH=/usr/local/cuda/bin/;$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64/;$LD_LIBRARY_PATH

The documentation for cuda, including pdf files is at:


Notes on installation are here:
and for ubuntu here:

Some other software is required including numpy

sudo apt-get install numpy, git-core
sudo apt-get install build-essential python-dev python-setuptools libboost-python1.37-dev
git clone

cd pycuda
# the following all one line
./ --cuda-root=/usr/local/cuda --cudadrv-lib-dir=/usr/lib/ 
    --boost-inc-dir=/usr/include/ --boost-lib-dir=/usr/lib/ 
    --boost-python-libname=boost_python-mt --boost-thread-libname=boost_thread-mt
make -j 4
sudo python install

The example ran ok on my system

Another example supplied with pycuda does matrix multiplication (check the downloaded files). I added some timing to the code and increased the matrix size from 2 to 20. My GPU can run 512 threads, so 22x22 matrix multiplication is the upper bound. You can find how many threads your gfx can run by executing this code:

from import DeviceData
import pycuda.autoinit

dd = DeviceData()
print dd.max_threads

#!/usr/bin/env python
# -*- coding: utf-8 -*-

Multiples two square matrices together using a *single* block of threads and 
global memory only. Each thread computes one element of the resulting matrix.

import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import time

# -- initialize the device
import pycuda.autoinit

kernel_code_template = """
__global__ void MatrixMulKernel(float *a, float *b, float *c)
    // 2D Thread ID (assuming that only *one* block will be executed)
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    // Pvalue is used to store the element of the matrix
    // that is computed by the thread
    float Pvalue = 0;

    // Each thread loads one row of M and one column of N, 
    //   to produce one element of P.
    for (int k = 0; k < %(MATRIX_SIZE)s; ++k) {
        float Aelement = a[ty * %(MATRIX_SIZE)s + k];
        float Belement = b[k * %(MATRIX_SIZE)s + tx];
        Pvalue += Aelement * Belement;

    // Write the matrix to device memory;
    // each thread writes one element
    c[ty * %(MATRIX_SIZE)s + tx] = Pvalue;

# define the (square) matrix size
#  note that we'll only use *one* block of threads here
#  as a consequence this number (squared) can't exceed max_threads,
#  see
#  for more information on how to get this number for your device

# create two random square matrices
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)

# compute reference on the CPU to verify GPU computation
t = time.time()
c_cpu =, b_cpu)
print time.time() - t

# transfer host (CPU) memory to device (GPU) memory 
a_gpu = gpuarray.to_gpu(a_cpu) 
b_gpu = gpuarray.to_gpu(b_cpu)

# create empty gpu array for the result (C = A * B)
c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)

# get the kernel code from the template 
# by specifying the constant MATRIX_SIZE
kernel_code = kernel_code_template % {

# compile the kernel code 
mod = compiler.SourceModule(kernel_code)

# get the kernel function from the compiled module
matrixmul = mod.get_function("MatrixMulKernel")

# call the kernel on the card
t = time.time()
matrixmul(a_gpu, b_gpu, c_gpu, block = (MATRIX_SIZE, MATRIX_SIZE, 1))
print time.time()-t

"""# print the results
print "-" * 80
print "Matrix A (GPU):"
print a_gpu.get()

print "-" * 80
print "Matrix B (GPU):"
print b_gpu.get()

print "-" * 80
print "Matrix C (GPU):"
print c_gpu.get()
print "-" * 80
print "CPU-GPU difference:"
print (c_cpu - c_gpu.get())[1]

np.allclose(c_cpu, c_gpu.get())

The results:
My system specs:
AMD Athlon™ 64 X2 Dual Core Processor 4800+
GeForce 9600 GT
The GPU calcuation was about 0.00017 s, but the CPU calculation was about 0.00003 sec. Adding some looping gave CPU results about 4x faster then GPU. So for this simple example CPU is faster. At this moment, I am not sure about the significance of this result in terms of execution time for something more significant.

There is a tutorial/lecture on pycuda here:



OpenCL are multiplatform ATI/NVIDIA/Apple-Snow Leopard


Very awesome podcast:

OpenCL Tutorial - Introduction to OpenCL
OpenCL Tutorial - OpenCL Fundamentals