Some CUDA programming fun in Julia

Suppose we want to draw a batch of images, where each image is made up of randomly positioned and colored triangles, that are blending. It will look like this:

triangles

and then find the euclidean distance of each such image to a given target image.

Now why on earth am I doing this? Well, this turns into an interesting optimization problem of finding the closest triangle image and also an excuse to practise Julia. The inspiration is from the this repo based on EvoJax.

Towards framing this as an optimization problem, we will represent a triangle as a vector of size 10, made up of floating point numbers between 0 and 1. Four numbers of this vector are for the color of the triangle; r,g,b and alpha, and for the three vertices of the triangle we need 6 numbers, (x1,y1), (x2,y2), (x3,y3). Hence, if we want to draw M images, each image having N triangles, we need a matrix of size (10,N,M), which will be our parameters matrix. I want to randomly create such a matrix and min-max scale it along the triangle dimension, by which I mean, for each image, I first find the minimum and maximum of a triangle parameter among the N triangles, and then subtract from the parameter this minimum and then divide the result by the difference between the maximum and the minimum. I want to end up with an array of size (3,w,h,M) for the images, where w is width and h is height, and an array of size M for the distances. Let’s see how fast we can do this.

First order of business is setting this up, note that I am scaling the numbers to the given width and height:

using Images
using Statistics
using Cairo
using BenchmarkTools
using Random: seed!

function prepare()
    seed!(123)
    w,h=128,128
    num_params=10
    num_triangles=50
    num_images=256
    target = rand(Float32, 3, h, w)
    prms = rand(Float32, num_params, num_triangles, num_images);
    prms .= (prms .- minimum(prms, dims=2)) ./ (maximum(prms, dims=2) .- minimum(prms, dims=2))   
    prms[collect(1:2:6),:,:] .*= w
    prms[collect(2:2:6),:,:] .*= h         
    return prms, target, num_images, num_triangles, w, h
end

With a 2d library

The first thing that comes to mind is to use a 2d graphics library for drawing, and since the Cairo lib is available, let’s try that. The function below is drawing the triangles on a blank white Cairo canvas, and copying it to the img array at the end:

@views function renderCairo(img, prms, num_triangles, w, h)                  
    # blank white canvas
    buffer = ones(UInt32, w, h) * typemax(UInt32)    
    c = Cairo.CairoImageSurface(buffer, Cairo.FORMAT_ARGB32, flipxy=false)
    cr = CairoContext(c);        
    
    for i in 1:num_triangles
        q = prms[:,i]
        set_source_rgba(cr,q[7], q[8], q[9], q[10])        
        move_to(cr, q[1],q[2]);
        line_to(cr, q[3],q[4]);
        line_to(cr, q[5],q[6]);
        close_path(cr);
        fill(cr);            
    end        
            
    resultimg = reinterpret(ARGB32, permutedims(c.data, (2, 1)))
    resultchn = Float32.(channelview(RGB.(resultimg)))                
    img .= resultchn
    Cairo.finish(c)
    Cairo.destroy(c)        
end

Now let’s draw each image in this fashion:

@views function withCairo() 
    prms, target, num_images, num_triangles, w, h = prepare() 

    imgs = Array{Float32}(undef, 3, h, w, num_images)    
    for i in 1:num_images
        img = imgs[:,:,:,i]        
        renderCairo(img, prms[:,:,i], num_triangles, w, h)        
    end
    dists = reshape(mean((imgs .- target) .^2, dims=(1,2,3)), num_images)
    return imgs, 1 .- dists
end

Benchmarking this with @btime withCairo();

I see 428.101 ms (3157 allocations: 205.09 MiB).

The cross product method

Now the cool part. Move your mouse inside and outside of the triangle below. You will see a bar chart, depicting the magnitude and direction of the 3rd components of the cross products, AB with AP (reds), BC with BP (greens) and CA with CP (blues). Observe that all those bars point to the same direction ONLY inside the triangle!

Whats great about this is that, cross products are just multiplications and subtractions, perfect job to parallelize with a GPU.

So, what needs to be done is clear. For each of the M images, and for each of the N triangles, our operation is to update a pixel color to blend with the current triangle’s color, if that pixel is inside the triangle. We will parallelize this operation with a CUDA kernel, shown below:

using CUDA

function puttri(prms, imgs, tri, ins)    
    idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x  
    idy = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    
    abx = prms[3,tri,ins] - prms[1,tri,ins]
    aby = prms[4,tri,ins] - prms[2,tri,ins]
    apx = idx - prms[1,tri,ins]
    apy = idy - prms[2,tri,ins]
    cr1 = apx * aby - apy * abx

    bcx = prms[5,tri,ins] - prms[3,tri,ins]
    bcy = prms[6,tri,ins] - prms[4,tri,ins]
    bpx = idx - prms[3,tri,ins]
    bpy = idy - prms[4,tri,ins]
    cr2 = bpx * bcy - bpy * bcx

    cax = prms[1,tri,ins] - prms[5,tri,ins]
    cay = prms[2,tri,ins] - prms[6,tri,ins]
    cpx = idx - prms[5,tri,ins]
    cpy = idy - prms[6,tri,ins]
    cr3 = cpx * cay - cpy * cax

    if ((cr1>=0) & (cr2>=0) & (cr3>=0)) | ((cr1<=0) & (cr2<=0) & (cr3<=0))
        oneMinusAlpha = (1.0f0-prms[10,tri,ins])        
        imgs[1,idx,idy,ins] = imgs[1,idx,idy,ins] * oneMinusAlpha + prms[7,tri,ins] * prms[10,tri,ins]
        imgs[2,idx,idy,ins] = imgs[2,idx,idy,ins] * oneMinusAlpha + prms[8,tri,ins] * prms[10,tri,ins]
        imgs[3,idx,idy,ins] = imgs[3,idx,idy,ins] * oneMinusAlpha + prms[9,tri,ins] * prms[10,tri,ins]
    end
    return
end

We will need to pass our parameters and target array to the GPU, and then call the kernel with @cuda. We can create white canvases with CUDA.ones here, so no need to pass it.

function withGpu()
    prms, target, num_images, num_triangles, w, h = prepare()     

    prms = CuArray(prms)    
    imgs = CUDA.ones(3, h, w, num_images)
    target = CuArray(target)
    for tri in 1:num_triangles
        for i in 1:num_images
            @cuda threads=(32,32) blocks=(8,8) puttri(prms, imgs, tri, i)        
        end                                
    end
    gpufitnesses = 1.0f0 .- reshape(mean((imgs .- target) .^ 2, dims=(1,2,3)),num_images)
    return Array(imgs), Array(gpufitnesses)
end

Benchmarking this I see 120.315 ms (38689 allocations: 52.53 MiB)

That’s about 4x speedup, not really impressive, but perhaps not too bad considering I have an old GPU. Note that I benchmarked with --check-bounds=no, which is a startup option that you pass to Julia, when launching, that disables the performance killer “bounds checking”.

In the next post, I will talk about the very cool and general PGPE algorithm used in EvoJax to steer these images towards a target image. You can see one example of this here.

Please let me know if you have any comments, suggestions, improvements.