Vulkan Render-Queues and how they Sync

This summer I embarked on a big rewrite of the foundations of Island, the Vulkan renderer I’ve been working on. This touched the big, load-bearing synchronisation parts, which was quite painful. All this to shave a yak of epic proportions, really: eventually I’d like to implement Vulkan Video, and for this to work, I will need a way to decode frames on a special video-decode queue, which cannot be the main draw-compute-transfer versatile queue.

This means that for Vulkan video to work, we need to be able to run multiple queues, and for this to work, resources must be able to change queue ownership, which doesn’t sound that bad at first, but changing queue family ownership in Vulkan is a ceremonial almost matching the intricacy of a baroque courtship ritual - it requires both queues from different families to perform a precise, polite little dance of release & acquire - and if, in this sordid sarabande, a step gets missed or misplaced, oops, that’s it, it’s a deadlock and your program and the wedding is off.

"…a precise, polite little dance of release & acquire…"

There is of course, a way to make queue ownership transfers the responsibility of the driver, by declaring every resource to be VK_SHARING_MODE_CONCURRENT instead of VK_SHARING_MODE_EXCLUSIVE. But how much fun would that be? And, it is said that this is rather bad for performance (I would assume that queue ownership is then internally transferred lazily at the last possible moment, and that this could cause bubbles). And, heck, we’re using a rendergraph, which means the renderer should get our intentions telegraphed far enough ahead in advance to make the right decisions…

Why Multiple Queues? A Chance for Extra Parallelism 

So here’s the goal: we’d like a system that can gracefully scale, whether there is one queue or sixteen, a system that can flexibly distribute renderpasses onto available queues - that can scale with the amount of available queues and the capabilities of these available queues, and that can automatically transfer resources as needed between queues.

On the bonus side, if this works, we get a new way of exploiting GPU parallelism - because independent Vulkan queues may execute in parallel. This means that for example we could run a compute pass at the same time as we run an independent graphics pass, which may give us better GPU utilisation.

Return of the Rendergraph 

Now, where to start? Probably where I left off with the previous post- once Rendergraph assembly is complete.

Two submissions to single queue
Fig. 1:
A rendergraph with three independent subgraphs - the thick underlined pass names symbolize root passes, dark filled triangles before resource names ▼ indicate write, unfilled triangles △ indicate read (click to zoom)

In Island, once the Rendergraph has been assembled, the backend receives a list of independent subgraphs, where each subgraph guarantees that it will only access its own resources - or, if resources are shared, that these resources are strictly READ_ONLY within the same queue family, for the full frame.

When the renderer forms subgraphs out of passes, the GPU queue requirements for all passes are accumulated: Say, you have a subgraph that consists of two passes, one COMPUTE and a DRAW pass, and because they use the same resource, they have been combined into a single subgraph. Now, this subgraph will have the requirements COMPUTE|DRAW, and will only run on a queue that supports hybrid COMPUTE and DRAW operations. This is pretty nice, because by using a hybrid queue, we can save ourselves resource ownership transfer.

Another nice consequence of this is that we can treat subgraphs as completely isolated from each other. And since these subgraphs are perfectly isolated from each other, the backend can translate each subgraph into a render command batch, and submit each batch on whichever matching queue is available and the most idle.

(Not so) Hidden Complexity 

Of course, using multiple queues comes at a complexity cost: now we have the additional responsibility that all GPU queues that we bring to the party must synchronise with each other. How do they do that? Well, by talking to each other, for a start.

Semaphores is how queues talk to each other in Vulkan. You can only use them on queue submissions, where you can name the semaphores that you wish this queue to wait for, the command batch to process, and then the semaphores to signal once this submission has been processed:

VkSubmitInfo2 submitInfo{
    .waitSemaphoreInfoCount   =  // number of semaphores to wait for
    .pWaitSemaphoreInfos      =  // semaphores to wait for before processing batch
    .commandBufferInfoCount   =  // number of command buffers for this batch
    .pCommandBufferInfos      =  // array of command buffers forming this batch
    .signalSemaphoreInfoCount =  // number of semaphores to signal
    .pSignalSemaphoreInfos    =  // semaphores to signal once batch has been processed

If Semaphores is how Vulkan GPU queues talk to each other, then Timeline Semaphores is how modern, more relaxed queues talk to each other. Timeline Semaphores can nowadays be used in place of the more clunky Binary Semaphores, and they are nicer to work with, because they are more tolerant: not every signal event on a Timeline Semaphore must be answered. Binary Semaphores, by contrast, require every signal event to have a matching wait event. Timeline Semaphores are more forgiving: it’s enough if you guarantee that the value signalled by the semaphore is monotonically increasing - and then you can wait on the highest signalled value that makes sense for a specific point in the execution timeline. You could also wait on more than one queue for the same Timeline Semaphore value.

I find Timeline Semaphores so useful, that in Island, each queue automatically gets its own exclusive Timeline Semaphore, whether it uses it or not.

I hope that this short paean to Timeline Semaphores has you convinced that Timeline Semaphores are the future, and that we therefore would like to use Timeline Semaphores for every & all our realtime-graphics needs. Sounds good? Well, unfortunately, at the time of writing, we can’t use Timeline Semaphores for swapchain operations, which is the place where we absolutely must use semaphores. That’s sad, but there’s a way around: we can still use a hybrid of Binary and Timeline Semaphores.

Extending Island’s Single Queue System 

Currently Island uses two Binary Semaphores per frame in order to synchronise with the WSI API (that’s the swapchain system): PRESENT_COMPLETE, and RENDER_COMPLETE.

We therefore need to wait for PRESENT_COMPLETE before we begin rendering, and we signal that we’re done with rendering by signalling RENDER_COMPLETE. In a single-queue renderer, we can submit all the render commands in one big batch, and synchronise this single submission by first waiting on PRESENT_COMPLETE, then processing the command batch, and then signalling RENDER_COMPLETE on completion.

Simple single Queue Submission
Fig. 2:
A simple, single queue submission: wait for PRESENT_COMPLETE, process a renderbatch, signal RENDER_COMPLETE

In the hypothetical case that we wanted to submit two render batches, this becomes a bit more complicated, because we can’t wait more than once for the same Binary Semaphore unless we signal it again. Fortunately, we don’t need to: we can just split the workload into two, and wait on PRESENT_COMPLETE on the first submission, and then signal RENDER_COMPLETE on the second submission. Because submissions on the same queue get executed in submission order, things still happen in the correct sequence.

Two submissions to single queue
Fig. 3:
The first submission waits for PRESENT_COMPLETE then processes its renderbatch, while the second submission waits for the first, processes its batch, and then signals RENDER_COMPLETE

Multi-Queue Rendering 

Now let’s say, we allow the user to choose (via a shopping list of queue capabilities) which kind and how many queues they wish to use for their Island application.

VkQueueFlags queue_capabilities[ 3 ] = {
le_backend_vk::settings_i.set_requested_queue_capabilities( queue_capabilities, 3 );

Note that it is not guaranteed that queues are always available: this is hardware dependent, which is why we must make sure that everything we do must have a graceful fallback. More on this later. But let’s say we have more than one GPU queue available, and we have multiple independent subgraphs assembled into Vulkan command batches, then we could submit these to run in parallel.

Of course, we would still have to enforce that whoever writes to the backbuffer needs to wait for the backbuffer resource to be available, signalled via PRESENT_COMPLETE - and we would also need to enforce that RENDER_COMPLETE only gets signalled, once all renderbatches have been submitted and processed. How can we enforce this?

We will use a trick: since operations on a queue happen in submission order, we can extend the previous split submission over multiple queues by adding some Timeline Semaphores:

On queue_0 (a graphics/compute queue) we first wait for PRESENT_COMPLETE, then execute the command batch for queue_0, then signal the Timeline Semaphore for queue_0.

Starting at the same time, queue_1 (a compute-only queue) doesn’t need to wait for anyone, it executes its command batch, and then signals its Timeline Semaphore for queue_1.

Back on queue_0, we wait for all Timeline Semaphores , then execute zero commands, and immediately signal RENDER_COMPLETE. The last queue submission on queue_0 is only there for synchronisation - it harvests all Timeline Semaphores across all queues, and then, on the main queue, signals RENDER_COMPLETE.

Submissions to two queues
Fig. 4:Two queues in sync- note how queue_1 may start processing even before queue_0, as it does not have to wait for anyone.

This then is the first piece of the puzzle: this will help us synchronize parallel queue submissions while still keeping up the pretense to the swapchain system that nothing has changed.

It will also gracefully fall back in case we only have one queue available: regardless whether the first or the second sub-graph get processed first, the sub-graph which draws to the backbuffer image will wait for that image to be available, and at the end, once both sub-graphs have been processed, we signal RENDER_COMPLETE. There’s no chance of a deadlock.

Resources: Best Kept within the Family 

Now that we have found a way to synchronise multiple queues, we need to look into how we will transfer queue family ownership for resources.

Note that there is a difference between queue ownership and queue family ownership. Two queues may merrily read-only from the same resource - as long as they are from the same family. If the two queues, however, are from different queue families, then this is a different story, as a resource that has been declared VK_SHARING_MODE_EXCLUSIVE can only belong to one family at a time - not even shared read-only access is allowed - we must make sure that queue ownership is transferred before a new family accesses the resource.

A resource transfer contains two matching operations: release and acquire. It must follow the following procedure: First, the currently owning queue family must release the resource. This is done by issuing a pipeline barrier on the owning queue family. The barrier contains two pieces of important information: it must name the source queue family and also the destination queue family. Acquire works similarly: the acquiring queue must issue a pipeline barrier that names the source queue family (the queue family that previously released the resource) and then names the destination queue family (the queue family that acquires the resource). It is this kind of double accounting that makes Vulkan programming sometimes feel a bit bureaucratic, but so be it. We must do this for every resource that changes queue family ownership.

Keeping it Within the Transfer Window 

So that we don’t end up with a huge number of operations on different queues that depend on each other, it’s perhaps good to think of ways to organise things, and group things together that belong together. It might also be a good idea to do all transfers before the rendering season begins. Once we have done this, a pattern appears:

First, for every queue family, we release all resources that lose queue ownership from this queue family. Then, for each queue family, acquire any resources that receive queue ownership from that queue family.

We must guard for one extra edge case: What if two queues from the same queue family exist, and both require READ_ONLY access to an acquired resource? How do we prevent the queue which is not involved in acquiring the resource to race ahead and to read from that resource before it has been safely acquired? We can do this by adding a separate must_wait_acquire submission to any of these sibling queues - all this submission does is to wait for the main sibling to signal via Timeline Semaphore that resource acquisition has completed. Because such a submission essentially blocks a sibling queue until it gets the correct signal, any subsequent submissions on this queue are protected from accidentally starting too early.

From here on, we can continue with how our frame was built before: we submit renderbatches on each queue, and at the end, on the main queue, in a separate submission exclusively used for synchronisation, we wait for all Timeline Semaphores, and then signal RENDER_COMPLETE.

Putting it all together 

So let’s say we have a rendergraph that can be split into three independent subgraphs just like this, where the resource compute_buffer[0|1] ping-pongs between two compute passes and a draw pass:

Fig. 5:
A rendergraph with three independent subgraphs - the thick underlined pass names symbolize root passes, dark filled triangles before resource names ▼ indicate write, unfilled triangles △ indicate read

Island will, if three queues are available, distribute this workload as follows:

Frame rendered with three queues
Fig. 6:
A more complex frame, built with 3 queues. queue_0 is a versatile graphics/compute queue, the two other queues are compute queues (click image to zoom)

Note in the diagram above that Island inserted a must_wait_acquire step to protect queue_2 from accessing compute_buffer[1] before queue_1 had a chance to acquire it for their shared queue family. This is only necessary if two queues from the same queue family want to access a READ_ONLY resource which needs to be acquired first.

In case there are only two queues available, the renderer detects that the must_wait_acquire element is not necessary anymore because submission order protects the resource from being accessed first:

Frame rendered with two queues
Fig. 7:
The same frame, built with 2 queues. queue_0 is a versatile graphics/compute queue, queue_1 is a compute queue. Note that subgraph {one} is implicitly synchronised because it is issued after subgraph {three} (click image to zoom)

The nice thing about this approach to resource ownership transfer is that it just slots in to the current code path - it’s one extra function which generates the extra sync submissions. And this makes it very easy to skip. Now, why would we want to skip this function? Well, if the renderer detects that there is only a single queue family available, all the resource queue ownership bookkeeping and the transfer logic is not needed, and we can just skip it.

If we only have only a single queue available - how does our frame look like? Like this:

Single-queue frame
Fig. 8:
The same frame, mapped to a single versatile queue

Note that the queue ownership transfer operations have melted away as there is only one queue family in the game and therefore no need to transfer ownership.

Island preview image
If you’re interested in how I applied the method described in this post to the Island codebase, I recommend you take a look at the relevant lines in the source code inside Island’s Vulkan backend module, le_backend_vk.cpp, on github.

What I’ve Learned so far 

Overall I’m quite pleased of how this has turned out - Island seems a pretty robust and adaptive renderer right now when it comes to use multiple queues.

Some early design choices for simplification took a lot of complexity out of the system: the decision to make subgraphs completely resource-independent from each other made it much simpler to reason about queue submissions, because now each queue submission could be looked at in isolation.

Sorting and grouping submissions showed me that there was a lot of repetition in the system, and these patterns led to the current architecture of the frame.

I don’t like to interleave new functionality into existing code - because it blurs the intent of the code - but with the implementation of resource ownership transfer, I got lucky, and found a way to place it into its own dedicated, isolated function by taking advantage of the implicit synchronisation guarantee that submissions on a queue happen in submission order.

Automatically drawing diagrams also helped a lot when reasoning about synchronisation. It makes it much easier for me to spot bugs and possible issues when I can see how a system fits together. All the diagrams in this post were automatically generated within Island - and rendered through graphviz.




Find out first about new posts by subscribing to the RSS Feed

Further Posts:

Colour Emulsion Simulations research real-time island art
Watercolours Experiments research real-time island art
Vulkan Video Decode: First Frames h.264 video island rendergraph synchronisation vulkan code
C++20 Coroutines Driving a Job System code coroutines c++ job-system
Rendergraphs and how to implement one island rendergraph vulkan code
Implementing Bitonic Merge Sort in Vulkan Compute code algorithm compute glsl island
Callbacks and Hot-Reloading Reloaded: Bring your own PLT code hot-reloading c assembly island
Callbacks and Hot-Reloading: Must JMP through extra hoops code hot-reloading c assembly island
Love Making Waves fft real-time island research
2D SDF blobs v.1 research real-time island
OpenFrameworks Vulkan Renderer: The Journey So Far writeup vulkan real-time software design
Earth Normal Maps from NASA Elevation Data tutorial code
Using ofxPlaylist tutorial code
Flat Shading using legacy GLSL on OSX tutorial code