This summer I embarked on a big rewrite of the foundations of Island, the Vulkan renderer I’ve been working on. This touched the big, load-bearing synchronisation parts, which was quite painful. All this to shave a yak of epic proportions, really: eventually I’d like to implement Vulkan Video, and for this to work, I will need a way to decode frames on a special video-decode queue, which cannot be the main draw-compute-transfer versatile queue.
This means that for Vulkan video to work, we need to be able to run multiple queues, and for this to work, resources must be able to change queue ownership, which doesn’t sound that bad at first, but changing queue family ownership in Vulkan is a ceremonial almost matching the intricacy of a baroque courtship ritual - it requires both queues from different families to perform a precise, polite little dance of release & acquire - and if, in this sordid sarabande, a step gets missed or misplaced, oops, that’s it, it’s a deadlock and your program and the wedding is off.
"…a precise, polite little dance of release & acquire…"
There is of course, a way to make queue ownership transfers the responsibility of the driver, by declaring every resource to be VK_SHARING_MODE_CONCURRENT
instead of VK_SHARING_MODE_EXCLUSIVE
. But how much fun would that be? And, it is said that this is rather bad for performance (I would assume that queue ownership is then internally transferred lazily at the last possible moment, and that this could cause bubbles). And, heck, we’re using a rendergraph, which means the renderer should get our intentions telegraphed far enough ahead in advance to make the right decisions…
So here’s the goal: we’d like a system that can gracefully scale, whether there is one queue or sixteen, a system that can flexibly distribute renderpasses onto available queues - that can scale with the amount of available queues and the capabilities of these available queues, and that can automatically transfer resources as needed between queues.
On the bonus side, if this works, we get a new way of exploiting GPU parallelism - because independent Vulkan queues may execute in parallel. This means that for example we could run a compute pass at the same time as we run an independent graphics pass, which may give us better GPU utilisation.
Now, where to start? Probably where I left off with the previous post- once Rendergraph assembly is complete.
In Island, once the Rendergraph has been assembled, the backend receives a list of independent subgraphs, where each subgraph guarantees that it will only access its own resources - or, if resources are shared, that these resources are strictly READ_ONLY
within the same queue family, for the full frame .
When the renderer forms subgraphs out of passes, the GPU queue requirements for all passes are accumulated: Say, you have a subgraph that consists of two passes, one COMPUTE
and a DRAW
pass, and because they use the same resource, they have been combined into a single subgraph. Now, this subgraph will have the requirements COMPUTE|DRAW
, and will only run on a queue that supports hybrid COMPUTE
and DRAW
operations. This is pretty nice, because by using a hybrid queue, we can save ourselves resource ownership transfer.
Another nice consequence of this is that we can treat subgraphs as completely isolated from each other. And since these subgraphs are perfectly isolated from each other, the backend can translate each subgraph into a render command batch, and submit each batch on whichever matching queue is available and the most idle.
Of course, using multiple queues comes at a complexity cost: now we have the additional responsibility that all GPU queues that we bring to the party must synchronise with each other. How do they do that? Well, by talking to each other, for a start.
Semaphores is how queues talk to each other in Vulkan. You can only use them on queue submissions, where you can name the semaphores that you wish this queue to wait for, the command batch to process, and then the semaphores to signal once this submission has been processed:
|
|
If Semaphores is how Vulkan GPU queues talk to each other, then Timeline Semaphores is how modern, more relaxed queues talk to each other. Timeline Semaphores can nowadays be used in place of the more clunky Binary Semaphores, and they are nicer to work with, because they are more tolerant: not every signal event on a Timeline Semaphore must be answered. Binary Semaphores, by contrast, require every signal
event to have a matching wait
event. Timeline Semaphores are more forgiving: it’s enough if you guarantee that the value signalled by the semaphore is monotonically increasing - and then you can wait on the highest signalled value that makes sense for a specific point in the execution timeline . You could also wait on more than one queue for the same Timeline Semaphore value.
I find Timeline Semaphores so useful, that in Island, each queue automatically gets its own exclusive Timeline Semaphore, whether it uses it or not.
I hope that this short paean to Timeline Semaphores has you convinced that Timeline Semaphores are the future, and that we therefore would like to use Timeline Semaphores for every & all our realtime-graphics needs. Sounds good? Well, unfortunately, at the time of writing, we can’t use Timeline Semaphores for swapchain operations, which is the place where we absolutely must use semaphores. That’s sad, but there’s a way around: we can still use a hybrid of Binary and Timeline Semaphores.
Currently Island uses two Binary Semaphores per frame in order to synchronise with the WSI API (that’s the swapchain system): PRESENT_COMPLETE
, and RENDER_COMPLETE
.
PRESENT_COMPLETE
is signalled when the presentation system has finished acquiring the image which will be the current backbuffer, telling us that this image is ready to be written to.RENDER_COMPLETE
is signalled when all commands from the render command batch have been submitted and processed, and said backbuffer image is ready to be flipped onto the screen.We therefore need to wait for PRESENT_COMPLETE
before we begin rendering, and we signal that we’re done with rendering by signalling RENDER_COMPLETE
. In a single-queue renderer, we can submit all the render commands in one big batch , and synchronise this single submission by first waiting on PRESENT_COMPLETE
, then processing the command batch, and then signalling RENDER_COMPLETE
on completion.
In the hypothetical case that we wanted to submit two render batches, this becomes a bit more complicated, because we can’t wait more than once for the same Binary Semaphore unless we signal it again. Fortunately, we don’t need to: we can just split the workload into two, and wait on PRESENT_COMPLETE
on the first submission, and then signal RENDER_COMPLETE
on the second submission. Because submissions on the same queue get executed in submission order, things still happen in the correct sequence.
Now let’s say, we allow the user to choose (via a shopping list of queue capabilities) which kind and how many queues they wish to use for their Island application.
|
|
Note that it is not guaranteed that queues are always available: this is hardware dependent, which is why we must make sure that everything we do must have a graceful fallback. More on this later. But let’s say we have more than one GPU queue available, and we have multiple independent subgraphs assembled into Vulkan command batches, then we could submit these to run in parallel.
Of course, we would still have to enforce that whoever writes to the backbuffer needs to wait for the backbuffer resource to be available, signalled via PRESENT_COMPLETE
- and we would also need to enforce that RENDER_COMPLETE
only gets signalled, once all renderbatches have been submitted and processed. How can we enforce this?
We will use a trick: since operations on a queue happen in submission order, we can extend the previous split submission over multiple queues by adding some Timeline Semaphores:
On queue_0
(a graphics/compute queue) we first wait for PRESENT_COMPLETE
, then execute the command batch for queue_0
, then signal the Timeline Semaphore for queue_0
.
Starting at the same time, queue_1
(a compute-only queue) doesn’t need to wait for anyone , it executes its command batch, and then signals its Timeline Semaphore for queue_1
.
Back on queue_0
, we wait for all Timeline Semaphores , then execute zero commands, and immediately signal RENDER_COMPLETE
. The last queue submission on queue_0
is only there for synchronisation - it harvests all Timeline Semaphores across all queues, and then, on the main queue, signals RENDER_COMPLETE
.
This then is the first piece of the puzzle: this will help us synchronize parallel queue submissions while still keeping up the pretense to the swapchain system that nothing has changed.
It will also gracefully fall back in case we only have one queue available: regardless whether the first or the second sub-graph get processed first, the sub-graph which draws to the backbuffer image will wait for that image to be available, and at the end, once both sub-graphs have been processed, we signal RENDER_COMPLETE
. There’s no chance of a deadlock.
Now that we have found a way to synchronise multiple queues, we need to look into how we will transfer queue family ownership for resources.
Note that there is a difference between queue ownership and queue family ownership. Two queues may merrily read-only from the same resource - as long as they are from the same family. If the two queues, however, are from different queue families, then this is a different story, as a resource that has been declared VK_SHARING_MODE_EXCLUSIVE
can only belong to one family at a time - not even shared read-only access is allowed - we must make sure that queue ownership is transferred before a new family accesses the resource.
A resource transfer contains two matching operations: release and acquire. It must follow the following procedure: First, the currently owning queue family must release the resource. This is done by issuing a pipeline barrier on the owning queue family. The barrier contains two pieces of important information: it must name the source queue family and also the destination queue family. Acquire works similarly: the acquiring queue must issue a pipeline barrier that names the source queue family (the queue family that previously released the resource) and then names the destination queue family (the queue family that acquires the resource). It is this kind of double accounting that makes Vulkan programming sometimes feel a bit bureaucratic, but so be it. We must do this for every resource that changes queue family ownership.
So that we don’t end up with a huge number of operations on different queues that depend on each other, it’s perhaps good to think of ways to organise things, and group things together that belong together. It might also be a good idea to do all transfers before the rendering season begins. Once we have done this, a pattern appears:
First, for every queue family, we release all resources that lose queue ownership from this queue family. Then, for each queue family, acquire any resources that receive queue ownership from that queue family.
We must guard for one extra edge case: What if two queues from the same queue family exist, and both require READ_ONLY
access to an acquired resource? How do we prevent the queue which is not involved in acquiring the resource to race ahead and to read from that resource before it has been safely acquired? We can do this by adding a separate must_wait_acquire
submission to any of these sibling queues - all this submission does is to wait for the main sibling to signal via Timeline Semaphore that resource acquisition has completed. Because such a submission essentially blocks a sibling queue until it gets the correct signal, any subsequent submissions on this queue are protected from accidentally starting too early.
From here on, we can continue with how our frame was built before: we submit renderbatches on each queue, and at the end, on the main queue, in a separate submission exclusively used for synchronisation, we wait for all Timeline Semaphores, and then signal RENDER_COMPLETE
.
So let’s say we have a rendergraph that can be split into three independent subgraphs just like this, where the resource compute_buffer[0|1]
ping-pongs between two compute passes and a draw pass:
Island will, if three queues are available, distribute this workload as follows:
Note in the diagram above that Island inserted a must_wait_acquire
step to protect queue_2
from accessing compute_buffer[1]
before queue_1
had a chance to acquire it for their shared queue family. This is only necessary if two queues from the same queue family want to access a READ_ONLY
resource which needs to be acquired first.
In case there are only two queues available, the renderer detects that the must_wait_acquire
element is not necessary anymore because submission order protects the resource from being accessed first:
The nice thing about this approach to resource ownership transfer is that it just slots in to the current code path - it’s one extra function which generates the extra sync submissions. And this makes it very easy to skip. Now, why would we want to skip this function? Well, if the renderer detects that there is only a single queue family available, all the resource queue ownership bookkeeping and the transfer logic is not needed, and we can just skip it.
If we only have only a single queue available - how does our frame look like? Like this:
Note that the queue ownership transfer operations have melted away as there is only one queue family in the game and therefore no need to transfer ownership.
Overall I’m quite pleased of how this has turned out - Island seems a pretty robust and adaptive renderer right now when it comes to use multiple queues.
Some early design choices for simplification took a lot of complexity out of the system: the decision to make subgraphs completely resource-independent from each other made it much simpler to reason about queue submissions, because now each queue submission could be looked at in isolation.
Sorting and grouping submissions showed me that there was a lot of repetition in the system, and these patterns led to the current architecture of the frame.
I don’t like to interleave new functionality into existing code - because it blurs the intent of the code - but with the implementation of resource ownership transfer, I got lucky, and found a way to place it into its own dedicated, isolated function by taking advantage of the implicit synchronisation guarantee that submissions on a queue happen in submission order.
Automatically drawing diagrams also helped a lot when reasoning about synchronisation. It makes it much easier for me to spot bugs and possible issues when I can see how a system fits together. All the diagrams in this post were automatically generated within Island - and rendered through graphviz.
Find out first about new posts by subscribing to the RSS Feed