At GDC 2019, Arm and Samsung were joined on stage in the “All-in-One Guide to Vulkan on Mobile” talk to share their learning from helping numerous developers and studios in optimizing their Vulkan mobile games. In tandem, Arm released Vulkan Best Practices for Mobile Developers to address some of the most common challenges faced when coding Vulkan applications on mobile. It includes an expansive list of runnable samples with full source code available online.
This blog series delves in detail into each sample, investigates individual Vulkan features, and demonstrates best practices of how to use them.
Setting up a Vulkan swapchain involves picking between options that don’t have a straightforward connection to performance. The default options might not be the most efficient ones, and what works best on a desktop may be different from what works on mobile.
Looking at the VkSwapchainCreateInfoKHR struct, we identified three options that need a more detailed analysis:
- presentMode: what does each present mode imply in terms of performance?
- minImageCount: which is the best number of images?
- preTransform: what does it mean, and what do we need to do about it?
This blog post covers the first two points, as they are both tied to the concept of buffering swapchain images. Surface transform is quite a complex topic that we’ll cover in a future post on the Arm community.
Choosing a present mode
Vulkan has several present modes, but mobile GPUs only support a subset of them. In general, presenting an image directly to the screen (immediate mode) is not supported.
The application will render an image, then pass it to the presentation engine via vkQueuePresentKHR. The presentation engine will display the image for the next VSync cycle, and then it will make it available to the application again.
The only present modes which support VSync are:
- FIFO: VK_PRESENT_MODE_FIFO_KHR
- MAILBOX: VK_PRESENT_MODE_MAILBOX_KHR
We will now each of these in more detail to understand which one is better for mobile.
Figure 1 shows an outline of how the FIFO present mode works. The presentation engine has a queue (or “FIFO”) of images, in this case, three of them. At each VSync signal, the image in front of the queue displays on screen and is then released. The application will acquire one of the available ones, draw to it and then hand it over to the presentation engine, which will push it to the back of the queue. You may be used to this behavior from other graphics APIs, double or triple buffering – more on that later!
An interesting property of the FIFO present mode is that if the GPU can process images really fast, the queue can become full at some point. When this happens, the CPU and the GPU will idle until an image finishes its time on screen and is available again. The framerate will be capped at a stable 60 fps, corresponding to VSync.
This idling behavior works well on mobile because it means that no unnecessary work is performed. The extra CPU and GPU budget will be detected by the DVFS (Dynamic Voltage and Frequency Scaling) system, which reduces their frequencies to save power at no performance cost. This limits overheating and saves battery life – even a small detail such as the present mode can have a significant impact on your users’ experience!
Let us take a look at MAILBOX now. The main difference, as you can see from Figure 2 below, is that there is no queue anymore. The presentation engine will now hold a single image that will be presented at each VSync signal.
The app can acquire a new image straight away, render to it, and present it. If an image is queued for presentation, it will be discarded. Mobile demands efficiency; hence, the word “discarded” should be a big red flag when developing on mobile – the aim should always be to avoid unnecessary work.
Since an image was queued for presentation, the framerate will not improve. What is the advantage of MAILBOX then? Being able to keep submitting frames lets you ensure you have the latest user input, so input latency can be lower versus FIFO.
The price you pay for MAILBOX can be very steep. If you don’t throttle your CPU and GPU at all, one of them may be fully utilized, resulting in higher power consumption. Unless you need low-input latency, our recommendation is to use FIFO.
Choosing the number of images
It is now clear that FIFO is the most efficient present mode for mobile, but what about minImageCount? In the context of FIFO, minImageCount differentiates between double and triple buffering, which can have an impact on performance.
The number of images you ask for needs to be bound within the minimum and maximum images supported by the surface (you can query these values via surface capabilities). You will typically ask for 2 or 3 images, but the presentation engine can decide to allocate more.
Let us start with double buffering. Figure 4 outlines the expected double-buffering behavior.
Double buffering works well if frames can be processed within 16.6ms, which is the interval between VSync signals at a rate of 60 fps. The rendered image is presented to the swapchain, and the previously presented one is made available to the application again.
What happens if the GPU cannot process frames within 16.6ms?
Double buffering breaks! As you can see from Figure 5, if no images are ready when the VSync signal arrives, the only option for the presentation engine is to keep the current image on screen. The app has to wait for another whole VSync cycle before it acquires a new image, which effectively limits the framerate to 30 fps. A much higher rate could be achieved if the GPU could keep processing frames. This may be ok for you if you are happy to limit framerate to 30 fps, but if you’re aiming for 60 fps, you should consider triple buffering.
Even if your app can achieve 60 fps most of the time, with double buffering the tiniest slowdown below 60 fps results in an immediate drop to 30 fps.
Figure 6 shows triple buffering in action. Even if the GPU has not finished rendering when VSync arrives, a previous frame is queued for presentation. This means that the presentation engine can release the currently displayed image and the GPU can acquire it as soon as it is ready.
In the example shown, triple buffering results in ~50 fps versus 30 fps with double buffering.
Our Vulkan Best Practice for Mobile Developers project on Github has a sample on swapchain images, that specifically compares double and triple buffering. You can check out the tutorial for the Swapchain Images sample.
As you can see from Figures 7 and 8, triple buffering lets the app achieve a stable 60 fps (16.6 ms frame time), providing x2 higher frame rate. When switching to double buffering the framerate drops.
We encourage you to check out the project on the Vulkan Mobile Best Practice GitHub page and try this or other samples for yourself! The sample code gives developers on-screen control to demonstrate multiple ways of using the feature. It also shows the performance impact of the different approaches through real-time hardware counters on the display. You are also warmly invited to contribute to the project by providing feedback and fixes and creating additional samples.
Please also visit the Arm Community for more in-depth blogs on the other Vulkan samples.