Symbolibre - Graphics acceleration on the Raspberry Pi Zero

The minimalist nature of the Raspberry Pi Zero makes graphics acceleration a very useful optimization to maintain performance on both the desktop and applications. Yet documentation on the subject is quite scarce, and the richest source of information is years of Internet articles by Raspberry Pi users investigating the configuration and use of accelerated rendering on the platform.

This article is but another record of such an attempt, which I hope can help future users in the same situation. 😀

TL;DR Summary:

The Raspberry Pi Zero's VideoCore 4 GPU is used basically out-of-the-box when using the Mesa support and the vc4 module.
Applications using DRI2 (with X) or OpenGL rendering (Wayland) will use the GPU without a problem.
For some reason, the X server's Glamor has major issues running with vc4, dropping below CPU-only performance levels, while Wayland compositors have no problem running smoothly. XWayland has the same issues as X.
Therefore I recommend using Wayland with Wayland-native applications, unless the issues with the X server can be found and solved.

Quick tour of accelerated rendering in Linux, X and Wayland

In the kernel and OpenGL implementation

The earlier interfaces to GPUs in Linux left most of the work to the applications. For instance, the X server would directly access the GPU device to set the resolution and color depth of the display (called a mode), and applications that performed rendering would manage the memory and framebuffers of the GPU. This resulted in stability and concurrency problems whenever several applications attempted to interact with the GPU simultaneously. This was mostly worked around at first, for example the mode setting question was commonly addressed by running the X server as root and not setting the mode in any other application.

To truly solve these issues, a kernel interface called Direct Rendering Management (DRM) was introduced. DRM abstracts away the command queue, memory management and other details of the GPU to allow several applications to use the GPU concurrently. DRM also handles setting the mode of the display, and exposes the Kernel Mode Setting interface to applications instead of letting them use the GPU. DRM and KMS are now used systematically.

As accelerated rendering developed, the Mesa library that implemented the OpenGL rendering interface, in software back then, was extended to also support accelerated rendering with the new DRM interface. The software renderer in Mesa still exists today, and is known as softpipe or llvmpipe depending on how it is built.

In the X server

The original interface of the X server exposed simple rendering functions to applications, such as shapes, images, or text. Applications communicated with the server via IPC and the server performed all the drawing. Just like Mesa, the X server implemented a GPU-accelerated version of its API, allowing the client applications to benefit from the GPU without changing the API. I'm not sure how it evolved over time, but Glamor ended up being used for this task and was eventually merged in the X server in 2014. This kind of setup code used to be necessary to use Glamor; the Load command loads the plugin and the AccelMethod option enables it.

Section "Module"
    Load "dri2"
    Load "glamoregl"
EndSection
Section "Device"
    Identifier "intel"
    Driver "intel"
    Option "AccelMethod" "glamor"
EndSection

However, this acceleration was not sufficient, as the cost of IPC was becoming a hard limitation on rendering performance. To work around this problem, the Direct Rendering Infrastructure (DRI) was introduced in the X system to allow clients to perform accelerated rendering without communicating with the X server.

The introduction of DRI means that there are two ways to perform rendering under X; either direct rendering using DRI and with minimal IPC, or indirect rendering using X primitives. The first method is inherently hardware-accelerated, and the second method can be too provided that a suitable driver is available and that Glamor is loaded.

This article on Forbidden Projects explains the rationale between the DRM and DRI systems very clearly.

Note that "DRI" is also the name of the DRM-accelerated implementation of OpenGL in Mesa, which is (among others) used by Glamor when performing indirect rendering. This 2016 blog post by Jasper St. Pierre should help clear the confusion. In this article, I only refer to the DRI subsystem of the X system that helps applications render without communicating with the X server.

In Wayland

Wayland was designed with accelerated rendering from the start and leaves all the rendering to the clients, which simplifies the whole picture quite a bit. Under Wayland, applications perform the rendering on their own, typically with the OpenGL/OpenGLES implementation in Mesa, and supply the finished textures to the Wayland compositor. The compositor then arranges them and uses DRM to push the results to the display.

This diagram from the Wayland rendering model on Wikipedia sums up pretty much every component mentioned in this section. On the left, you can see that Wayland applications communicate fully-rendered textures to the Wayland compositor (yellow arrow), which then accesses KMS and DRM. Applications render on their own using (for instance) Mesa 3D (green arrow).

By contrast, X clients traditionally call into the server to render X primitives (whether the server is the original implementation or XWayland doesn't matter), which then goes into Glamor for accelerated rendering. DRI would be represented as a green arrow from the X11 client to the Mesa 3D library, but is not pictured.

Wayland and XWayland flow diagram on Wikipedia The Linux Graphics Stack and Glamor, by Shmuel Csaba Otto Traian. Licensed under CC BY-SA 4.0.

Using the Raspberry Pi Zero's VideoCore IV GPU

The Raspberry Pi Zero comes with a Broadcom 2835 chip that has a VideoCore IV (VC4) GPU. For reference, here is the DRM documentation regarding VC4. Eric Anholt has been developing and maintaining Mesa 3D and KMS support for VC4 since mid-2014, and the main goal as far as the kernel is concerned is to have his driver loaded and running. See the VC4 driver's repository wiki for the most official documentation.

There is an OpenGLES implementation for VC4 in the /opt/vc folder in Raspberry Pi OS. It's documented in pretty good detail on ELinux.org, and the sources are available on the raspberrypi/firmware repository on Github. However, I've had a much easier time using the official (and more recent) Mesa support, and I'd rather use code that is maintained as part of the Linux kernel. Because the Symbolibre OS is based on Raspbian instead of Raspberry Pi OS, it's also much easier to grab the packaged Mesa instead of copying and configuring the original driver.

After installing Mesa, one simply needs to change /boot/config.txt to load the device tree overlay that assigns the VC4 driver to the GPU and add some GPU memory. The 512 MB memory of the Raspberry Pi Zero is shared between CPU and GPU, so everything that is assigned to the GPU (here, 128 MB) will be unavailable to applications.

# Add in /boot/config.txt:
dtoverlay=vc4-kms-d3d
gpu_mem=128

Additionally, the main user should be a member of the video group, so that user processes (including the Wayland compositor, Wayland applications and DRI applications) can access the GPU. Once configured, dmesg prints this kind of message at startup.

[   29.554462] [drm] Initialized vc4 0.0.0 20140616 for soc:gpu on minor 0

The good news so far is that both X and Wayland will automatically find this driver and use it without asking any questions. The main problem is whether they can make the best out of it for themselves and for the applications.

Performance on X: perfect with DRI, terrible with Glamor

I tested the X server with i3 in this configuration. I have two programs to compare performance for, xterm (a direct rendering program that goes through the IPC-based X drawing API with Glamor), and glxgears (a mesa-utils program that uses DRI). I'm using a 1280×1024 VGA screen connected to the mini-HDMI port.

VideoCoreIV Glamor on your Raspberry Pi explains how Glamor used to be set up under X (around the time Anholt's driver was started), and section 5 in particular mentions two X options:

Driver "modesetting"
Option "AccelMethod" "glamor"

The first option enables the modesetting driver, which is the X server driver that uses KMS and DRM, and replaces the previous xserver-xorg-video-* drivers that would access the GPU directly. This has been the preferred approach for a long time now (see the notice on the xserver-xorg-video-intel Debian package for instance). We could also use the fbdev driver that is standard but CPU-only, and useful as a benchmark baseline.

Fortunately, all of these settings are now the default. Starting up X without any specific configuration gives me this message, which says it all.

[  219.484] (II) modeset(0): glamor X acceleration enabled on VC4 V3D 2.1

However even in this configuration, the desktop is nearly unusable, with xterm struggling to scroll even basic bitmap-font screens, extremely slow window movement and resizing, and 80% CPU usage almost all the time. Only glxgears runs at a flawless 60 FPS while covering half of my screen.

I am unable to track down the reason for this problem, but it is apparent that something is catastrophically preventing Glamor from using the full capabilities of the GPU, while DRI works as expected. (This is the main consideration in this article, and took me ages to pin down.) Ironically, the performance of Glamor is even worse than with the non-accelerated fbdev driver:

Driver	Scrolling in xterm	Resizing windows	glxgears
`modesetting`	~1 FPS	~5 FPS	60 FPS (half-screen, DRI2)
`fbdev` or `modesetting` with `AccelMethod=none`	~2 FPS	~10 FPS	1.7 FPS (half-screen, `llvmpipe`)

xterm is almost unusable in both cases, and although 10 FPS for window operations could be acceptable, it is hardly so and nowhere near the smoothness that one would expect from accelerated rendering. I've played around with server options but could only observe changes when using fbdev or changing AccelMethod.

Note that there is an fbturbo driver that improves fbdev with ARM-specific CPU optimizations, and supports accelerated window moving/scrolling with the BCM2835 DMA. This could be a reasonable option if using X is mandatory, although it doesn't provide full OpenGLES capabilities.

I tried to enable DRI2 with modesetting while keeping Glamor disabled on the server, but ironically could not achieve it (feel free to give it a try, as my experience configuring X is shallow at best). The non-accelerated X server might not be as fast the non-accelerated Mesa code though, as Mesa uses llvmpipe which compiles code on-the-fly to improve its performance.

There are a couple of entries in Anholt's VC4 driver journal that mention Glamor performance but none noted such issues.

Performance with Wayland: clean for Wayland-native applications

I tested the Wayland system with sway, which I am the most familiar with (but any wlroots-based compositor would work the same). Since sway runs XWayland, a specialized version of the X server that allows X applications to run in a Wayland desktop, I could also try xterm. As before, I tested glxgears, as well as a Wayland-native terminal, foot (which is packaged in Debian).

Starting sway with the default settings will use the Mesa-bundled OpenGL ES implementation (including EGL), and DRM for the video card. (The dri in /dev/dri/card0 is inherited from the early stages of DRM and has nothing to do with the X system's DRI.)

00:00:00.177 [INFO] [backend/drm/backend.c:138] Initializing DRM backend for /dev/dri/card0 (vc4)

00:00:03.426 [INFO] [render/gles2/renderer.c:693] Using OpenGL ES 2.0 Mesa 20.2.3
00:00:03.427 [INFO] [render/gles2/renderer.c:694] GL vendor: Broadcom
00:00:03.427 [INFO] [render/gles2/renderer.c:695] GL renderer: VC4 V3D 2.1

Applications here behave similarly as with X. The inherently hardware-accelerated glxgears maintains 30 FPS even in full-screen; it could go to 60 FPS but automatically synchronizes with the mode chosen by sway (which to be honest is more than enough, we don't want to drain the batteries with 60 FPS on an HD testing display).

Note that glxgears uses GLX and hence goes through XWayland... but this is mostly transparent as it performs direct rendering; these results are sufficient to validate acceleration with Mesa. The same cannot be said of xterm; it uses indirect rendering with X primitives, and the Glamor code in XWayland suffers from the same problems as the original X server, so there is no real improvement here.

By contrast, the Wayland-native foot terminal is able to scroll at a decent speed without rendering partial frames while using an anti-aliased font. Now foot seems to be less efficient when the window is large, which reflects in the scroll speed as well as the window resizing speed; the average is around 15 FPS. Other terminals might perform better. Fortunately in the Symbolibre calculator the display is pretty small (320x240) so this won't be too much of an issue.

The performance of window management is wildly different in this configuration, as the problems encountered by Glamor do not occur with Wayland. Moving around and resizing the glxgears window is consistently fluid (I estimate in the vicinity of 20 FPS).

Driver	Scrolling in xterm	Scrolling in foot	Resizing windows	glxgears
DRM/KMS on `vc4`	~2 FPS	~15 FPS	~20 FPS	30 FPS (half-screen)

The combination of a full-screen (1280x1024) glxgears with XWayland and sway supporting it takes up to 70-80% CPU usage, but running top in a terminal uses about 10%, which is satisfying for our use case.

A note on using PRIME with SPI displays

There can be other concerns with using an SPI display. Unlike DPI displays which are handled by the GPU, SPI displays are mostly on the CPU side, which means that performing accelerated rendering requires additional CPU/GPU exchanges.

Fortunately, the kernel DRM has a subsystem called DMA-BUF which allows the kernel to move around buffers between several GPUs using DMA. It is the basis for a DRM extension called PRIME which you may know if you have a hybrid graphics laptop. The tinydrm driver that we are using supports using PRIME on SPI displays, which helps with retrieving GPU data (I don't think the fact that the Pi Zero RAM is a single 512 MB block is exploited, it's not even clear whether a concurrent access would be safe.)

I haven't delved too much into PRIME configuration, but I believe the bottleneck remains in the SPI bandwidth. 30 FPS on a 320x240 SPI display requires a transfer speed of almost 40 MB/s, which SPI devices rarely reach (you can overclock, but that's on you). In our tests changing the SPI bandwidth in the device tree would greatly impact the framerate for sway.

If SPI is the only choice for you but transfer performance is still limiting, you might have luck with the fbcp-ili9341 driver which implements several optimizations to greatly reduce the amount of data being transferred (at the cost of working around some of the kernel's stack).

Conclusion

On a Raspbian system built from the package repositories rather than the official Raspberry Pi OS firmware, using Anholt's VC4 driver for KMS/DRM and Mesa is the most efficient and least painful way to go. The built-in support in Linux means that everything, from the X server to the Wayland server to accelerated applications, will automatically use it.

However, there seem to be backend problems with the X server's Glamor, which result in a dramatic loss of performance for applications that perform indirect rendering with X primitives, as well as window management.

For a smoother hardware-accelerated ride, I thus suggest using a Wayland server with Wayland-supporting programs, all of which have acceleration built-in. XWayland will support direct-rendering X programs without a problem, despite not helping with the indirect-rendering ones because of the same Glamor issues.

In any case, good luck if you're attempting any of this! 😁

Graphics acceleration on the Raspberry Pi Zero