The three Reasons Why GPUs Didnt Work Out for You

Modern GPUs are powerful beasts increasingly capable of taking on tasks that were previously not their forte. It may sound like a gross oversimplification to equate GPUs with high performance (and also unfair towards FPGAs). However, I still advise companies dealing with complex computations to investigate whether GPUs could have an application in their workflow. Often, we are talking about an order of magnitude potential speedup or even more compared to multi-threaded CPU programs, nothing to scoff at.

Throughout my career, I was lucky enough to work with companies that were well aware of the potential advantages. However, on many occasions, my involvement started after a failed initial attempt. In my experience, most of these failures originate from misguided expectations of GPU-naïve companies, as well as incomplete scopes for initial investigations leading to design choices severely hindering the chance of unlocking the full potential of GPUs. In this article, I tried to list some of my favourite pet peeves that can easily lead to disappointment.

1) You Didn’t Optimise the Rest of Your Application Enough

Most GPU-naïve companies would like to think of GPUs as CPUs with many more cores and wider SIMD lanes, but unfortunately, that understanding is missing some crucial differences. CPU cores are so-called latency-oriented designs to give results faster, whilst GPUs are throughput-oriented designs, meaning that they are intended to provide more results over the same time. Albeit the two things may seem very similar, let me illustrate the difference with a practical example of our other favourite performance beasts: cars!

Let’s say we have a 100km distance between cities A and B. Should we allow for a speed limit of 100km/h, and we’d have a single lane, that would mean that from one car’s point of view, the drive would take 1 hour, so this is our latency figure. If we reduced the speed limit to 50km/h, it is easy to see that it would double our latency. However, if we had four lanes instead of 1, then assuming heavy traffic, even though each car’s drive would take twice as long. Still, twice as many cars could get from city A to B over time; hence our throughput of vehicles would be twice as much.

Understanding this example would probably give you an idea that any expectations about keeping the same overall program design and expecting it to be just magically faster are a bit naïve, as we are not speaking about generally quicker hardware, but of a different kind.

All those lanes will not make your car faster, but…

Returning to our previous example, having four lanes between City A and City B will not help our poor commuters much if the roads of City A are so jammed that they cannot fill up all four lanes with cars. It is usually the “highway” part that is a target of optimisation, and the rest of the program, which may deal with seemingly uninteresting things like reading/writing files of standard formats etc. could easily become the new bottleneck as a result.

Asynchronous, pipeline-oriented designs of (almost) the whole applications, which would otherwise be considered nice-to-haves or “advanced optimisations” for CPU programs, are the 101 prerequisites for software to stand a chance to utilise ever-more powerful discrete GPUs fully.

2) You Expected Too Much From GPU-backed Libraries, Tools and Wrappers

Originally GPU programming used to be an arcane art of the lucky few who could pull it off, the first being graphics programmers who could achieve beautiful things by bastardising the shading languages that were available back then. With the advent of GPGPU platforms like CUDA and OpenCL, the programming model became more direct and closer to the hardware, allowing for more optimisation opportunities. Since then, the focus has been on expanding the ecosystem and increasing the accessibility of GPU programming. Although it has many advantages from the perspective of democratising access to technology, at the same time, it is also creating the false impression as if GPU programming isn’t a pain & gain genre anymore.

For instance, many libraries have some level of GPU acceleration under the hood. Still, they don’t always come with a consistent design on the API level that could enable the user to at least minimise transactions over the PCI-Express bus. Even if that was taken care of, kernel fusion opportunities reducing the number of video memory accesses are way out of reach. Since, by and large, most optimised GPU programs are ending up being bandwidth limited, that’s quite a bit of a hit in itself.

An even harsher version of the illusion of accessibility is the one that is offered via wrappers enabling the access of GPUs from languages that (in the context of high-performance computations) are mainly meant for fast prototyping rather than operational code. Of course, accessing CUDA from an existing Python code is a fantastic way of accelerating early development. Yet, realistically, how much can you conclude about the final performance from an environment so easily bottlenecked by the Global Interpreter Lock, if not from the painful raw performance of Python alone? Still, many engineering teams attempt to do just that, to determine if further work would be worthwhile or if the initiative must be shot down.

DALL-E demonstrating my point quite literally

Car analogy again: moving intermediate results unnecessarily between computation steps (and, to top it off, with Python/OpenCV) is not far off from towing a Lamborghini with a horse.

3) You Did Only Half of the Job

It might seem tempting to see GPU optimisation as a natural extension of the iterative optimisation process that started on CPUs, always driving attention to the hottest hotspot only. The problem with this attitude is that it might lead to a fragmented design, with a sub-optimal number of CPU-GPU back-and-forth and synchronisation points. There might indeed be parts of every algorithm that in itself would not make much sense to port to the CPU, as they are more sequential in nature (remember, one fast lane may win the race!), yet, sometimes a locally suboptimal solution can lead to better overall performance for the whole system. Under ideal circumstances, GPUs should be fed from and fed into pipelines asynchronously. Every diversion from this logic may have unexpected consequences. I mean, not that unexpected; otherwise, I would not have raised this point…

Am I advocating porting as much from the core calculations as possible to GPUs? Yes, but not in the ordinary sense. Let’s imagine you had a sorting algorithm in the middle of the original design, but that wasn’t really a problem on a CPU either. The exact same algorithm may not be the best fit on GPU. Should you turn back? No, you should just ask more questions. In this particular case, you may drop quick sort and use a massively parallel version of merge sort instead after some research. Still, the ultimate takeaway is that as long as you can keep an open mind and not aim to port everything the same way but allow yourself to reinvent bits and pieces of the original concept, you stand a chance to end up with a pure GPU-friendly design, 100% utilising even the most giant GPUs out there. No car analogies here; seriously, just do the whole job properly!

After reading this article, it may have become apparent that GPU programming is not as much of a problem of learning new tools and new languages but being intricately familiar with hardware architectures and our old friends: algorithms and data structures. This is the main reason we, at TechnoLynx, unlike most companies, do not separate the roles of an algorithm researcher and an embedded programmer. Instead, we are trying to recruit/train people capable of confidently addressing both angles of the usual acceleration problems. Should you have any difficult (over here at TechnoLynx, also known as “fun”) acceleration problem to solve, we’d be more than happy to hear about it, and in the meantime, please follow us on LinkedIn and Medium.com. We are only getting started on sharing our learnings with the community!

The three Reasons Why GPUs Didnt Work Out for You

Real-Time Edge Processing with GPU Acceleration

Real-Time Data Streaming with AI

Case Study: CloudRF  Signal Propagation and Tower Optimisation

AI Object Tracking Solutions: Intelligent Automation

The Foundation of Generative AI: Neural Networks Explained

The Growing Need for Video Pipeline Optimisation

Unlocking XR’s True Power with Smarter GPU Optimisation

Markov Chains in Generative AI Explained

Explainability (XAI) In Computer Vision

Augmented Reality and QR Codes: Power Couple!

How Agents Learn Through Trial and Error: Reinforcement Learning

3D Visualisation Just Became Smarter with AI

Optimising LLMOps: Improvement Beyond Limits!

Machine Learning on GPU: A Faster Future

MLOps vs LLMOps: Let’s simplify things

GPU Coding Program: Simplifying GPU Programming for All

AI-Driven Innovation: Integrating AI APIs into Your Business

Enhance Your Applications with Promising GPU APIs

Why do we need GPU in AI?

How to use GPU Programming in Machine Learning?

Understanding the Tech Stack for Edge Computing

The Rise of Futuristic AR Powered by Advanced AI

Exploring Diffusion Networks

The Synergy of AI: Screening & Diagnostics on Steroids!

Internet of Medical Things: All Medical Devices Communicating

Introduction to MLOps

Case-Study: V-Nova – VC6 - GPU Porting from OpenCL to Metal

Generating New Faces

Navigating the Potential GPU Shortage in the Age of AI

The 3 Reasons Why GPUs Didn’t Work Out for You available now!

BUILD YOUR OWN GAME

Training a Language Model on a Single GPU in one day

Case Study: Accelerating Cryptocurrency Mining (Under NDA)

The three Reasons Why GPUs Didnt Work Out for You

Real-Time Edge Processing with GPU Acceleration

Real-Time Data Streaming with AI

Case Study: CloudRF Signal Propagation and Tower Optimisation

AI Object Tracking Solutions: Intelligent Automation

The Foundation of Generative AI: Neural Networks Explained

The Growing Need for Video Pipeline Optimisation

Unlocking XR’s True Power with Smarter GPU Optimisation

Markov Chains in Generative AI Explained

Explainability (XAI) In Computer Vision

Augmented Reality and QR Codes: Power Couple!

How Agents Learn Through Trial and Error: Reinforcement Learning

3D Visualisation Just Became Smarter with AI

Optimising LLMOps: Improvement Beyond Limits!

Machine Learning on GPU: A Faster Future

MLOps vs LLMOps: Let’s simplify things

GPU Coding Program: Simplifying GPU Programming for All

AI-Driven Innovation: Integrating AI APIs into Your Business

Enhance Your Applications with Promising GPU APIs

Why do we need GPU in AI?

How to use GPU Programming in Machine Learning?

Understanding the Tech Stack for Edge Computing

The Rise of Futuristic AR Powered by Advanced AI

Exploring Diffusion Networks

The Synergy of AI: Screening & Diagnostics on Steroids!

Internet of Medical Things: All Medical Devices Communicating

Introduction to MLOps

Case-Study: V-Nova – VC6 - GPU Porting from OpenCL to Metal

Generating New Faces

Navigating the Potential GPU Shortage in the Age of AI

The 3 Reasons Why GPUs Didn’t Work Out for You available now!

BUILD YOUR OWN GAME

Training a Language Model on a Single GPU in one day

Case Study: Accelerating Cryptocurrency Mining (Under NDA)

Case Study: CloudRF  Signal Propagation and Tower Optimisation