The three Reasons Why GPUs Didnt Work Out for You

Most GPU-naïve companies would like to think of GPUs as CPUs with many more cores and wider SIMD lanes, but unfortunately, that understanding is missing some crucial differences.

The three Reasons Why GPUs Didnt Work Out for You
Written by TechnoLynx Published on 01 Feb 2023

Modern GPUs are powerful beasts increasingly capable of taking on tasks that were previously not their forte. It may sound like a gross oversimplification to equate GPUs with high performance (and also unfair towards FPGAs). However, I still advise companies dealing with complex computations to investigate whether GPUs could have an application in their workflow. Often, we are talking about an order of magnitude potential speedup or even more compared to multi-threaded CPU programs, nothing to scoff at.

Throughout my career, I was lucky enough to work with companies that were well aware of the potential advantages. However, on many occasions, my involvement started after a failed initial attempt. In my experience, most of these failures originate from misguided expectations of GPU-naïve companies, as well as incomplete scopes for initial investigations leading to design choices severely hindering the chance of unlocking the full potential of GPUs. In this article, I tried to list some of my favourite pet peeves that can easily lead to disappointment.

1) You Didn’t Optimise the Rest of Your Application Enough

Most GPU-naïve companies would like to think of GPUs as CPUs with many more cores and wider SIMD lanes, but unfortunately, that understanding is missing some crucial differences. CPU cores are so-called latency-oriented designs to give results faster, whilst GPUs are throughput-oriented designs, meaning that they are intended to provide more results over the same time. Albeit the two things may seem very similar, let me illustrate the difference with a practical example of our other favourite performance beasts: cars!

Let’s say we have a 100km distance between cities A and B. Should we allow for a speed limit of 100km/h, and we’d have a single lane, that would mean that from one car’s point of view, the drive would take 1 hour, so this is our latency figure. If we reduced the speed limit to 50km/h, it is easy to see that it would double our latency. However, if we had four lanes instead of 1, then assuming heavy traffic, even though each car’s drive would take twice as long. Still, twice as many cars could get from city A to B over time; hence our throughput of vehicles would be twice as much.

Understanding this example would probably give you an idea that any expectations about keeping the same overall program design and expecting it to be just magically faster are a bit naïve, as we are not speaking about generally quicker hardware, but of a different kind.

All those lanes will not make your car faster, but…
All those lanes will not make your car faster, but…

Returning to our previous example, having four lanes between City A and City B will not help our poor commuters much if the roads of City A are so jammed that they cannot fill up all four lanes with cars. It is usually the “highway” part that is a target of optimisation, and the rest of the program, which may deal with seemingly uninteresting things like reading/writing files of standard formats etc. could easily become the new bottleneck as a result.

Asynchronous, pipeline-oriented designs of (almost) the whole applications, which would otherwise be considered nice-to-haves or “advanced optimisations” for CPU programs, are the 101 prerequisites for software to stand a chance to utilise ever-more powerful discrete GPUs fully.

2) You Expected Too Much From GPU-backed Libraries, Tools and Wrappers

Originally GPU programming used to be an arcane art of the lucky few who could pull it off, the first being graphics programmers who could achieve beautiful things by bastardising the shading languages that were available back then. With the advent of GPGPU platforms like CUDA and OpenCL, the programming model became more direct and closer to the hardware, allowing for more optimisation opportunities. Since then, the focus has been on expanding the ecosystem and increasing the accessibility of GPU programming. Although it has many advantages from the perspective of democratising access to technology, at the same time, it is also creating the false impression as if GPU programming isn’t a pain & gain genre anymore.

GPU programming is no joke
GPU programming is no joke

For instance, many libraries have some level of GPU acceleration under the hood. Still, they don’t always come with a consistent design on the API level that could enable the user to at least minimise transactions over the PCI-Express bus. Even if that was taken care of, kernel fusion opportunities reducing the number of video memory accesses are way out of reach. Since, by and large, most optimised GPU programs are ending up being bandwidth limited, that’s quite a bit of a hit in itself.

An even harsher version of the illusion of accessibility is the one that is offered via wrappers enabling the access of GPUs from languages that (in the context of high-performance computations) are mainly meant for fast prototyping rather than operational code. Of course, accessing CUDA from an existing Python code is a fantastic way of accelerating early development. Yet, realistically, how much can you conclude about the final performance from an environment so easily bottlenecked by the Global Interpreter Lock, if not from the painful raw performance of Python alone? Still, many engineering teams attempt to do just that, to determine if further work would be worthwhile or if the initiative must be shot down.

DALL-E demonstrating my point quite literally
DALL-E demonstrating my point quite literally

Car analogy again: moving intermediate results unnecessarily between computation steps (and, to top it off, with Python/OpenCV) is not far off from towing a Lamborghini with a horse.

3) You Did Only Half of the Job

It might seem tempting to see GPU optimisation as a natural extension of the iterative optimisation process that started on CPUs, always driving attention to the hottest hotspot only. The problem with this attitude is that it might lead to a fragmented design, with a sub-optimal number of CPU-GPU back-and-forth and synchronisation points. There might indeed be parts of every algorithm that in itself would not make much sense to port to the CPU, as they are more sequential in nature (remember, one fast lane may win the race!), yet, sometimes a locally suboptimal solution can lead to better overall performance for the whole system. Under ideal circumstances, GPUs should be fed from and fed into pipelines asynchronously. Every diversion from this logic may have unexpected consequences. I mean, not that unexpected; otherwise, I would not have raised this point…

Am I advocating porting as much from the core calculations as possible to GPUs? Yes, but not in the ordinary sense. Let’s imagine you had a sorting algorithm in the middle of the original design, but that wasn’t really a problem on a CPU either. The exact same algorithm may not be the best fit on GPU. Should you turn back? No, you should just ask more questions. In this particular case, you may drop quick sort and use a massively parallel version of merge sort instead after some research. Still, the ultimate takeaway is that as long as you can keep an open mind and not aim to port everything the same way but allow yourself to reinvent bits and pieces of the original concept, you stand a chance to end up with a pure GPU-friendly design, 100% utilising even the most giant GPUs out there. No car analogies here; seriously, just do the whole job properly!

After reading this article, it may have become apparent that GPU programming is not as much of a problem of learning new tools and new languages but being intricately familiar with hardware architectures and our old friends: algorithms and data structures. This is the main reason we, at TechnoLynx, unlike most companies, do not separate the roles of an algorithm researcher and an embedded programmer. Instead, we are trying to recruit/train people capable of confidently addressing both angles of the usual acceleration problems. Should you have any difficult (over here at TechnoLynx, also known as “fun”) acceleration problem to solve, we’d be more than happy to hear about it, and in the meantime, please follow us on LinkedIn and Medium.com. We are only getting started on sharing our learnings with the community!

Real-Time Edge Processing with GPU Acceleration

Real-Time Edge Processing with GPU Acceleration

10/07/2025

Learn how GPU acceleration and mobile hardware enable real-time processing in edge devices, boosting AI and graphics performance at the edge.

Real-Time Data Streaming with AI

Real-Time Data Streaming with AI

19/05/2025

You have surely heard that ‘Information is the most powerful weapon’. However, is a weapon really that powerful if it does not arrive on time? Explore how real-time streaming powers Generative AI across industries, from live image generation to fraud detection.

Case Study: CloudRF  Signal Propagation and Tower Optimisation

Case Study: CloudRF  Signal Propagation and Tower Optimisation

15/05/2025

See how TechnoLynx helped CloudRF speed up signal propagation and tower placement simulations with GPU acceleration, custom algorithms, and cross-platform support. Faster, smarter radio frequency planning made simple.

AI Object Tracking Solutions: Intelligent Automation

AI Object Tracking Solutions: Intelligent Automation

12/05/2025

AI tracking solutions are incorporating industries in different sectors in safety, autonomous detection and sorting processes. The use of computer vision and high-end computing is key in AI tracking.

The Foundation of Generative AI: Neural Networks Explained

The Foundation of Generative AI: Neural Networks Explained

28/04/2025

Find out how neural networks support generative AI models with applications like content creation, and where these models are used in real-world scenarios.

The Growing Need for Video Pipeline Optimisation

The Growing Need for Video Pipeline Optimisation

10/04/2025

Learn how video pipeline optimisation improves real-time computer vision performance. Reduce bandwidth use, transmit data efficiently, and scale AI applications with ease.

Unlocking XR’s True Power with Smarter GPU Optimisation

Unlocking XR’s True Power with Smarter GPU Optimisation

9/04/2025

Learn how optimising your GPU can enhance performance, reduce costs, and improve user experience. Discover best practices, real-world case studies.

Markov Chains in Generative AI Explained

Markov Chains in Generative AI Explained

31/03/2025

Discover how Markov chains power Generative AI models, from text generation to computer vision and AR/VR/XR. Explore real-world applications!

Explainability (XAI) In Computer Vision

Explainability (XAI) In Computer Vision

17/03/2025

Learn how to build fair and transparent AI systems in computer vision by addressing bias, improving data quality, and applying explainable methods across healthcare, retail, and security sectors.

Augmented Reality and QR Codes: Power Couple!

Augmented Reality and QR Codes: Power Couple!

10/03/2025

Learn how QR codes and Augmented Reality are transforming industries from retail to aviation. Discover innovative applications and future possibilities in this exciting partnership.

How Agents Learn Through Trial and Error: Reinforcement Learning

How Agents Learn Through Trial and Error: Reinforcement Learning

24/02/2025

Discover how RL is applied in various industries, from robotics and gaming to healthcare and finance. Explore the key concepts, algorithms, and real-world examples to grasp the potential of this transformative technology.

3D Visualisation Just Became Smarter with AI

3D Visualisation Just Became Smarter with AI

3/02/2025

We are all very familiar with 3D printers. Chances are that you want one, you already have one, or you have a friend who does. The concept of ‘Let’s make everything 3D’ is not new, yet the industrial applications of it mostly are. In this article, we will discuss how incorporating AI into 3D scanning, projecting, and modelling can transform the industry in so many fields.

Optimising LLMOps: Improvement Beyond Limits!

2/01/2025

If we didn’t have LLMOps, the Internet as it is today simply wouldn’t exist. We live in an era of great automation, where content generation is just two clicks away. How is it that LLMOps are so powerful, though? What technology is behind this success? Let’s find out!

Machine Learning on GPU: A Faster Future

26/11/2024

Learn how GPUs transform machine learning, including AI tasks, deep learning, and handling large amounts of data efficiently.

MLOps vs LLMOps: Let’s simplify things

25/11/2024

Two concepts that are not exactly clear are MLOps and LLMOps. Despite the fact that these two abbreviations look similar, they are completely different. Or are they?! Well, the answer is not that simple. Let’s dive in and see what each of the two models is, how large language models work, how they differ from each other, and how they can be combined for the creation of NLPs.

GPU Coding Program: Simplifying GPU Programming for All

13/11/2024

Learn about GPU coding programs, key programming languages, and how TechnoLynx can make GPU programming accessible for faster processing and advanced computing.

AI-Driven Innovation: Integrating AI APIs into Your Business

14/10/2024

Learn how to improve your applications with AI APIs and frameworks. Gain practical insights into integration steps, challenges, and best practices using advanced technologies like TensorFlow and AWS SageMaker to boost your business and streamline operations.

Enhance Your Applications with Promising GPU APIs

16/08/2024

Review more complex GPU APIs to get the most out of your applications. Understand how programming may be optimised for efficiency and performance with GPUs tailored to computational processes.

Why do we need GPU in AI?

16/07/2024

Discover why GPUs are essential in AI. Learn about their role in machine learning, neural networks, and deep learning projects.

How to use GPU Programming in Machine Learning?

9/07/2024

Learn how to implement and optimise machine learning models using NVIDIA GPUs, CUDA programming, and more. Find out how TechnoLynx can help you adopt this technology effectively.

Understanding the Tech Stack for Edge Computing

8/07/2024

Walk through the core components of a tech stack for edge computing. Learn about distributed computing models and how TechnoLynx can help implement effective edge computing solutions.

The Rise of Futuristic AR Powered by Advanced AI

13/06/2024

The future can be brought right in front of your eyes, and your fingertips can now point anywhere in the air and make things happen! You call it magic; we call it the power of AI integrated with AR/VR.

Exploring Diffusion Networks

10/06/2024

Let's delve into the underlying concepts of diffusion networks and provide a practical explanation on how such a model can be trained and used to generate novel images.

The Synergy of AI: Screening & Diagnostics on Steroids!

3/05/2024

Sometimes a visit to the doctor for an X-ray is a necessity. Apart from having to endure the long queue, when you are in pain, the time until your results arrive can seem endless. Let’s take a look at how AI can be integrated into medical facilities to automate medical imaging for better screening and faster results.

Internet of Medical Things: All Medical Devices Communicating

29/04/2024

Have you ever wondered how the IoT can complement the field of medicine? We have, and we found so much cool stuff for you to read. Follow us as we explore how IoT started, how it became an integrated part of our lives, and how the IoMT has changed how physicians and patients interact.

Introduction to MLOps

4/04/2024

Understand the basics of MLOps and how MLOps is improving machine learning systems. We cover everything from the MLOps lifecycle to trends we’ll see in the future of MLOps.

Case-Study: V-Nova – VC6 - GPU Porting from OpenCL to Metal

15/12/2023

Case study on moving a GPU application from OpenCL to Metal for our client V-Nova. Boosts performance, adds support for real-time apps, VR, and machine learning on Apple M1/M2 chips.

Generating New Faces

6/10/2023

With the hype of generative AI, all of us had the urge to build a generative AI application or even needed to integrate it into a web application.

Navigating the Potential GPU Shortage in the Age of AI

7/08/2023

The rapid advancements in artificial intelligence have fueled an unprecedented demand for powerful GPUs (Graphics Processing Units) to drive AI computations.

The 3 Reasons Why GPUs Didn’t Work Out for You available now!

7/02/2023

TechnoLynx started to publish on Medium! From now on, you will be able to read all about our engineers’ expert views, tips and insights...

BUILD YOUR OWN GAME

30/01/2023

Have you ever wondered if you could build your own chess game and play it with an AI engine on the browser? If you do, then you might...

Training a Language Model on a Single GPU in one day

4/01/2023

AI Research from the University of Maryland investigating the cramming challenge for Training a Language Model on a Single GPU in one day.

Case Study: Accelerating Cryptocurrency Mining (Under NDA)

29/12/2020

Our client had a vision to analyse and engage with the most disruptive ideas in the crypto-currency domain. Read more to see our solution for this mission!

← Back to Blog Overview