The three Reasons Why GPUs Didnt Work Out for You

Most GPU-naïve companies would like to think of GPUs as CPUs with many more cores and wider SIMD lanes, but unfortunately, that understanding is missing some crucial differences.

The three Reasons Why GPUs Didnt Work Out for You
Written by TechnoLynx Published on 01 Feb 2023

Modern GPUs are powerful beasts increasingly capable of taking on tasks that were previously not their forte. It may sound like a gross oversimplification to equate GPUs with high performance (and also unfair towards FPGAs). However, I still advise companies dealing with complex computations to investigate whether GPUs could have an application in their workflow. Often, we are talking about an order of magnitude potential speedup or even more compared to multi-threaded CPU programs, nothing to scoff at.

Throughout my career, I was lucky enough to work with companies that were well aware of the potential advantages. However, on many occasions, my involvement started after a failed initial attempt. In my experience, most of these failures originate from misguided expectations of GPU-naïve companies, as well as incomplete scopes for initial investigations leading to design choices severely hindering the chance of unlocking the full potential of GPUs. In this article, I tried to list some of my favourite pet peeves that can easily lead to disappointment.

1) You Didn’t Optimise the Rest of Your Application Enough

Most GPU-naïve companies would like to think of GPUs as CPUs with many more cores and wider SIMD lanes, but unfortunately, that understanding is missing some crucial differences. CPU cores are so-called latency-oriented designs to give results faster, whilst GPUs are throughput-oriented designs, meaning that they are intended to provide more results over the same time. Albeit the two things may seem very similar, let me illustrate the difference with a practical example of our other favourite performance beasts: cars!

Let’s say we have a 100km distance between cities A and B. Should we allow for a speed limit of 100km/h, and we’d have a single lane, that would mean that from one car’s point of view, the drive would take 1 hour, so this is our latency figure. If we reduced the speed limit to 50km/h, it is easy to see that it would double our latency. However, if we had four lanes instead of 1, then assuming heavy traffic, even though each car’s drive would take twice as long. Still, twice as many cars could get from city A to B over time; hence our throughput of vehicles would be twice as much.

Understanding this example would probably give you an idea that any expectations about keeping the same overall program design and expecting it to be just magically faster are a bit naïve, as we are not speaking about generally quicker hardware, but of a different kind.

All those lanes will not make your car faster, but…
All those lanes will not make your car faster, but…

Returning to our previous example, having four lanes between City A and City B will not help our poor commuters much if the roads of City A are so jammed that they cannot fill up all four lanes with cars. It is usually the “highway” part that is a target of optimisation, and the rest of the program, which may deal with seemingly uninteresting things like reading/writing files of standard formats etc. could easily become the new bottleneck as a result.

Asynchronous, pipeline-oriented designs of (almost) the whole applications, which would otherwise be considered nice-to-haves or “advanced optimisations” for CPU programs, are the 101 prerequisites for software to stand a chance to utilise ever-more powerful discrete GPUs fully.

2) You Expected Too Much From GPU-backed Libraries, Tools and Wrappers

Originally GPU programming used to be an arcane art of the lucky few who could pull it off, the first being graphics programmers who could achieve beautiful things by bastardising the shading languages that were available back then. With the advent of GPGPU platforms like CUDA and OpenCL, the programming model became more direct and closer to the hardware, allowing for more optimisation opportunities. Since then, the focus has been on expanding the ecosystem and increasing the accessibility of GPU programming. Although it has many advantages from the perspective of democratising access to technology, at the same time, it is also creating the false impression as if GPU programming isn’t a pain & gain genre anymore.

GPU programming is no joke
GPU programming is no joke

For instance, many libraries have some level of GPU acceleration under the hood. Still, they don’t always come with a consistent design on the API level that could enable the user to at least minimise transactions over the PCI-Express bus. Even if that was taken care of, kernel fusion opportunities reducing the number of video memory accesses are way out of reach. Since, by and large, most optimised GPU programs are ending up being bandwidth limited, that’s quite a bit of a hit in itself.

An even harsher version of the illusion of accessibility is the one that is offered via wrappers enabling the access of GPUs from languages that (in the context of high-performance computations) are mainly meant for fast prototyping rather than operational code. Of course, accessing CUDA from an existing Python code is a fantastic way of accelerating early development. Yet, realistically, how much can you conclude about the final performance from an environment so easily bottlenecked by the Global Interpreter Lock, if not from the painful raw performance of Python alone? Still, many engineering teams attempt to do just that, to determine if further work would be worthwhile or if the initiative must be shot down.

DALL-E demonstrating my point quite literally
DALL-E demonstrating my point quite literally

Car analogy again: moving intermediate results unnecessarily between computation steps (and, to top it off, with Python/OpenCV) is not far off from towing a Lamborghini with a horse.

3) You Did Only Half of the Job

It might seem tempting to see GPU optimisation as a natural extension of the iterative optimisation process that started on CPUs, always driving attention to the hottest hotspot only. The problem with this attitude is that it might lead to a fragmented design, with a sub-optimal number of CPU-GPU back-and-forth and synchronisation points. There might indeed be parts of every algorithm that in itself would not make much sense to port to the CPU, as they are more sequential in nature (remember, one fast lane may win the race!), yet, sometimes a locally suboptimal solution can lead to better overall performance for the whole system. Under ideal circumstances, GPUs should be fed from and fed into pipelines asynchronously. Every diversion from this logic may have unexpected consequences. I mean, not that unexpected; otherwise, I would not have raised this point…

Am I advocating porting as much from the core calculations as possible to GPUs? Yes, but not in the ordinary sense. Let’s imagine you had a sorting algorithm in the middle of the original design, but that wasn’t really a problem on a CPU either. The exact same algorithm may not be the best fit on GPU. Should you turn back? No, you should just ask more questions. In this particular case, you may drop quick sort and use a massively parallel version of merge sort instead after some research. Still, the ultimate takeaway is that as long as you can keep an open mind and not aim to port everything the same way but allow yourself to reinvent bits and pieces of the original concept, you stand a chance to end up with a pure GPU-friendly design, 100% utilising even the most giant GPUs out there. No car analogies here; seriously, just do the whole job properly!

After reading this article, it may have become apparent that GPU programming is not as much of a problem of learning new tools and new languages but being intricately familiar with hardware architectures and our old friends: algorithms and data structures. This is the main reason we, at TechnoLynx, unlike most companies, do not separate the roles of an algorithm researcher and an embedded programmer. Instead, we are trying to recruit/train people capable of confidently addressing both angles of the usual acceleration problems. Should you have any difficult (over here at TechnoLynx, also known as “fun”) acceleration problem to solve, we’d be more than happy to hear about it, and in the meantime, please follow us on LinkedIn and Medium.com. We are only getting started on sharing our learnings with the community!

TPU vs GPU: Practical Pros and Cons Explained

TPU vs GPU: Practical Pros and Cons Explained

24/02/2026

A TPU and GPU comparison for machine learning, real time graphics, and large scale deployment, with simple guidance on cost, fit, and risk.

TPU vs GPU: Which Is Better for Deep Learning?

TPU vs GPU: Which Is Better for Deep Learning?

26/01/2026

A practical comparison of TPUs and GPUs for deep learning workloads, covering performance, architecture, cost, scalability, and real‑world training and…

CUDA vs ROCm: Choosing for Modern AI

CUDA vs ROCm: Choosing for Modern AI

20/01/2026

A practical CUDA vs ROCm comparison for AI in 2026: performance, framework support, developer experience, real cost trade-offs, and what is still missing.

Best Practices for Training Deep Learning Models

Best Practices for Training Deep Learning Models

19/01/2026

A clear and practical guide to the best practices for training deep learning models, covering data preparation, architecture choices, optimisation, and…

Measuring GPU Benchmarks for AI

Measuring GPU Benchmarks for AI

15/01/2026

A practical guide to GPU benchmarks for AI; what to measure, how to run fair tests, and how to turn results into decisions for real‑world projects.

GPU‑Accelerated Computing for Modern Data Science

GPU‑Accelerated Computing for Modern Data Science

14/01/2026

Learn how GPU‑accelerated computing boosts data science workflows, improves training speed, and supports real‑time AI applications with…

CUDA vs OpenCL: Picking the Right GPU Path

CUDA vs OpenCL: Picking the Right GPU Path

13/01/2026

A clear, practical guide to cuda vs opencl for GPU programming, covering portability, performance, tooling, ecosystem fit, and how to choose for your team and workload.

Performance Engineering for Scalable Deep Learning Systems

Performance Engineering for Scalable Deep Learning Systems

12/01/2026

Learn how performance engineering optimises deep learning frameworks for large-scale distributed AI workloads using advanced compute architectures and…

Choosing TPUs or GPUs for Modern AI Workloads

Choosing TPUs or GPUs for Modern AI Workloads

10/01/2026

A clear, practical guide to TPU vs GPU for training and inference, covering architecture, energy efficiency, cost, and deployment at large scale across…

Energy-Efficient GPU for Machine Learning

Energy-Efficient GPU for Machine Learning

9/01/2026

Learn how energy-efficient GPUs optimise AI workloads, reduce power consumption, and deliver cost-effective performance for training and inference in…

Accelerating Genomic Analysis with GPU Technology

Accelerating Genomic Analysis with GPU Technology

8/01/2026

Learn how GPU technology accelerates genomic analysis, enabling real-time DNA sequencing, high-throughput workflows, and advanced processing for large-scale genetic studies.

Real-Time Edge Processing with GPU Acceleration

Real-Time Edge Processing with GPU Acceleration

10/07/2025

Learn how GPU acceleration and mobile hardware enable real-time processing in edge devices, boosting AI and graphics performance at the edge.

Real-Time Data Streaming with AI

19/05/2025

You have surely heard that ‘Information is the most powerful weapon’. However, is a weapon really that powerful if it does not arrive on time?

Case Study: CloudRF  Signal Propagation and Tower Optimisation

15/05/2025

See how TechnoLynx helped CloudRF speed up signal propagation and tower placement simulations with GPU acceleration, custom algorithms, and cross-platform support. Faster, smarter radio frequency planning made simple.

Markov Chains in Generative AI Explained

31/03/2025

Discover how Markov chains power Generative AI models, from text generation to computer vision and AR/VR/XR. Explore real-world applications!

Augmented Reality and QR Codes: Power Couple!

10/03/2025

Learn how QR codes and Augmented Reality are transforming industries from retail to aviation. Discover innovative applications and future possibilities in this exciting partnership.

How Agents Learn Through Trial and Error: Reinforcement Learning

24/02/2025

Discover how RL is applied in various industries, from robotics and gaming to healthcare and finance. Explore the key concepts, algorithms, and real-world examples to grasp the potential of this transformative technology.

3D Visualisation Just Became Smarter with AI

3/02/2025

We are all very familiar with 3D printers. Chances are that you want one, you already have one, or you have a friend who does. The concept of ‘Let’s make everything 3D’ is not new, yet the industrial applications of it mostly are. In this article, we will discuss how incorporating AI into 3D scanning, projecting, and modelling can transform the industry in so many fields.

Machine Learning on GPU: A Faster Future

26/11/2024

Learn how GPUs transform machine learning, including AI tasks, deep learning, and handling large amounts of data efficiently.

GPU Coding Program: Simplifying GPU Programming for All

13/11/2024

Learn about GPU coding programs, key programming languages, and how TechnoLynx can make GPU programming accessible for faster processing and advanced…

AI-Driven Innovation: Integrating AI APIs into Your Business

14/10/2024

Learn how to improve your applications with AI APIs and frameworks. Gain practical insights into integration steps, challenges, and best practices using advanced technologies like TensorFlow and AWS SageMaker to boost your business and streamline operations.

Why do we need GPU in AI?

16/07/2024

Discover why GPUs are essential in AI. Learn about their role in machine learning, neural networks, and deep learning projects.

How to use GPU Programming in Machine Learning?

9/07/2024

Learn how to implement and optimise machine learning models using NVIDIA GPUs, CUDA programming, and more. Find out how TechnoLynx can help you adopt this technology effectively.

Internet of Medical Things: All Medical Devices Communicating

29/04/2024

Have you ever wondered how the IoT can complement the field of medicine? We have, and we found so much cool stuff for you to read. Follow us as we explore how IoT started, how it became an integrated part of our lives, and how the IoMT has changed how physicians and patients interact.

Case-Study: V-Nova - GPU Porting from OpenCL to Metal

15/12/2023

Case study on moving a GPU application from OpenCL to Metal for our client V-Nova. Boosts performance, adds support for real-time apps, VR, and machine learning on Apple M1/M2 chips.

Generating New Faces

6/10/2023

With the hype of generative AI, all of us had the urge to build a generative AI application or even needed to integrate it into a web application.

BUILD YOUR OWN GAME

30/01/2023

Have you ever wondered if you could build your own chess game and play it with an AI engine on the browser? If you do, then you might...

Training a Language Model on a Single GPU in one day

4/01/2023

AI Research from the University of Maryland investigating the cramming challenge for Training a Language Model on a Single GPU in one day.

Case Study: Accelerating Cryptocurrency Mining (Under NDA)

29/12/2020

Our client had a vision to analyse and engage with the most disruptive ideas in the crypto-currency domain. Read more to see our solution for this mission!

Back See Blogs
arrow icon