Why do we need GPU in AI?

Yes, AI needs GPUs — but most teams overpay for the ones they buy. Profile utilisation before procurement to spot the hidden cost.

Why do we need GPU in AI?
Written by TechnoLynx Published on 16 Jul 2024

Introduction

The short answer is yes — modern AI workloads need GPUs because the maths that drives training and inference (dense linear algebra over batched tensors) is exactly what a GPU is built to do in parallel. A CPU executes a few threads quickly; a GPU executes thousands of arithmetic operations simultaneously. For a neural network whose forward pass is dominated by matrix multiplies, that throughput difference is the difference between a model that trains in hours and one that trains in weeks.

The interesting question, though, is not whether you need GPUs. It is whether you are actually using the ones you have. Enterprise GPU infrastructure purchased for AI workloads routinely sits underutilised — memory bandwidth unused, compute cores idle during data transfer, batch sizes that leave most of the silicon on the table — and the cost compounds monthly. You cannot know what you are wasting until you profile.

What this means in practice

  • The reason GPUs accelerate AI is parallel arithmetic throughput on dense tensors, not a single magic primitive.
  • GPU-busy % is the most misleading utilisation metric in common use — a GPU can be “busy” doing memory operations while its compute units stall.
  • The honest cost metric is TCO per useful FLOP, not TCO per purchased FLOP.
  • Most cloud GPU spend that gets approved for “more capacity” would be undercut by profiling the workloads already running.

How do I calculate the true cost of an underutilised GPU fleet?

The true cost has three components most balance sheets do not separate. First, the capital or rental envelope — the sticker price of a purchased H100, or the per-hour rate of a rented instance. Second, the utilisation-adjusted denominator — what fraction of that capacity actually ran productive AI work, where productive is defined by your application’s success metric rather than by the operating system’s idle counter. Third, the opportunity cost — what the same money would have bought if redirected to optimising the existing fleet (better data pipelines, larger batch sizes, more aggressive kernel fusion) instead of adding capacity.

A defensible calculation looks like: monthly spend × (1 − fraction-of-time-doing-useful-work) = the hidden waste line item. For a typical enterprise AI team that has not profiled, that figure lands between 30% and 60% of GPU spend. It is rarely zero. The corrective question is not “do we need more GPUs” but “what are the existing ones doing when they are nominally working?”

What does “GPU utilisation” actually measure — and why is the GPU-busy percentage misleading?

The most commonly cited metric, nvidia-smi’s GPU-Util field, reports the percentage of time over the sample window during which one or more SM (streaming multiprocessor) kernels were active. It says nothing about how many SMs ran, what fraction of memory bandwidth was used, or whether the kernels achieved anything close to peak throughput. A GPU running a small kernel that occupies one SM at 5% achieved throughput will report GPU-Util: 100% for the duration. That number tells the operations team the device is fully loaded; it tells the engineering team almost nothing.

The metrics that matter live one layer deeper: SM occupancy (what fraction of the available warp slots are filled), memory bandwidth utilisation (what fraction of the HBM read/write capacity is consumed), Tensor Core utilisation for the AI-relevant fp16/bf16/int8 paths, and kernel time vs idle time per training step. Tools like NVIDIA Nsight Systems, NCCL profiler outputs, or PyTorch’s built-in profiler expose these directly. The first time a team profiles a workload against these metrics rather than GPU-Util, the typical finding is that “100% utilised” actually means “doing memory copies”.

How do I compute total cost of ownership per useful FLOP rather than per purchased FLOP?

Purchased FLOPs are advertised peak: an H100 SXM at fp8 with Tensor Cores ships ~3.96 PFLOPS. Useful FLOPs are what your specific workload achieves end-to-end on real data, end-to-end being the relevant qualifier — it has to include data loading, preprocessing, inter-GPU communication, gradient synchronisation, and the periods when the device is waiting for something else.

The calculation is mechanical once you have profiler output. Take the achieved FLOP rate for a representative training step (the profiler computes this), multiply by the fraction of wall-clock time the step actually ran (the remainder being data-loading stalls, communication waits, evaluation pauses, checkpointing), and divide your monthly amortised cost (capital depreciation plus power plus data-centre overhead, or cloud invoice) by that number. The result is dollars per useful FLOP-hour. Two organisations running the “same” model on the “same” hardware routinely differ by 3-5× on this metric, and the differential is almost entirely in the data pipeline and the kernel-launch overhead, not in the GPU itself.

Which workload patterns most often leave GPU capacity on the table?

Four patterns account for most of the waste we see in production AI infrastructure. Small batches — fixed by either increasing the batch size until memory bandwidth saturates, or by switching to a model architecture whose forward pass scales better. CPU-bound data pipelines — fixed by moving augmentation onto the GPU, increasing dataloader workers, or pre-processing data offline. Synchronous evaluation between training steps — fixed by overlapping evaluation with the next training step on a separate stream, or by moving evaluation off the training device entirely. Distributed training with poorly tuned collective communication — fixed by checking the NCCL ring/tree topology against the physical interconnect, and by sizing gradient buckets to hide allreduce behind backward computation.

None of these fixes require buying anything. All of them require profiling first to know which one applies.

Should I procure additional GPU capacity or first profile the utilisation of what I have?

Profile first. The decision rule we apply in audits is simple: until your existing fleet runs at greater than ~60% achieved FLOP utilisation on your representative workloads, additional capacity will only multiply the waste. The financial case for adding capacity rests on the assumption that the existing capacity is being used; once the assumption is checked, the case often collapses. A profiling pass costs days of engineering time. A procurement cycle costs months and tens to hundreds of thousands of dollars. The asymmetry alone justifies profiling as the precondition to procurement.

What cost savings are realistic from optimising utilisation versus renting more cloud GPUs?

In engagements we have run, the realistic savings band from utilisation optimisation alone — without any hardware change — is 30-60% of the GPU line item, achieved over a 4-8 week intervention. The lower end captures the easy wins (batch sizing, dataloader tuning, mixed-precision adoption); the higher end requires per-kernel optimisation, model architecture adjustment, or a re-think of how the data pipeline meets the device. Beyond 60% the marginal effort typically exceeds the marginal saving, and the decision shifts to model or hardware refresh.

The cost frame matters: the saving is recurring (every month thereafter), whereas additional cloud capacity is a recurring cost. The net present value of profiling once and using less is consistently larger than the NPV of renting more.

Limitations that remained

This article describes what to measure and where the savings live; it does not eliminate the work of measuring them. Three honest gaps remain. First, profiler output is only useful if engineering time is allocated to interpret it — many teams have the data and do not have the bandwidth to act on it. Second, the utilisation framing assumes a stable workload mix; teams running heterogeneous workloads (training, inference, ad-hoc experimentation) on shared infrastructure face a scheduling problem that profiling alone does not solve. Third, the savings band quoted above is based on engagements with teams that had at least one engineer willing to own the optimisation work end-to-end; teams without that ownership see the data, agree it is interesting, and then change nothing.

How TechnoLynx Can Help

TechnoLynx is a visual-computing R&D consultancy. For AI infrastructure teams we run GPU performance audits that quantify per-workload utilisation against the metrics that actually matter (SM occupancy, memory bandwidth, achieved FLOPs), identify the dominant waste sources in the data pipeline and the kernel mix, and produce a prioritised intervention list with measurable before/after targets. We work with engineering teams that want to spend less on capacity by using more of what they already have. Contact us to discuss your AI infrastructure utilisation.

Image credits: Freepik.

Back See Blogs
arrow icon