CASE STUDY

GPU Porting from OpenCL to Metal - V-Nova

V-Nova approached TechnoLynx with a large, well-structured GPU codebase written in OpenCL. While it performed well on most platforms, performance on Apple devices (M1 and M2) needed improvement. The goal was to reuse existing kernels on Apple hardware without rewriting everything from scratch, while keeping a single-source codebase and consistent outputs across platforms.

OpenCL Metal GPU Porting Apple M1/M2 Cross-Platform

The Challenge

V-Nova had a GPU-heavy application optimised for OpenCL on AMD and NVIDIA hardware, but performance dropped on Apple’s Metal framework. The work was not just “code conversion”: it required preserving the existing programming model, handling differences in memory models, syntax, and supported features, and maintaining performance across platforms without fragmenting the codebase.

Maintain a single-source GPU codebase.

The solution needed to expand Apple support without forcing two separate kernel implementations or a fragmented backend.

Bridge OpenCL vs. Metal programming differences.

Thread indexing, address spaces, barriers, event handling, and memory access work differently between OpenCL and Metal.

Hit real-time performance on Apple hardware.

The application needed to deliver fast results for real-time workloads (e.g., ray tracing, video editing, and machine learning) without delays.

Keep GPU compute reliable and consistent.

It wasn’t enough to “run”, outputs had to match across platforms, with shared/global memory handled correctly and practical fallbacks where needed.

GPU close up

Project Timeline

From runtime kernel porting to replay-based verification and consistent outputs across platforms

Assessment

Evaluated why OpenCL-optimised kernels performed well on AMD/NVIDIA but dropped on Apple’s Metal framework, and defined cross-platform constraints.

Mapped OpenCL concepts to Metal equivalents: thread indexing, address space keywords, barriers, logical operators, and event behaviour.

Model Mapping

Runtime Porting Tool

Built a runtime tool that reads OpenCL kernels and outputs working Metal equivalents so most kernels can be reused with almost no rewriting.

Added checksum-based kernel caching to avoid recompiling unchanged kernels, improving app startup and development iteration time.

Caching & Iteration Speed

Replay-Based Verification

Built a replay tool that records inputs/outputs and parameters before and after kernel runs, enabling side-by-side behaviour comparison and faster debugging.

The Solution

We created a runtime GPU porting framework that converts OpenCL kernels into Metal, while preserving key constructs like address spaces, thread hierarchy, and synchronisation. Where Metal didn’t match OpenCL features directly, we implemented custom logic and added replay-based testing to confirm consistent outputs across platforms.

Runtime Kernel Porting

A tool that reads OpenCL code and outputs a working Metal version at runtime, so developers can write one kernel and run it on both OpenCL and Metal.

Preserved Programming Model

Carefully mapped address spaces and memory behaviour, handled thread indexing differences, and rebuilt synchronisation/event handling to match Metal’s system.

Replay + Buffer Tracking

Recorded memory usage, parameters, and outputs to replay and compare OpenCL vs. Metal execution, pinpointing where differences begin and speeding up fixes.

Technical Specifications

Source kernels OpenCL
Apple backend Metal (Apple M1/M2)
Caching Checksum-based kernel caching to avoid recompiling unchanged kernels
Indexing Mapped get_global_id()-style calls by passing thread indices into Metal kernels
Memory mapping Careful global/shared memory mapping; in Metal, shared buffers can make unmap() a no-op
Differences handled Logical operator outputs, barrier syntax, vector field access, event handling
Validation Replay tool records inputs/outputs and replays pipelines to pinpoint kernel output differences
GPU visuals

The Outcome

V-Nova kept a single GPU codebase while supporting Apple devices through Metal and other platforms through OpenCL. The runtime porting approach simplified ongoing development and made testing/debugging more reliable across backends.

229%
Average encoding speedup after porting from OpenCL to Metal
340%
Average decoding speedup after porting from OpenCL to Metal

Key Achievements

Real-time GPU tasks ran at smooth frame rates, making the software viable for video rendering, design, and scientific computing on Apple devices.

Checksum-based kernel caching avoided recompiling unchanged kernels, speeding up app startup and improving development time.

Replay and buffer tracking made debugging faster and more accurate by tracking output differences to the exact point of failure.

Shared memory access was handled efficiently and global memory usage was mapped properly, helping avoid memory access errors common when switching GPU frameworks.

Need to Port GPU Code to Apple Silicon?

Let’s discuss how cross-platform GPU strategies (OpenCL, Metal, and runtime tooling) can help you keep one codebase, improve performance, and ship reliably across hardware targets.