V-Nova approached TechnoLynx with a large, well-structured GPU codebase written in OpenCL. While it performed well on most platforms, performance on Apple devices (M1 and M2) needed improvement. The goal was to reuse existing kernels on Apple hardware without rewriting everything from scratch, while keeping a single-source codebase and consistent outputs across platforms.
V-Nova had a GPU-heavy application optimised for OpenCL on AMD and NVIDIA hardware, but performance dropped on Apple’s Metal framework. The work was not just “code conversion”: it required preserving the existing programming model, handling differences in memory models, syntax, and supported features, and maintaining performance across platforms without fragmenting the codebase.
Maintain a single-source GPU codebase.
The solution needed to expand Apple support without forcing two separate kernel implementations or a fragmented backend.
Bridge OpenCL vs. Metal programming differences.
Thread indexing, address spaces, barriers, event handling, and memory access work differently between OpenCL and Metal.
Hit real-time performance on Apple hardware.
The application needed to deliver fast results for real-time workloads (e.g., ray tracing, video editing, and machine learning) without delays.
Keep GPU compute reliable and consistent.
It wasn’t enough to “run”, outputs had to match across platforms, with shared/global memory handled correctly and practical fallbacks where needed.
From runtime kernel porting to replay-based verification and consistent outputs across platforms
Evaluated why OpenCL-optimised kernels performed well on AMD/NVIDIA but dropped on Apple’s Metal framework, and defined cross-platform constraints.
Mapped OpenCL concepts to Metal equivalents: thread indexing, address space keywords, barriers, logical operators, and event behaviour.
Built a runtime tool that reads OpenCL kernels and outputs working Metal equivalents so most kernels can be reused with almost no rewriting.
Added checksum-based kernel caching to avoid recompiling unchanged kernels, improving app startup and development iteration time.
Built a replay tool that records inputs/outputs and parameters before and after kernel runs, enabling side-by-side behaviour comparison and faster debugging.
We created a runtime GPU porting framework that converts OpenCL kernels into Metal, while preserving key constructs like address spaces, thread hierarchy, and synchronisation. Where Metal didn’t match OpenCL features directly, we implemented custom logic and added replay-based testing to confirm consistent outputs across platforms.
A tool that reads OpenCL code and outputs a working Metal version at runtime, so developers can write one kernel and run it on both OpenCL and Metal.
Carefully mapped address spaces and memory behaviour, handled thread indexing differences, and rebuilt synchronisation/event handling to match Metal’s system.
Recorded memory usage, parameters, and outputs to replay and compare OpenCL vs. Metal execution, pinpointing where differences begin and speeding up fixes.
V-Nova kept a single GPU codebase while supporting Apple devices through Metal and other platforms through OpenCL. The runtime porting approach simplified ongoing development and made testing/debugging more reliable across backends.
Real-time GPU tasks ran at smooth frame rates, making the software viable for video rendering, design, and scientific computing on Apple devices.
Checksum-based kernel caching avoided recompiling unchanged kernels, speeding up app startup and improving development time.
Replay and buffer tracking made debugging faster and more accurate by tracking output differences to the exact point of failure.
Shared memory access was handled efficiently and global memory usage was mapped properly, helping avoid memory access errors common when switching GPU frameworks.
Let’s discuss how cross-platform GPU strategies (OpenCL, Metal, and runtime tooling) can help you keep one codebase, improve performance, and ship reliably across hardware targets.