Background
V-Nova approached TechnoLynx with a clear need. They had a large, well-structured GPU codebase written in OpenCL code. Their solution worked well on most platforms, but performance on Apple devices using M1 and M2 chips needed improvement.
Apple supports Metal as the primary GPU framework. To avoid rewriting everything from scratch, V-Nova wanted a way to reuse their existing code on Apple hardware.
This was not just about code conversion. It was about making sure the existing programming model stayed intact. The task also included maintaining performance across different platforms without fragmenting the codebase.
GPU porting can be complex, especially when frameworks differ in memory models, syntax, and supported features. V-Nova needed a solution that worked across all platforms, handled shared memory correctly, and delivered performance improvements without compromising functionality.
Problem
V-Nova had a GPU-heavy application built for high performance on AMD and NVIDIA hardware. It used OpenCL and performed well on standard graphics processing units. But performance dropped on Apple’s Metal framework.
Creative professionals often use Apple devices. With M1 and M2 chips, Apple hardware is now widely used in fields like virtual reality, 3D design, and computer graphics. Many of these tasks depend on GPU performance.
This included Apple’s custom Metal framework. Their goal was to maintain a single-source codebase while expanding support.
We also ran into differences in thread indexing and address space usage. OpenCL uses keywords like __local and __private, while Metal uses thread group and thread. Barriers, event handling, and memory access worked differently.
The aim was not just to make things run. We needed GPU computing to work reliably and efficiently, with consistent outputs across all platforms. This meant solving the problem in a new way.
Challenges
The main challenge was GPU architecture. OpenCL and Metal use different concepts. We needed to match performance without changing the core algorithm.
V-Nova’s code was deeply optimised for OpenCL. The code assumed GPU scheduling and address space behaviour typical of AMD and NVIDIA devices. Apple’s Metal works differently. It needed custom adjustments to work with the Apple GPU pipeline.
We also had to address parallelism. The software used both the central processing unit (CPU) and GPU. Moving this logic to Metal required precise control to prevent bottlenecks.
Real-time performance was key. The app needed to deliver fast results for users in fields like ray tracing, video editing, and machine learning. We couldn’t afford delays.
Another issue was GPGPU use. The app did more than graphics. It performed complex general-purpose tasks such as physics and AI computations. We had to preserve this capability.
Some tools used in the original setup also depended on specific GPU features not supported on Intel graphics. We needed fallbacks.
Solution
We created a tool that could port code at runtime. It reads OpenCL code and outputs a working Metal version. This made it possible to reuse most of the existing kernels with almost no rewriting.
We kept key GPU constructs intact. Address space handling, thread hierarchy, and synchronisation were preserved. This made the programming model behave in a familiar way across both systems.
The tool caches compiled kernels using checksums. If a kernel hasn’t changed, we don’t recompile it. This speeds up app startup and improves development time. In cases where Metal lacked a direct match for OpenCL features, we wrote custom code to fill the gap.
For example, instead of calling get_global_id() like in OpenCL, thread indices are passed as arguments in Metal. Global memory and shared memory concepts were mapped carefully. In Metal, buffers are often shared by default between CPU and GPU, so unmap() becomes a no-op.
We also handled small but important differences. Logical operators return different values for true and false in the two systems. The syntax for memory barriers is not the same.
Vector field access needed adjustment. Event handling was rebuilt to fit the Metal system.
This GPU porting tool means developers can write one kernel and run it on both OpenCL and Metal. No need for two separate codebases. It saves time, prevents bugs, and keeps development simpler.
The work also included testing. We built a tool that records all memory usage, parameters, and results before and after running a kernel. This data goes into a file that can be replayed later.
Using this file, we can compare the behaviour of OpenCL and Metal kernels. We can even pass output from one backend into the next one in the pipeline. This helps us find where differences begin. We can spot which kernel causes incorrect results and fix it fast.
Results
The final solution worked well. V-Nova did not have to split their GPU codebase. They could use the same source code on Apple devices with Metal and on other platforms with OpenCL. Our framework improved GPU speed significantly. On the M1, performance increased by over 300%. On M2, results were even better.
Real-time GPU tasks ran at smooth frame rates. This made the software viable for video rendering, design, and scientific computing on Apple devices.
The runtime porting tool made it easier to test and update the code. Developers could focus on improving features without worrying about backend differences.
The GPU computing performance on Apple M1 and M2 chips improved. By porting code directly to Metal and avoiding emulation, we reduced execution time and increased throughput.
Shared memory access was handled efficiently. Global memory usage was mapped properly. This helped avoid memory access errors, which are common when switching between GPU frameworks.
The tool also made testing much easier. With output replay and buffer tracking, debugging became faster and more accurate. Differences in kernel output were tracked to the exact point of failure.
The solution even allowed for some creativity. We worked within the limitations of Metal but still achieved the same results as OpenCL.
We didn’t forget the basics. Even though we dealt with advanced GPU frameworks, we still respected the programming language rules and worked from clear goals. At times, it felt like working through the periodic table—picking and adapting elements like iron (for structure) and pure metals (for clarity), and adjusting weight where needed

Final Thoughts
This case showed that smart GPU porting can solve complex issues. With the right tools and planning, moving from OpenCL to Metal can be smooth.
We made sure that shared memory, global memory, and kernel logic matched across both platforms. This allowed V-Nova to keep their performance high and avoid code duplication.
TechnoLynx delivered a solution that was simple to maintain, efficient to run, and easy to extend. The programming model stayed consistent, which kept developers happy.
And most importantly, the software now works well on Apple hardware—without extra work from the client’s side. GPU computing on M1 and M2 is no longer a limitation. It’s just part of the process.