Problem

Our client had a highly specialised GPU application, specifically designed and optimised for AMD and NVIDIA hardware. The client had invested significant resources in developing a high-quality algorithm, which was performing exceptionally well on these platforms. However, despite the software’s ability to run on Apple devices, the performance on Apple’s Metal framework was far below optimal levels.

This was a significant issue for the client because many creative professionals prefer Apple hardware. Apple devices, particularly those using the M1 and M2 chips, have become popular due to their power and design, making them a staple in the creative industry.

However, Apple uses its proprietary Metal framework for GPU technology, rather than the more common OpenCL, which posed a problem for software designed to work on more standard GPU architectures. Therefore, optimising the software to run smoothly on Apple devices became a high-priority task for the client, as it had the potential to unlock more business opportunities and expand their customer base.

The sub-par performance on Apple hardware was primarily due to the way the algorithm was optimised. The original code heavily used OpenCL, a framework that works well with AMD and NVIDIA graphics cards but was not as effective on Apple’s Metal framework. This difference in how GPU tasks are handled across platforms led to significant inefficiencies when the software ran on Apple hardware, making it unsuitable for creative professionals who rely on high performance for tasks like video editing, 3D rendering, and other GPU-intensive work.

Solution

Our team approached this project with the goal of ensuring the GPU application could be fully optimised for Apple’s Metal framework while maintaining the performance levels the client had achieved on AMD and NVIDIA hardware. This required building a flexible, highly adaptable framework that would convert the existing OpenCL codebase into Metal, Apple’s proprietary system.

To solve the problem, we developed a generic and flexible framework capable of covering the subset of functionality used by OpenCL and mapping these features to Metal. Our approach involved moving the execution of the algorithm closer to Metal’s native functionality, allowing for improved performance on Apple GPUs, particularly the M1 and M2 chips.

This conversion was not just a one-time process; it resulted in a sophisticated tool that could automatically convert nearly 100% of the original OpenCL code into Metal functionality.

By doing so, we ensured that the graphics processing units (GPUs) on Apple hardware could handle the same tasks as their AMD and NVIDIA counterparts with minimal performance loss. Our framework was specifically designed to address the computational needs of high-performance computing, ensuring real-time efficiency even in resource-intensive tasks.

Challenges of the Conversion Process

One of the key challenges in this project was maintaining the high-performance characteristics of the algorithm during the conversion process. OpenCL and Metal operate differently, and it was essential to ensure that the functionality provided by OpenCL could be replicated within Metal without sacrificing efficiency.

Another difficulty lay in balancing parallel processing tasks between the CPU (Central Processing Unit) and GPU. GPUs are designed for parallel processing, which is what made the original algorithm so effective on AMD and NVIDIA hardware. In moving to Apple’s Metal framework, we had to ensure that the balance between central processing units and GPUs was maintained, especially when the algorithm was performing real-time computations in applications that required immediate feedback, such as machine learning models and graphics rendering.

In many ways, this was a project that required careful consideration of general purpose computing on GPUs (GPGPU). Ensuring that the algorithm could handle specific tasks efficiently without overwhelming the system was key to achieving the performance boost our client required. The new framework we developed needed to handle a wide range of tasks, from graphics rendering to artificial intelligence computations, all while performing at a level that met industry standards for creative professionals.

Results

After implementing our conversion tool, we achieved dramatic performance improvements. On the smaller M1 GPU, the algorithm saw speed improvements of over 300%, and the larger M2 GPU delivered even better results. These real-time speed improvements not only made the client’s software viable on Apple hardware, but it also positioned it as a top-tier option for users who required high-performance solutions for their creative work.

The conversion framework we developed was more than just a quick-fix solution. It was designed with the future in mind, ensuring that the client’s software could be easily updated and maintained. The internal structure of the framework retained the single-source approach used in the original project, which made future development more streamlined. With the support of two backends (OpenCL and Metal), the client now had the flexibility to continue offering their solution across different platforms without additional overhead or complexity.

As a result, the client was able to expand their market to include Apple users, a significant business opportunity given the large number of creative professionals who rely on Apple devices for their work. The client’s software, previously optimised only for AMD and NVIDIA GPUs, could now deliver comparable performance on Apple’s Metal framework, thus broadening their appeal and increasing their customer base.

Long-Term Impact and Future-Proofing

One of the significant advantages of this project was its long-term impact. The solution we developed was not just a one-time fix; it was a high-performance computing solution that could continue to evolve with new hardware releases from Apple.

As Apple continues to develop its M-series chips, the software will be able to scale accordingly, thanks to the flexible nature of the conversion framework. The fact that the tool converts nearly 100% of the original codebase ensures that future updates and optimisations will be straightforward, saving the client time and money in future development cycles.

Moreover, this flexible framework provides the client with the ability to optimise for different GPUs, meaning they can easily expand their software’s compatibility across other platforms if needed.

Conclusion

This case study demonstrates the real-world challenges of optimising GPU technology for a wide range of hardware, particularly when dealing with proprietary frameworks like Apple’s Metal. By developing a flexible, high-performance computing solution, we were able to deliver a product that not only met the client’s immediate needs but also positioned them for long-term success in the evolving GPU market.

The project highlighted the importance of understanding the nuances of different GPUs and their related frameworks. Whether working with OpenCL, Metal, or any other system, the ability to map parallel processing tasks efficiently across central processing units and GPUs is essential for achieving optimal performance.

For businesses looking to optimise their software for multiple platforms, this case study offers a detailed example of how to approach the problem. The key takeaway is that real-time performance improvements are possible, even when working with fundamentally different GPU frameworks, as long as the project is approached with flexibility and future-proofing in mind.

At TechnoLynx, we specialise in tackling these kinds of high-performance computing challenges, helping businesses optimise their software for a wide range of platforms and hardware. Whether you’re working with AMD, NVIDIA, Apple, or any other hardware provider, our team has the expertise to ensure your software runs at its best, delivering the performance your customers expect.

Contac us to find out more!

Image by Freepik
Image by Freepik