Case-Study: Text-to-Speech Inference Optimisation on Edge (Under NDA)

Problem

Our client, a small government-backed startup, aimed to improve accessibility for a group of individuals with visual impairment. The project goal was to expand multi-platform screen reading tools by supporting the Kazakh language on consumer devices. This was part of a broader effort to improve device access for users with visual challenges.

At first glance, it seemed like a standard text-to-speech (TTS) task. But the Kazakh language posed unique challenges.

Kazakh is not widely supported by major platforms. Few high-quality pre-trained models exist. Existing tools lacked native support.

At the same time, the client required fast deployment across Android, Windows, and iOS. The project had a tight schedule and high-quality standards.

We needed a working Kazakh TTS that produced natural speech and ran in real time across platforms. The system also had to meet memory and performance limits on mobile devices. Our task was clear: build a reliable solution using existing resources while meeting real-time user needs.

Challenges and Constraints

This individual case had several technical constraints, all of which shaped our research design.

Limited Time

Case studies often highlight how a business navigates challenges, and in this scenario, the key obstacle was the tight deadline. The client approached us with a firm release date already set, leaving limited time to prepare. Despite the late start, our team embraced the challenge and used pre-trained models for Kazakh, as there wasn’t sufficient time to develop custom models or retrain existing ones from scratch.

Format Compatibility

Our deployment required converting the PyTorch models into ONNX and CoreML formats. ONNX was chosen for Android and Windows. CoreML was needed for iOS. This process involved reworking inputs, outputs, and inference pipelines.

iOS Memory Limitations

The iOS screen reader framework (AVSpeechSynthesisProviderAudioUnit) had strict memory limits. The original model could not run within these constraints. This issue had to be addressed during implementation.

Audio Quality

Natural sounding speech was a core requirement. The Kazakh TTS had to be usable for formal communication, not just casual content. Robotic or artificial tones were unacceptable.

Outdated Application Layers

Some helpful open-source tools existed but used old build systems and dependencies. We had to update them or rewrite core components to ensure stability and compatibility.

Toolchain

To meet these challenges, we used the following tools:

PyTorch (for pre-trained model checkpoints)
ONNX Runtime (for Android and Windows deployment)
CoreML Tools (for iOS optimisation)
FFmpeg (for audio pre/post-processing)
Apple’s Xcode and AVFoundation (for iOS integration)

Solution

We applied a clear case study approach to reach the solution. Our research questions were simple: How can we deploy a Kazakh TTS model on three platforms using existing tools? How can we ensure high-quality, real-time speech under memory limits?

For Android and Windows, we converted the PyTorch model into ONNX. We optimised the model size by reducing layers and quantising weights. This helped us reduce the model size from 230MB to 97MB and inference time from 320ms to 130ms on average.

For iOS, the same model was too heavy. The screen reader framework imposed hard limits. We decided to build a standalone app for iOS. This let us run the model outside the system’s built-in memory limits.

The compromise was that the app could not integrate directly with VoiceOver. Still, it provided full text-to-speech functionality with natural voice quality.

Our solution passed internal tests and met all core requirements. Users could now listen to Kazakh content read aloud with high clarity. Deployment worked across all platforms.

Results

The project ultimately resulted in a high-quality text-to-speech system available on multiple platforms, including Android, Windows, and iOS. While the iOS version required a standalone application due to memory limitations, the overall solution was deemed a success. Our engineers minimised the size from 65MB to a set of ONNX models between 20 to 10MB and a 3.6MB CoreML model.

It let users hear natural-sounding speech in Kazakh. This made devices easier to use for people with visual impairments.

The generated speech was of sufficient quality to be practically indistinguishable from human speech, helping users interact with their devices more naturally and comfortably. It was important to ensure that the speech could serve for casual reading and in formal settings like school or business.

Future Steps

Our team proposed a number of potential improvements for future projects. Specifically, we suggested continuing to optimise the model for iOS devices, exploring alternative ways to reduce memory consumption. Additionally, as AI voice technologies continue to evolve, future projects could incorporate more advanced deep learning models to enhance the quality of text-to-speech software even further.

Outlook and Lessons

This case shows how careful planning and the right tools can solve practical language support problems. The project used a structured case study approach with clear goals. We adapted pre-trained models, updated legacy tools, and worked within device limits. The result was a working system used by real people.

Text-to-speech systems like this are part of everyday life. They help with accessibility, education, and customer support. Our work here demonstrates that we can include even languages with limited support.

The role of AI in this field is growing. As models improve, the voices will sound even more natural. Devices will run faster with smaller models. This helps everyone, especially users with special access needs.

We can apply the approach we used in this project to other languages or platforms. Any group of individuals needing voice support can benefit. Whether it’s in a government system, classroom, or personal device, this kind of tool can help.

Why This Case Study Matters

This is one of many types of case study that show how AI helps in real life. It focuses on a clear goal, a specific group of individuals, and measurable results. The research design followed practical steps. It asked direct research questions and solved real problems.

We believe this individual case shows how small changes make a big difference. Quality AI does not have to be complex. It just has to work. That’s what we did here, and that’s what we aim to do in every project.

Text-to-Speech: A Broader Perspective

Text-to-speech technologies have gained increasing importance in recent years, thanks to the rise of AI technologies. These systems are used not only to support accessibility for those with visual impairments but also in a wide range of other industries. For example, text-to-speech software is used in customer service applications, virtual assistants, and content generation. A free text-to-speech solution can help businesses save time by automating the reading of documents, reports, or other text-heavy content.

AI voice technology has come a long way from its early, robotic-sounding origins. Today’s systems use deep learning to generate voices that sound more human-like, and they can operate in multiple languages, making them suitable for a wide range of global applications. By developing solutions that can handle niche languages like Kazakh, companies can expand their markets and serve previously underserved populations.

Our Perspective

At TechnoLynx, we apply structured methods. We combine computer science with human-centred design. Each individual case teaches us how to work better and faster. We rely on deep learning, efficient frameworks, and strong quality control.

If you are working on a project that includes AI voice technology, accessibility, or language support, get in touch. We’re ready to help you build something that works for your users, not just your specs. Contact us now to discuss your needs and find the ultimate solution!

Image by Freepik