We use cookies to ensure you have the best possible experience. If you click accept, you agree to this use. For more information, please see our privacy policy.
/

Innovation

How to select the perfect AI model for a project?

Armand Brière
Armand Brière
min read

In this article, we look at how Osedea's AI team selects the ideal AI model for a project, considering their specific needs while providing the best performance. We will look at the importance of data and how to validate potential options, even without access to the client's data.

Following the latest trends in AI, we are starting to see a greater need for AI speech to text models. Clients are looking for new ways to interact with their systems, especially in the context of LLM chatbots. Let's dive into the speech-to-text models in the context of the pharmaceutical industry.

An Overview of SOTA models

Every client is unique and has different needs, but when it comes to AI, everyone wants the best performance. To select the best state-of-the-art (SOTA) model, understanding the strengths and weaknesses of the different models is key. Below is an overview of some top-performing model that we looked at:

  1. Whisper from OpenAI

Whisper is a strong model that supports multiple languages with a high accuracy. It's known for its great handling of different accents and background noise.

  1. Speech-to-Text from Google

Speech-to-Text API provides real-time transcription of audio at scale in managed environments by Google.

  1. Transcribe from AWS

AWS transcribe is a great model, similar to Google's Speech-to-Text. Mainly used for customer service and call center, it's highly scalable and suitable for processing large amounts of data.

All of these models are great at what they do, but sometimes the best model is not the one with the best performance on common benchmarking tasks. Since we are working on a pharmaceutical project where data is highly specific to the domain, we needed to find a model that could be fine-tuned to our specific use case.

Understanding the use case

The use case is the most important part of the project. Understanding the client's needs enables us to choose the best model for our implementation. Do we need to transcribe a call center interaction, a podcast, or a video? Are we looking for real-time transcription or batch processing? These are some of the questions we need to ask ourselves before selecting models to evaluate. In this case, we are looking at developing an AI assistant that will help scientists in drug development at a pharmaceutical company.

The key feature of this AI assistant is to transcribe the scientist's voice to text in real-time. This allows the scientists to focus on their work and not on taking notes.



Choosing the right model



After understanding the use case, we explored the list of available models. We quickly realized that any of the SOTA models are not suitable for our use case. Although they all provide real-time transcription, the cost of using these models at scale can quickly become too high. Additionally, the complexity of deploying and maintaining these models, along with the significant computational resources required, made them impractical for our needs. As a result, we decided to explore more cost-effective and resource-efficient alternatives that could still meet our accuracy and performance requirements.

Fortunately, OpenAI decided to open-source their Whisper model in 2022 in various sizes. This allowed us to train our own model on our own data and fine-tune it to our specific use case. This approach provided us with the flexibility to customize the model to our needs, while also reducing the cost and complexity of deployment and maintenance.

The Whisper model is open-source, but we still need to choose the right size of the model, the best hyperparameters and validate the model on our specific use case.

In order to achieve this, we followed these steps that helped us to select the best model for our client:

1. Data Collection

2. Data Preprocessing

3. Model Benchmarking


Data collection for benchmarking

Since we are working on a pharmaceutical project, the data is highly specific to the domain. We started by identifying keywords and terms that are specific to the pharmaceutical industry, and we established a detailed list of sentences that we would use to benchmark the models. Once we had our list of sentences, averaging 2000 words, we started to look for the best way to transform those written sentences into audio files.

A lot of tools are available to transform text to speech, but how good are they? They are a bit too good and robotic for a proper benchmarking. The best data is data that reflects what we will have in a real-world, production environment. With this in mind, we simply decided to record ourselves reading the sentences and create our own custom dataset. Osedea's diverse team allowed us to collect a good mix of accents and vocal tones to ensure the data was representative of the environment it would be used in. 

Gathering our custom dataset at the office was a fun experience, and we developed a small script that would present the sentences to the reader and record the audio through a simple command line interface. We realized that some of the sentences were hard to read and that some of the words were not pronounced correctly. This was a good opportunity to improve our data and make it more robust.

Once finished, we had a dataset of audio files averaging 1 hour of audio, which is sufficient to benchmark the models.

Benchmarking the models

As stated earlier, the Whisper model comes in different sizes. We benchmarked the tiny, small, medium, and base models on a base CPU. After all, why default to the most powerful GPU? Is it really necessary? Running the benchmark on a CPU allowed us to run it on multiple machines and compare results. We also wanted to see the performance of the model on basic hardware to evaluate the cost of running the model in production. Our work was inspired by this Picovoice speech-to-text benchmark. This open-source project is taking a smart object-oriented approach to be able to test multiple models. A default Engine class is created, and new engines are in charge of implementing their own transcribe() function.

Here are a few snippets of what the code looks like:

class Engine:
    """Base class for all engines."""

    def transcribe(self, path: str) -> str:
        """Transcribe the given audio file."""
        raise NotImplementedError()

The default Engine class is fairly simple by design, with a few additional methods handling metrics collection.

From this Engine, a new WhisperTiny engine class can be created without any difficulty:

class WhisperTiny(Engine):
    """Whisper Tiny engine for the benchmark."""

    SAMPLE_RATE = 16000

    def __init__(self, model: str = WHISPER_TINY_MODEL_NAME):
        self._model = whisper.load_model(model, device="cpu")
        self._audio_sec = 0.0
        self._proc_sec = 0.0

    def transcribe(self, path: str, **transcribe_params) -> str:
        """Transcribe the given audio file."""

        audio, sample_rate = soundfile.read(path, dtype="int16")
        assert sample_rate == self.SAMPLE_RATE
        self._audio_sec += audio.size / sample_rate

        start_sec = time.time()
        res: str = self._model.transcribe(
            path, language="en", fp16=False, **transcribe_params
        )["text"]
        self._proc_sec += time.time() - start_sec

        return self._normalize(res)

The dataset is defined, the models and hardware are ready, let's see the metrics we will use to evaluate the models:

The Word Error Rate (WER) is a common metric used to evaluate the performance of speech recognition models. It measures the percentage of words that are incorrectly transcribed by the model. The lower the WER, the better the model's performance.

The Real-Time Factor (RTF) is a measure of the speed of the model. It represents the ratio of the time taken to transcribe the audio to the duration of the audio. A lower RTF indicates that the model is faster at transcribing the audio.

Running on model on our dataset gave us the following results:

(IMAGE TABLEAU)

If you prefer a visual representation of the results, here is a graph that shows the WER and RTF of the models:

The results are clear, as expected the smaller the model, the faster it is but the higher the WER gets. It seems to be the conclusion that everyone is getting on public benchmarks. For us the Small and Medium model are giving nearly the same results in both WER and RTF, tripling the size of the model is not worth the performance gain of 0.06% between those two.

The difference between the Tiny and the Base model is significant, the WER is 3.69% higher for the Tiny model but the RTF is 0.85 lower. This is a good trade-off for a real-time transcription model. Those results are promising, since we haven't talked about hyperparameters optimization or fine-tuning the model to domain specific data. Running the model on a GPU would improve the performance of the model in terms of speed but would also increase the cost of running the model in production. Future discussions with the client will help us to decide if the performance gain is worth the cost.

Conclusion

Selecting the right model for a project is not easy. It requires a deep understanding of the use case, the data, and the model's performance. In this article, we looked at how we selected the Whisper model for our pharmaceutical project. We started by understanding the use case, creating our own dataset and benchmarking the model. The results were promising and enabled us to move forward with the development. 

Did this article start to give you some ideas? We’d love to work with you! Get in touch and let’s discover what we can do together.

Get in touch
Button Arrow