NLU Benchmarking: How Ultimate's AI Engine Stacks Up Against Others

A graph showing Ultimate's accuracy at 79.83%, higher than any competitors'.

We ran an experiment on our AI engine and the results are: our AI outperforms similar engines by accuracy in deciphering user intent. Here’s a rundown of the experiment, data sets, and identified limitations.

At its heart, NLU (Natural Language Understanding) is about deciphering the intent behind a natural language utterance. The more accurate and reliable an engine is in identifying the intent, the more powerful will be the solution that it drives.

In a broader linguistic sense, intent is defined as the desired outcome of behavior (or action, sentence, or a piece of written text, for that matter). When a person yells “A dog!”, they could be either trying to warn us of a menacing stray dog or tell us there is a fluffy dog looking for a cuddle. So, the intent would be to warn us or get our attention. In reality, though, there are many more potential intentions behind this utterance. And to pin it down correctly, we need accuracy.

In an NLU sense, intent could be defined as an objective or a use case that a customer might need help with. Say a customer on a website types into the chat box: “where can i see the books i ordered?” In this case, the intent could be “[see] orders”.

Each intent comprises similarly expressed visitor messages, called ‘expressions’, which are grouped together. In the example above, these expressions could include messages such as “I want to track my orders”, “How can I update my shipping address” and many other forms.

The underlying process that groups these expressions together is called classification. However, classification is a broader concept, which does a lot more than just stacking new expressions against the existing ones. In fact, classification helps create categories and then predict which input falls into which category — and as a result, it helps create a generalized understanding of the topic. For example, once the NLU engine analyzes a chunk of natural language, it would classify it according to intent and then proceed to direct the customer towards the information they’re looking for.

As a result, the NLU removes any friction on a customer’s journey. Instead, it serves them exactly the information they’re looking for, reducing frustration and waiting time, and servicing a higher number of incoming queries, much higher than an agent could. This is why most existing solutions rely on perfecting their intent classification performance.

But getting to this frictionless intent classification is where the challenge lies. It requires a set of data to train our engine, and then we need samples of natural language that our engine hasn’t interacted with before to run and test its mettle.

Replicating the Experiment with Ultimate's data

The experiment we ran on our NLU engine leans on the Cognigy experiment. All results mentioned in this piece are from the original experiment obtained on the raw data, save for Ultimate's performance metrics.


The data we used are from the home automation bot data set, “Benchmarking Natural Language Understanding Services for Building Conversational Agents (2019)”.

For the training purposes, for each of the 64 intents, we selected 10 utterances per intent from the smaller data set, and 30 utterances per intent from the larger data set. In the case of the smaller data set, we ran our NLU engine against 1076 test utterances, whereas in the case of the larger dataset, we ran the engine on 5518 test sentences. To make results comparable, datasets are exactly the same as in the Cognigy test.

In the context of data quality, it’s important to mention that all the utterances were grammatically correct and the test data set featured short utterances with all the relevant details that help identify the intent present.

In comparison, genuine customer service data is significantly messier. As a rule of thumb, customers won’t always pay much attention to properly spell or punctuate their sentences, nor pay attention to how grammatically ‘clean’ their input is — they might omit auxiliary verbs or even use a spelling that combines abbreviations or uses both numbers and words, such as ‘l8r’ to signify ‘later’. And as a result, messier input data means worse performance, that is, unless the NLU engine is designed to handle these real-life inputs as well.

Benchmarking Results

In the table below you can see how Ultimate’s NLU engine performed on both smaller and larger data sets and how it stacked against other engines.

12 AI comparison table v2_Page_3

19% reduction in error rate, while narrowing the F1 error by 17%

12 AI comparison table v2_Page_6

To put these results into a perspective, it’s essential to reflect on the two metrics used—accuracy and F1 score.

Accuracy refers to the percentage of test sentences correctly matched with the underlying intent. A score of .51 would mean that 51% of sentences were successfully paired to their intent.

F1 is the harmonic mean between precision and recall.

Precision means how many expressions that the engine identified were actually relevant. In our case, out of all the expressions that engine marked as ‘weather-related’, how many were actually weather-related.

On the other hand, recall is the percentage of relevant instances that were retrieved—or, in our case, out of all the weather-related queries available in the test data set, how many did the engine correctly pair to the weather-related intent.

Essentially, by calculating the F1 score you balance false positives and false negatives, and provide even better benchmarking of the NLU engine, by giving equal weight to both precision and recall.

What Benchmarking Tells Us

In short, the results of the experiment we ran show that Ultimate's NLU engine outperforms all other systems examined.

However, there are few things to consider to arrive at a wider context of these results. As for the data set, the chunks of natural language are simple enough for all the systems to show decent performance with large data sets. But as we’ve mentioned, in a different setting such as customer service, the utterances that the engine would have to classify would be a lot more unsystematic and lack proper grammar.

Another aspect that should be considered is that in the case of Ultimate’s NLU engine, the accuracy and F1 results are in a much more similar range, which could either suggest that Ultimate’s engine can handle intents equally including complex intents, or that it doesn’t favor precision and recall one over another. However, a deeper look into the results data would be required for a more confident conclusion.

Test Limitations

One of the main limitations of the experiment is that it’s focused on the English language data. Given its commercial value, English is the go-to testing ground for most NLU engines, but given its structure and availability of data tests, it has become the easiest language to test NLU on. On the other hand, based on our hands-on experience, if we were to repeat the experiment and test NLU engines on more complex languages, our hypothesis is that the results would vary more significantly.

In terms of technique, in this experiment we saw little benefit from transfer learning, most likely due to the fact that the data used in this experiment is quite different from the actual real-life customer service data.

Finally, if this experiment shows anything it’s that the performance of an NLU engine is a function of a dataset size. The actual tools for creating a training data set to use in real-life customer service intent classification training matter as much as the underlying logic behind the intent classification. And we’ve put a lot of thought and work into ensuring our tools for building these datasets are robust and intuitive and that they rest on AI-driven features to help users and deliver a seamless customer experience.

Want to know how NLU-powered automation can help you scale?