What Happen When Speech Recognition Dataset Goes Wrong

Tripoto
Photo of What Happen When Speech Recognition Dataset Goes Wrong by Global Technology Solutions

With the launch of voice-activated devices every week, it's easy to believe that we're at an uncharted territory in the application of technology to recognize speech. But, a recent Bloomberg article asserts that, while voice recognition technology has taken significant advances in recent times but the way it is implemented to the process of Speech Data Collection has hindered this technology from seeing a level where it can replace how people communicate with devices. The public has embraced the idea of voice-activated devices with enthusiasm, however the actual experience still has the potential to be improved. What's holding this technology behind?

More data = better performance

As per the author's report, what's required to enhance the ability of devices to comprehend and interact to users are terabytes humans' speech information comprising different accents, languages and dialects, to enhance the ability to understand conversations that the gadgets have.

Recent advancements in speech engines are the result of a type of artificial intelligence known as neural networks that learn and evolve with time without precisely programmed. Based loosely on humans' brains, these computer systems are able to train themselves to comprehend the world around us, and perform better when they have more data. Andrew Ng, Baidu's chief scientist, states "The more information we put into our systems, the more efficient it is. This is the reason why speech is a capital-intensive procedure; not a lot of companies have this amount of information ."

It's all about quality and quantity.

While the amount of data is crucial however, its quality is crucial to optimize machines learning techniques. "Quality" in this instance refers to how well the data is suited to the purpose. For instance, if the machine that recognizes speech is being designed for use in cars then the data must be collected inside a car to get the best results, taking into consideration all the usual background noises that the engine is able to detect.

While it's tempting using off-the-shelf data, or to gather the data with random methods, it's more effective long-term to collect AI Training Data that is specifically designed for the purpose it is intended to be used.

The same principle applies to creating global speech recognition software. Human speech data is nuanced, inflected, and full with cultural bias. Data collection needs to be done across a wide range of languages, geographic accents and locations to decrease errors and increase the efficiency.

If Speech Recognition Goes Wrong

Speech recognition that is automatic (ASR) is one of the things we use every day at GTS. The accuracy of speech recognition is something we take pride by helping our customers improve and we're confident that those efforts are appreciated all over the globe as people are using the technology to recognize voice on their smartphones as well as on their laptops, or in their homes. Virtual personal assistants are in our reach and being asked to set reminders, respond to messages or emails, and even to search for us and recommend a good place to take us for a meal.

It's all very well however even the most advanced voice recognition software has a difficult time getting 100 100% accuracy. If problems occur it could be very obvious, or even amusing.

1.What kinds of errors occur?

A device that recognizes speech is almost always able to create an array of words depending on the information it has received and that's exactly what they're built to accomplish. However, choosing which strings of words that it's heard can be a challenging task as there are couple of things that could make users feel confused.

2.The wrong word is being guessed

This is, naturally, the main issue. Natural language software cannot create complete plausible sentences. There are many possible misinterpretations that could appear similar, but do not make much sense as a whole sentence:

3.Things that don't match the words you were using

If someone passes by and you're speaking in a loud voice or you cough during a phrase it's unlikely that a computer is likely to determine which part of the audio you were speaking and which were a different part that is playing. It could result in things such as a phone taking a transcription while they were practicing how to play the tuba.

4.What's the deal there?

What is the reason these well-trained algorithms making mistakes that any normal listener would find laughable?

5.What can people do when things fail?

If things go wrong with the accuracy of speech recognition it is common for them to continue to go wrong. The general public is nervous when they speak to a virtual person even at even the most tense of times. it's not hard to shake that trust! When an error is made there are all kinds of odd things to clarify themselves.

Some people will slow down. Some people might over-pronounce their words and ensure that all their Ks and Ts are as precise as they could be. Others will attempt to mimic the accent they think computers will most easily understand, and do their best impression or Queen Elizabeth II or of Ira Glass.

But here's the problem Although these techniques may assist if you're talking to an inexperienced tourist or an individual on a poor telephone line, they won't assist computers at all! In reality, the more we deviate from the natural spoken speech (the type that was used in the recordings that trained the recogniser) and the more complicated things will get and the cycle will go on.