**This is an old revision of the document!**

Speech Recognition Technology

Speech recognition technology is an area of research at the intersection of computer science, electrical engineering and acoustic engineering that aims to transcribe spoken word into speaker-initiated computer generated functions, normally in the form of text. Although this technology has been around for over 50 years, it is only with the increase of global accessibility to personal computers and the more recent explosion of smart phone usage that the technology has become more ubiquitous. Indeed it is presently the case that voice-recognition technology is a part of daily life, as smart phone companies have capitalized on the ability to facilitate text input commands for common queries, e.g. Apple’s famous “Siri”. Applications of speech recognition technology have been seen in several industries, including military, healthcare, automotive, and more, and its potential for growth is huge, considering the trend towards simple, accessible technology. Speech recognition is often confused with voice recognition, which is more of an individual identification application meant for security purposes, whereas speech recognition enables a more enhanced and better connected technological experience. Speech recognition technology makes use of several algorithms and tools to enable the understanding of different accents, voices, and languages. Over time, these tools have become more sophisticated in order to improve the accuracy of speech recognition devices.


Early Origins of Speech Recognition Technology

The first documented voice-recognition software, “Audrey” was created at the famous Bell Labs in 1952. The software ran through an analog computing system and was able to recognize spoken strings of digits as long as the individual who was speaking paused significantly (for about 350 milliseconds) between words. The software first created a library of distinct individual sounds used in language (or in this case, the language restricted to single-digit numbers) as a reference. Subsequently, when an individual spoke, the software compared spoken strings to the references in the library and accordingly transformed the data to text. The performance of the software peaked at a 99% transcription efficiency when the subject speaking was the same one who had “trained” the software by establishing the reference library. On the other hand, for arbitrary individuals, performance dipped as low as 60%. Almost 10 years after Audrey, IBM released their “shoebox” analog computer at the 1962 Seattle world fair. This computer marginally increased Audrey’s performance by being able to recognize 16 words and single-digits, yet much like Audrey, it was also inhibited by inordinate pauses between words and a cumbersome design typical of the primitive analog computers in use at that time. It wasn’t until the 1970s that speech recognition software began seeing increased progress, partly due to funding from the United States’ Department of Defense. Under the department’s DARPA Speech Understanding Research (SUR) program, which ran from 1971 to 1976, many projects came to fruition, but of the most successful was a program called “Harpy” that was developed at Carnegie Mellon University. Harpy utilized a novel searching algorithm, called “Beam search” that gave it the power to recognize just over 1000 words (what roughly equates to the average vocabulary of a three-year-old). Nontheless, in spite of this increased strength in word recognition, speakers still needed to pause for large amounts of time between words in order to be understood.

Later Developments

After Harpy, and through the 1980s, voice-recognition technology took a huge leap forward when researchers began implementing a new algorithm called the “Hidden Markov Model” (HMM). Equipped with this new tool, it was no longer necessary to have a specific, rigid library of sounds to which to compare input speech to produce transcription. Rather, an unknown sound was assigned a probability that it was a word. With this flexibility it was possible to attain much larger vocabularies for speech recognition systems. After the introduction of HMM algorithms, personal computers became increasingly powerful, and it was not long before they were able to run commercial software for personal voice transcription. Furthermore, in 1997, the software company Dragon released their Naturally Speaking software, which finally permitted speakers to follow their normal flow of speech, without unnatural pauses between words. Thus at the turn of the millennium, it seemed that Speech-recognition software had plateaued in terms of efficiency, with programs unable to surpass an 80% transcription rate unless limited to very small vocabularies. It was not until Google introduced its voice search that the state of affairs drastically changed. With voice search Google took advantage of cloud computing to move the computationally intensive problem of voice transcription with a large vocabulary to its massive servers. In addition, with its vast amount of data on hand, Google was also able to effectively expand its transcription vocabulary to previously unimaginable levels. This method of speech recognition was what effectively laid the groundwork for current popular models, such as Apple’s Siri.

Acoustic Transcription

In order to discuss the methods behind speech recognition, it is important to distinguish between two different phases of the algorithms at hand that coalesce to form the overall product: Acoustic Transcription and Data Analysis. Acoustic transcription refers to the first half of most algorithms, whereby acoustic information is transcribed into binary data that can be manipulated to infer on transcriptions. Acoustic transcription makes use of two primary techniques, Pulse-Code Modulation (PCM) and Fourier Transform and the Fast Fourier Transform (FFT). PCM is a technique that is used to digitally represent an analog signal. An analog signal is one that is continuous in time such as a sound wave, and with modern digital computing methods, it is important to be able to represent this signal in binary to perform operations on it. The way that this is done is relatively simple, namely regular time intervals are chosen wherein samples are taken on the analog signal’s amplitude, and it is precisely this data that is stored in a digital format. The Fourier Transform is arguably one of the most important mathematical advances of the past couple of centuries. The concept was initially introduced by French Mathematician Joseph Fourier in order to solve the physical problem of heat propagation, which ended up having far-reaching consequences, and eventually led to the creation of the Fourier Transform, which is a method of seamlessly changing signals from their time domain to their frequency domain. In the case of sound signals, one is able to look at a sound from the perspective of how its intensity changes over time and from the perspective of how its different frequencies add up to produce the entire timbre. Furthermore, with the advent of the Fast Fourier Transform, this tool has become indispensable in teasing apart and manipulating as much information as possible about a sound signal.

Data Analysis

Data analysis refers to the bulk of the processing that voice recognition software effectuates in order to properly categorize sounds and language. There are several models that are used for data analysis in speech recognition. The Hidden Markov Model (HMM) has been one of the most powerful advances in speech recognition software. By treating speech as a stationary process on short time scales, and by further treating these auditory signals as Markov models allows one to assign words to arbitrary sounds with a certain probability, rather than being limited to a fixed and predetermined vocabulary. Dynamic time-warping based speech recognition is a method of making two temporal signals comparable by appropriately shortening or lengthening one to match the others’ length. In the case of speech recognition, this method was of particular importance before the arrival of the HMM, as many of those earlier algorithms relied on being able to compare introduced sounds to an inner learned template within the software. The computational method of neural networks, as the name suggests, is inspired by neuronal structures within the brain, where complicated connections between neurons attenuate and magnify information signals to a perform calculations on specific inputs. Through careful programming, one can create algorithms that are capable of machine learning (in this case, applied to language recognition). Finally, it is important to note that speech recognition programs fall into one of two categories: Speaker independent systems and speaker dependent systems. While speaker dependent systems require a period of learning whereby the software can fine-tune its performance to the specific tone modulations of a user’s voice, speaker independent systems do not and are generically programmed for all possible voice inputs.

Industrial applications

Some of the main industries that make use of speech recognition technology are the automotive, defense, and healthcare industries. In the automotive industry, for example, as the concept of “smart” cars becomes more widespread, automotive retailers are making use of speech recognition technology to allow drivers to give simple vocal commands to cars to perform basic functions like initiating phone calls, changing the radio station, or play music from an external device. While these applications are basic for now, and vary from model to model, these have the potential to become more sophisticated, encompassing other functions that are currently not available. In the defense and military sector, several advanced research programs have been launched that are exploring the potential of speech recognition systems to operate in certain aircraft models. For example, the Eurofighter Typhoon is an airplane model currently in use by the Royal Air Force, that allows pilots to perform basic cockpit functions. Not only does the system allow pilots to reduce their manual operations, but also to assign targets to himself or his squadron with a few simple commands. Air traffic controller sytems are also increasing their use of speech recognition technology by means of a training system that helps them to simulate live conversations with pilots, which has drastically reduced the need for training and support personnel. The healthcare industry has implemented speech recognition technology in both front-end and back-end operations.

Retail Applications

The use of speech recognition technology has not only pervaded industries, but everyday applications as well. Mobile phones, for example, have speech recognition functions that allow users to call phone numbers, search for terms on the internet, and even perform other basic functions. Gaming and simulation technology, too, makes heavy use of speech recognition functions. One of the most useful applications of speech recognition technology, however, has been in the education and learning sector, since it can be used to learn a second language. In addition to providing advanced practice in pronunciation and speaking practice, it also helps in understanding the differences between accents and languages, a self-learning process to improve these technologies. Students who are visually impaired, or who are unable to write due to injuries can also make use of this technology to complete assignments and participate as normal in a class. It has also been proven that speech recognition technology allows students who have learning disabilities to become better writers, by speaking what they mean to write, and helping them get the punctuation, spelling, and grammar out of the way. Other notable ways in which speech recognition technology has been used is in courtroom reporting, digital transcription, and interactive voice response functions on telephone systems.

Performance and Accuracy

The main measure of accuracy in voice recognition software is the percent of words successfully transcribed from arbitrary spoken samples. In spite of this being a simple measure, it is always important to contextualize specific success rates with the task at hand. For example, it is intuitively clear that as the vocabulary size of the individual speaking increases, error rates should increase as well, for there are more words for the software to become confused with. Also, if the individual is speaking in broken speech (as was the case with the very first voice recognition softwares) it is much easier to piece apart what words are being said, thereby increasing success rates in such tasks. In addition, it can be expected that if a software is “trained” to a specific individuals voice that it will have a higher success rate than a speaker-independent system. Finally, there can also be significant differences in success rates when an individual is reading a prepared text instead of speaking spontaneously, for the latter often includes gap words such as “uh” or “um” that can throw off software and decrease success rates. In spite of the difficulties in establishing a set universal parameter to judge voice recognition systems, it is safe to say that their efficiency has indeed been increasing over the years. Whereas the first systems introduced in the 1960s had vocabularies limited to less than 100 words, and success rates as low as 60%, current systems, such as those implemented by Google or Apple’s Siri who have massive vocabularies that are many orders of magnitude greater than those before and still maintain high success rate of close to 80%

Research and Funding

There is much active research in the field of speech recognition, both at the academic and the industry level. Furthermore, different counties’ respective departments of defense (in particular the United States’ Department of Defense) have had an interest in developing the technology. The Defense Advanced Research Projects Agency (DARPA) has had previous breakthroughs under its Speech Understanding Research (SUR) program, which ran from 1971 to 1976 that lead to the creation of Carnegie Mellon’s “Harpy” Software, but current research has greatly replaced the methods used then. Two recent projects under DARPA funding, EARS and GALE have once again pushed the boundaries of speech recognition technology to new levels. DARPA Project Effective, Affordable, Reusable Speech-to-Text (EARS) aims to create a much more robust voice-recognition technology with both an increased accuracy and efficiency relative to currently used transcription methods. In particular, the focus of the project is on being able to create core technology to transcribe everyday natural human communication spanned across multiple languages. DARPA Project Global Autonomous Language Exploitation (GALE) expands upon EARS by aiming to produce a system that automates transcription of human auditory data from multiple languages, particularly in the form of newscasts. Furthermore, GALE aims to categorize this data by making it available to human queries.

Future of Speech Recognition

There are many possibilities for where the field of speech recognition may branch in the future, but if the trend we are currently seeing continues, applications of this technology will become more ubiquitous as its efficiency increases. The future of speech recognition is intrinsically tied to the closely related field of Artificial Intelligence, for where speech recognition aims to solve the problem of language transcription, research is still in nascent stages of creating programs that actually understand the higher meanings of our everyday language. In fact, creating algorithms that deeply understand everyday language is one of the most difficult problems in the field of Artificial Intellegence, whose aim is to create machines as intelligent agents. Although it may seem that Google searches understand a fair amount of what one asks, there is still much more work to be done before algorithms will be able to understand and paraphrase texts in an intelligent manner. Once researchers have been able to crack this code though, and computers are able to deeply understand that which we tell them, then perhaps we won’t be as far off from the future the recent movie “Her” paints where meaningful and seamless interaction with artificial agents is so commonplace.


QR Code
QR Code speech_recognition_technology (generated for current page)