Friday, February 27, 2015

Voice - Part 1


Everything starts as somebody's daydream. - Larry Niven

One of the most long-lasting and influential phenomena to come out of the 1960's was, of all things, a rather short-lived television program. Though the original broadcast of Star Trek was limited to three seasons and 79 episodes on the NBC network between 1966 and 1969, its subsequent effects radiated thru most of American culture and especially its engineering, technology and applied science segments.

Standard props on the show - the tricorder, phaser, communicator, transporter and others - turned into actual research programs in corporations such as Motorola, Xerox and IBM. These were bold objectives indeed, yet it was a time for daring and audaciousness. By today's measures, the spacecraft of the Mercury, Soyuz, Gemini and Apollo programs incorporated circuits and systems that were the equivalent of stone knives and bearskins. Yet from 1969 to 1972 and with such crude technology, Neil Armstrong and another 11 astronauts after him walked on another world.


This was an adventurous, even fearless generation. To them, going to the moon was not the slightest bit crazy, but was instead something which, though very difficult, they had to find a way to do. "Where there's a will, there's a way" was not a trite and hackneyed credo with them. Furthermore, once the moon was evidently within reach, it seemed that a future three decades hence as presented by Kubrick's "2001" - where spacefarers would routinely journey to near orbit and Luna while sending expeditions to the far reaches of the solar system - was not an outlandish proposition, but eminently reasonable .

It was also a generation with a clear vision and sense of purpose - a generation that had grown up during the Great Depression and made their first mark on the pages of history during the Second World War, determined to leave those terrible times fading in the dust glimpsed from a rear view mirror as they blazed a path to the future in order to forge a better tomorrow for their children and grandchildren. Their aspirations have been passed down to those who toil in the labs at Google and Apple, the cube farms in Cisco and Qualcomm and Intel, the hallways and meeting rooms of Broadcom and NXP and Lattice and Microsoft and Mediatek and so many other firms - where people are inured to the burden of sacrificing the pleasures of the moment, toiling to design, build, optimize and create wonders.

In addition to Star Trek, the 1960's and 1970's saw an explosion of science fiction in both film and literature, exploring an enormous variety of topics - among them things such as space travel, communications and....robotics. We began our own examination of this last subject back at the beginning of the year with the 1/16 and 1/23 editorials ('Hephaestus' and 'Argus'.) Those previous posts focused on Machine Vision. Today, we will examine another capability under development that will be de riguer for the full realization of future androids: Voice Recognition & Activation.

Naturally, the arts have already preceded us in this. For instance, Star Trek anticipated verbal communication with computing devices would be a normal and even essential facet of tomorrow's technology-dominated world - and would not be without its pecularities and foibles:

For the moment, such problems can be left for another day. What scientists, researchers and engineers are focused on at the moment is creating the first truly functional voice recognition architectures from basic principles and building blocks - a task which is proving to be anything but simple and straightforward. 

As everyone who has studied transcripts of tape-recorded speech knows, we all seem to be extremely reluctant to come right out and say what we mean—thus the bizarre syntax, the hesitations, the circumlocutions, the repetitions, the contradictions, the lacunae in almost every non-sentence we speak. - Janet Malcolm

There are already a broad variety of voice recognition systems deployed today in a diversity of applications. Some need to be 'trained' to understand a given speaker's voice and speech patterns due to variations in tone, inflection and whatnot. Such systems are frequently employed for identity verification. More sophisticated systems are speaker - independent. 
Many are able to translate voice inputs into text. The range of accepted words for any given system can vary greatly - from as few as 50 to 100,000 or more.

Research in voice recognition is not a new field - it has, in fact, been underway for over three generations. The biggest breakthroughs, however, are quite recent, having begun to appear towards the end of the last decade. Today's ubiquity of computing power and storage - both standalone and distributed - has directly instigated the latest leaps and bounds of progress in the field.

Voice recognition is, at its heart, an issue of audio signal processing. Thus, it is a problem defined by its mathematics. Much of today's voice recognition software is based on the work of Andrey Markov, the brilliant late 19th century Russian mathematician. Markov produced a base model for random systems wherein the future state of the system is assumed to be predictable based on the current state, but independent of any previous state. There are actually four models, depending on how fully observable or not the system is and the level of control of the system (full control or independent of control.)

One very well known manifestation of Markov's models is the Viterbi algorithm. Widely used in communications applications including wireless and broadband processing, the Viterbi codec computes a most-likely sequence of states from a given starting point. Another is the 
Baum-Welch algorithm. Used when the state of a system is only partially observable, it can discover the most probable value of the actual state of that system. 

What these probabilistic approaches have in common is that they are used on complex and unpredictable systems and, starting with either full or partial information, try to predict likely outcomes going forward in time. For these methods to have any hope of working properly, there need to be boundary conditions such that the systems under scrutiny lend themselves to the determination of potential patterns. In addition to speech recognition and data communications, these algorithms are also employed in the analysis of protein molecules and their potential chemical reactions & properties, as well as genetics, handwriting recognition and cryptography.

Embracing the philosophy that one learns by doing, there are already quite a few companies deploying voice recognition technology for a wide selection of market segments. Apple's Siri is the company's first commercial attempt at a smartphone AI. Initial reception of its voice recognition technology was - to put it as politely as possible - rather mixed. Some of my friends had very pointed comments to make about the first version of Siri which, for the sake of propriety, I will not quote. 

Nevertheless, Apple has continued work on Siri, both as an AI as well as its voice recognition portion. The company stands a better chance than most of eventually producing a product which will change the mobile computing sector. However, it is unlikely that Apple will not have to contend with any competition, as both Google and Microsoft are putting a great deal of research effort into their own programs. Whoever can demonstrate significant value add with this technology will have a major effect on the entire mobile computing sector, so developments will bear watching.

Voice recognition has already found its way into the automotive market. My 2009 Acura TL uses a system with a limited set of commands for making cellphone calls. Frankly, I'm not too happy with it, but it's almost 6 years out of date. More current models are undoubtedly much more capable (in fact, they would have to be.)

One sector that has long been an avid proponent of voice activation technology is the military.

Not just the Pentagon, but almost all major western mil/aero companies as well as defense and intelligence ministries have been pursuing this technology energetically for quite some time:

There are also many other market and application segments which stand to be revolutionized by voice recognition - air traffic control, medical applications for the deaf (taking advantage of voice to text translation), court reporting, hands-free computing, robotics and others. Nonetheless, adoption of voice recognition & activation by these and other sectors continues to be hampered by lingering and seemingly intractable problems with the technology.

We have scotch'd the snake, not killed it. - Shakespeare, "Macbeth"

There are environmental factors that play a part, such as location acoustics or background noise.One of the major obstacles is vocabulary. The larger the list of supported words, the higher the error rate. For very large lists, the error rate becomes a fatal handicap. Non-verbal vocal phenomena (sneezing, coughing, laughing) can play havoc with word prediction. Furthermore, speech discontinuities, long periods of silence or speakers who rush and jumble their words can defeat even the most carefully crafted speech parsing and recognition algorithms.

There is an additional layer of complexity stemming from the fundamental nature of human speech and language in general. In any language, there are words or letters that sound similar or even identical to each other, presenting a very difficult problem indeed for current voice recognition technology. Certain words do not predictably follow one another, depending on the concept being expressed, frame of reference or speech context. Some languages are worse than others in this regard. There are further nuances to voice communication which convey feelings of insecurity, stress, uncertainty, growing interest and so on, all of which affect context and thus meaning.

Despite decades of R&D, the failure of current voice recognition systems to perform as envisioned by their developers can no longer be characterized as a problem of insufficient computational resources. It is, in fact, the underlying mathematical modelling which is at the heart of it.

The approaches used in speech recognition heretofore described are all probability-based. By definition, then, they assume speech patterns emerge that follow a gaussian distribution over time. Yet as we can now see from the above description of the many parameters that comprise 'the spoken word', what a given individual says at one time or another is dependent on a huge number of variables whose relative weight varies over time. Stated differently: speech is very complex. It is thus not a linear problem, but distinctly non-linear

This realization has begun to pervade the research community and is dramatically changing the mathematical tools and methods being applied to voice/speech recognition. The details regarding these new developments, though, will have to wait for a future editorial. ;-)