Friday, March 6, 2015

Voice - Part 2



























Source: thekenyonthrill.com

Speech is but broken light upon the depth
Of the unspoken. - George Eliot


There is a valley in Utah 20 miles north of the northeastern edge of the Mojave Desert known as Cedar Valley. With high mountain ranges to the east and southwest, the valley is wide & verdant and, at 5800 feet in elevation, catches a fair amount of snow in the winter. Consisting mostly of open grassland with a forested fringe, the Cedar Valley seems like the very archetype of horse and cattle country.

There's a cute little town on the eastern edge of the valley called, appropriately enough, Cedar City. The town is the home of Southern Utah University and is a reasonably active tourist destination. Farther east in the mountains is the beautiful Cedar Breaks National Monument, and a little ways down I-15 to the south is a little used northern entrance to Zion National Park with a truly spectacular drive along red cliff canyons.

But in Cedar City itself, there is a summer event which attracts a different kind of tourist and makes the town truly unique. This event is the Utah Shakespeare Festival. There are several different plays annually, with the productions staged in a re-creation of the famous Globe Theater. 

I've seen a good dozen of Shakespeare's plays at this festival. The cast changes every year, but the organizers consistently assemble a fine acting troupe for the performances. Nevertheless, attending the shows always requires several weeks of preparation on my part - if I don't read the plays beforehand, I have the devil of a time understanding what's going on.

The totality of what one views in these plays, though, cannot be understood from merely reading the prose of each folio. There is so much more that goes into the staging and enactment of any of these plays - the costumes, the stage & background scenery, the gestures, accents and postures of the players and their interactions with each other all communicate so much more than the words printed in their manuscripts could ever hope to do.

One understands only in proportion to becoming himself that which he understands. - Kierkegaard

What this starkly illustrates is that speech is often framed by a great deal of context which alters & enhances both its meaning and significance. Gestures, posture, relative positions between speaker and audience, timbre and intonations that reflect anger, sarcasm, humor, stress, confusion or insecurity, the context of a word, phrase or sentence within a conversation, emphasis placed on any words or phrases, facial expressions, the state of a speaker's eyes (shape, pupil size, direction of gaze, etc), and even what the speaker is wearing in terms of clothing, makeup, jewelry or other accoutrements all bear upon the understanding of the spoken word. Through all of these factors, an individual can even distinguish between two words which sound the same or one word with multiple potential meanings.

From this we can begin to comprehend how the voice recognition methods we scrutinized last week - the ones based on Markov Model approaches - inevitably experience escalating failure rates as the volume and range of words they attempt to recognize increases. Recall that the various Markov models and associated audio codecs capture and filter audio input, then attempt to match that input against a 'library' of audio & voice data. The associations are expected to be partial, with probabilities of an actual match characterized by a Gaussian distribution.

The casino is the only human venture I know where the probabilities are known, Gaussian (i.e., bell-curve), and almost computable. - Nassim Taleb, "The Black Swan"

Current conceptions are, then, completely linear in their approach to voice recognition. The Markov methods, in their linearity, are attempting to emulate human audio recognition without sufficient additional data input or context. Such solutions are inescapably limited in their dimensionality and as a consequence highly restricted in their applicability. It is thus inevitable that once these systems employ word lists in the 10's of thousands or more, the frequency and magnitude of errors they generate surpass their inherently bounded functional breadth.

In response to these limitations, researchers are delving into alternate approaches that are based on entirely different mathematical principles. These research efforts go well beyond the Markov Model techniques and are based on the emerging field of Machine Learning. They are, in fact, attempting to re-create how the human brain and auditory apparatus work regarding speech. Thus, it's no longer a purely mechanical effort to emulate or imitate human speech understanding, but an endeavor to develop a synthetic analogue to the whole biological system.

The determining factor which distinguishes the new research paths from the old is that the latest work is based on nonlinear mathematical principles. What, though, does "nonlinear" specifically mean? 

In a sense, the concept is rather simple - a nonlinear system is one where the output is not directly proportional to the input. The output may appear random, but it is not. It is simply chaotic, or more accurately, extraordinarily complex, as it depends on multiple variables whose values, weights and interactions can change dynamically and are highly sensitive to initial conditions. Stated differently: a nonlinear system or phenomenon does indeed form patterns, but those patterns can manifest themselves differently, even drastically so, based on initial conditions. A classic example would be the weather and the notorious butterfly effect.

Here's a little thought exercise on how nonlinear systems can, from very small beginnings, produce extraordinary outcomes. Let's say you write a program for your computer which randomly generates white dots on your display. You would expect that, after a period of time, your screen would fill up with dots and would have gone from a uniformly black screen to a white one. This is, in fact, what would happen.

However, let's take this up just a tiny bit in complexity. Instead of a white dot, we'll write the program to instantiate a small white line just a few pixels long. We'll tell the computer to draw these little white lines randomly, but with two caveats:
1. The lines cannot overlap.
2. Any new line must connect at right angles to an endpoint of at least one already existing line.
With that in mind, you would expect the screen to fill up with little white squares, right? But that's not at all what happens. Instead, you get a pattern that looks very much like this:



Many scientists have come to believe that the way humans receive, process and produce voice data, with all its nuance and context, can be best grasped conceptually within a nonlinear mathematical framework. These researchers are pursuing a variety of ideas to apply nonlinear/complex/chaotic precepts to voice recognition. 

As we examine this in the section below, we must do so with the following shared understanding: the overview in this editorial can do no more than just brush the surface of this incredibly complicated and rapidly evolving topic. The mathematics and their expression in hardware and software are much deeper than can be properly explored within the confines of this blog, and there are verifiable geniuses who spend their entire lives delving into the intellectually entangling details of these matters. Nevertheless, the overview, shallow is it is, should still prove to be enlightening and stimulating.

For most of my life, one of the persons most baffled by my own work was myself. - Benoit Mandelbrot

The sector of  Machine Learning that appears to be attracting the most attention at the moment is widely referred to as "Deep Learning." It is sometimes referred to as "Deep Neural Networks", as its basis is derived from a specialized application of Artificial Neural Networks - a field that has arisen from AI (artificial intelligence) research.

Artificial neural networks have proven promising for identifying words. However, they have also demonstrated themselves to be weak in dealing with sentences or long speeches because of their inability to account for temporal dependencies (which Markov models and audio codecs do quite well.)

In Deep Learning, a group of Machine Learning algorithms use nonlinear models on very large data sets - in particular, data sets which grow over time. The inspiration for the approach sprang from observations on the response of neurons in the brain to stimuli from human senses - hence the interest from both speech recognition researchers and developers of Machine Vision (which we examined in January.) 

Here, in a nutshell, is how it is intended to work:
Algorithms are applied to a large and deep data set and - using pattern recognition along with statistical methods - are used to reduce the data to a much smaller representation containing what are hoped to be the vital components. This is called 'feature extraction.' Note the conceptual similarities to image and video processing.

These extracted features can be further modified with applied weights and thresholds. Sometimes these factors are preset (such as in Deep Belief Networks, which are an artificial neural network variation.) In other instances, they can be created and dynamically modified with discriminatory algorithms. Both can be done together by starting with a preset that is modified thru experience (with the preset serving as an analogue to basic human instincts.) It depends on the approach being pursued and the object/phenomena under study.

This process can be implemented on different yet related data sets, or to the same data set with varying rigor/repetition. The features extracted can then be pooled, connections can be formed between them and different feature sets can even be organized hierarchically. This develops what we can conceptualize as layers of abstraction which can be used collectively to form higher, more complete data sets from minimized/reduced sets.

Note that the interactions between feature sets, each at a different level of hierarchy and with their own weights and thresholds, does not necessarily produce a linear output. In fact, the whole approach is intended to support nonlinear outputs, where the whole is not only potentially greater than the sum or product of the parts, but is completely disproportional and not necessarily the same every time, depending on the dynamic interactions of the feature sets and their individual weighted characteristics, which themselves can be dynamic.

One can easily conjecture how this implies enormous computational burdens, both in terms of logic and memory, not to mention speed, power and storage. How is this done as a practical matter?

























Source: asharperfocus.com

It turns out that GPUs are a very popular choice. Current arrays include an enormous amount of parallel processing units with local store memory that can execute iteratively and in parallel. Such high performance multi-processor architectures, which inherently support floating point operations, are a natural fit. 

FPGAs, however, may also have a significant role to play in this field. Even the most advanced of these chips lack the same data throughput as GPUs, but there are some hidden advantages. First and foremost, FPGAs have more feature-rich architectures. By their very nature, leading FPGAs have both bit-oriented and word-oriented programmable blocks and can support logic, DSP and CPU operations simultaneously. Surprisingly, they may have some distinct power advantages - a shocking realization, considering that we're talking about programmable logic. Using several FPGAs in tandem might be just as computationally effective as a GPU, with reduced power to boot, depending on the application. 

Yet the applicability of existing microelectronic technology belies the fact that both machine vision and voice recognition represent a rising tide of disruptive change for High Tech. Of particular significance is that the relationships between computing variables are in a continual state of flux. 

What makes these relationships, along with their shifting weights and thresholds, so dynamic? It is a fundamental aspect of artificial neural networks, as their purpose is to interact with real world data and learn from it. And, as we have seen in today's overview of voice recognition, the problem is actually greater than that of a machine teaching itself to recognize sounds and words. The more limited task of voice recognition opens a whole new order of complexity, as this sort of layered learning requires an interpretive faculty. Such interpretation takes intelligence - in this case, an Artificial one. 

But THAT discussion, my dear readers, will have to wait for a future post. ;-)