Multi-Microphone Signal Acquisition for Speech Recognition Systems

by

Kevin Fink

EE 586 - Speech Recognition Systems

Prof. Holden

December 16, 1993


INTRODUCTION

Vocal communication is the most natural and quickest form of high-level communication for humans. It has long been an appealing goal to extend our abilities in this area to computers, so that we could communicate with them with the same simplicity and speed which we communicate with other people. In addition, there are many circumstances when vocal communication is the only safe way to communicate. For example, it is much safer for a driver to ask his/her car's computer to dial home than to manually dial a car telephone while driving.

Unfortunately, it is often the case that it is in those situations in which speech recognition would be most helpful that speech recognition performs the worst. The high amounts of ambient noise in a moving car, for example, renders laboratory speech recognition systems worthless. Laboratory systems work very well in the laboratory, where the speech signals to be recognized can be gathered in the absence of noise, other speakers, or other interfering signals. The next step is to modify these systems so that they also work in high-noise conditions.

Traditional speech recognition systems use a single microphone to collect the speech data to be analyzed. These systems have been successful in low-noise, single-source environments when designed and/or trained for a specific microphone. However, when the environment includes noise, interfering signals, or when a different microphone is used to acquire data than was used to train the system, the system's performance invariably degrades, usually drastically.

Many methods have been proposed to improve the robustness of speech recognition systems. These include training the system in a noisy environment, using noise masking techniques, modifying the parameter selection and computation process, improving segmentation techniques, using different types of microphones, and using multiple microphones with beam-forming algorithms. Each of these areas will be summarized briefly, then the idea of using multiple microphones with beam-forming algorithms will be explored in more detail. In addition, several recent conference papers will be reviewed to portray the current state of research in this area. Finally, a simple project will be described which utilizes beamforming techniques to reduce the amount of noise present in a simulated microphone signal.

NOISE, DISTORTION, AND ARTICULATION EFFECTS

In order to understand how to decrease the effects of noise on speech recognition systems, we must first understand what types of noise are present in typical speech recognition environments. Three types are commonly encountered: ambient noise, distortion, and articulation effects.

Ambient noise is a feature of any environment, but is especially a problem in the office environment and in vehicles. In a typical office environment the ambient noise includes office machinery sounds like typewriters, computer fans and disk drives, and telephones ringing, as well as sounds generated by people moving around and background conversations. The sound pressure level (SPL) in a typical personal office is around 45-50 dBA (noise criterion NC 40-45) (Juang, 1991). The SPL in a business office with secretarial staff could be 15-20 dB higher.

Vehicles are also high-noise environments. Several studies (Roe, 1987; Lecomte et al, 1989) showed that the signal-to-noise ratio (SNR) of speech signals recorded in a passenger car traveling at 90 km/hr with windows up and fan off could drop below -5 dB. Probably the noisiest environment where speech recognition would be useful is the cockpit of jet fighter planes. SPLs of 90 dB across the speech frequency band have been reported in such environments (Powell et al, 1987).

The frequency distribution of noise is also important. In the passenger car environment, for example, the low-frequency noise is as high as 95 dB but decays rapidly as the frequency increases, until the noise above 1000 Hz is relatively flat at about 45 dB.

In addition to ambient noise, the speech signal is subject to a variety of distortions. Reverberation off the walls and ceilings, the type of microphone used and its position and orientation, and transmission through a telephone network are sources of distortion. In an extreme case, the performance of a speech recognition system fell from 85% to 19% simply by changing the microphone from a close-talking type to a desk-mounted type (Acero & Stern, 1990).

Not only do different speakers sound different, but the same speaker can sound different under different conditions. Even the awareness of communicating with a speech recognition system can produce a significant change in formant positions and rhythmic stability (Lecomte et al, 1989). Even greater changes are produced when a speaker is talking in a noisy environment. This is referred to as the Lombard effect.

STRATEGIES FOR IMPROVING RECOGNITION IN THE PRESENCE OF NOISE

A variety of methods has been suggested to overcome the effects of noise on speech recognition systems. They can be broken into four general categories: feature classification, feature selection, signal processing, and signal acquisition.

The first class of methods for reducing noise effects regards modifications to the algorithms which classify features into meaningful speech. The usual technique is to train the classifier under similar noise conditions to where it will be used. However, this assumes that the use environment is known and does not change over time. This is usually not a valid assumption. A better possibility is to use classification schemes which are not as sensitive to noise.

Before classifying features, the speech recognition must select and calculate those features. Linear predictive coefficients (LPCs) and cepstral coefficients are the most commonly used features. These tend to be very sensitive to noise, however, so a third area deals with modifying or replacing these features with more noise-invariant features. One proposal is to use the short-time modified coherence (SMC) of speech instead of a standard spectral representation. Another is to use computational models based on the human auditory system, which is quite good at ignoring noise.

Signal processing techniques can be used to modify the raw, incoming speech waveforms in order to reduce the noise or emphasize the desired signal. Noise masking techniques and stress compensation fall in this category. In addition, if multiple microphones are used, noise cancellation and acoustical beamforming techniques can be used to reduce noise and focus on a specific signal.

Finally, the method of signal acquisition can be changed. This includes using specialized microphones, using multiple microphones, and using other types of transducers, such as accelerometers.

Using higher-level knowledge, such as grammar, context, or meaning, should also improve the accuracy of speech recognition in noise. The specific area of how higher level knowledge integration effects noise performance seems to be largely unexplored at the moment, however, and so will not be addressed in this paper. Undoubtedly, an optimal speech recognition system would include concepts from many areas. The complementary (or possibly interfering) effects of these techniques is another unexplored area.

FEATURE CLASSIFICATION

If the spectral characteristics and level of the noise are known beforehand, the pattern classifiers can be trained using data corrupted with matching noise. A speech recognition system trained at a SNR of 18 dB, for example, will work better in an 18 dB noise situation than the same system trained using clean speech. However, it's performance will fall off for other noise levels. In fact, it's performance for clean speech will be much worse than the clean- speech-trained system. Dautrich, Rabiner, and Martin (1983) investigated this effect. They used an isolated word recognizer which had an accuracy of 95% on clean speech when trained with clean speech. The accuracy dropped to about 60% when it was used on speech corrupted by noise with an SNR of 18 dB. By training the system with 18 dB noise, the performance climbed to about 90%.

In a similar study, Fried and Cuperman (1989) tested a commercially available speech recognition system (which was not identified.) They trained the system at four different SNRs: 15, 18, 21, and 24 dB, and tested it at range of SNRs from 10 to 30 dB. The recognition accuracy was highest in each case at a recognition SNR slightly higher than the training SNR and decreased for other recognition SNRs. For example, the recognizer trained at 15 dB achieved a maximum of 90% recognition, but dropped to 70% for a recognition SNR of 30 dB. Similar results were reported for the other training SNRs. Thus the system would work fine if everyone in the office was talking, but wouldn't work once everyone went home!

FEATURE SELECTION

The two most commonly used features of speech are linear predictive coefficients (LPCs) and cepstral coefficients. Both of these features are sensitive to noise in the signal, as well as to the type of microphone used and other irrelevant factors. In addition, vector quantization is commonly used to reduce the number of coefficients and as a first step in classifying them. The more these coefficients change with noise, the more likely it is that the vector quantization process will incorrectly quantize the coefficients, resulting in large errors.

There are two approaches to improving features. The first is to modify the LPCs or cepstral coefficients to minimize the effect of noise. This can be done by introducing robust distortion measures. The idea behind robust distortion measures is that by defining an appropriate measure for robust speech recognition, one can use it to automatically emphasize (or lend more credence to) those features which have less distortion.

Spectral noise masking is a simple example. By measuring the amount of noise power in frequency bands, the bands with less noise can be emphasized and those with lots of noise can be ignored. Most features are not directly tied to spectral components, however, and so applying distortion measures to them is less straightforward. Distortion measures discussed by Juang (1991) include the cepstral distance, the weighted (liftered) cepstral distance, the likelihood ratio distortion, the weighted likelihood ratio distortion, and the asymmetrically weighted likelihood ratio distortion. Matsumoto and Imai (1986) showed an improvement in speech recognition performance under 18 dB white noise from 60% to 90% word accuracy when using these measures. In addition, using weighted cepstral measures has been found to improve clean speech performance as well as noisy. This feature makes them more appealing than the noise training techniques discussed previously, which had deleterious effects on clean speech performance.

The second method to improve features is to discard LPCs and cepstral coefficients entirely. This can be done either from a signal processing viewpoint, or as an attempt to mimic the human auditory system.

Mansour and Juang (1989) proposed a replacement feature set called the short- time modified coherence (SMC) of speech. It takes advantage of the fact that adjacent segments of speech signals have high coherence, due to the finite speed at which the human vocal tract changes shape. In LPC analysis, each speech segment is modeled as the impulse response of an all-pole filter. In SMC analysis, the autocorrelation between adjacent speech segments is modeled instead of the actual speech segments themselves. Because the noise tends to be cancelled out in the autocorrelation process, these coefficients are less noise- dependent than LPC coefficients. The amount of improvement depends on the signal, but was found to be around 10-12 dB for SNRs between 0 and 20 dB. For the specific case of a speaker-dependent digit recognition test, a system with clean accuracy of 99.2% was found to degrade to 39.8% with LPC coefficients, but only to 98.3% using the SMC coefficients (Juang, 1991).

Since humans do a much better job at ignoring noise than any machine, another promising approach is to attempt to model the working of the human auditory system. Ghitza (1986) did this when developing his ensemble interval histogram (EIH) method, which models the auditory-nerve firing pattern. It uses 85 simulated cochlear filters which break up the signal into frequency bands from 200-3200 Hz, measurements of level crossing for each filter output, and accumulation of the histograms for the level-crossing intervals. This gives a result similar to a spectrum, but is non-linear and does not have uniform frequency spacing. It does seem to be much less affected by noise than the corresponding LPC spectrum (see Juang, 1991).

SIGNAL PRE-PROCESSING

The effects of noise on speech recognition can be reduced by reducing the amount of noise in the signal. Signal pre-processing deals with this area.

By using some estimate of the characteristics of the noise, such as noise power, the noise spectrum, or the signal to noise ratio, an improved spectral model of the speech can be obtained. Porter and Boll (1984) came up with a least-squares estimator of short-time independent spectral components. In it, the conditional mean of the spectral component is derived from the sample average estimator of clean speech rather than from an assumed parametric distribution. This is done by using a clean speech database and adding noise to it to make a noisy version of it. Then a function can be found which maps a noisy spectral component to a clean one. Under very rigid conditions (fixed and known signal and noise levels), this approach was able to reduce noise errors from 40% to 10% on a speaker-dependent digit recognition task.

Another approach was taken by Ephraim, Wilpon, and Rabiner (1987). Rather than just estimating the all-pole spectral model for the speech, they iteratively estimate both the spectral model and the short-time noise level. The iterations minimize the Itakura-Saito distortion between the noisy spectrum and a composite model spectrum (the sum of the estimated signal and noise spectra). The final spectral model is then used in the recognizer. This resulted in an improvement from 42% to 70% recognition accuracy on a 10 dB SNR isolated-word recognition test.

Noise cancellation is a technique which attempts to use multiple signal sources to get rid of some of the noise from a signal. The technique, in its simplest form, uses two microphones, one near the speaker and one away from the speaker. The microphone near the speaker will pick up both the desired signal and the background noise, but the other microphone will only pick up noise. By subtracting the second signal from the first (using an adaptive filter to delay the signals appropriately), the speech signal will be left alone while the noise signal will be (at least partially) cancelled out. This approach requires that the noise entering the two microphones must be coherent, however, or they will not cancel out. Experiments have shown that in many situations this assumption is not valid. For example, if two microphones are placed more than 50 cm apart in a car, then the only coherent noise is that from the engine. Unfortunately, most of the noise in a vehicle comes from sources other than the engine. In order to cancel out 90% of the total noise energy, the microphones could not be more than 5 cm apart, making it impossible for one microphone to hear the speech and the other not to (Dal Degan & Prati, 1988). Therefore, this approach will work to get rid of low-frequency (such as engine) noise, but not high-frequency noise.

SIGNAL ACQUISITION

The final way of reducing noise effects is to change the method of gathering the speech signal. In some environments head-set microphones can be used. These can be very directional and so pick up very little noise from outside sources, since they are in a fixed position with respect to the speaker. However, head- sets are not practical in all situations. They must be connected by a wire or radio link to the computer, they only work with one speaker, and they are fragile. A headset would not be practical for public use, such as at an ATM machine, for example.

Noise cancelling microphones are another possibilities. In these microphones, both sides of the diaphragm are exposed. For sounds (presumably noise) coming from a distance, both sides see the same sound pressure and no net force is felt, thereby generating no signal. For a speaker close to the microphone, however, the front side shades the back side and the signal is transduced. These work well if they are kept very close to the speaker's mouth and parallel to the sound waves. Their performance drops drastically, however, as the person moves around or looks in different directions.

In some situations, such as in jet fighter cockpits, the noise level is so high that microphones are virtually useless. In these situations other transducers, such as throat-mounted accelerometers may be used.

ACOUSTICAL BEAMFORMING USING MULTIPLE MICROPHONES

A final possibility is to use a whole array of microphones placed in the speaker's vicinity. By using acoustical beamforming techniques, the microphone array can "focus" on the speaker's position. By knowing the speaker's position, the individual microphone outputs can be combined in such a way as to add the separate signal contributions while cancelling the noise contributions. In addition, the array can sense the changing position of a speaker and "follow" that signal source as the speaker moves around.

Acoustic beamforming is used principally in sonar and underwater imaging applications, where light doesn't travel far enough to generate useful pictures. Submarines tow arrays of hydrophones behind them to listen for ships and other submarines. By analyzing the signals from the hydrophones, the sonar operator can tell where the target is, which direction it's heading, and how fast it's moving.

The simplest type of beamforming uses the "delay and sum" concept. Each microphone's signal can be delayed by an amount proportional to the distance between a known target and that microphone. Then all of these delayed signals are added together, resulting in a large signal component. As long as the noise wasn't coming from the exact same position as the desired signal, the noise signals won't be coherent and thus won't add up. The total noise power will remain approximately the same as for one microphone, but the total signal power will be multiplied by the number of microphones in the array.

Improved performance can be attained by adding aperture shading to the system. Rather than simply summing the appropriately delayed microphone outputs, the signals are multiplied by different gain factors (or weights) before summing. This provides the effect of shading the aperture, giving the ability to trade between beamwidth and sidelobe attenuation. This is the analog of choosing a window shape in 1-D filter design. For example, a boxcar window gives the smallest beamwidth but highest sidelobes, whereas a Hanning window gives a larger beamwidth but smaller sidelobes.

Beamforming algorithms other than delay and sum are also available, including the adaptive Frost and generalized sidelobe canceler (GSC) beamforming algorithms.

The adaptive Frost method (Frost, 1972) uses a constrained least mean-squares (LMS) algorithm to continuously adjust the signal weights. The algorithm is constrained to a chosen frequency response in the look direction, then iteratively adapts the weights to minimize the noise power at the output. This allows the array to adapt to changing noise characteristics.

Many modifications and improvements to this algorithm have been proposed. Griffiths and Jim (1982) described a generalization of the Frost algorithm which they termed the adaptive sidelobe canceling beamformer and has since been called the generalized sidelobe canceler (GSC) or Griffiths-Jim algorithm. It can be viewed as an alternative implementation of the Frost beamformer, but also allows the effects of errors such as imperfect steering and reverberation to be observed and the constraints change to account for these errors.

If the position of the desired signal source is known and constant, the appropriate delays can be set up beforehand and then be left alone. However, in most cases the position of the source is not known and/or changes during operation. Therefore, the beamformed array needs to be able to change its delays to follow the source. There are various direction-finding methods used to choose the weights. Probably the best known and most used algorithm is the multiple signal classification (MUSIC) algorithm (Schmidt, 1986).

The MUSIC algorithm is very general and finding the position of the desired signal source is only one small part of what it can do. The algorithm generates asymptotically unbiased estimates of the number of signals, the directions of arrival, the strengths and cross-correlations among the directional waveforms, the polarizations, and the strength of noise and interference.

The Maximum Likelihood (ML) technique, proposed by Ziskind and Wax (1988), is an improvement on MUSIC. It provides better resolution and accuracy, and also applies to coherent signals (such as those coming from multipath environments) which MUSIC can't handle. It uses an Alternating Projection (AP) algorithm to solve a nonlinear optimization problem. Its complexity, however, results in a very high computational cost.

Other algorithms have been proposed which have lower complexity but good accuracy. Watanabe proposed a method called Maximum Likelihood Bearing Estimation by Quasi-Newton Method (Watanabe et al, 1991a) which is about 24 times as fast as the AP algorithm yet has almost the same accuracy. It uses the very efficient quasi-Newton method to solve the optimization problem by gradient descent. An adaptive version of this algorithm has also been developed recently which can track moving sources. (Watanabe et al, 1993)

Other methods are also possible. Kellermann (1991) describes a voting algorithm which uses elements of pattern classification and exploits the temporal characteristics of speech. His system will be discussed in more detail in a paper review.

MULTI-MICROPHONE SPEECH ACQUISITION SYSTEMS

Many actual systems using microphone arrays have been built and tested, and several will be discussed in paper reviews following the body of this paper. They use a variety of beamforming and direction-finding algorithms. In an actual system, several parameters must be considered. In addition to complexity and computation issues, which are dependent on the algorithm(s) chosen and the number and positioning of microphones in the array, some physical considerations must be made. A compromise must be made between a large aperture, which will provide good spatial resolution, and a small aperture, which better conforms to the far-field assumption on which beamforming is based. In addition, the spacing between microphones must be less than half of the smallest wavelength of interest to avoid spatial aliasing.

CONCLUSION

The problem of recognizing speech under noisy conditions is an as-yet unsolved problem. However, a great variety of methods have been proposed which improve the performance of speech recognition systems. The continual evolution of these methods and the composite of multiple methods working together will undoubtedly provide better systems in the future.


PAPER REVIEW 1

A Customized Beamformer System for Acquisition of Speech Signals

T. Switzer, D. Linebarger, E. Dowling, Y. Tong, and M. Munoz

This paper discusses a system designed to use a linear array of microphones to acquire speech signals. The system uses eight microphones and a single DSP chip to implement a direction finding algorithm and a beamforming algorithm. This paper focuses on the choice of the beamforming algorithm, but also discusses the overall system and results.

Three beamforming algorithms were investigated in the context of speech acquisition with an 8-microphone linear array. These were the conventional delay and sum, the adaptive Frost, and the generalized sidelobe canceler beamformers. Testing involved using array output simulated with a recorded speech signal. Parameters were the number of sensors M, the number of taps per sensor N, and the adaptation rate æ. The two adaptive beamformers were configured to perform identically.

Two interfering signals were used, both of which originated 30 degrees off-axis. The first was a 1200 Hz sine wave. The single-sensor SNR was -7.7 dB. The SNRs for 4 taps, 8 taps, and 16 taps with the delay and sum beamformer were -5.9 dB, 0.9 dB, and 5.6 dB, respectively. The adaptive beamformers were much better than the delay and sum, especially with 10 or more taps per microphone. They peaked at approximately 32 taps, with SNRs of about 17, 20, and 22 dB for M=4, 8, and 16, respectively. Thus the adaptive beamformers increased the SNR by as much as 23 dB over the delay and sum and as much as 30 dB over the single- microphone case.

The second interfering signal was a recording of the same speaker as the desired signal, but reading a different sentence. The results were similar, although the adaptive beamformer's performance depended more heavily on the number of taps per sensor. By choosing 25 taps per sensor, the adaptive beamformer's SNR was about 10, 12, and 14 dB. The delay and sum beamformer's SNRs were the same as in the narrowband case.

The beamformer's complexity was also examined for an 8 kHz sampling rate, eight sensors, and 20 taps per sensor. The delay and sum is by far the least complex, requiring 64k FLOPS on the TMS320C30 microprocessor, while the GSC requires 3M FLOPS and the Frost 6.8M FLOPS. Since the GSC and Frost performance were identical and much better than the delay and sum, the GSC beamforming algorithm was chosen.

The paper only reported preliminary, qualitative results with real array data. The system seems to reduce interfering signals, but no quantitative measurements were shown.


PAPER REVIEW 2

Switching Adaptive Filters for Enhancing Noisy and Reverberant Speech from Microphone Array Recordings

Dirk Van Compernolle

Compernolle (1990) built a system which incorporates both direction-finding and adaptive beamforming algorithms. It utilizes a modified form of the Griffiths- Jim beamformer discussed earlier. The system has two sections; the first a "speech focus" (or direction-finding) section and the second an adaptive noise canceler. It is set up so that only one (and sometimes neither) of the two sections is allowed to adapt at any point in time. The switching is controlled by a speech detection function, incorporated as an adaptive energy threshold which assumes that high energy bursts are speech and low energy segments are noise.

The speech focus section needs a relatively wide field of view so that it can track a speaker's movement and won't lose the signal due to sudden movements. The modified Griffiths-Jim beamformer structure has several interesting features. Since the filter coefficients are only adapted under appropriate conditions, a signal-free noise reference is not needed. The switching is set so that it emphasizes the speech source over eliminating noise sources. Thus a directional noise source in the same direction as the speech source can be tolerated. It is important that the noise cancellation section only adapts when the speech source is not present. If it were to adapt during speech, the filter would set itself to cancel out the speech, resulting in poor performance. The adaptation in the speech focus section is much slower than in the noise cancellation section. Because of this, changes in the filters in the first section can be quickly compensated for in the second. The beamforming also provides suppression of uncorrelated noises.

The system was tested with a linear 4-microphone array in a small laboratory. The speaker was situated about 1 m away from the array. A badly-tuned radio at a similar distance, but pointed at a wall to provide reflected sound, was used as the noise source. The system was able to increase the SNR by up to 10 dB (for an input SNR of 8.5 dB) and held the output SNR to around 20 dB for input SNRs of from 8.5 to 14.5 dB.


PAPER REVIEW 3

A Self-Steering Digital Microphone Array

W. Kellermann

The system described in this paper is similar to Compernolle's, but does the beamforming first and the direction-finding second. The approach taken was to form fixed beams whose superposition covers the entire space of interest. Then a voting algorithm selects the beam(s) which will contribute to the output signal. This system was designed for telephone applications, so was designed around a bandwidth of approximately three octaves (at a sampling rate of 8 kHz). The microphone array actually consists of three 11 element arrays. The first array, for the low-frequency range, was spaced at 16 cm, the second, for the mid-frequency range, at 8 cm, and the third, for the high-frequency range, at 4 cm. Since the arrays overlap, some microphones are used in more than one array, so 23 separate microphones, rather than 33, were used. The arrays were situated on a wall to provide a 180 degrees field of view. Seven beams were chosen with look directions of 0, ñ20 degrees , ñ40 degrees , and ñ60 degrees off axis.

The beamforming is accomplished by a sequence of operations. First, the 23 microphone signals are grouped into three groups of 11 signals each (some are used several times, as mentioned above.) Next, an aperture shading stage weights each signal appropriately to choose the desired beam shape. The weights are chosen using the Dolph-Chebychev design method which results in the minimum beamwidth for a given sidelobe attenuation and number of sensors. After aperture shading, the wavefront reconstruction stage applies the appropriate delays to each signal to form the seven beams for each frequency range. The final stage filters the three ranges (high-pass for the high-frequency, band- pass for the mid-frequency, and low-pass for the low-frequency) and then combines them into one signal for each beam.

The voting stage selects those beam signals which provide the best coverage of the speaker and forms the output signal. This is done on a frame-by-frame (16 ms) basis. Three features from each beam are used to determine whether the signal from each is speech or noise. The discrimination function is based on the Mahalanobis distance between the current feature vector and the estimated background noise feature vector. It requires an adaptive estimation of the background noise, which is updated when the function determines that speech is not present. The actual algorithm used is relatively complicated, with several adaptation procedures running in parallel.

The beam weight assignment was designed to satisfy two requirements. First, it must emphasize those beams which provide the best coverage. Second, since this system was designed for use with human listeners rather than a speech recognition system, it must account for perceptual criteria, such as the unpleasantness of switching noise. It does this by calculating potentials for each beam and then activating high-potential beams. The algorithm for calculating potentials takes into account instantaneous energy levels, estimated SNR, burst echo suppression, and neighbor inhibition.

The weights are then assigned depending on the previous weight and the current potential. In addition, perceptual criteria dictate that when a beam is first turned on, it's weight increased with a sigmoid characteristic, and when a beam is turned off, it's weight decreases as a decaying exponential.

The system was tested qualitatively with good results. The beamforming performed as expected. The reaction time was fast enough to avoid noticeable chopping of speech and no switching noise can be heard when activating or deactivating beams. However, quantitative results were not reported.


TUTORIAL PROJECT

This section describes a simple tutorial project which I undertook to better understand acoustic beamforming and its application to speech signals. I implemented a simple delay and sum beamformer in the PV-WAVE programming language/environment. It allows an arbitrary number of microphones which can be placed anywhere. A source position and a noise position are also defined, along with their relative power (defined by the signal-to-noise ratio). The simulation doesn't include direction-finding, so the source position must be known. The simulation will generate either a wide-band noise source (white noise) or a narrow-band source (a single tone).

To test the simulation, I obtained a segment of speech which had been digitized and stored on a computer. The segment was a male speaker saying, "There's usually a valve." Several representative plots are included. Each plot shows the original speech signal, the speech+noise signal as viewed at one microphone, and the output of the beamformed array. The performance can be seen to increase as the number of microphones is increased. The simulation will also play the signals on a Sun SparcStation. The qualitative improvement when using 16 microphones is very impressive. The speech obscured by noise is almost unrecognizable, while the output of the beamformer is quite clear. The remaining noise seems to be mostly high-frequency, as well, which could easily be filtered out.


Appendix A - Beamforming Simulation Source Code

; Simulates a linear array delay and sum beamformer.
;
; Keyword parameters:
;  snr		The input signal-to-noise ratio, in dB
;  numpoints	The number of samples of the speech segment to use.
;  mikes	The number of microphones in the array.
;  size		The aperture size (length) of the array.
;  data_origin  The position [x,y] of the source signal origin.
;  noise_origin	The position [x,y] of the noise signal origin.
;  tone		If set, use a tone as the jammer instead of white noise.
;  sound	If set, play the results on the speaker.
;
pro beam,snr=snr,numpoints=numpoints,mikes=mikes,size=size,sound=sound $
,data_origin=data_origin,noise_origin=noise_origin,tone=tone
 
; Set defaults if not specified.
 if n_elements(snr) eq 0 then snr=5
 if n_elements(numpoints) eq 0 then numpoints=13421
 if n_elements(mikes) eq 0 then mikes=11
 if n_elements(size) eq 0 then size=20
 if n_elements(data_origin) ne 2 then data_origin=[100,0]
 if n_elements(noise_origin) ne 2 then noise_origin=[100,57.735]
 
; Restore the signal data (the sentence "There's usually a valve.")
 restore,'sentence.save'
; Make sure we're not asking for more data points than there are available.
 numpoints=min([numpoints,n_elements(data)])
 dat=data(0:numpoints-1)
; Generate the noise sequence (Tone or White Gaussian Noise)
 if keyword_set(tone) then noise=sin(indgen(numpoints*2)/3.0) else $
  noise=randomn(seed,numpoints*2)
  
; Normalize the signal data.
 dat=dat/sqrt(total(dat^2)/numpoints)
; Set the SNR to the appropriate value.
 noise=noise/total(noise^2)/sqrt(10.0^(snr/10.0))*numpoints*2
; Initialize the arrays. Delay will hold the calculated delay for each
; microphone (to set the look direction.) Ndelay will hold the delay from
; the noise source to each microphone.
 delay=fltarr(mikes)
 ndelay=delay
; Mike will hold the incoming signal (data+noise) as seen by each microphone.
 mike=fltarr(mikes,numpoints)
; All_out will hold the final output signal.
 all_out=fltarr(numpoints)
; Mike_Pos will hold the xy position of each microphone.
 mike_pos=fltarr(mikes,2)
; Evenly space the microphones from (0,-size/2) to (0,size/2).
 mike_pos(*,1)=range(-size/2,size/2,mikes)
 
; Calculate the delay from the data source to each microphone and from the
; noise source to each microphone.
 for index=0,mikes-1 do begin
  delay(index)=sqrt(total((data_origin-mike_pos(index,*))^2))
  ndelay(index)=sqrt(total((noise_origin-mike_pos(index,*))^2))
 endfor 
; For this simple simulation, we'll just normalize the delays to integers
; so that we don't have to worry about fractional sampling.
 delay=round(8.0*(delay-min(delay)))
 max_delay=max(delay)
 ndelay=round(8.0*(ndelay-min(ndelay)))
 
; Calculate the signal (data + noise, appropriately delayed) at each
; microphone.
 for index=0,mikes-1 do begin
  last=min([delay(index)+numpoints,n_elements(dat)])-1
  num=last-delay(index)
  mike(index,0:num)=transpose(dat(delay(index):last) $
 +noise(ndelay(index):ndelay(index)+num))
 endfor
 
; Now we finally do the actual beamforming. (Everything up to this point
; has been simulating the physical system.)
; For each microphone:
 for index=0,mikes-1 do begin
  ; Calculate the needed delay.
  dif=max_delay-delay(index)
  ; Apply the delay.
  out=[transpose(mike(index,dif:*)),fltarr(dif+1)]
  ; Add the delayed signal to the total output signal.
  all_out=all_out+out
  endfor
; Normalize the output signal.
 all_out=all_out/mikes
; Calculate signal-to-noise ratios.
 SNRin=10*alog10(total(dat^2)/total(noise(0:numpoints-1)^2))
 print,'Original SNR:',SNRin
 desired=dat(delay(2):*)
 error=desired-mike(2,*)
 print,'SNR for single Microphone:',10*alog10(total(desired^2)/total(error^2))
 desired=dat(max_delay:*)
 error=desired-all_out
 SNRout=10*alog10(total(desired^2)/total(error^2))
 print,'Output SNR:',SNRout
; If requested, make the sound files to hear the results.
 if keyword_set(sound) then begin
  put,dat,'input'
  spawn,"int2audio input"
  spawn,"rm input"
  put,mike(0,*),'noisy'
  spawn,"int2audio noisy"
  spawn,"rm noisy"
  put,all_out,'output'
  spawn,"int2audio output"
  spawn,"rm output"
  endif
; Plot (and play, if requested) the results.
 titlestring='Input SNR='
 titlestring=titlestring+strcompress(string(format='(F8.1)',SNRin))
 titlestring=titlestring+', Output SNR='
 titlestring=titlestring+strcompress(string(format='(F8.1)',SNRout))
 titlestring=titlestring+', '+string(format='(I0)',mikes)+' Microphones'
 maxdat=max(dat)
 mindat=min(dat)
 maxmike=max(mike(2,*))
 minmike=min(mike(2,*))
 maxout=max(all_out)
 minout=min(all_out)
 dif1=(maxmike-mindat)*1.1
 dif2=(maxout-minmike)*1.1
 plot,dat+dif1,yrange=[minout-dif2,maxdat+dif1],xstyle=1,ystyle=3,ticklen=0 $
  ,xticks=1,xtickname=replicate(' ',2) $
  ,yticks=2,ytickv=[-dif2,0,dif1] $
  ,ytickname=['Output Signal','Single Microphone','Original Signal'] $
  ,title=titlestring
 if keyword_set(sound) then spawn,"play input.au"
 oplot,mike(2,*)
 if keyword_set(sound) then spawn,"play noisy.au"
 oplot,all_out-dif2
 if keyword_set(sound) then spawn,"play output.au"
end

BIBLIOGRAPHY

(Acero & Stern, 1990)
Acero-A. Stern-R-M. "Environmental Robustness in Automatic Speech Recognition." Proceedings of ICASSP-90. Albuquerque, New Mexico. pp. 849-852. April 1990.
(Busnelli, Dal Degan, and Poretta, 1987)
Busnelli-L. Dal Degan-N. Poretta-S. "Theoretical and Experimental Results on a Speech Enhancement Technique Using a Multisensor Input." Digital Signal Processing-87. Ed. Cappellini-V. Constantinides-A-G. Elsevier Science Publishers B.V. (North-Holland), 1987.
(Compernolle, 1990)
Van-Compernolle-D. "Switching adaptive filters for enhancing noisy and reverberant speech from microphone array recordings". ICASSP 90. 1990 International Conference on Acoustics, Speech and Signal Processing. Albuquerque, NM, USA. pp. 833-6 vol.2. IEEE. 3-6 April 1990.
(Dal Degan & Prati, 1988)
Dal Degan-N. Prati-C. "Acoustic Noise Analysis and Speech Enhancement Techniques for Mobile Radio Applications." Signal Processing. vol. 15, pp. 43-56. 1988.
(Dautrich, Rabiner, & Martin, 1983)
Dautrich-B-A. Rabiner-L-R. Martin-T-B. "On the Effects of Varying Filter Bank Parameters on Isolated Word Recognition." IEEE Transactions in Acoustics, Speech and Signal Processing ASSP-31, pp. 793-806. 1983.
Dowling-E-M. Linebarger-D-A. Tong-Y. Munoz-M. "An adaptive microphone array processing system. Microprocessors and Microsystems. vol.16, no.10. pp. 507-16. 1992.
(Ephraim, Wilpon & Rabiner, 1987)
Ephraim-Y. Wilpon-J-G. Rabiner-L-R. "A Linear Predictive Front-End Processor for Speech Recognition in Noisy Environments." Proceedings of ICASSP- 87. Dallas, Texas. pp. 1324-1327. April 1987.
(Fried & Cuperman, 1989)
Fried-N. Cuperman-V. "Evaluation of Speech Recognition Equipment in a Vehicular Environment." IEEE Pacific Rim Conference on Communications, Computers and Signal Processing. pp. 455-458. 1-2 June 1989.
(Frost, 1972)
Frost-O-L. "An Algorithm for Lineary Constrained Adaptive Array Processing." Proceedings of the IEEE. vol. 60, no. 8, pp. 926-935. August 1972.
(Ghitza, 1986)
Ghitza-O. "Auditory nerve representation as a front-end for speech recognition in a noisy environment." Computer Speech and Language. vol. 1 pp. 109-130.
(Griffiths & Jim, 1982)
Griffiths-L-J. Jim-C-W. "An Alternative Approach to Linearly Constrained Adaptive Beamforming." IEEE Transactions on Antennas and Propagation. vol. AP-30, no. 1, January 1982.
Hoffman-M-W. Buckley-K-M. Link-M-J. Soli-S. "Robust microphone array processor incorporating headshadow effects". ICASSP 91: 1991 International Conference on Acoustics, Speech and Signal Processing. Toronto, Ont., Canada. pp. 3637-40 vol.5. IEEE. 14-17 April 1991.
(Juang, 1991)
Juang-B-H. "Speech Recognition in Adverse Environments." Computer Speech and Language. vol. 5, pp. 275-294.
(Kellermann, 1991)
Kellermann-W. "A self-steering digital microphone array". ICASSP 91: 1991 International Conference on Acoustics, Speech and Signal Processing. Toronto, Ont., Canada. pp. 3581-4 vol.5. IEEE. 14-17 April 1991.
(Lecomte et al, 1989)
Lecomte-I. Lever-M. Boudy-J. Tassy-A. "Car noise processing for speech input". ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing. Glasgow, UK. pp. 512-15 vol.1. IEEE. 23-26 May 1989.
(Lee & Kashyap, 1990)
Lee-D-D. Kashyap-R-L. "Robust maximum likelihood bearing estimation in contaminated Gaussian noise". Fifth ASSP Workshop on Spectrum Estimation and Modeling. Rochester, NY, USA. pp. 104-8. IEEE. 10-12 Oct. 1990.
(Mansour & Juang, 1989)
Mansour-D. Juang-B-H. "The short-time modified coherence representation and noisy speech recognition." IEEE Transactions on Acoustics, Speech and Signal Processing. vol.37, no.6. pp. 795-804. June 1989.
(Matsumoto & Imai, 1986)
Matsumoto-H. Imai-H. "Comparative study of various spectrum matching measures on noise robustness." Proceedings of ICASSP-86. Tokyo, Japan. pp. 769-772. April 1986.
Noll-A. Hamer-H-H. Piotrowski-H. Ruehl-H-W. Dobler-S. Weith-J. "Real-time connected-word recognition in a noisy environment". ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing. Glasgow, UK. pp. 679-81 vol.1. IEEE. 23-26 May 1989.
(Porter & Boll, 1984)
Porter-J-E. Boll-S-F. "Optimal estimators for spectral restoration of noisy speech." Proceedings of ICAASP-84. San Diego, California. pp. 18A.2.1-18A.2.4. March 1984.
(Powell et al, 1987)
Powell-G-A. "Practical adaptive noise reduction in the aircraft cockpit environment." Proceedings of ICASSP-87. Dallas, Texas. pp. 173-176. April 1987.
(Roe, 1987)
Roe-D-B. "Speech recognition with a noise-adapting codebook." Proceedings of ICASSP-87. Dallas, Texas. pp. 1139-1142. April 1987.
(Schmidt, 1986)
Schmidt-R-O. "Multiple Emitter Location and Signal Parameter Estimation." IEEE Transactions on Antennas and Propagation. vol. AP-34, no. 3, pp. 276-280. March 1986.
Switzer-T. Linebarger-D. Dowling-E. Tong-Y. Munoz-M. "A Customized Beamformer System for Acquisition of Speech Signals." Conference Record of the Twenty-Fifth Asilomar Conference on Signals, Systems and Computers. Pacific Grove, CA, USA. pp. 339-43 vol.1. IEEE. Naval Postgraduate School. San Jose State Univ. 4-6 Nov. 1991.
(Watanabe et al, 1989)
Watanabe-H. Suzuki-M. Nagai-N. Miki-N. "A method for maximum likelihood bearing estimation without nonlinear maximization". Transactions of the Institute of Electronics, Information and Communication Engineers A. vol.J72A, no.8. pp. 303-8. Aug. 1989.
(Watanabe et al, 1991a)
Watanabe-H. Suzuki-M. Nagai-N. Miki-N. "Maximum likelihood bearing estimation by quasi-Newton method using a uniform linear array". ICASSP 91: 1991 International Conference on Acoustics, Speech and Signal Processing. Toronto, Ont., Canada. pp. 3325-8 vol.5. IEEE. 14-17 April 1991.
(Watanabe et al, 1991b)
Watanabe-H. Suzuki-M. Nagai-N. Miki-N. "An approximate method for efficient maximum likelihood bearing estimation by a uniform linear array". Transactions of the Institute of Electronics, Information and Communication Engineers A. vol.J74A, no.5. pp. 794-6. May 1991.
(Watanabe et al, 1992)
Watanabe-H. Suzuki-M. Nagai-N. Miki-N. "An efficient method for exact maximum likelihood bearing estimation using a uniform linear array". Transactions of the Institute of Electronics, Information and Communication Engineers A. vol.J75-A, no.12. pp. 1865-8. Dec. 1992.
(Watanabe et al, 1993)
Watanabe-H. Suzuki-M. Nagai-N. Miki-N. "Algorithms for adaptive bearing estimation based on maximum likelihood method using a uniform linear array". Journal of the Acoustical Society of Japan (E). vol.14, no.3. pp. 217-19. May 1993.
(Ziskind & Wax, 1988)
Ziskind-I. Wax-M. "Maximum likelihood localization of multiple sources by alternating projection." IEEE Transactions on Acoustics, Speech and Signal Processing. vol.36, no.10. pp. 1553-60. Oct. 1988.

Kevin Fink's Home Page (http://www.fink.com/Kevin.html)

kevin@fink.com