Legacy: Articulatory Synthesis
Articulatory synthesis is a method of generating speech by controlling the speech articulators (e.g. jaw, tongue, lips, etc.). Changing the position of the articulators results in a change in the shape of the vocal tract. In articulatory models, this controllable vocal tract serves as a “filter” with its own resonance characteristics. Thus, changing the shape of the tract changes the character of the sound, just as changing the shape of a trombone, by moving the slide, changes its sound quality. Articulatory synthesis provides a tool for modeling the physiology of speech production, while also producing an acoustic speech signal.
The first software articulatory synthesizer regularly used for laboratory experiments was developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein (1981). This synthesizer, known as ASY, is a computational model of speech production based on vocal tract models developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein (Mermelstein, 1973), Cecil Coker, Osamu Fujimura. and colleagues. Another popular, early model that has been frequently used is that of Shinji Maeda, which uses a factor-based approach to control tongue shape. ASY synthesizes speech through control of articulatory instead of acoustic variables. An important aspect of this model’s design is that speech sounds for use in perceptual tests can be generated through controlled variations in timing or position parameters of the articulators. Another very important aspect is that the synthesis procedure is fast enough to make interactive on-line research practical
The Haskins ASY computational model was designed for studies of speech production and perception (e.g., Abramson et al., 1981; Raphael et al.,1979; Browman et al., 1984; Browman and Goldstein, 1986), examining the linguistically and perceptually significant aspects of articulatory events. It allows for the quick modification of a limited set of key parameters that control the positions of the major articulators: the lips, jaw, tongue body, tongue tip, velum, and hyoid bone position (which sets larynx height and pharynx width). The particular set of parameters provides a description of vocal-tract shape, adequate for research purposes, that incorporates both individual articulatory control and linkages among articulators. Additional input parameters include excitation (sound source) and movement timing information. The ASY model helped shape the development of Catherine Browman and Louis Goldstein’s articulatory phonology approach, and the task dynamics model of Elliot Saltzman and colleagues. All three models, articulatory phonology, task dynamics, and ASY, were incorporated into a Gestural Computational Model at Haskins (Browman, Goldstein, Kelso, Rubin, & Saltzman, 1984; Saltzman, 1986; Saltzman & Munhall, 1989; Browman & Goldstein, 1990a,c). In the 1990s, Rubin, Goldstein, Mark Tiede, Khalil Iskarous, and colleagues designed a radical revision of the ASY system. This configurable, three-dimensional model of the vocal tract (CASY) permits researchers to replicate MRI images of actual vocal tracts and the articulations of different speakers. Douglas Whalen, Goldstein, Rubin, and colleagues extended this work over the next decades to study the relation between speech production and perception.
ASY, the Haskins articulatory synthesis program, provides a kinematic description of speech articulation in terms of the moment-by-moment positions of six major structures; the jaw, velum, tongue body, tongue tip, lips and hyoid bone, all presented graphically for viewing in the midsagittal plane. The positions of the articulators can be controlled manually or by means of a table of specifications over time; the former producing steady-state utterances and the latter dynamic productions (as in the figure at the top of this page, for the production of /dah/). Tables of parameters can also be used to control the amplitude of glottal excitation, its fundamental frequency and its mode of representation (i.e., in the time or frequency domain). The amplitude and tract point-of-insertion for fricative excitation can also be specified.
Steps in the production of a synthetic utterance begin with the drawing of the first tract configuration on the graphics screen and the superimposition of a grid structure. The intersection of the grid lines with the tract walls leads to a derivation of the sagittal dimensions, the center line and the length of the tract. Then, using formulae based on a variety of vocal tract measurements (Heinz & Stevens, 1964; Ladefoged, Anthony & Riley, 1971; Mermelstein, Maeda & Fujimura, 1971), the sagittal cross-sections are converted to a smoothed area function approximated by a sequence of uniform tubes each 0.875 cm in length. This simplification of the vocal tract shape permits a rapid calculation of the vocal tract transfer function. Speech output is then generated, at a sampling rate of 20 kHz, by feeding the glottal waveform through the digital filter representation of the transfer function which, for voiced sounds, accounts for both oral and nasal branches of the tract.
< Click here for an interactive demonstration of the original ASY model. >