Legacy: ASY Details

Articulatory Synthesis is a method of synthesizing speech by controlling the speech articulators (e.g. jaw, tongue, lips, etc.). The legacy demo provides a brief virtual tour of the original Haskins Laboratories articulatory synthesis program, ASY, and related work. ASY was designed as a tool for studying the relationship between speech production and speech perception.

The Haskins articulatory synthesis program (ASY) is a computational model of the vocal tract, begun at Bell Laboratories (Mermelstein, 1973) and subsequently significantly enhanced and extended by Rubin, Baer and Mermelstein (1981) for use in studies of production and perception (e.g., Abramson et al., 1981; Raphael et al.,1979; Browman et al., 1984; Browman and Goldstein, 1986). The software implementation provides a kinematic description of speech articulation in terms of the moment-by-moment positions of six major structures; the jaw, velum, tongue body, tongue tip, lips and hyoid bone, all presented graphically for viewing in the midsagittal plane. The positions of the articulators can be controlled manually or by means of a table of specifications over time; the former producing steady-state utterances and the latter dynamic productions. Tables of parameters can also be used to control the amplitude of glottal excitation, its fundamental frequency and its mode of representation (i.e., in the time or frequency domain). The amplitude and tract point-of-insertion for fricative excitation can also be specified.

Steps in the production of static synthetic vowel utterances begin with the drawing of a sagittal view of the model’s vocal tract outline on the computer screen. Key articulators are indicated and can be moved to reconfigure the tract, constrained by the model. Next, a grid structure is superimposed on the tract. The intersection of the grid lines with the tract walls leads to a derivation of the sagittal dimensions, the center line, and the length of the tract. Then, using formulae based on a variety of vocal tract measurements (Heinz & Stevens, 1964; Ladefoged, Anthony & Riley, 1971; Mermelstein, Maeda & Fujimura, 1971), the sagittal cross-sections are converted to a smoothed area function approximated by a sequence of uniform tubes each 0.875 cm in length. This simplification of the vocal tract shape permits a rapid calculation of the vocal tract transfer function spectrum. Speech output is then generated, at a sampling rate of 20 kHz, by feeding the glottal waveform through the digital filter representation of the transfer function which, for voiced sounds, accounts for both oral and nasal branches of the tract.