November 13-15, 2002: Three Publications Examine Voice Analysis Following Release of Alleged Bin Laden Tape

The release of an audio message by a man thought to be Osama bin Laden (see November 12, 2002) sparks several publications to run stories about the authentication of the voice on the tape. These articles make several points about voice analysis of apparent bin Laden recordings:
Machine analysis: Some aspects of voice identification are done my machine. Voice authentication software measures the acoustic qualities of a person’s voice, such as pitch, loudness, basic resonances, frequency, and amplitude. [New Scientist, 11/13/2002; Slate, 11/15/2002] This produces spectrographic information and can also be used to look for specific features of a voice, such as a nasal quality. In addition, every person creates the same sounds using a slightly different set of basic pitches, so the set of frequencies in bin Laden’s vowels, like those in “ea” from “fear,” will be marginally different from anyone else’s. By examining this frequency detail for every vowel and comparing them to previous examples, a machine analysis can tell if they are the same and were all said by him. [Slate, 11/15/2002] However, “People hardly ever pronounce the same word the same way twice, even in the same utterance,” says Robert Berkovitz, a speech analyst with Sensimetrics Corp. [CBS News, 11/13/2002]
Human analysis: Some aspects of voice identification are done by humans, who are, according to Slate, “very good at doing the kind of thing most people do subconsciously—telling if someone comes from a particular region by recognizing basic vowel and consonant qualities.” For example, a human analyst can tell whether the “Ye” sound in “Yemen” is of the right length and stress for bin Laden’s dialect. [Slate, 11/15/2002] Experts listen to previous recordings of bin Laden, and compare them syllable by syllable. [New Scientist, 11/13/2002; Slate, 11/15/2002] Experts can also verify whether words on a tape generally match those uttered by someone of bin Laden’s age and educational background. [Slate, 11/15/2002]
Quality of tape: According to Slate, the November tape is “allegedly very noisy and possibly went down a phone line at some point.” [Slate, 11/15/2002] However, the New Scientist reports, “Voice analysis experts say the quality of the recording appears good enough to determine if the recording is genuine.” It also quotes Steve Cain of Forensic Tape Analysis, a company that received snippets of the tape from US media, who says, “It seems like it is at least clear enough and there’s enough amplitude of that unknown speaker’s voice that if you had a known sample of bin Laden it would be possible.” [New Scientist, 11/13/2002]
Splicing: Analysis can determine whether a tape is spliced together. Potential red flags include hitches in timing and rhythm, removal of background noise, and different pitch to accommodate for differences in background noise. [Slate, 11/15/2002]
It makes no difference to voice analysis what language a recording is in. [CBS News, 11/13/2002]
Uncertainty: The New Scientist quotes Tomi Kinnunnen, an expert in computer analysis of speech at the University of Joensuu, Finland, as saying: “There is always the possibility of error.… But if you have a clean sample with little noise, you can quite reliably say [who it is].” [New Scientist, 11/13/2002] However, according to Slate, human and machine analyses can be “formidable,” but “neither type of analysis can say with 100 percent certainty that the speaker on the tape is bin Laden or anyone else.” [Slate, 11/15/2002] CBS finds that intelligence analysts are convinced the tape is from bin Laden, but “they will never be sure,” because “Computer voice analysis lacks the accuracy of fingerprint or DNA identification and can be hamstrung by a skilled impersonator or low-quality recording.” “You can say with some probability, but you can never be sure,” says Kenneth Stevens, a Massachusetts Institute of Technology expert on speech analysis and synthesis. “Where there’s a combination of strong motivation and relatively weak science, there’s an opportunity for deception,” adds Berkovitz. “You can’t put the voice in a slot and have it come out saying, ‘This is Joe Smith.’” [CBS News, 11/13/2002]
One analyst, Matsumi Suzuki of Japan Acoustic Lab, Tokyo, says that, although the recording seems genuine, the speaker sounds ill. [New Scientist, 11/13/2002]

Stay Informed