Navigation: [Home]   [Music & Guitars]   [Steely Dan]   [Acoustics research]   [Speech analytics]   [Site map]

Phonetic audio mining, audio searching, speech analytics

What is audio mining?

Audio mining is a technique that is used to search audio for occurences of spoken words or phrases. Speech technology is used to recognise the words or phonemes that are spoken in an audio or video file, and audio mining searches can then be carried out to locate specific words and phrases within the audio. These audio mining searches run at speeds that are typically many thousands of times faster than real time, so large quantities of audio or speech can be searched in a short time.

Terminology

A number of different terms are used in connection with audio mining. These include: audio mining, audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, information retrieval. Note that the terms "audio analytics" and "speech analytics" are often used to cover both audio mining and other speech analysis technologies, for example speaker identification - see separate speech analytics page.

Audio mining applications

Audio mining software can be used to search audio or video content that contains speech. Typical applications include searching large audio/media archives, where little or no information is available that describes the audio content. This could be used, for example, to retrieve relevant clips for a news story from a large video archive. Audio mining searches can typically be carried out many thousands of times faster than real time, which makes it possible to search large amounts of speech data when previously this was impossible, due to the time it would take for humans to listen to the material.

Audio mining techniques are also used in telephony applications, for example to help automate quality control aspects of the business where it is important to check that telephone agents actually said what they were supposed to say. Audio mining searches on the recorded calls can be made to locate words or phrases that must always be said. This can offer significant advantages in terms of the number of calls that can be checked as the speed at which relevant matches can be found using audio mining is much greater than can be achieved by traditional means (a human listening to the recorded calls).

Audio mining has also been used for captioning (subtitling) of TV and other media content, as the speech content associated with the text for each caption can be located by running an audio mining search. However, a much more effective and efficient way to obtain the start and end times of each word in the caption text is to use speech recognition to automatically align the text with the speech. See, for example, the automatic speech/text alignment software software produced by Aurix Ltd.

Approaches

There are two common approaches to audio mining - one uses large vocabulary continuous speech recognition (LVCSR), and the other uses phonetic recognition to carry out phonetic audio mining. An overview of these two approaches to audio mining is given below.

LVCSR audio mining

This is a two-stage process. In the first stage (pre-processing or indexing stage), the speech content of the audio is processed by a large vocabulary recogniser to generate a searchable index file (if the data being processed is a video file, then the relevant audio information must be extracted and fed to the recogniser). The index file contains information about the words spoken in the audio or video data.

In the second stage (search stage), a search term is defined (e.g. a word or phrase), and one or more index files are searched for all occurences that match the specified search term. The results of the search can be displayed graphically as "search hits" in the audio file, or the relevant portions of the audio or video file can be played to the user.

Phonetic audio mining

Like LVCSR audio mining, phonetic audio mining is a two-stage process. In the first stage, audio is processed (indexed) with a phonetic recogniser to generate an index file. The index file produced by this phonetic approach to audio mining stores the phonetic content of the speech, in contrast to the index files generated by LVCSR methods, which contain information about words.

The second stage is similar to the LVCSR approach, in which a search term (word or phrase) is defined, and a number of phonetic index files are searched to retrieve matches for the search term. Here, the search term is converted into a phonetic sequence and it is matches for this phonetic sequence that are actually retrieved from the phonetic index files. This is in contrast to the LVCSR appraoch, where all matching is done with the text that corresponds to the word or phrase.

It is also possible to enter a phonetic search term directly, if the user has sufficient phonetic expertise to enter the sequence of phones that correspond to the pronuciation of the word or phrase they want to search for.

The use of phonetic audio mining techniques offers a number of advantages over the LVCSR approach. One advantage is speed: the rate at which the audio content can be pre-processed is many times faster with phonetic search techniques. This is largely because LVCSR recognition requires a sophisticated language model for good recognition, and this greatly increases the amount of processing needed. Phonetic audio mining software can pre-process audio data at rates of 10-15 times faster than real time, compared to around 2 or 3 times faster than real time for LVCSR systems.

Another advantage of the phonetic search engine approach is that an open vocabulary is maintained, which means that searches for personal or company names can be performed without the need to reprocess the audio. With LVCSR systems, any word that was not known by the system at the time the speech was indexed can never be found. For example, imagine a new product called "terazap" became popular. This word will not be in the dictionary of words used by an LVCSR audio mining system, which means the recogniser can never output this word, even if the word was actually spoken in audio that was processed by the system. In order to find matches for this new word, the LVCSR system has to be updated with a new dictionary that contains the word "terazap", and all the audio has to be pre-processed again, which is a time-consuming task. This problem does not occur with phonetic audio mining systems because they work at the level of phones, not words. As long as a phonetic pronunciation for the word can be generated at search time, it will be able to find matches for the word, and no re-processing of audio is required.

Existing audio mining and speech analytics technology

A number of different companies produce audio mining and speech analytics software, applications and SDKs. Below is a list of links to some of the available audio mining and speech analytics tools.

Aurix - Aurix audio miner

Aurix audio miner - phonetic audio mining software.

BBN

BBN audio indexer - LVCSR audio mining system.

CallMiner

CallMiner - Speech analytics software.

Nexidia - FastTalk

Nexidia - phonetic audio mining software ("phonetic search engine").

ScanSoft (Nuance)

ScanSoft speech indexing - LVCSR audio mining system.

Witness Systems

Witness systems - Speech analytics for call centres.


Speech analytics and audio mining in the news

Try some Google news searches to check for current news articles about audio mining and speech analytics.


Last updated: October 2007

All information in these pages copyright © 2000-2007 Howard Wright unless otherwise stated. All rights reserved.