Audio mining techniques are also used in telephony applications, for example to help automate quality control aspects of the business where it is important to check that telephone agents actually said what they were supposed to say. Audio mining searches on the recorded calls can be made to locate words or phrases that must always be said. This can offer significant advantages in terms of the number of calls that can be checked as the speed at which relevant matches can be found using audio mining is much greater than can be achieved by traditional means (a human listening to the recorded calls).
Audio mining has also been used for captioning (subtitling) of TV and other media content, as the speech content associated with the text for each caption can be located by running an audio mining search. However, a much more effective and efficient way to obtain the start and end times of each word in the caption text is to use speech recognition to automatically align the text with the speech. See, for example, the automatic speech/text alignment software software produced by Aurix Ltd.
In the second stage (search stage), a search term is defined (e.g. a word or phrase), and one or more index files are searched for all occurences that match the specified search term. The results of the search can be displayed graphically as "search hits" in the audio file, or the relevant portions of the audio or video file can be played to the user.
The second stage is similar to the LVCSR approach, in which a search term (word or phrase) is defined, and a number of phonetic index files are searched to retrieve matches for the search term. Here, the search term is converted into a phonetic sequence and it is matches for this phonetic sequence that are actually retrieved from the phonetic index files. This is in contrast to the LVCSR appraoch, where all matching is done with the text that corresponds to the word or phrase.
It is also possible to enter a phonetic search term directly, if the user has sufficient phonetic expertise to enter the sequence of phones that correspond to the pronuciation of the word or phrase they want to search for.
The use of phonetic audio mining techniques offers a number of advantages over the LVCSR approach. One advantage is speed: the rate at which the audio content can be pre-processed is many times faster with phonetic search techniques. This is largely because LVCSR recognition requires a sophisticated language model for good recognition, and this greatly increases the amount of processing needed. Phonetic audio mining software can pre-process audio data at rates of 10-15 times faster than real time, compared to around 2 or 3 times faster than real time for LVCSR systems.
Another advantage of the phonetic search engine approach is that an open vocabulary is maintained, which means that searches for personal or company names can be performed without the need to reprocess the audio. With LVCSR systems, any word that was not known by the system at the time the speech was indexed can never be found. For example, imagine a new product called "terazap" became popular. This word will not be in the dictionary of words used by an LVCSR audio mining system, which means the recogniser can never output this word, even if the word was actually spoken in audio that was processed by the system. In order to find matches for this new word, the LVCSR system has to be updated with a new dictionary that contains the word "terazap", and all the audio has to be pre-processed again, which is a time-consuming task. This problem does not occur with phonetic audio mining systems because they work at the level of phones, not words. As long as a phonetic pronunciation for the word can be generated at search time, it will be able to find matches for the word, and no re-processing of audio is required.
All information in these pages copyright © 2000-2007 Howard Wright unless otherwise stated. All rights reserved.