KB ID # 20 - What audio files does InfraWare support?

Answer / Solution

Supported Audio Formats

Summary

The InfraWare 360 transcription platform supports a wide range of audio formats and has specific requirements for certain features. This document introduces the major concepts surrounding the support of digital audio files.

Audience

Readers should have a fundamental understanding of the InfraWare 360 platform.

Overview

Digital audio files are computer files that have replaced analog media such as cassette tapes. There are an unlimited number of ways to save or encode audio files, and variances in methods chosen can affect the ability of various programs to understand or playback a recording. Automatic speech recognition (ASR) is particularly sensitive to recording quality. This document outlines the fundamental concepts involved as well as the specific formats supported by the InfraWare 360 platform.

Encoding

When digital audio files are saved, they are encoded. The software element that specifies the method of encoding is called a codec which is an abbreviation for coding and decoding. Sometimes, software programs use a proprietary codec to accomplish certain goals the programmer has for the audio files they handle such as making them very small or matching the requirements for another downstream software process.

As a practical matter, when capturing dictation from PC or a telephone, the preferred format is PCM (Pulse Code Modulation) wave. Sometimes described as raw, recording PCM wave files represents the voice in a pristine state, just as the PC microphone or telephone company delivered it.

Sample Size, Sample Rate & Bit Rate

Digital data is captured as bits which are a stream of 0s and 1s. For audio, this means sampling, or taking a snapshot of the sound every so often. How often, or how many times this is done per second, is described by the sample rate. The higher the sample rate, the more data is captured. High sample rates yield high clarity and large file sizes. Likewise, lower sample rates can yield smaller files at the expense of clarity. Sometimes this clarity is not identifiable to the human ear, but it can make a big difference to a computer process, such as ASR.

Sample Size describes how much data (how many bits) are captured each time a sample is taken. As with sample rate, the higher the sample size, the more data is captured which yields larger files and higher clarity. Bit rate is the combination sample size and sample rate.

The Properties of an audio file on a Windows XP computer are represented by this screen shot. (Right click on an audio file, choose Properties, then select the Summary tab. If necessary, click the Advanced button.) Audio properties are listed. In this case, the sample size is 8 bits. Every time a sample was collected, it recorded 8 bits of data. The sample rate is 8 kHz, meaning a sample was collected 8,000 times per second (pretty often). If you multiply the sample size by the sample rate, you see that the bit rate is 64 kbps (sixty four thousand bits per second).

Conversion

Conversion involves making a copy of an audio file in a new format. This can be helpful to change a recording into a format that a desired program (such as a player) can recognize.

It is important to note that conversion almost always reduces quality (if quality describes how accurately a digital file represents the original analog speech). Sometimes the reduction in quality is small and acceptable, but other times it isn’t. For example, many programs will allow the user to convert a recording from a lower quality format to a higher quality (higher bit rate or sample rate) format. Changing to the higher quality format might accomplish the goal of making the file readable for a particular program, but it does not improve the quality of the recording. The same original information exists and the conversion utility must estimate the additional information needed to fill out the file. Such a process is counter-productive for software applications such as speech recognition that depend on all data being accurate.

Compression

The process of compression is used to represent data in less space than the original digital file required. This is very helpful when a smaller file is needed to yield shorter transmission time or smaller storage requirements. This is done with complex algorithms, but the concept itself is quite simple. It can be thought of as shorthand. Compression methods can be divided into two classes: lossy and lossless.

Lossless compression means that no data is actually lost in the process. If an audio recording is compressed with a lossless algorithm, it can be uncompressed such that the resulting file is precisely the same as the original file. The benefits of compressions are realized but not loss of quality is suffered.

On the other hand, Lossy compression sacrifices the precise nature of the original file by making a similar file. Lossy compression algorithms often reduce file sizes much more than lossless, but the cost is the inability to return to the original file. In other words. Lossy compression implies a conversion has taken place and data has been lost.

The InfraWare platform uses compression in at least two parts of the workflow. When dictation is captured in using the InfraWare Dictation Client (IDC) a lossless compression is used to reduce the size of the recording file to promote faster transmission time to the processing center. Since the compression is lossless, the servers at the processing center can revert back to the original recording without any lost quality. Speech recognition can be performed as well as if compression had not taken place.

Downstream, after ASR has taken place, the processing center converts the audio dictation to a Windows Media format (wma file type) for transmission to transcriptionist’s using the InfraWare Transcription Client (ITC). This promotes very fast transmission to the transcriptionist and provides a smaller requirement for storage space on the transcriptionist's computer. This is a lossy conversion that cannot be reversed, but the reduction in file size is substantial, and the need to retain the original quality is no longer present. (For the human ear, the wma file sounds just fine.)

As you can see, there is a time for each type of compression. The most important observation is that we must not use a lossy compression too early in the process, especially prior to ASR.

Impact of Audio File Formats on ASR

Speech recognition can be very sensitive to file format. Speech engines themselves are usually only written to handle a small selection of possible formats. Even within a format, the sample size and sample rate can be very important.

When working to accommodate the requirements of an ASR engine, conversion usually doesn’t help. While it is possible to down-sample a file to reach a target format, any attempts to up-sample will yield unfavorable results. InfraWare’s First Draft service requires a minimum of 8 bits and 8 kHz (64 kbps). If a recording was made at 4 bits and 8 kHz (32 kbps), that represents only half the desired information. While a conversion utility can up-sample such a file to 8 bits, the extra 4 bits do not represent meaningful data. Instead of being a true recording of the audio dictation, they are merely a software program’s guess at what might have taken place.

Supported File Types

File formats supported by the platform depend upon whether ASR is used.

Supported File Types for InfraWare First Draft speech recognition service:

File Type

Encoding

Bit Rate/Sample Rate (freq)

Typical Source

Wave (*.wav)

PCM

16 bit / 22 kHz (preferred)

PC Microphone, PDA

PCM

16 bit / 11 kHz

PC Microphone, PDA

PCM

8 bit / 8 kHz

PC Microphone/PDA, or Landline Telephone

DSS (*.dss)

SP

Standard Play (not Long Play)

Olympus or Philips PDR

DS2

.WMA Windows Media format

MP3

In addition to those file types supported for First Draft, the platform supports additional audio files for routine transcription workflow processing where speech recognition is not required, including:

Common WAV files (which are supported by windows recorder by default: PCM, ADPCM, TrueSpeech, G.723, A-Law, u-Law)

Limitation: The InfraWare Dictation Client (IDC) supports playback for all of the formats listed above.

Conclusions

Observing good care in collecting and handling digital audio files can promote good speech recognition, transmission performance and storage requirements.

Related KBs

	Why won't my system play audio in the ITC? Data execution prevention setting in Windows
	Do I need the Olympus DSS Player installed? DSS Player Requirements
	How do I upload files from the Olympus DS-330? Upload Files from the Olympus DS-330 IDC Explained