Audio in Software

Author: thothonegan
Tags: gamedev audio wolf_engine programming

Debugging a lot of audio stuff recently, so a quick overview of how sound works in software, along with a bit about how it works in Wolf.

In the beginning, there were numbers

In the physical world, sound is created by vibrating particles creating a wave. This wave is what we need to be able to represent so that a sound use be used within a program. To do this, we sample the wave at fixed points to approximate the actual wave. For example, CD audio takes 44100 points of the wave per second (44.1khz). Each point is a single number, and storing all of these numbers lets you simulate a sound digitally. For a lot of programs, there are three main operations you need to do: decoding a sound, playing a sound using the soundcard, and mixing sounds.

Decoding Sounds

If you've ever dealt with WAV files, you know that they can be huge : such as 30MB for a 4 minute song. Yet an MP3/MP4 can be 4MB for the exact same song. The reason has to deal with how the sound is represented within the different formats. The WAV file format is one of hte simplest formats : it pretty much is just a header with the numbers representing the wave (this is called PCM). So given a CD quality (44.1khz) song that's 4 minutes (240 seconds) long, it will take 44100 * 240 = 10584000 numbers to represent it. If it's stored as floats (4 bytes per number), that's 40.374 MB for the data. So instead of storing the raw numbers or compressing it (known as lossless), most algorithms for sounds store functions and approximations to the audio (lossy). If you've worked with images, its similar to the difference between PNG and JPEG. Playing a sound will need the original values though, so you have to decode the audio to be able to use it. So it takes more work to play back an MP3, but the files are a lot smaller.

In Wolf this is handled by Wolf::Audio. Pretty much it takes a file and will decode it back to the given PCM format. Every driver handles a different file format : such as opus, or ogg vorbis. It also can either preload the entire file into memory (which is great for quick sounds so you dont pay the decode cost) or stream from disk (great for music, since 40MB of RAM per track is still a lot and latency isnt as important). So after you've loaded a file, you need some way to play it, which bring us to the audio system.

Playing Sounds

So at this point you have some raw audio, but need to hand it to the sound card (which i call the "audio system"). Audio systems generally work in one of two ways to give them the data to play : pull or push. Push is the simpler approach so I'll cover it first, but over time systems have prefered pull.

For push (such as OpenAL or Waveout), you setup the sound system, tell it what format you'll give it data in (such as floats, 2 channel), and then give it buffers. Each buffer goes into a queue and at the same time as your code it continues to play through the queue. As it finishes each buffer it marks it as finished so your code can keep filling them up. Get behind, and the music stops. For an analogy, you have two people running a restaurant : one person serves food, while the other both cooks and washes dishes. As long as enough dishes are washed, food can keep getting surged and everything runs smoothly. If dishes run out and the chef is cooking, the waiter just has to wait for the chef to get back to washing. As long as you keep feeding the audio system buffers its fine, otherwise it will stop. You can control the number of buffers and try to balance the latency cost vs having enough data to keep it smooth, but it'd a delicate balance.

Pull on the other hand tries to reverse this. You do setup like normal, but you don't directly feed it buffers. Instead, the system calls you (in the form of a callback) whenever it wants more data, handing you a buffer to fill. Generally this is also on a separate thread, so it can even run if your main thread is off doing something. While its a little harder to use, it matches more how the sound card works. In the the restaurant analogy theirs a third person that does nothing but washing. 90% of the time it does nothing, but when dishes start getting low the waiter tells him to wash a dish. The chef then can just focus on cooking.This allows a lot of optimizations : one example is zero copy audio. Using the push model you've got to fill a buffer, which the audiosystem then takes and has to copy to the card's buffer. In the pull case however, it gives you the buffer (which can be the card's direct buffer), so there isn't an extra unnecessary copy.

In Wolf::AudioSystem we use the pull model, even though some of the systems we wrap use push. Most of the high performance systems use pull, and it can integrate a push model into it easily by either a thread or by registering a handler in the main loop. Each driver handles a different platform, and creates a common API. The pull model is also very event driven, which fits better with Wolf itself.

So at this point we can decode a file and feed it to audio system for playback. For simple apps and some music players this is enough. For games though, we need a few more features (and a nicer high level interface, since I don't want to have to mess with bytes all the time).

Mix All The Sounds!

This is the part that can be quite different depending on implementation. Essentially you need a way to manage and play multiple sounds; preferably at a high level. To have multiple sounds play, you combine the waves from each sound. Its amazingly straightforward, but you have to be careful you don't accidentally amplify it, or end up going outside the range of your format (which then clips). You can also apply your own effects at the same time you're mixing the sounds (such as playing a sound at half volume).

Wolf::Mixer is our high level audio system. Every sound is loaded into a track. Each track can be controlled independently such as its volume, its playback speed, and so on. Internally, it creates a thread which manages a Wolf::AudioSystem::Player (which handles playing back the data the mixer creates). When data is requested from the Player, it gathers up all the currently playing tracks, applies all the effects needed, combines them into a single set of data, then feeds the data to the audio system for playback.

The nice thing about the Mixer is the application doesn't have to know or care about most of these details for loading or playback. Using its API you tell it to load a track, set its volume, and play. It also can be told to notify when it passed a specific marker in the sound or finishes playing, making it easy to create loops. The app doesn't have to deal with the details of how audio is formatted or the details of each individual system.

So in summary, audio is a fun subsystem of any program. Its not super complex, but its really hard to debug something related to it when its not crashing, or something obviously wrong. My current problem involves static randomly being put in the audio : I can hear its wrong, but cannot identify what code is causing it. The data somewhere is wrong, but the problem can be anywhere within the path from decoding to playback. Its the greatest feeling though when it works properly.