Imagine yourself in the middle of a loud busy party. Your ears receive input from multiple sound sources: people around you chat, the DJ spins his grooves, glasses clink. All these signals reach your ears simultaneously, but you can easily isolate one that interests you and focus on it. For example, you can easily hold a conversation with a friend, distinguishing his voice between the many voices around you. Your brain has the extraordinary capability to effectively suppress all sounds, except the one you choose to hear. This process is described in the literature as the “cocktail party effect”.
In truth, your brain's ability to suppress unwanted sound signals remains unmatched. Engineers have long been trying to find a solution to the “cocktail party” problem, i.e. separating signal sources that propagate, interfere and mix in an environment, before being captured by a set of sensors. It's a problem that arises in many fields apart from audio, from telecommunications to biomedical applications. Matters are complicated further by the fact that we usually don't have access to the original source signals or any information about the way they were combined before reaching the sensors. Consider, for example, the case of a mastered track. If you would like to isolate the vocals or the drums and you didn't have the original multitrack available, what options would you have?
For almost three decades now, engineers and scientists seek to develop signal processing algorithms that can separate the sources from mixed signals received from the sensors. These algorithms form a set of techniques that are collectively known as “Source Separation” methods. Usually, these techniques are “blind”, meaning that they only use the available sensor signals to perform the separation task. In practice, no method is completely blind. There are always some assumptions about the nature of the source signals and the way they were mixed. In the last decade, Audio Source Separation received significant attention when a new set of powerful techniques was introduced, triggering a recent wave of advances. These techniques are based on Non-negative Matrix Factorization (NMF) algorithms, made popular by Lee and Seung's seminal paper in 1999.
Accusonus has developed the “Advanced Audio Analysis” (A3) technological framework, our own take on Source Separation algorithms. In the following picture, an illustrative block diagram of the technology is shown.
Now let us take a tour of this diagram and see each basic processing step. First, each microphone signal is transformed into its frequency domain equivalent, resulting in a corresponding spectrogram. Here, the analysis parameters must be carefully chosen, especially in the case of drum signals. An adequate time resolution must be ensured, since drum signals are transient signals, while at the same time sufficient frequency resolution must be obtained in order to be able to distinguish the finer frequency components of each drum. For a typical multichannel drum recording, these requirements result in a significant amount of data that need to be stored and processed by the algorithm. This is why drumatom requires more memory in your computer than a typical audio software. At the same time, a set of time domain cues is extracted from each transformed microphone signal, which is passed over to the main source separation algorithm as additional information typically improves the performance of a blind source separation algorithm.
The main algorithm block is an NMF based algorithm. While NMF is a quite involved technique its principles can be easily understood. In the case of audio, NMF methods decompose the spectrogram to a set of spectral profiles and a set of activation functions. Each spectral profile corresponds to the spectrum of a source found in the microphone signal and each activation function indicates the time instant when a spectral profile is active.
After the successful estimation of spectral profiles and the activation functions, a number of time-frequency masks can be extracted. Each mask is applied on the spectrogram and after the appropriate time-frequency synthesis, a component signal is generated. Each component signal may contain a part of the direct sound source or a part of the leakage. By an appropriate combination of the relevant components, the desired direct signal can be reconstructed.
Theoretically, a human user would have to listen to all of the components for each microphone signal and decide which belong to the direct source. In Accusonus we developed a method to automate this process, by assessing the resemblance of each component signal to the source. The final output of the Advanced Audio Analysis technology is a set of components, re-ordered in a way that remains meaningful to the end user, while also allowing for a degree of intervention.
In general, interacting with advanced signal processing algorithms can get extremely complicated. The parameters and variables behind the scenes of Drumatom are just too many. In the case of Drumatom, we have put significant effort into abstracting the complex underlying process and providing the user with the controls and information that are essential to achieving his target, allowing you to focus on the sound and the task at hand!
If you are interested, a bit more technical information is available in our engineering brief that was presented at the 135th AES (see here).
This last article has concluded our series about drum leakage. We certainly hope you found it interesting! Stay tuned for more posts in the future.