Applying Source Separation to Sacred Harp Singing
Introduction
Sacred Harp singing is a style of a cappella folk music that originated in the churches and religious communities of the northeastern United States in the 19th century. The style is practiced and performed at communal sings, where participants are arranged by part (trebble, alto, tenor, bass) around a central conductor. The singing is not typically done for an audience, but rather, singers participate as a communal activity. They sing for worship, and they sing for friendship.
The name is taken from the title of a popular hymnal, The Sacred Harp, published in 1844 by Benjamin Franklin White and Elisha J. King. This book became the standard-bearer for Sacred Harp melodies and lyrics, but the idiosyncracies of Sacred Harp singing are beyond the songs’ texts and arrangements. One reporter for National Public Radio described it as “full-body, shout-it-out singing. The harmonies are stark and haunting–raw, even … The room vibrates with the sound.” 1
The tradition’s roots are especially deep in the American South, where there are still many active Sacred Harp groups today. Additionally, there are vibrant communities in Ireland and the United Kingdom, and even a small community of singers in the Netherlands. 2
Analyzing Sacred Harp might pose a number of interesting audio-processing challenges. There are relatively few professionally recorded works, and many historical recordings would benefit from audio-restoration techniques. However, for this project, the focus is audio source separation. Simply put: how technically feasible is extracting the individual parts of the chorus from a completed mix of a Sacred Harp song?
While blind source separation (BSS) is a decades-old research problem, Sacred Harp offers some unique challenges that make it a compelling case study. Whereas many approaches to source separation utilize timbral artifacts or audio mastering techniques to isolate sounds, the relative homogenity of an a cappella performance and the amateur recordings often produced by Sacred Harp hobbyists might minimize the usefulness of these approaches.
There are significant motivations for this project, as well. Sacred Harp is a folk style, and as such, is often sung by people with little or no formal musical training. Instead of learning songs by reading notes with a mastery of pitch and harmonic intervals, singers are most likely to join a group sing (as a gathering is often called) and learn from those around them. To be able to isolate these parts might help new singers practice Sacred Harp at home.
Additionally, Sacred Harp is highly spatial, but much of that is lost in the process of mixing down to stereo recordings. Enthusiasts, historians and audiences might find it interesting to experience the sound as it is heard in the room, or by the conductor in the middle of a choir. Advances in binaural recording equipment and techniques have made recording and replaying spatialized sound technically feasible, but these techniques require the intentional recording of a song with these devices. Techniques for so-called 3D audio, mostly used in virtual reality applications, could potentially be combined with source-separated Sacred Harp songs in order to re-create the original setting.
For this project, the focus will be source separation with an eye towards spatialization as a future possibility.
Approaches
The attempted approaches for BSS this project can be grouped into two categories: time-frequency masking and machine learnining. Mohammed, Ballal and Grbic describe BSS as an “attempt to recover a sets of unobserved signals or ‘sources’ from a set of observed mixtures.” 3 More succinctly: with relatively little knowledge (or none at all) about how the individual components of a song are mixed, how might we reproduce the original components?
Time-frequency Masking
Time-frequency masking is based on the premise that in a piece of recorded audio, different sources will correspond to different, distinct frequencies. We can easily imagine how a piccolo and a string bass could occupy higher and lower frequencies. Indeed, researches have claimed impressive results using various time-frequency masking techniques when sources have little overlap. This quality is called W-disjoint orthogonality, and techniques for evaluating and approximating it are well-researched. 4
To understand how well time-frequency masking might work on Sacred Harp music, we should first analyze the music and its sonic components. This analysis used Python and its Librosa library. Below, we can see the Mel Spectrogram of the popular Sacred Harp song, “The Last Words of Copernicus.” Immediately, we can see that the highest concentrations of energy are in several distinct bands, beginning at approximately 440 Hz and dissipating near 2048 Hz.
We can see the bands even more clearly if we focus on the first few seconds of the recording. During this time, singers listen to the choir’s leader for the starting pitch (from 0-2 seconds), and tune their voices accordingly with each section coming in one-by-one (at approximately 2.5, 3.5, and 5 seconds in the image below).
In this close-up examination, we can see especially high intensities near 512 Hz, approaching 1024 Hz. Notably, the song as performed in this recording is in the key of E, and E5 (E above middle C), corresponds to approximately 659 Hz, and the third (G#) and fifth (B) above it are approximately 830 Hz and 988 Hz. 5 With the knowledge of the score to inform our analysis of the spectrogram, we can see that the visible frequency bands exist at the sung notes and their corresponding harmonics. This information is useful for creating a time-frequency mask for extracting the individual parts from the whole song.
To further understand the audio, component analysis was applied to the audio, as well. Using Librosa’s decomposition with nearest-neighbor filtering, which translates components into note classes over time. We can see the results below, and they confirm what we had gleaned from the previous spectrograms. There is a concentration of energy near G, B flat, and E flat. It is also important to note that in Sacred Harp singers rarely reference an instrument such as the piano or a pitch pipe before singing.
Both spectrogram and component analysis also show that there is significant overlap among the various components of the song. Additionally, there is significant noise in the recording.
In spite of these factors, however, a time-frequency filtering approach was applied. The predictable nature of the spectrogram around notes in the musical score meant that filters could easily be designed. The proposed framework would use MIDI data from the musical score and translate it into a series of band-pass filters centered on the pitch indicated by the MIDI, as well as several additional harmonics.
To prototype this, the visual programming environment Pure Data was used in combination with MIDI data taken from the song. Prior to this project, a MIDI score for “The Last Words of Copernicus” was unavailable, so the notation was transcribed for all four parts from a scan of the song using the program MuseScore3. 6 The score was then matched to the appropriate key and tempo-adjusted to match the audio file. This would provide the four MIDI tracks to guide the filters for the Pure Data patch.
When Pure Data receives the MIDI information, each of the four tracks is split, then the note value from the score is translated by Pure Data into a corresponding frequency. The frequency is then fed to a series of eight band-pass filters centered around the original frequency, a well as four integer-multiple harmonics above and below the fundamental pitch. At the same time, a synced version of the original audio is passed through the system, and the band-pass filters are applied. This is intended to attenuate the frequencies outside the fundamental and harmonics for each vocal part.
Additionally, an interface was made to allow a user to adjust the range of the band-pass filter, as well as the levels of the additional harmonics. A switch was also added for toggling between the vocal lines. The interface can be seen below.
After passing the song through the filters, we can see the differences in the spectrograms. When the bassline is isolated, we see that the lower frequences have comparatively more energy, with significant attenuation between 512 Hz and 1024 Hz. Similarly, with the soprano line, we see the greatest energy between 512 Hz and 1024 Hz, and attenutation above and below that range. However, when we listen to the audio, the isolating effect is quite limited. As expected, the significant overlap among the recording’s vocal lines and harmonics makes separation by time-frequency filter difficult.
Machine Learning Approaches
In November 2019, as work on the project was underway, French company Deezer open-sourced its Spleeter source-separation neural network. 7 Spleeter uses a U-Net architecture, a type of convolutional auto-encoder with “skip-connections that bring back detailed information lost during the encoding stage.” The neural network is implemented with TensorFlow. 8 9 The Spleeter framework had produced impressive results for recorded music, so applying it to Sacred Harp seemed promising. Deezer’s pre-trained models include two-, four- and five-stem variants, but these models are all trained on music tracks containing vocals, drums, bass and guitar. To apply Spleeter to Sacred Harp, building a compatible dataset would be the first step.
The dataset for training would consist of stems (the four isolated tracks) and a complete mix for each song. No public repository for this existed, so after reaching out to the Sacred Harp Singers of Cork, Ireland, and the developers of the FaSoLaMix iOS app, the stems and full mixes of 24 songs were compiled. Because this is not a particularly large dataset, data augmentation was performed using Python and Librosa. For each song, ten pitch-shifted versions, four time-shifted versions, and three versions with added noise were produced, as well as combinations of these variations. Then, corresponding comma-separated value (.csv) files were compiled to guide the Spleeter training process.
While the dataset is somewhat complete, work to apply Deezer’s source separation tools to Sacred Harp recordings (via its Python API) is still ongoing. And even though there are more than 960 permutations of the original data provided by the FaSoLaMix developers, there are still useful transformations that might be applied: remixing and loudness scaling, among others, were applied by Deezer when creating Spleeter. 10
Conclusions and Future Work
The discrete nature of Sacred Harp’s four-part harmonies make it an interesting challenge for BSS. The relative abundance of musical information on the genre, in the form of scores and MIDI files, as well as the distinct spectral regions those voices occupy, might suggest that time-frequency filtering techniques might be effective for separating sources. In this test, MIDI data was used to modify band-pass filters in real time to guide the separation process, but results only marginally isolated the parts, and distortion was significant.
Machine-learning approaches hold some promise, and the open-source Spleeter tool recently released by Deezer could provide significant strides in applying source separation to Sacred Harp music. Unfortunately, a dearth of training data complicates this particular problem. To that end, a set of files was compiled to form the beginning of such a dataset, and fully realizing a purpose-trained model might be attainable in the near future.
References
1 Melissa Block. “Preserving the Sacred Harp Singing Tradition.” National Public Radio (website). 5 December 2003.
2 “Wie is Sacred Harp Amsterdam?” Sacred Harp Amsterdam (website). Accessed 12 January 2020.
3 Abbas Mohammed, Tarig Ballal, Nedelko Grbic. “Blind Source Separation Using Time-Frequency Masking.” Radio Engineering Vol. 16 No. 4 (December 2007): 96-100.
4 Ozgur Yilmaz, Scott Rickard. “Blind Separation of Speech Mixtures via Time-Frequency Masking.” IEEE Transactions on Signal Processing Vol. 52 No. 7 (July 2004).
5 Joe Wolf. “Note Names, MIDI Numbers and Frequencies.” The University of New South Wales (website). Accessed 12 January 2020.
6 Sarah Lancaster, “The Last Words of Copernicus.” The International Music Score Library Project (website). Accessed 12 January 2020.
7 Deezer. “Spleeter.” GitHub (website). Accessed 14 January 2020.
8 Laure Pretet, Romain Hennequin, Jimena Royo-Letelier, Andrea Vaglio. “Singing Voice Separation: A Study on Training Data.” ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and Signal Processing.
9 Keagan Pang. “Deezer’s Spleeter: Deconstructing Music with AI.” Digital Innovation and Transformation (blog). Harvard Business School. Accessed 12 January 2020.
10 Favio Vazquez. “Separate Music Tracks with Deep Learning.” Towards Data Science (blog). 7 November 2019.