We present a method for learning interpretable music signal representations directly from waveform signals. Our method can be trained using unsupervised objectives and relies on the denoising auto-encoder model that uses a simple sinusoidal model as decoding functions to reconstruct the singing voice. The supplementary material focuses on three “qualitative” aspects:
Randomly sampled mixture (mix
) and singing voice (vox
) segments of 3 seconds, from the test sub-set.
Example | Input | Reconstructed |
---|---|---|
Mix Ex. 1 | ||
Vox Ex. 1 | ||
Mix Ex. 2 | ||
Vox Ex. 2 | ||
Mix Ex. 3 | ||
Vox Ex. 3 | ||
Mix Ex. 4 | ||
Vox Ex. 4 | ||
Mix Ex. 5 | ||
Vox Ex. 5 | ||
Mix Ex. 6 | ||
Vox Ex. 6 | ||
Mix Ex. 7 | ||
Vox Ex. 7 | ||
Mix Ex. 8 | ||
Vox Ex. 8 | ||
Mix Ex. 9 | ||
Vox Ex. 9 | ||
Mix Ex. 10 | ||
Vox Ex. 10 | ||
Mix Ex. 11 | ||
Vox Ex. 11 |
Example 1
Example 2
Example 3
Example 4
Example 5
Example 6
Example 7
Example 8
Example 9
Example 10
Example 11
We used the best performing model, reported in our paper, and compute the discrete Fourier transform (DFT) of the resulting basis signals, that the decoder is is using. We cherry picked some the basis.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
From the above collection of figures we could conclude the following: