Soundmosaic
Copyright © 2001-2007 by Steven Hazel <sah@awesame.org>

Soundmosaic constructs an approximation of one sound out of small pieces of other sounds.

Starting with a target file and a set of source files, soundmosaic splits the target file up into equal-sized segments, or "tiles". For each tile in the target file, it finds the closest match in the source files, and replaces the target tile with the tile from the source files.

I've made some sample mp3s:

    For the first demo, the target sound was a recording of a chimpanzee screaming, and the source files were a few short recordings from George W. Bush's public speeches. The final product is a concatenation of soundmosaic results for decreasing tile sizes, starting at a few seconds per tile (such that the first sound is a direct clip of GW's speech), and decreasing to one microsecond per tile (such that the last sound is a perfect reproduction of the chimp's scream):
    bushchimp.mp3

    The second demo is based on a recording of the Beatles introducing themselves, replaced by snippets of John Coltrane performing "A Love Supreme". The tile size is about half a centisecond. You can hear the sax pretty clearly, especially when one of the Beatles whistles after George's introduction. Some of the clicks you hear are artifacts of the concatenation, but others are drums, mostly the ride cymbal:
    beatles-coltrane.mp3


Download:


Distance Metric:

    The difference between two tiles is defined as the correlation of the normalized vectors. This is the cosine of the angle between the vectors, and can be calculated with a dot product once the vectors have been scaled to any common length.

    In fact, the prospective match is scaled to the volume of the original tile before comparison, and it is written to the output file at that volume. Normalization before comparison means that the overall volume of tiles does not affect the comparison. This also serves to make the output sound a little bit more like the target, since it follows the same broad amplitude changes.

    Before 1.1, soundmosaic used the Manhattan distance between the "normalized" vectors, where "normalization" was done in the common audio sense of increasing the volume as much as possible without clipping (this corresponds to mapping onto the surface of a hypercube rather than a hypersphere). The old metric worked reasonably well, but the new metric is much better.

Resampling:

    Soundmosaic automatically resamples the source files to match the sample rate of the target file. It does this using a simple zero order hold / drop sample resampler, which is low quality and introduces all kinds of artifacts -- it doesn't even low pass filter at the relevant Nyquist frequency. If resampling quality is important to you, you should use a higher quality resampler to adjust all of your source material to the same sample rate as the target file before you run soundmosaic.

Dealing with Large Amounts of Data:

    In order to find matches good enough to make both the target and source inputs recognizable in the output, it helps to have a tremendous amount of source data, and a tremendous amount of data storage and processing to go with it. Distributing the system across multiple machines using the --master and --slave options helps to handle that load so that a decent result can be achieved in a more reasonable amount of time.

    Normally, we compare each tile with all of the continuous tiles in the source files (one beginning at the first sample, another beginning at the second, and so on). That's very time consuming, though, even for a small amount of data, so the --partition flag is provided to merely partition the source file into non-overlapping tiles, the same as is done with the target file. This method produces lower quality results, but it allows for a variety of source tiles, and prevents the processing time from getting out of hand. It can be a useful way to "test run" a soundmosaic project to get an idea of what the results might be like.

Future Development:

    I'm interested in ways of speeding up the calculation of distance -- I'm not sure whether soundmosaic can use the standard DSP techniques for calculating correlation more efficiently, because I think the per-tile normalization probably gets in the way.

    I'm also interested in distance metrics which are more relevant to the sounds which are important to the human ear. It might be helpful to filter some frequency ranges before doing the comparison, or to use mp3 compression to strip out less important information.

    Soundmosaic usually produces output that clicks loudly at the edges of tiles. I'd like to fix that. I could fade the ends of every output tile, but I'm not sure that would sound any better for small tile sizes, and I don't know what the falloff curve should be or how quickly to fade the edges. Or I could split tiles at the nearest 0-crossing, but I don't like the idea of having variable-length tiles.

Related Work: