Finding Music That Fits: Experiments with Audio Embeddings in the Gemini API
Introduction
A DJ set is not a playlist. A playlist is a list of songs you like. A set is a journey — each track has to connect to the next one tonally, energetically, and stylistically. Finding those connections in a library of thousands of tracks is a non-trivial problem, and it’s one I’ve been thinking about for a long time.
When I started in the Internet business in 2000, my first role was Product Manager for KaZaA, the file-sharing application. One question consumed me: how do you find the right music in a library of millions of tracks shared across the world? The answer we pursued back then was metadata — leveraging the ID3 tags embedded in MP3 files to build filtering and search. It worked, to a degree, but it was always limited by the quality and consistency of what people wrote into those tags. Twenty-five years later, I’m still working on the same problem — just with better tools.
This post documents an ongoing experiment: building a personal DJ library manager (Codexi) that uses AI-powered embeddings to answer the question “what else sounds like this?”. We tried three distinct approaches. Two didn’t work as well as we hoped. The third — native audio embeddings via the new Gemini API — opened up something genuinely interesting. This is an honest account of all three.
The Problem
When you’re preparing a set, you need to rapidly evaluate whether a track fits the mood you’re building. For a large library (3,000+ tracks), that means your mental model of the music has to be very good, or you miss things. The appeal of AI-assisted search is obvious: embed the library into a vector space, then find the nearest neighbors to whatever you’re currently playing.
The question is: what should you embed?
How others have approached this
Music similarity is a well-studied problem, and the range of existing approaches is instructive because each one reveals a different assumption about what “similar” means.
Pandora’s Music Genome Project took the most labor-intensive route: human analysts listen to every track and hand-label it across roughly 450 attributes — things like “minor key tonality,” “mild rhythmic syncopation,” or “extensive vamping.” The result is extremely precise but entirely dependent on human annotation, which doesn’t scale and encodes the analyst’s frame of reference.
Collaborative filtering — the approach underlying most streaming recommendation systems, including large parts of Spotify — sidesteps audio analysis entirely. If users who listen to track A also tend to listen to track B, then A and B are “similar” in a useful sense, regardless of whether they sound alike. This works remarkably well at scale, but it’s blind to acoustic content and has a cold-start problem: new or obscure tracks with no listening history don’t get recommended.
Spotify’s audio analysis (originally developed by The Echo Nest, acquired in 2014) combines both worlds. They run audio through CNNs to extract timbral features, and they also crawl web text — reviews, blogs, playlists — to build cultural context around tracks. Their public Audio Features API exposed some of this as numeric descriptors: energy, danceability, valence, acousticness, tempo. This was genuinely useful for developers, but the API was restricted in 2024/2025 to approved business partners only, which effectively closed off that route for independent tooling.
Open-source audio embedding libraries like VGGish (Google), PANNs (Pre-trained Audio Neural Networks trained on AudioSet), MusicNN, and OpenL3 take the neural approach without requiring proprietary APIs. These models learn audio representations from large labeled datasets and produce fixed-length vectors that encode timbral and structural properties of audio. They work, but they require running inference locally, which adds significant complexity and computational overhead for a personal library tool.
CLAP (Contrastive Language-Audio Pretraining, analogous to CLIP for images) represents a newer approach: train a model jointly on audio and text descriptions so that the two modalities land in the same embedding space. This enables natural language audio search — “find me something that sounds like early morning rain on concrete” — and is a direction several research teams are actively pursuing.
What’s notable about the Gemini audio embedding API is that it sits in the same conceptual space as VGGish/PANNs — a pre-trained neural model that encodes audio into a fixed-dimension vector — but it’s available as a hosted API call rather than a locally-run model. For a personal app with no GPU budget, that’s perfect.

Approach 1: Embedding Vibe Metadata as Text
The first approach was entirely metadata-driven. We used Gemini 2.5-flash as an “experienced underground DJ” to analyze each track’s existing metadata — artist, title, label, year, BPM, genre/style tags from the file and from Discogs — and produce a structured vibe assessment.
The output was constrained to a controlled vocabulary of 50 tags across three categories:
- Sub-genre (20 options): Acid House, Acid Techno, Minimal Techno, Deep House, Detroit Techno, EBM, IDM, Dub Techno, and so on
- Sonic character (20 options): 303 Bassline, Hypnotic, Driving, Dark, Atmospheric, Warehouse, Raw, Funky, etc.
- Texture/feel (10 options): Classic, Modern, Underground, Peak Time, Late Night, Stripped Back, etc.
In addition, the model assigned an energy score on a 1–10 scale with explicit anchors:
2 = ambient or drone / 4 = warm-up / 6 = mid-set depth track / 7.5 = floor-filler / 9 = peak time / 10 = extreme
And a placement value from five fixed options: Warm-up, Early Set, Mid Set, Peak Time, After-hours.
The resulting structured data was then serialized into a short text string for embedding:
Styles: acid techno, minimal techno, warehouse. Energy: 9. Placement: Peak Time. Vibe: dark, driving, hypnotic
This string was passed to gemini-embedding-001 to generate a 768-dimensional vector:
const response = await ai.models.embedContent({
model: 'gemini-embedding-001',
contents: textToEmbed,
config: { outputDimensionality: 768 },
});
const embedding = response.embeddings[0].values; // 768-dim vectorCode language: JavaScript (javascript)
The resulting vector is stored in a sqlite-vec virtual table for approximate nearest-neighbor search.
What worked
The approach works well for what it is. Tracks from similar sub-genres cluster together. The similarity search surfaces genuinely related music. We added two derived metrics to the playlist view:
- Cohesion score: the magnitude of the centroid vector across all track embeddings in a playlist. A tight genre cluster produces a centroid with high magnitude (approaching 1.0); a chaotic collection produces a low-magnitude centroid. In practice, a well-curated single-genre set scores 0.75–0.85; a diverse DJ set sits around 0.50–0.60.
- Per-track fit score: cosine similarity of each track’s embedding against the normalized centroid. We min-max normalize within the playlist (best fit = 100%, worst = 0%) because the raw cosine range was too compressed (0.50–0.57) to be readable.
The cohesion calculation is simple linear algebra — average the vectors, measure how “aligned” the result is:
// Compute centroid (element-wise average of all track embeddings)
const centroid = new Float32Array(768);
for (const { embedding } of trackEmbeddings) {
for (let i = 0; i < 768; i++) centroid[i] += embedding[i];
}
for (let i = 0; i < 768; i++) centroid[i] /= trackEmbeddings.length;
// Centroid magnitude = cohesion (1.0 = all vectors identical, ~0 = pointing everywhere)
let magnitude = 0;
for (let i = 0; i < 768; i++) magnitude += centroid[i] * centroid[i];
const cohesion = Math.sqrt(magnitude);
// Per-track fit: cosine similarity vs. normalized centroid, then min-max scale
const norm = new Float32Array(768);
for (let i = 0; i < 768; i++) norm[i] = centroid[i] / cohesion;
const rawScores = trackEmbeddings.map(({ path, embedding }) => {
let dot = 0;
for (let i = 0; i < 768; i++) dot += embedding[i] * norm[i];
return { path, score: dot };
});
const min = Math.min(...rawScores.map(s => s.score));
const max = Math.max(...rawScores.map(s => s.score));
const range = max - min;
const fitScores = rawScores.map(({ path, score }) => ({
path,
score: range > 0 ? (score - min) / range : 1, // 0% = worst fit, 100% = best fit
}));Code language: JavaScript (javascript)
The intuition: if every track in a playlist points in roughly the same direction in embedding space, their average (the centroid) retains most of its length. If the tracks point in wildly different directions, the vectors cancel out and the centroid shrinks toward zero. This gives us a single number for “how coherent does this playlist feel?” — without any genre labels or human judgment.

A key lesson about embedding text
Early experiments included artist name, label, and year in the embedding text. This caused same-label tracks to cluster regardless of how they sounded — Tresor records clustered together even when the music was stylistically unrelated. Removing those fields and keeping only the sonic descriptors (styles, energy, placement, vibe tags) improved the quality of similarity matches significantly.
The ceiling
The limitation is inherent to the approach: it encodes what we say about the music, not what the music sounds like. If the metadata is sparse or incorrect, the embedding reflects that. And the controlled vocabulary, while useful for consistency, constrains the expressiveness. Two tracks can share all three vibe tags and still sound nothing alike.
Approach 2: Raw Audio Feature Extraction (MFCC)
Given the metadata ceiling, the next experiment went in the opposite direction: analyze the audio itself, without any AI or human labels involved.
The approach:
- Decode each track to mono PCM at 22,050 Hz via ffmpeg
- Process in 2,048-sample frames with a 1,024-sample hop
- Extract per-frame: MFCC coefficients 1–12, spectral centroid, spectral rolloff, zero-crossing rate, RMS energy, and 12 chroma bins
- Average all frames into a single 28-dimensional mean vector
- L2-normalize and store in a separate sqlite-vec table
This is a classic approach from music information retrieval research. The hypothesis was that genre-similar tracks would have similar mean spectral shapes.
What happened
It did not work. At all.
Even tracks from completely different genres — ambient recordings, acid techno, classic rock — scored 90–99% cosine similarity to each other. The model correctly identified nothing.
The root cause is conceptually simple once you see it: averaging thousands of audio frames into a single mean vector destroys all temporal structure. A techno track and an ambient track both have sustained low-frequency energy, both have mid-range spectral content, and both have similar mean chroma distributions over time. The mean spectral shape of most music is broadly similar regardless of genre because it reflects the physics of sound production, not the stylistic choices of the music.
What would be needed to make audio similarity work at this level is a pre-trained music-specific neural embedding model — something like PANNs (Pre-trained Audio Neural Networks), VGGish, or MusicNN — trained on labeled music data with enough examples to learn genre-discriminative features. Hand-crafted DSP feature averaging is not sufficient.
We abandoned the approach entirely and removed the code.
Approach 3: Native Audio Embeddings via the Gemini API
The third approach became possible when gemini-embedding-2-preview launched with multimodal input support, including audio. Rather than extracting features ourselves, we send raw audio directly to the model and receive a 768-dimensional embedding in return.
The model handles the feature extraction internally, drawing on whatever representations it learned during training on large-scale audio data. We don’t know exactly what it’s learned to distinguish — that’s the interesting part.
Implementation
The pipeline for each track:
- Extract a segment from the track midpoint using ffmpeg. We chose the midpoint because DJ tracks frequently have long intros and outros that don’t represent the body of the track. For a 7-minute track, we extract 60–80 seconds starting at the 3.5-minute mark.
- Encode as base64 MP3 and send to the Gemini API.
- Store the 768-dim vector in a second sqlite-vec virtual table (
vec_audio_tracks), separate from the text vibe embeddings.
The segment extraction uses ffmpeg. We get the track duration, calculate the midpoint, and extract an 80-second MP3 clip:
import { execFile } from 'node:child_process';
import { promisify } from 'node:util';
const execFileAsync = promisify(execFile);
async function extractSegment(inputPath, startSec, durationSec) {
const tmpFile = path.join(os.tmpdir(), `embed-${Date.now()}.mp3`);
await execFileAsync('ffmpeg', [
'-ss', String(startSec), // seek to start position
'-t', String(durationSec), // duration to extract
'-i', inputPath,
'-acodec', 'libmp3lame',
'-q:a', '2', // variable bitrate, high quality
'-y', // overwrite
tmpFile,
]);
return tmpFile;
}
// Extract 80s from the midpoint of a 7-minute track
const duration = 420; // seconds (parsed from ffmpeg probe)
const segment = await extractSegment(trackPath, Math.floor(duration / 2), 80);Code language: JavaScript (javascript)
The segment file is then base64-encoded and sent to the Gemini API. We use the @google/genai SDK:
npm install @google/genaiCode language: CSS (css)
The core API call is straightforward:
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const data = await fs.readFile(segmentPath, { encoding: 'base64' });
const response = await ai.models.embedContent({
model: 'gemini-embedding-2-preview',
contents: [{
parts: [{
inlineData: {
mimeType: 'audio/mpeg',
data, // base64-encoded MP3 segment
},
}],
}],
config: {
outputDimensionality: 768,
},
});
const embedding = response.embeddings[0].values; // Float32Array, 768 dimsCode language: JavaScript (javascript)
We rate-limit to one request every 3 seconds to avoid quota exhaustion, with a 30-second backoff on 429 responses.
Storing and querying vectors with sqlite-vec
We use sqlite-vec, a SQLite extension that adds virtual tables for vector similarity search. The schema is minimal:
CREATE VIRTUAL TABLE vec_audio_tracks USING vec0(
embedding float[768]
);Code language: CSS (css)
Each track’s embedding is stored as a Float32Array. The vector table’s rows are linked to the main tracks table by shared row ID, so a single JOIN connects vectors back to track metadata.
Finding similar tracks is a single query. sqlite-vec’s MATCH operator returns the k nearest neighbors by L2 (Euclidean) distance. Because our embeddings are L2-normalized (unit vectors), we can convert L2 distance back to cosine similarity:
const sql = `
SELECT t.path, t.artist, t.title, t.bpm, v.distance
FROM (
SELECT rowid, distance
FROM vec_audio_tracks
WHERE embedding MATCH ?
ORDER BY distance
LIMIT ?
) v
JOIN tracks t ON v.rowid = t.rowid
`;
const rows = db.prepare(sql).all(queryEmbedding, limit);
// Convert L2 distance on unit vectors to cosine similarity:
// cos_sim = 1 - (L2^2 / 2)
const results = rows.map(row => ({
...row,
score: 1 - (row.distance * row.distance) / 2,
}));Code language: JavaScript (javascript)
The math here is worth explaining. For two unit vectors a and b, the L2 distance and cosine similarity are related by: ||a - b||² = 2 - 2·cos(a,b), which rearranges to cos(a,b) = 1 - L2²/2. This lets us use sqlite-vec’s native L2 index while reasoning about results in terms of cosine similarity.
Segment length
We ran the same 9-track experiment at 30 seconds and 80 seconds to see whether segment length affected quality. The script accepts the duration as a command-line argument:
GEMINI_API_KEY=<key> node audio-embed.mjs 30 # 30s segments
GEMINI_API_KEY=<key> node audio-embed.mjs 80 # 80s segments (default)Code language: HTML, XML (xml)


Results: Similarity Matrix
We ran the experiment on 9 tracks chosen for maximum genre diversity: Chicago acid house (Armando, Phuture), acid techno (Emmanuel Top), ambient (Brian Eno), Warp-era electronic (LFO), Chicago house (Green Velvet), beat-driven experimental (Flying Lotus), disco (Sylvester), and jungle (Congo Natty). The goal was to test whether the model can distinguish across genres, not just within them.
What the numbers show
The acid cluster is tight and correct. Armando and Phuture — both foundational Chicago acid house — are the closest pair at 0.923 (80s) / 0.904 (30s). LFO sits right next to them (0.911 with Phuture, 0.902 with Armando at 80s), and Emmanuel Top’s acid techno connects strongly too (0.890 with Phuture). The model recognizes the shared DNA of 303-driven machine music across decades and scenes without being told anything about genre.
Brian Eno is correctly isolated. Music for Airports scores lowest against almost everything — its nearest neighbor is Armando at 0.754 (80s) / 0.734 (30s), and it drops to 0.566 / 0.554 against Congo Natty. As the only non-rhythmic, purely atmospheric track in the set, it sits at the edge of the embedding space.
Congo Natty is the biggest outlier. Jungle breakbeats are acoustically very different from four-to-the-floor electronic music. Congo Natty’s highest similarity is with LFO at just 0.737 (80s) / 0.734 (30s) — the lowest “best match” of any track. The chopped breaks, sub-bass, and ragga vocals put it in its own region.
Sylvester’s disco sits apart from the electronic cluster. The vocal-driven disco scores moderately against the acid tracks (0.806 with Armando at 80s) but drops off against ambient (0.684) and jungle (0.683). It occupies a middle ground — rhythmic enough to connect with the dancefloor tracks, but vocally and harmonically distinct.
The overall spread is wide. At 80s, scores range from 0.566 (Eno ↔ Congo Natty) to 0.923 (Armando ↔ Phuture) — a spread of 0.357. At 30s, the range is 0.554 to 0.904 — a spread of 0.350. Both are dramatically more discriminating than what we saw with the MFCC approach (where everything scored 0.90–0.99), confirming that the model captures real acoustic differences.
Longer segments sharpen the picture. Comparing the 30s and 80s matrices, the overall ranking of pairs stays remarkably consistent — the model correctly identifies the same clusters and outliers regardless of segment length. But the 80s version tightens the top-end clusters: Armando ↔ Phuture jumps from 0.904 to 0.923, LFO ↔ Phuture from 0.864 to 0.911, and Armando ↔ LFO from 0.843 to 0.902. Meanwhile the distant pairs stay roughly stable (Eno ↔ Congo Natty: 0.554 → 0.566). The effect is that 80 seconds gives the model enough audio to be more confident about genuine similarity, while 30 seconds is enough to get the broad picture right. For a production system, 80 seconds is worth the extra API cost.
Library Visualization
Beyond track-to-track matching, the embeddings enable a different kind of library exploration: visualizing the entire collection as a graph.
We built a force-directed visualization (visualize.mjs) where:
- Each track is a node
- Each track connects to its 4 most acoustically similar neighbors (by audio embedding cosine similarity). Because the relationship is asymmetric — track A might be in track B’s top 4 without B being in A’s — hub tracks that appear in many neighbors’ lists end up with more connections. This is itself informative: a node with many edges is a genre bridge.
- A physics simulation runs repulsion between all nodes and attraction along edges, then cools to a stable layout
The result is a spatial map of the library. Genres self-organize into clusters without any label information being provided to the layout algorithm. Tracks at the center of dense clusters are the “core sound” of that region. Tracks at the periphery are the outliers — unusual or transitional records.
Color modes let you overlay different dimensions on the same layout:
- Isolation (default): red = acoustically isolated / few strong connections; green = well-connected hub
- Connectors: highlights tracks that bridge multiple clusters — the genre-spanning records
The isolation score is particularly useful for a DJ context. An isolated track is acoustically unique — it doesn't connect strongly to any cluster. These tracks are either genre outliers (records that don't fit your usual sound), or they're the "surprise" records that can bridge unexpected combinations. A hub track, by contrast, connects to many neighbors across clusters — these are the genre bridges, ideal for transitions between different sounds in a set.
Below you can find an example for a bigger collection of tracks from my personal library.

Comparing the Two Embedding Approaches
We now run both embedding types in parallel and expose them as separate search modes:
| Text Vibe Embeddings | Audio Embeddings | |
|---|---|---|
| Input | AI-analyzed metadata text string | 30–80s raw MP3 segment from midpoint |
| Model | gemini-embedding-001 | gemini-embedding-2-preview |
| Dimensions | 768 | 768 |
| Speed | Fast (text API call) | Slower (ffmpeg extraction + audio API call) |
| What it captures | Semantic/label-based similarity | Acoustic/sonic similarity |
| Works without metadata? | No — requires good tags + Discogs enrichment | Yes |
| Granularity | Controlled by ontology | Emergent from audio content |
| Best for | Playlist cohesion, vibe-based search | "Sounds like" search, library visualization |
The approaches are genuinely complementary rather than redundant. Text vibe embeddings encode intentional genre and vibe labels — the result of human curation and AI classification. Audio embeddings encode what the track actually sounds like in a more direct, label-free way.
A track labeled "Deep House" by its original tagger but with a heavy industrial texture will score differently in each system. That discrepancy is information, not noise.
Open Questions
The experiment answered some questions and raised more:
DJ transition compatibility. The most practically useful test would be to take track pairs we know mix well and pairs that clash, and see whether the cosine scores reflect that. If audio embedding similarity correlates with actual mixing compatibility, the case for using it as the primary similarity metric becomes strong. We haven't run this experiment systematically yet.
Subgenre discrimination. The matrix shows that techno and industrial cluster together (0.88–0.94 range). Can the model distinguish Detroit techno from Berlin techno from industrial techno? Our suspicion is that within-genre distinctions will be harder, and the signal will be in the 0.90–0.99 range where the spread is tighter.
Tempo vs. texture. Does a slow ambient track score closer to a slow techno warm-up track, or to a faster ambient piece? This would reveal whether the model prioritizes energy level or sonic texture — both are relevant for DJs but in different ways.
Multi-segment sampling. Extracting one segment from the midpoint is a heuristic. A more robust approach would embed at 25%, 50%, and 75% of track duration and average the three vectors. This would reduce the influence of any one segment being atypically quiet or loud.
Same-artist and same-label clustering. Do tracks from the same artist cluster together in audio embedding space? What about label catalogs — do Tresor or R&S releases form coherent acoustic clusters independent of genre labels?
Try It Yourself
If you want to experiment with audio embeddings on your own music library, the core loop is surprisingly small.
What you need:
- Node.js (v18+)
- ffmpeg installed on your system
- A Gemini API key — create one at aistudio.google.com. The embedding API has a free tier that's generous enough for experimentation.
The zipfile contains a minimal end-to-end script called audio-embed.mjs that embeds a folder of MP3s and prints a similarity matrix as well as outputs a json file. It's ~160 lines, self-contained, and needs only @google/genai and ffmpeg. Drop your MP3s into a tracks/ folder next to the script and run:
npm install
GEMINI_API_KEY=<your-key> node audio-embed.mjs # 80s segments (default)
GEMINI_API_KEY=<your-key> node audio-embed.mjs 30 # or try 30sCode language: HTML, XML (xml)
The script comes with a minimal package.json — just run npm install to pull in the @google/genai dependency.
It outputs the raw embeddings to output/embeddings-<duration>s.json, writes a markdown similarity matrix to output/similarity-matrix-<duration>s.md, and prints the results to the console.
There's also a Python script, heatmap.py, that generates heatmap images from the similarity matrices. It auto-discovers any output/similarity-matrix-*s.md files and produces a PNG for each one with numbered axes and a track list legend. If you have uv installed, you can run it without manually installing dependencies:
uv run --with matplotlib --with numpy heatmap.pyCode language: JavaScript (javascript)
Or with a standard pip environment: pip install matplotlib numpy && python heatmap.py.
A second script, visualize.mjs, reads the embeddings and generates an interactive force-directed graph as a self-contained HTML file:
node visualize.mjs # uses the latest embeddings file
node visualize.mjs 80 # or specify a duration explicitlyCode language: PHP (php)
It opens the result in your browser — each track is a node, edges connect to the 4 nearest neighbors, and the physics simulation settles into clusters. Hover for similarity scores, pan and zoom with the mouse.
Conclusion
We started trying to solve a specific problem — finding music that fits together in a DJ context — and ended up exploring the boundary between metadata-based and audio-based music representation.
The MFCC experiment was a useful failure: it clarified that raw audio feature averaging is not discriminative enough for this use case, and that what's needed is a model trained to understand music, not just to measure spectral properties.
The Gemini audio embedding API represents a qualitatively different approach. It's not perfect — scores are compressed, fine-grained subgenre discrimination is uncertain, and we've only run small-scale tests so far. But the clustering in our 9-track experiment is directionally correct without any genre labels or metadata, and the force-directed library visualization reveals structure that isn't visible when you're browsing a flat file tree.
The Library Visualization clearly showed that the generic genres like Techno, Acid, DnB and UKG are clustered pretty well and stand out from each other. Ambient music like Pete Namlook gets more or less the same cluster as the Chill Out tracks from Air, which intuitively makes sense. But also more subtle stuff like pretty low-fi sounding acid (Paranoid London) gets organized in a cluster with similar sounding tracks, no matter if they are old (Acid Junkies) or new (Delta Funktionen).
Some final words: This technology will not replace the work of a great DJ and selector. The intuitive and surprising selections a DJ can make can hardly be repeated by this type of algorithms. What it did for me is that it sometimes highlighted tracks that were somewhat unexpected but happened to fit pretty well together across genres and eras. Very interesting and fun to work with.
Codexi is a local-first Electron app for DJ library management. The audio embedding experiment uses gemini-embedding-2-preview via the @google/genai SDK, with sqlite-vec for vector storage and approximate nearest-neighbor search.