Building a RAG System Over Structured Biometric Data

Most RAG tutorials embed PDFs or markdown documents. AthleteOS presented a different challenge: the source of truth was a Snowflake warehouse with structured dimensional tables — sessions, reps, biometric readings, form scores. The question was how to make that queryable in natural language without hallucinating numbers that don't exist in the data.

Here's how I approached it.

The problem with structured data and RAG

Traditional RAG works well for unstructured corpora. You chunk documents, embed them, store vectors in something like pgvector or Pinecone, retrieve the top-k chunks at query time, and pass them as context to the LLM. The LLM synthesizes an answer from the retrieved text.

Structured data breaks this in two ways. First, the "documents" are rows — a session record isn't a paragraph, it's 30 columns of floats and timestamps. Embedding raw JSON blobs of tabular data produces terrible retrieval. Second, precise numeric answers require exact aggregation, not fuzzy semantic similarity. If I ask "what was my average HRV on deadlift days in March?", I need the actual SQL result, not the closest-sounding training log entry.

The solution is a hybrid: use embeddings for semantic retrieval of context, but use the LLM to generate SQL for precise numeric questions, and verify every number against actual query results before surfacing it.

The data model

AthleteOS writes to Snowflake via dbt. The relevant tables after transformation:

- fct_sessions — one row per training session: athlete_id, session_id, start_ts, duration_mins, exercise_type, total_reps, avg_form_score

- fct_biometrics — one row per biometric reading: session_id, timestamp, hrv, strain, recovery_score, heart_rate

- fct_reps — one row per rep: session_id, rep_number, form_score, joint_angles (JSON), flagged_errors (array)

- dim_exercises — exercise metadata: exercise_id, name, muscle_groups, movement_pattern

The embeddings live in PostgreSQL with pgvector, not Snowflake. I write session summaries to Postgres as part of the dbt post-hook — a plain-text paragraph summarizing each session that gets embedded and stored in a session_embeddings table.

Session summary generation

The dbt post-hook calls a FastAPI endpoint after each model run. The endpoint takes the session ID, queries Snowflake for the session's stats, and generates a natural language summary:


Session 2025-03-14: 45-minute deadlift session.
6 sets × 5 reps. Average form score 0.82/1.0.
HRV entering session: 62ms (above baseline).
Recovery score: 74%.
Flagged errors: lower back rounding on reps 3 and 5 of set 4.
Progressive overload maintained — 10lb increase from previous session.

This summary is then embedded via the OpenAI embedding API and stored in pgvector alongside the session_id foreign key.

Query routing

At query time, I classify the incoming question into one of three types before doing anything else:

- Semantic recall — "What did my deadlift form feel like last week?" → retrieve similar session summaries via vector similarity

- Precise numeric — "What was my average HRV on Tuesday sessions this month?" → generate SQL, execute against Snowflake, return exact numbers

- Hybrid — "Why has my form score been dropping?" → retrieve context via embeddings, then query for the actual trendline numbers

The classifier is a simple LLM call with a system prompt that returns one of {semantic, numeric, hybrid} and nothing else. Fast and cheap.

SQL generation and verification

For numeric queries, I pass the Snowflake schema (table definitions + column descriptions) as context and ask the LLM to generate a read-only SELECT query. The key constraints in the prompt:

- Only SELECT — no mutations allowed, enforced at the connection level too

- Always filter by athlete_id = {current_user} — the LLM doesn't get to decide scope

- Return the SQL before execution so I can log it

The generated SQL runs against Snowflake. If it errors, I send the error back to the LLM with the original query and ask for a corrected version — one retry, then fallback to "I couldn't compute that precisely."

The numeric result is then passed back to the LLM as verified context for the final answer. The LLM can interpret and explain the numbers, but it cannot invent them.

What actually works well

The hybrid approach handles the hardest question type well: "Why was my deadlift form worse last Tuesday?" Vector retrieval pulls the session summary with the flagged errors. A follow-up numeric query pulls the form score trend for the past 4 weeks. The LLM synthesizes both into an explanation that cites actual numbers from actual data.

The biggest win was keeping the LLM out of the arithmetic. Every number in the final answer is a Snowflake query result. The LLM's job is interpretation and language, not computation.