All work
AI-Driven Financial Insight Pipeline
2025·Sole engineer·shipped

AI-Driven Financial Insight Pipeline

An AI-powered data pipeline that extracts financial insights from unstructured PDF documents, transforms them through BigQuery via dbt, and generates automated variance analysis and revenue forecasts. Delivered measurable impact: +20% forecast accuracy, +15% revenue identified, and 10 hours per week recovered for the finance team.

Problem

Critical payment data was locked in unstructured PDFs — invoices, bank statements, transaction records — that no existing pipeline could read. Finance teams spent 10+ hours per week manually extracting and reconciling this data, with no systematic forecasting or anomaly detection in place.

Solution

Built an LLM-powered extraction layer using Claude and Gemini APIs to parse unstructured financial PDFs into structured records. Airflow DAGs orchestrate the full pipeline: extract → validate → load to BigQuery → dbt transforms → automated variance analysis with narrative insight generation.

Architecture

Extraction

Claude and Gemini APIs parse raw PDF content — invoices, statements, transaction records — into structured JSON payloads with merchant, amount, date, and category fields.

Orchestration

Apache Airflow DAGs manage the full pipeline: extraction, validation, BigQuery loading, and dbt model execution. Retries and alerting handle transient API failures.

Transformation

dbt models in BigQuery define staging, intermediate, and mart layers. Dimensional modeling separates raw financial records from aggregated KPI and forecast tables.

Forecasting

Time-series models trained on historical payment data generate revenue forecasts with confidence intervals. Variance analysis highlights deviations from plan automatically.

Insight generation

LLM-generated narrative summaries explain key variances, flag anomalies, and surface revenue opportunities — replacing manual analyst commentary with automated reports.

Highlights

  • +20% improvement in revenue forecast accuracy over previous manual process.
  • +15% revenue identified through systematic analysis of previously unread PDF data.
  • 10 hours per week recovered for the finance team — zero manual extraction.
  • Claude and Gemini API extraction layer handles diverse PDF formats without schema changes.
  • Airflow-orchestrated dbt pipeline with full data lineage and quality validation.
  • Automated variance analysis with LLM-generated narrative commentary.