Development Engine for Complex Data Platforms (part II)

Setting up the enablers: CLAUDE.md, skills, and the unified extensibility system that embeds architectural discipline into the development process.

mar 21, 2026

01 · The Development Problem Behind Data Platforms

Building a medallion-architecture platform pipelines across multiple processing domains creates a specific engineering challenge that has nothing to do with the data itself: how do you keep architectural discipline consistent across every notebook, every pipeline, and every environment as the platform evolves?

Quality rules get forgotten. Partition keys get omitted. Configuration patterns drift between notebooks. A new developer copies an old notebook and misses the latest pattern. A production fix bypasses the MERGE pattern and introduces duplicates. These are not hypothetical risks — they are the natural entropy of any complex codebase under active development.

Traditional approaches — code reviews, style guides, wiki documentation — rely on human attention. They work until the platform reaches a scale where no single person can hold the full architectural context. Claude Code solves this by embedding the platform’s architectural knowledge directly into the development toolchain.

02 · The Technology Stack

Before discussing how Claude Code helps, it is worth mapping the stack that needs to be maintained. Each component has specific configuration requirements, failure modes, and interaction patterns that must be consistently applied across every pipeline.

Microsoft Fabric — Unified analytics platform. Lakehouses, pipelines, notebooks, and scheduling under one capacity.
Delta Lake — ACID transactions on the lakehouse. MERGE/UPSERT for idempotent writes. Schema enforcement at Silver layer.
Apache Spark — All transformation logic in PySpark notebooks. Shared sessions via orchestrator pattern. Window functions for time-series continuity.
Azure Cosmos DB — Low-latency serving layer for real-time dashboards. Partition discipline enforced by code. REST API exposure.
Azure Key Vault — Secrets management. 3-tier config cascade: Key Vault → Spark Config → Default Value across all environments.
Terraform — Infrastructure as Code for capacity provisioning, Cosmos DB accounts, Key Vault policies, and network rules.

The challenge is not any single component — it is the interactions between them. A Cosmos container defined in Terraform must match the partition key used in the notebook that writes to it. A Key Vault secret path must be consistent across the Terraform policy, the Spark config, and the notebook’s fallback default. These cross-cutting dependencies are exactly where drift causes silent failures.

03 · Why Claude Code for Data Platform Development

A data platform is not a web app. There is no npm run dev. The development loop involves PySpark notebooks that run on remote clusters, Delta Lake tables you can only inspect by querying, documents in a serving layer that silently misdirect if a partition key is missing, and Terraform plans that manage infrastructure across three environments.

Claude Code fits this workflow because it operates in the terminal — the same place where you run az login, terraform plan, pipe logs, and inspect tables. It reads the full project structure, understands architectural context from CLAUDE.md, and executes agentic workflows that chain multiple tools: search the codebase for all MERGE statements, validate partition key consistency across Terraform and notebooks, run a lint pass, and generate a compliance report — all in a single session.

The Strategic Value
The value of Claude Code for data platforms is not speed — it is consistency at scale. When every notebook must follow the same structure, every write must use the same pattern, and every deployment must pass the same governance checks, the question is not whether rules will be violated. It is how quickly those violations get caught. Claude Code shifts detection from code review (hours/days) to code generation (instant).

04 · CLAUDE.md — The Platform’s Persistent Memory

The CLAUDE.md file sits at the root of the repository. It encodes the architectural decisions, quality rules, naming conventions, and environment setup that every developer — human or AI — must follow. Claude Code reads it at the start of every session. No repetition, no drift.

# Data Platform — Real-Time Analytics Platform

## Project Overview
Medallion architecture on Microsoft Fabric (F4 capacity).
Bronze → Silver → Gold → Cosmos DB / Power BI serving layer.
Three processing domains: orchestrated batch, hybrid event/scheduled,
incremental replication. 15+ active pipelines.

## Architecture Rules
- Every write uses Delta Lake MERGE (UPSERT). No append-only writes.
- Every Cosmos DB document MUST include partition_date field.
- All fact-to-dimension joins MUST be LEFT JOIN with count guard.
- Every cache() must have a matching unpersist().
- Named constants only. No hardcoded literals for thresholds.
- Imports at module top level. Never inside loops or conditionals.

## Environment Configuration
3-tier cascade: Key Vault → Spark Config → Default Value.
Notebooks MUST work standalone (dev) and under orchestrator (prod).
@docs/environment-setup.md for Key Vault paths per environment.

## Notebook Structure
Every notebook follows: Config → Read → Transform → Write → Cleanup.
Orchestrator children detect shared temp views via try/except.
Fallback file discovery scans last N timestamped folders via regex.

## Validation
- spark-submit --deploy-mode local for unit testing
- Domain-specific thresholds: @docs/validation-rules.md
- Geospatial: reject coordinates outside configured bounding box
- Alert classification: auto-computed, not manually overridable

## Documentation
Every code change requires a docs update in the same commit.
User-facing docs are bilingual (Spanish/English).
See @docs/checklist-template.md for mandatory checklist.

This file is under 60 lines. It does not try to cover everything — it uses @path imports to reference deeper documentation that Claude reads only when working on related tasks. This follows the progressive disclosure principle: load context only when it is relevant.

Design Principle
A good CLAUDE.md is not a knowledge dump — it is a routing table. It tells Claude what the rules are and where to find deeper context. The architecture rules section encodes the six quality rules from the platform. The @docs/ references point to validation thresholds, environment configs, and schema definitions that change more frequently than the architectural patterns themselves.

05 · The Unified Skills System

The unified skills (.claude/skills/) system has two invocation modes:

User-invoked skills are triggered explicitly with /skill-name. These are the workflows you call when you need them — creating a notebook, running a governance audit, validating infrastructure consistency.
Agent-invoked skills are loaded automatically by Claude when it recognizes a matching context. If a skill’s description field matches what you are working on, Claude loads it without being asked. This means a skill describing “Spark notebook creation” can activate automatically when Claude detects you are building a new pipeline.

Every skill lives in its own directory with a SKILL.md as the entry point. The SKILL.md file has two parts: YAML frontmatter that controls when and how the skill runs, and markdown content with the instructions Claude follows.

What Skills Can Do Now
Beyond simple prompt instructions, the current skills system supports: frontmatter fields that restrict which tools Claude can use (allowed-tools), override the model (model), or spawn the skill as an isolated subagent (agent: true); supporting files like templates, example outputs, and validation scripts that live alongside SKILL.md; and dynamic context injection via !command syntax that runs a shell command and injects its output into the skill prompt before Claude sees it. This makes skills genuinely programmable, not just instructional.

06 · Pipeline Creation Skills

The most valuable skills for a data platform encode the pipeline structure itself, so every new notebook starts compliant. Here is the spark-notebook skill with proper frontmatter:

.claude/skills/spark-notebook/
├── SKILL.md                    # Main instructions (required)
├── templates/
│   └── notebook-template.py    # Skeleton Claude fills in
├── examples/
│   └── silver-telemetry.py     # Reference output showing expected format
└── scripts/
    └── validate-structure.sh   # Script to verify cell order compliance

---
name: spark-notebook
description: >
  Creates or modifies PySpark notebooks for the Data platform.
  Use when building new pipelines, adding analytical notebooks, or
  refactoring existing notebooks to match the mandatory structure.
allowed-tools:
  - Read
  - Write
  - Bash(spark-submit:*)
  - Bash(python:*)
---

# Spark Notebook Creation Skill

When creating or modifying a PySpark notebook for this platform,
follow this structure strictly.

## Mandatory Cell Order
1. Parameters cell — Fabric parameters + inj_* injected params
2. Configuration cell — 3-tier config resolution
3. Source reading cell — Check for shared temp views first
4. Transformation cells — Business logic with named constants
5. Write cell — Delta Lake MERGE with composite key
6. Serving sync cell — Include partition_date in every document
7. Cleanup cell — unpersist() all cached DataFrames

## Config Resolution Pattern
Always implement the 3-tier cascade:

    endpoint = spark.conf.get(
        "spark.dataplatform.cosmos.endpoint",
        "https://default-dev.documents.azure.com"
    )

## Orchestrator Detection

    try:
        df = spark.sql("SELECT * FROM tmp_source_data")
        running_under_orchestrator = True
    except:
        df = spark.read.parquet(fallback_path)
        running_under_orchestrator = False

## Write Pattern
ALWAYS use MERGE. Never .mode("overwrite") or .mode("append"):

    delta_table.alias("target").merge(
        df.alias("source"),
        "target.entity_id = source.entity_id AND
         target.event_date = source.event_date"
    ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

## Anti-Patterns (reject if found)
- NEVER use .mode("overwrite") on Silver/Gold tables
- NEVER omit partition_date in serving-layer documents
- NEVER use cache() without unpersist()
- NEVER hardcode thresholds as inline literals

## Supporting Files
- See templates/notebook-template.py for the cell skeleton
- See examples/silver-telemetry.py for a complete reference
- Run scripts/validate-structure.sh to verify compliance

When a developer types /spark-notebook, Claude loads the skill, reads the supporting files for context, follows the cell structure, and produces a notebook that is already compliant with every quality rule. Because the description field mentions “building new pipelines” and “refactoring existing notebooks,” Claude can also load this skill automatically when it detects related work — no slash command needed.

The same approach works for other repeatable patterns: a telemetry-ingestion skill that encodes the hybrid event/scheduled execution modes, a tail-fetching skill that encodes the incremental replication pattern with LEAD window functions, or a serving-sync skill that enforces partition discipline and retry logic. Each gets its own directory, its own frontmatter, and its own supporting files.

07 · Governance Audit Skills

Where pipeline creation skills help create correct code, governance audit skills help verify that existing code stays correct. These are user-invoked skills — you call them explicitly when you need a cross-cutting check.

---
name: validate-serving-layer
description: >
  Scans all notebooks that write to the serving layer and verifies
  partition key presence, naming conventions, and retry logic.
  Use before commits, during PR reviews, or as a pre-deploy check.
allowed-tools:
  - Read
  - Bash(grep:*)
  - Bash(find:*)
---

# Validate Serving Layer Documents

Scan all notebooks in the repository that write to Cosmos DB.
For each one, verify:

1. Every document dictionary includes a "partition_date" key
2. The partition key value is derived from data, not hardcoded
3. The container name matches convention: "gold-{domain}-{entity}"
4. Error handling wraps the write with retry logic

Report findings as a table:
| Notebook | Container | partition_date present | Naming OK | Retry logic |

Flag any notebook that fails any check as NEEDS FIX with severity.

---
name: audit-cache
description: >
  Finds orphaned cache() calls without matching unpersist().
  Critical on constrained Spark capacity where orphaned caches
  cause OOM crashes on shared sessions.
allowed-tools:
  - Read
  - Bash(grep:*)
---

# Audit Cache/Unpersist Pairs

Search all .py and notebook files for .cache() calls.
For each cached DataFrame, verify a matching .unpersist()
exists in the same notebook.

On constrained capacity, orphaned caches cause OOM crashes
on shared Spark sessions. This is a production blocker.

List all violations with the exact line where
unpersist() should be added.

---
name: check-merge-keys
description: >
  Verifies that every Delta Lake MERGE operation uses the correct
  composite key as documented in the table schemas. Partial keys
  cause silent duplicates that compound over time.
allowed-tools:
  - Read
  - Bash(grep:*)
context:
  - docs/table-schemas.md
---

# Check MERGE Key Consistency

For every Delta Lake MERGE operation in the codebase:
1. Extract the merge condition (composite key)
2. Verify the key matches the table's documented primary key
   in @docs/table-schemas.md
3. Flag any MERGE that uses a partial key — this causes
   silent duplicates that compound over time

Report as: | Table | Expected Key | Actual Key | Match? |

Notice the frontmatter differences. The validate-serving-layer skill restricts tools to Read and basic Bash commands — it is an auditor, not a modifier. The check-merge-keys skill uses the context field to automatically load the table schemas documentation, so Claude has the reference data it needs without searching for it.

Running /validate-serving-layer takes seconds and catches the exact failure mode that cost days to debug in production: a serving database silently misdirecting documents when a partition key is missing. The skill encodes that institutional knowledge permanently — it does not depend on any individual remembering to check.

Agent Skills vs. User-Invoked Skills
Governance audits are user-invoked — you call /validate-serving-layer when you want a check. But the spark-notebook skill can work both ways: invoked explicitly with /spark-notebook, or loaded automatically when Claude detects you are creating a new pipeline. The description field in the frontmatter is what drives automatic loading. Write it like a search query for the kind of work it applies to, and Claude will load it when the context matches.

08 · Built-in Skills Worth Knowing

Claude Code ships with bundled skills that complement custom platform skills. Several are directly relevant to data platform development:

/review — Code review with context-aware analysis. Useful for catching quality rule violations in PRs before custom audit skills run.
/simplify — Refactors complex code for clarity. On a platform where notebook readability directly affects maintainability, this is high leverage.
/batch — Runs a command across multiple files. Combined with audit skills, this enables bulk compliance checks: “run /audit-cache across all notebooks in the telemetry domain.”
/loop — Executes a prompt at recurring intervals within a session. Acts as a lightweight monitoring layer during deployment windows: /loop 30m "check pipeline logs for CRITICAL errors".
/debug — Reads the session debug log and diagnoses Claude Code issues. Useful when a skill behaves unexpectedly or a tool call fails silently.

These bundled skills work alongside custom skills. A typical governance workflow might chain /review for general code quality, then /validate-serving-layer for platform-specific checks, then /audit-cache for memory safety — all in the same session.

What’s Next

This post covered the setup: CLAUDE.md as the architectural routing table, the unified skills system with frontmatter and supporting files, pipeline creation skills that encode mandatory structures, and governance audit skills that verify compliance.

In Part III, we take these enablers into production. We cover how the six quality rules become an enforceable governance framework, how Terraform and notebook code are cross-referenced to prevent infrastructure drift, a complete Spark notebook example using the tail-fetching pattern, how auto memory compounds debugging insights across sessions, and how agentic workflows chain everything into full-repository audits that run in seconds.

Claude Code · CLAUDE.md · Unified Skills · Frontmatter · PySpark · Delta Lake · Data Governance · Platform Engineering

Disclaimer: Built from the ground up using documentation and diagrams from production-scale Medallion architectures. While Claude assisted in the structural organization of this post, all technical strategies and data-processing milestones are derived from my direct professional practice.

Sandro Gomez’s personal blog

Discusión sobre este post

Por supuesto, sigue adelante.