Data Engineer Agent

The pipeline builder for a zero-human AI startup. This agent ensures data flows cleanly from where it's created to where it's needed. Every agent in the org depends on data for decisions — this agent makes sure that data is accurate, timely, and accessible. It builds ETL/ELT pipelines, monitors data quality continuously, and creates dashboards that the CEO, CTO, CGO, and COO rely on for strategic and tactical decisions.

Quick Start

Deploy the agent using OpenClaw with the ClawPack bundle:
```
clawpack install @agentebox/data-engineer
```
Configure communication channels — Data Engineer needs to communicate with Lead Engineer (upstream), and delivers data products to CTO, CGO, COO, CEO (dashboard consumers).
Set up the Remote Project Board — primary tracking for pipeline development, quality issues, and dashboard requests.
Connect data infrastructure — pipeline orchestration platform (Airflow/Prefect/dbt), cloud data warehouse, code repository.
Configure cadences — daily pipeline check (morning, 10 min), continuous sprint data work.
Initialize quality monitoring — set up quality checks and baselines for all existing data products.

Environment Variables

Variable	Description	Required
`REMOTE_PROJECT_ID`	Project ID on the Remote board	Yes
`LEAD_ENGINEER_AGENT_ID`	Session ID or label for the Lead Engineer agent	Yes
`DATA_WAREHOUSE_URL`	Connection string for the analytics warehouse	Yes
`PIPELINE_ORCHESTRATOR`	Pipeline orchestration platform (airflow/prefect/dbt)	Yes
`CODE_REPO_URL`	Repository URL for pipeline code	Yes
`QUALITY_SCORE_TARGET`	Target data quality score (default: 0.98)	No
`FRESHNESS_SLA_HOURS`	Default freshness SLA in hours (default: 6)	No

File Listing

File	Description
`SOUL.md`	Complete agent identity: behaviors, decision framework, communication protocols, boundaries, failure modes
`IDENTITY.md`	Quick-reference identity card (name, role, emoji)
`manifest.json`	Machine-readable configuration: skills, tools, cadences, autonomy levels
`README.md`	This file — setup guide and integration reference
`skills/pipeline-development/SKILL.md`	ETL/ELT pipeline design, implementation with 4 quality gates, validation, documentation
`skills/data-quality-monitoring/SKILL.md`	5-dimension quality checks, alert investigation, baseline management, remediation
`skills/dashboard-creation/SKILL.md`	Metric definition, dashboard design, accuracy validation, ongoing maintenance

Architecture

Lead Engineer
     ↕ (data requirements, schema coordination)
Data Engineer ──── 🔄
     ├── Pipelines (extract → validate → transform → validate → load → monitor)
     ├── Quality Monitoring (completeness, accuracy, freshness, consistency, schema)
     └── Dashboards (CEO KPIs, CTO engineering, CGO funnel, COO operations)

Data flows to:
     → CEO Orchestrator (company-wide KPI dashboard)
     → CTO (engineering metrics dashboard)
     → CGO (funnel and revenue dashboards)
     → COO (financial and operational dashboards)

Framework Integration

OpenClaw (Native)

# openclaw.yaml
agent:
  name: data-engineer
  soul: ./SOUL.md
  identity: ./IDENTITY.md
  skills:
    - ./skills/pipeline-development/
    - ./skills/data-quality-monitoring/
    - ./skills/dashboard-creation/
  heartbeat:
    interval: 30m
    file: ./HEARTBEAT.md

CrewAI

from crewai import Agent, Task, Crew

data_eng = Agent(
    role="Data Engineer",
    goal="Build reliable data pipelines and accurate dashboards that the entire org trusts for decision-making",
    backstory=open("SOUL.md").read(),
    tools=[pipeline_tool, warehouse_tool, remote_board_tool, messaging_tool],
    verbose=True
)

daily_check = Task(
    description="Run daily pipeline and quality check: verify overnight runs, check freshness SLAs, investigate quality alerts",
    agent=data_eng,
    expected_output="Pipeline health report with quality scores and any issues flagged"
)

crew = Crew(agents=[data_eng], tasks=[daily_check], verbose=True)
crew.kickoff()

Monitoring

The Data Engineer is healthy when:

Pipeline uptime stays above 99.5%
Data freshness SLAs are met for all production datasets
Data quality score stays ≥0.98 across all monitored datasets
Query performance (p95) stays under 1 second for dashboard queries
Every metric has a documented definition traceable to source data
No two dashboards show conflicting numbers for the same metric

Warning signs:

Pipeline failures increasing (source data instability)
Data freshness SLA breaches (pipelines running slower than expected)
Quality score declining (data degradation)
Dashboard consumers reporting "numbers don't look right"
Schema drift alerts (source data changing without coordination)
Pipeline costs growing faster than data volume

Version History

Version	Date	Changes
1.0.0	2026-03-16	Initial creation