Data Engineer agent for a zero-human AI startup — the pipeline builder who ensures data flows cleanly from where it's created to where it's needed, with accuracy, freshness, and quality that the entire org can trust.
Data Engineer Agent
The pipeline builder for a zero-human AI startup. This agent ensures data flows cleanly from where it's created to where it's needed. Every agent in the org depends on data for decisions — this agent makes sure that data is accurate, timely, and accessible. It builds ETL/ELT pipelines, monitors data quality continuously, and creates dashboards that the CEO, CTO, CGO, and COO rely on for strategic and tactical decisions.
Quick Start
-
Deploy the agent using OpenClaw with the ClawPack bundle:
clawpack install @agentebox/data-engineer -
Configure communication channels — Data Engineer needs to communicate with Lead Engineer (upstream), and delivers data products to CTO, CGO, COO, CEO (dashboard consumers).
-
Set up the Remote Project Board — primary tracking for pipeline development, quality issues, and dashboard requests.
-
Connect data infrastructure — pipeline orchestration platform (Airflow/Prefect/dbt), cloud data warehouse, code repository.
-
Configure cadences — daily pipeline check (morning, 10 min), continuous sprint data work.
-
Initialize quality monitoring — set up quality checks and baselines for all existing data products.
Environment Variables
| Variable | Description | Required |
|---|---|---|
REMOTE_PROJECT_ID | Project ID on the Remote board | Yes |
LEAD_ENGINEER_AGENT_ID | Session ID or label for the Lead Engineer agent | Yes |
DATA_WAREHOUSE_URL | Connection string for the analytics warehouse | Yes |
PIPELINE_ORCHESTRATOR | Pipeline orchestration platform (airflow/prefect/dbt) | Yes |
CODE_REPO_URL | Repository URL for pipeline code | Yes |
QUALITY_SCORE_TARGET | Target data quality score (default: 0.98) | No |
FRESHNESS_SLA_HOURS | Default freshness SLA in hours (default: 6) | No |
File Listing
| File | Description |
|---|---|
SOUL.md | Complete agent identity: behaviors, decision framework, communication protocols, boundaries, failure modes |
IDENTITY.md | Quick-reference identity card (name, role, emoji) |
manifest.json | Machine-readable configuration: skills, tools, cadences, autonomy levels |
README.md | This file — setup guide and integration reference |
skills/pipeline-development/SKILL.md | ETL/ELT pipeline design, implementation with 4 quality gates, validation, documentation |
skills/data-quality-monitoring/SKILL.md | 5-dimension quality checks, alert investigation, baseline management, remediation |
skills/dashboard-creation/SKILL.md | Metric definition, dashboard design, accuracy validation, ongoing maintenance |
Architecture
Lead Engineer
↕ (data requirements, schema coordination)
Data Engineer ──── 🔄
├── Pipelines (extract → validate → transform → validate → load → monitor)
├── Quality Monitoring (completeness, accuracy, freshness, consistency, schema)
└── Dashboards (CEO KPIs, CTO engineering, CGO funnel, COO operations)
Data flows to:
→ CEO Orchestrator (company-wide KPI dashboard)
→ CTO (engineering metrics dashboard)
→ CGO (funnel and revenue dashboards)
→ COO (financial and operational dashboards)
Framework Integration
OpenClaw (Native)
# openclaw.yaml
agent:
name: data-engineer
soul: ./SOUL.md
identity: ./IDENTITY.md
skills:
- ./skills/pipeline-development/
- ./skills/data-quality-monitoring/
- ./skills/dashboard-creation/
heartbeat:
interval: 30m
file: ./HEARTBEAT.md
CrewAI
from crewai import Agent, Task, Crew
data_eng = Agent(
role="Data Engineer",
goal="Build reliable data pipelines and accurate dashboards that the entire org trusts for decision-making",
backstory=open("SOUL.md").read(),
tools=[pipeline_tool, warehouse_tool, remote_board_tool, messaging_tool],
verbose=True
)
daily_check = Task(
description="Run daily pipeline and quality check: verify overnight runs, check freshness SLAs, investigate quality alerts",
agent=data_eng,
expected_output="Pipeline health report with quality scores and any issues flagged"
)
crew = Crew(agents=[data_eng], tasks=[daily_check], verbose=True)
crew.kickoff()
Monitoring
The Data Engineer is healthy when:
- Pipeline uptime stays above 99.5%
- Data freshness SLAs are met for all production datasets
- Data quality score stays ≥0.98 across all monitored datasets
- Query performance (p95) stays under 1 second for dashboard queries
- Every metric has a documented definition traceable to source data
- No two dashboards show conflicting numbers for the same metric
Warning signs:
- Pipeline failures increasing (source data instability)
- Data freshness SLA breaches (pipelines running slower than expected)
- Quality score declining (data degradation)
- Dashboard consumers reporting "numbers don't look right"
- Schema drift alerts (source data changing without coordination)
- Pipeline costs growing faster than data volume
Version History
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2026-03-16 | Initial creation |