agentebox/

data-engineer

Org

Data Engineer agent for a zero-human AI startup — the pipeline builder who ensures data flows cleanly from where it's created to where it's needed, with accuracy, freshness, and quality that the entire org can trust.

Data Engineer Agent

The pipeline builder for a zero-human AI startup. This agent ensures data flows cleanly from where it's created to where it's needed. Every agent in the org depends on data for decisions — this agent makes sure that data is accurate, timely, and accessible. It builds ETL/ELT pipelines, monitors data quality continuously, and creates dashboards that the CEO, CTO, CGO, and COO rely on for strategic and tactical decisions.

Quick Start

  1. Deploy the agent using OpenClaw with the ClawPack bundle:

    clawpack install @agentebox/data-engineer
    
  2. Configure communication channels — Data Engineer needs to communicate with Lead Engineer (upstream), and delivers data products to CTO, CGO, COO, CEO (dashboard consumers).

  3. Set up the Remote Project Board — primary tracking for pipeline development, quality issues, and dashboard requests.

  4. Connect data infrastructure — pipeline orchestration platform (Airflow/Prefect/dbt), cloud data warehouse, code repository.

  5. Configure cadences — daily pipeline check (morning, 10 min), continuous sprint data work.

  6. Initialize quality monitoring — set up quality checks and baselines for all existing data products.

Environment Variables

VariableDescriptionRequired
REMOTE_PROJECT_IDProject ID on the Remote boardYes
LEAD_ENGINEER_AGENT_IDSession ID or label for the Lead Engineer agentYes
DATA_WAREHOUSE_URLConnection string for the analytics warehouseYes
PIPELINE_ORCHESTRATORPipeline orchestration platform (airflow/prefect/dbt)Yes
CODE_REPO_URLRepository URL for pipeline codeYes
QUALITY_SCORE_TARGETTarget data quality score (default: 0.98)No
FRESHNESS_SLA_HOURSDefault freshness SLA in hours (default: 6)No

File Listing

FileDescription
SOUL.mdComplete agent identity: behaviors, decision framework, communication protocols, boundaries, failure modes
IDENTITY.mdQuick-reference identity card (name, role, emoji)
manifest.jsonMachine-readable configuration: skills, tools, cadences, autonomy levels
README.mdThis file — setup guide and integration reference
skills/pipeline-development/SKILL.mdETL/ELT pipeline design, implementation with 4 quality gates, validation, documentation
skills/data-quality-monitoring/SKILL.md5-dimension quality checks, alert investigation, baseline management, remediation
skills/dashboard-creation/SKILL.mdMetric definition, dashboard design, accuracy validation, ongoing maintenance

Architecture

Lead Engineer
     ↕ (data requirements, schema coordination)
Data Engineer ──── 🔄
     ├── Pipelines (extract → validate → transform → validate → load → monitor)
     ├── Quality Monitoring (completeness, accuracy, freshness, consistency, schema)
     └── Dashboards (CEO KPIs, CTO engineering, CGO funnel, COO operations)

Data flows to:
     → CEO Orchestrator (company-wide KPI dashboard)
     → CTO (engineering metrics dashboard)
     → CGO (funnel and revenue dashboards)
     → COO (financial and operational dashboards)

Framework Integration

OpenClaw (Native)

# openclaw.yaml
agent:
  name: data-engineer
  soul: ./SOUL.md
  identity: ./IDENTITY.md
  skills:
    - ./skills/pipeline-development/
    - ./skills/data-quality-monitoring/
    - ./skills/dashboard-creation/
  heartbeat:
    interval: 30m
    file: ./HEARTBEAT.md

CrewAI

from crewai import Agent, Task, Crew

data_eng = Agent(
    role="Data Engineer",
    goal="Build reliable data pipelines and accurate dashboards that the entire org trusts for decision-making",
    backstory=open("SOUL.md").read(),
    tools=[pipeline_tool, warehouse_tool, remote_board_tool, messaging_tool],
    verbose=True
)

daily_check = Task(
    description="Run daily pipeline and quality check: verify overnight runs, check freshness SLAs, investigate quality alerts",
    agent=data_eng,
    expected_output="Pipeline health report with quality scores and any issues flagged"
)

crew = Crew(agents=[data_eng], tasks=[daily_check], verbose=True)
crew.kickoff()

Monitoring

The Data Engineer is healthy when:

  • Pipeline uptime stays above 99.5%
  • Data freshness SLAs are met for all production datasets
  • Data quality score stays ≥0.98 across all monitored datasets
  • Query performance (p95) stays under 1 second for dashboard queries
  • Every metric has a documented definition traceable to source data
  • No two dashboards show conflicting numbers for the same metric

Warning signs:

  • Pipeline failures increasing (source data instability)
  • Data freshness SLA breaches (pipelines running slower than expected)
  • Quality score declining (data degradation)
  • Dashboard consumers reporting "numbers don't look right"
  • Schema drift alerts (source data changing without coordination)
  • Pipeline costs growing faster than data volume

Version History

VersionDateChanges
1.0.02026-03-16Initial creation

Install

clawpack pull agentebox/data-engineer
1
Downloads
0
Stars
Latest1.0.0
Updated3/16/2026

Share