#13957: Enhanced OpenClaw Observability with OTEL Integration
docs
scripts
stale
Cluster:
Session Management Enhancements
# Enhanced OpenClaw Observability with OpenTelemetry Integration
## ๐ฏ Overview
This PR implements a comprehensive observability solution for OpenClaw, adding production-ready monitoring with OpenTelemetry, Prometheus, and Grafana integration. It provides enterprise-grade visibility into both application performance and system health.
## โจ Features
### ๐ง **OTEL Integration**
- Enhanced diagnostics-otel extension functionality
- Proper dependency management for OpenTelemetry packages
- Enabled traces, metrics, and logs export to OTEL collectors
- Full configuration support for endpoints, sampling, and protocols
### ๐ **Advanced Monitoring Stack**
- **System Metrics**: Node Exporter integration for CPU, memory, disk, network monitoring
- **Application Metrics**: OpenClaw diagnostic events, token usage, costs, performance
- **Business Intelligence**: Trading analytics, cost tracking, efficiency metrics
- **Infrastructure Health**: Queue depths, error rates, session management
### ๐ **Grafana Dashboards**
- **Infrastructure Dashboard**: Operational metrics and system health
- **Business Dashboard**: Cost analysis and trading performance
- **System Dashboard**: EC2 performance and resource utilization
### ๐ **Production Ready**
- Automated installation scripts
- Complete configuration templates
- Comprehensive documentation
- Security-hardened configurations
## ๐๏ธ Files Added
### Core Infrastructure
- `scripts/install-node-exporter.sh` - System metrics setup
- `docs/observability/otel-collector-config.example.yaml` - Complete OTEL configuration
- `docs/observability/observability-stack-config.yaml` - Stack configuration reference
### Grafana Dashboards
- `dashboards/infrastructure-dashboard.json` - OpenClaw operational metrics
- `dashboards/business-dashboard.json` - Trading and cost analytics
- `dashboards/system-monitoring-dashboard.json` - System performance metrics
### Documentation
- `docs/observability/OBSERVABILITY.md` - Complete setup and configuration guide
- `docs/observability/README.md` - Quick start guide
## ๐ง Configuration Example
```yaml
diagnostics:
enabled: true
flags: ["*"]
otel:
enabled: true
endpoint: "http://localhost:4318"
serviceName: "openclaw-gateway"
traces: true
metrics: true
logs: false
flushIntervalMs: 5000
```
## ๐ Metrics Provided
### Application Metrics
- `openclaw_tokens_total` - Token usage by type (input/output)
- `openclaw_cost_usd_total` - Model costs in USD
- `openclaw_message_processed_total` - Message processing outcomes
- `openclaw_queue_depth` - Queue depth monitoring
- `openclaw_run_duration_ms` - Agent run durations
### System Metrics
- `node_cpu_seconds_total` - CPU usage by core and mode
- `node_memory_MemAvailable_bytes` - Available memory
- `node_filesystem_avail_bytes` - Disk space availability
- `node_network_receive_bytes_total` - Network I/O
## ๐๏ธ Architecture
```
OpenClaw Gateway โ OTEL Collector:4318 โ
System Metrics โ Node Exporter:9100 โ OTEL Collector โ Prometheus:8889 โ Grafana
```
## ๐ Quick Start
### 1. Install System Monitoring
```bash
chmod +x scripts/install-node-exporter.sh
./scripts/install-node-exporter.sh
```
### 2. Configure OpenClaw
```bash
openclaw gateway config.patch '{"diagnostics":{"enabled":true,"otel":{"enabled":true,"endpoint":"http://localhost:4318"}}}'
```
### 3. Import Dashboards
- Upload dashboard JSON files to Grafana
- Configure Prometheus data source: `http://localhost:8889`
- Set refresh intervals and alerts
## ๐ Security
- All sensitive credentials removed from configurations
- Example templates use placeholder values
- Security best practices documented
- Network binding considerations addressed
## ๐งช Testing
Tested with:
- OpenClaw 2026.2.6+
- OTEL Collector 0.91.0+
- Node Exporter 1.8.2+
- Prometheus compatible endpoints
- Grafana Cloud integration
## ๐ฏ Impact
### For Operators
- **Complete Visibility**: Application + system metrics in one place
- **Proactive Monitoring**: Alerts before issues impact users
- **Cost Tracking**: Detailed model usage and spending analytics
- **Performance Optimization**: Identify bottlenecks and inefficiencies
### For Developers
- **Debug Production Issues**: Traces and metrics for troubleshooting
- **Performance Profiling**: Detailed timing and resource usage
- **Capacity Planning**: Historical data for scaling decisions
- **Quality Metrics**: Success rates and error patterns
## ๐ Migration Notes
- Backward compatible with existing OpenClaw configurations
- Optional feature - can be enabled incrementally
- Minimal performance impact when configured properly
- Existing diagnostic events enhanced, not replaced
## ๐ฎ Future Enhancements
- Custom business metric collection
- Advanced alerting rules
- Distributed tracing across tools
- Automated remediation hooks
- Cost optimization recommendations
---
This observability enhancement transforms OpenClaw from a functional system into a fully monitored, enterprise-ready AI assistant platform with comprehensive visibility into every aspect of its operation.
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR adds comprehensive observability infrastructure with OpenTelemetry, Prometheus, and Grafana integration for OpenClaw. However, there are **critical metric naming mismatches** that will prevent the dashboards from working.
## Key Issues
- **Metric Name Conflicts**: The OTEL collector config sets `namespace: openclaw` but the extension already exports metrics with `openclaw.` prefix, causing double-prefixing (`openclaw_openclaw_*` instead of expected `openclaw_*_total`)
- **Missing Metrics**: Dashboard references `openclaw_sessions_active_total` which doesn't exist in the OTEL extension code
- **Installation Script Problems**:
- Hardcoded `User=ubuntu` won't work on non-Ubuntu systems
- Appends to OTEL config without idempotency checks (breaks on re-runs)
- Assumes `otel-collector.service` exists but never creates it
- **Personal Information**: Grafana Cloud URL contains author's personal account (`trevorbgordon.grafana.net`)
## Metric Mapping Issue
The extension exports metrics like:
- `openclaw.tokens` (line 135 in extensions/diagnostics-otel/src/service.ts)
- `openclaw.cost.usd` (line 139)
- `openclaw.message.processed` (line 167)
But dashboards expect:
- `openclaw_tokens_total`
- `openclaw_cost_usd_total`
- `openclaw_message_processed_total`
With `namespace: openclaw` in the Prometheus exporter, actual output will be `openclaw_openclaw_tokens`, `openclaw_openclaw_cost_usd`, etc.
## What Works
- Documentation structure is comprehensive
- OTEL collector pipeline configuration is sound
- System metrics collection approach is valid
- Dashboard layouts and visualizations are well-designed
## Recommendations
1. Remove `namespace: openclaw` from otel-collector-config.example.yaml
2. Update all dashboard metric names to match actual OTEL extension output
3. Fix installation script to be idempotent and work on all Linux distributions
4. Add OTEL collector systemd service creation to installation docs
5. Replace personal Grafana URL with placeholder
<h3>Confidence Score: 1/5</h3>
- This PR has critical issues that will prevent the observability stack from functioning
- The metric naming mismatches between the OTEL extension and the dashboards mean the dashboards won't display any data. The installation script has multiple bugs that will cause failures on most systems. These are not minor issues but fundamental problems that prevent the feature from working as designed.
- Critical: dashboards/infrastructure-dashboard.json, dashboards/business-dashboard.json, docs/observability/otel-collector-config.example.yaml, scripts/install-node-exporter.sh - all contain logic errors that prevent functionality
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#13872: feat: Cost Optimization Suite - Session Management & Resource Effic...
by trevorgordon981 ยท 2026-02-11
73.6%
#21223: elliott commit
by elliottamador ยท 2026-02-19
72.9%
#16865: fix(diagnostics-otel): share listeners/transports across module bun...
by leonnardo ยท 2026-02-15
72.8%
#19353: fix(diagnostics-otel): fix cross-chunk module isolation breaking evenโฆ
by nez ยท 2026-02-17
72.0%
#4255: fix(diagnostics-otel): complete OpenTelemetry v2.x compatibility
by arbgjr ยท 2026-01-29
71.5%
#21290: feat(diagnostics-otel): OpenTelemetry diagnostics with GenAI semant...
by Baukebrenninkmeijer ยท 2026-02-19
70.2%
#11530: diagnostics-otel: fix OpenTelemetry v2 resource/logs API compatibility
by erain ยท 2026-02-07
70.1%
#10367: CLI/Ops: resilient browser fill + failover hardening + operations t...
by cluster2600 ยท 2026-02-06
69.6%
#17273: feat: add security-guard extension โ agentic safety guardrails
by miloudbelarebia ยท 2026-02-15
69.5%
#14313: feat: Atomic OpenClaw Configuration Management
by aronchick ยท 2026-02-11
69.1%