Monitoring Stack Overview
Recommended Stack
- Prometheus: Metrics collection
- Grafana: Visualization and dashboards
- AlertManager: Alert routing and management
- Node Exporter: System metrics
- Loki: Log aggregation (optional)
Quick Monitoring Setup
Step 1: Enable Prometheus Metrics
Step 2: Install Prometheus
Step 3: Install Grafana
Key Metrics to Monitor
Node Health Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
up | Node availability | = 0 for 5m |
stablebft_consensus_height | Current block height | No increase for 5m |
stablebft_consensus_validators | Active validators | N/A |
stablebft_consensus_rounds | Consensus rounds | > 3 |
stablebft_consensus_block_interval | Block time | > 10s |
stablebft_p2p_peers | Connected peers | < 3 |
stablebft_mempool_size | Mempool size | > 1500 |
stablebft_mempool_failed_txs | Failed transactions | > 100/min |
System Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
node_cpu_seconds_total | CPU usage | > 80% for 5m |
node_memory_MemAvailable_bytes | Available memory | < 10% |
node_filesystem_avail_bytes | Available disk | < 10% |
node_network_receive_bytes_total | Network RX | > 100MB/s |
node_disk_io_time_seconds_total | Disk I/O | > 80% |
node_load15 | System load | > CPU cores * 2 |
Grafana Dashboard Setup
Import Stable Dashboard
Custom Dashboard Import
Import dashboards via Grafana UI:AlertManager Configuration
Install AlertManager
Alert Rules
Log Monitoring
Systemd Logs
Log Analysis Scripts
Loki Setup (Optional)
Health Check Endpoints
HTTP Endpoints
Health Check Script
Maintenance Tasks
Daily Maintenance
Weekly Maintenance
Database Maintenance
Performance Monitoring
Resource Usage Tracking
Query Performance
Monitoring Best Practices
-
Set Up Redundant Monitoring
- Use external monitoring services
- Implement cross-node monitoring
- Set up dead man’s switch alerts
-
Alert Fatigue Prevention
- Tune alert thresholds based on baseline
- Use alert grouping and inhibition
- Implement escalation policies
-
Data Retention
- Keep metrics for 30 days minimum
- Archive important logs
- Regular backup of monitoring configs
-
Security
- Secure Grafana with strong passwords
- Use HTTPS for all endpoints
- Restrict prometheus access
-
Documentation
- Document all custom metrics
- Maintain runbooks for alerts
- Keep dashboard descriptions updated
Next Steps
- Review Troubleshooting Guide for issue resolution
- Configure Upgrades with monitoring
- Set up custom alerts based on your requirements

