> ## Documentation Index
> Fetch the complete documentation index at: https://docs.stable.xyz/llms.txt
> Use this file to discover all available pages before exploring further.

# Monitoring & maintenance guide

> Monitoring setup using Prometheus, Grafana, and AlertManager for Stable node observability and alerting.

Comprehensive guide for monitoring Stable nodes and performing routine maintenance tasks.

## Monitoring stack overview

### Recommended stack

* **Prometheus**: Metrics collection
* **Grafana**: Visualization and dashboards
* **AlertManager**: Alert routing and management
* **Node Exporter**: System metrics
* **Loki**: Log aggregation (optional)

## Quick monitoring setup

### Step 1: enable Prometheus metrics

```toml theme={"dark"}
# Edit ~/.stabled/config/config.toml
[instrumentation]
prometheus = true
prometheus_listen_addr = ":26660"
namespace = "stablebft"
```

Restart node:

```bash theme={"dark"}
sudo systemctl restart ${SERVICE_NAME}
```

### Step 2: install Prometheus

```bash theme={"dark"}
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus

# Create config
sudo tee /opt/prometheus/prometheus.yml > /dev/null <<EOF
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'stable-node'
    static_configs:
      - targets: ['localhost:26660']
        labels:
          instance: 'mainnode'

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
EOF

# Create systemd service
sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
  --config.file=/opt/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/prometheus/data

[Install]
WantedBy=multi-user.target
EOF

# Start Prometheus
sudo useradd -rs /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
sudo systemctl enable prometheus
sudo systemctl start prometheus
```

### Step 3: install Grafana

```bash theme={"dark"}
# Add Grafana repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Install Grafana
sudo apt-get update
sudo apt-get install grafana

# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

# Access at http://your-ip:3000
# Default login: admin/admin
```

## Key metrics to monitor

### Node health metrics

| Metric                               | Description          | Alert Threshold    |
| ------------------------------------ | -------------------- | ------------------ |
| `up`                                 | Node availability    | = 0 for 5m         |
| `stablebft_consensus_height`         | Current block height | No increase for 5m |
| `stablebft_consensus_validators`     | Active validators    | N/A                |
| `stablebft_consensus_rounds`         | Consensus rounds     | > 3                |
| `stablebft_consensus_block_interval` | Block time           | > 10s              |
| `stablebft_p2p_peers`                | Connected peers      | \< 3               |
| `stablebft_mempool_size`             | Mempool size         | > 1500             |
| `stablebft_mempool_failed_txs`       | Failed transactions  | > 100/min          |

### System metrics

| Metric                             | Description      | Alert Threshold  |
| ---------------------------------- | ---------------- | ---------------- |
| `node_cpu_seconds_total`           | CPU usage        | > 80% for 5m     |
| `node_memory_MemAvailable_bytes`   | Available memory | \< 10%           |
| `node_filesystem_avail_bytes`      | Available disk   | \< 10%           |
| `node_network_receive_bytes_total` | Network RX       | > 100MB/s        |
| `node_disk_io_time_seconds_total`  | Disk I/O         | > 80%            |
| `node_load15`                      | System load      | > CPU cores \* 2 |

## Grafana dashboard setup

### Import Stable dashboard

```json theme={"dark"}
{
  "dashboard": {
    "title": "Stable Node Monitoring",
    "panels": [
      {
        "title": "Block Height",
        "targets": [
          {
            "expr": "stablebft_consensus_height{chain_id=\"stabletestnet_2201-1\"}"
          }
        ]
      },
      {
        "title": "Peers",
        "targets": [
          {
            "expr": "stablebft_p2p_peers"
          }
        ]
      },
      {
        "title": "Block Time",
        "targets": [
          {
            "expr": "rate(stablebft_consensus_height[1m]) * 60"
          }
        ]
      },
      {
        "title": "Mempool Size",
        "targets": [
          {
            "expr": "stablebft_mempool_size"
          }
        ]
      }
    ]
  }
}
```

### Custom dashboard import

Import dashboards via Grafana UI:

```bash theme={"dark"}
# Navigate to Dashboards > Import > Upload JSON file
# Or use Dashboard ID in Grafana's dashboard library
```

## AlertManager configuration

### Install AlertManager

```bash theme={"dark"}
# Download AlertManager
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
sudo mv alertmanager-0.26.0.linux-amd64 /opt/alertmanager

# Configure
sudo tee /opt/alertmanager/alertmanager.yml > /dev/null <<EOF
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'team-notifications'

receivers:
  - name: 'team-notifications'
    webhook_configs:
      - url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        send_resolved: true
    email_configs:
      - to: 'alerts@yourteam.com'
        from: 'prometheus@yournode.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'your@gmail.com'
        auth_password: 'app-specific-password'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
EOF

# Start AlertManager
sudo systemctl enable alertmanager
sudo systemctl start alertmanager
```

### Alert rules

```yaml theme={"dark"}
# /opt/prometheus/alerts.yml
groups:
  - name: stable_alerts
    rules:
      - alert: NodeDown
        expr: up{job="stable-node"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"

      - alert: BlockProductionStopped
        expr: increase(stablebft_consensus_height[5m]) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Block production stopped"

      - alert: LowPeerCount
        expr: stablebft_p2p_peers < 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low peer count: {{ $value }}"

      - alert: HighMempool
        expr: stablebft_mempool_size > 1500
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High mempool size: {{ $value }}"

      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space: {{ $value | humanizePercentage }}"

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage: {{ $value }}%"
```

## Log monitoring

### Systemd logs

```bash theme={"dark"}
# View recent logs
sudo journalctl -u ${SERVICE_NAME} -n 100

# Follow logs
sudo journalctl -u ${SERVICE_NAME} -f

# Filter by time
sudo journalctl -u ${SERVICE_NAME} --since "1 hour ago"

# Export logs
sudo journalctl -u ${SERVICE_NAME} --since today > stable-logs-$(date +%Y%m%d).log
```

### Log analysis scripts

```bash theme={"dark"}
#!/bin/bash
# analyze-logs.sh

# Count errors in last hour
echo "Errors in last hour:"
sudo journalctl -u ${SERVICE_NAME} --since "1 hour ago" | grep -c ERROR

# Show peer connections
echo "Peer connections:"
sudo journalctl -u ${SERVICE_NAME} --since "10 minutes ago" | grep "Peer connection" | tail -10

# Check for consensus issues
echo "Consensus rounds:"
sudo journalctl -u ${SERVICE_NAME} --since "30 minutes ago" | grep -E "enterNewRound|Timeout" | tail -20

# Memory usage patterns
echo "Memory warnings:"
sudo journalctl -u ${SERVICE_NAME} --since "1 day ago" | grep -i memory
```

### Loki setup (optional)

```bash theme={"dark"}
# Install Loki
wget https://github.com/grafana/loki/releases/download/v2.9.0/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki

# Install Promtail
wget https://github.com/grafana/loki/releases/download/v2.9.0/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo mv promtail-linux-amd64 /usr/local/bin/promtail

# Configure Promtail
sudo tee /etc/promtail-config.yml > /dev/null <<EOF
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
  - job_name: stable
    systemd_journal:
      matches: "_SYSTEMD_UNIT=stabled.service"
      labels:
        job: stable
        host: localhost
EOF

# Start services
promtail -config.file=/etc/promtail-config.yml
```

## Health check endpoints

### HTTP endpoints

```bash theme={"dark"}
# Basic health check
curl -s http://localhost:26657/health

# Node status
curl -s http://localhost:26657/status | jq

# Net info
curl -s http://localhost:26657/net_info | jq

# Consensus state
curl -s http://localhost:26657/consensus_state | jq

# Unconfirmed transactions
curl -s http://localhost:26657/num_unconfirmed_txs | jq
```

### Health check script

```bash theme={"dark"}
#!/bin/bash
# health-check.sh

set -e

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
export SERVICE_NAME="stable"

echo "=== Stable Node Health Check ==="
echo

# Check if service is running
if systemctl is-active --quiet ${SERVICE_NAME}; then
    echo -e "${GREEN}✓${NC} Service is running"
else
    echo -e "${RED}✗${NC} Service is not running"
    exit 1
fi

# Check node sync status
SYNC_STATUS=$(curl -s localhost:26657/status | jq -r '.result.sync_info.catching_up')
if [ "$SYNC_STATUS" = "false" ]; then
    echo -e "${GREEN}✓${NC} Node is synced"
else
    echo -e "${YELLOW}⚠${NC} Node is syncing"
fi

# Check peer count
PEERS=$(curl -s localhost:26657/net_info | jq -r '.result.n_peers')
if [ "$PEERS" -ge 3 ]; then
    echo -e "${GREEN}✓${NC} Connected peers: $PEERS"
else
    echo -e "${YELLOW}⚠${NC} Low peer count: $PEERS"
fi

# Check disk space
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -lt 80 ]; then
    echo -e "${GREEN}✓${NC} Disk usage: ${DISK_USAGE}%"
else
    echo -e "${YELLOW}⚠${NC} High disk usage: ${DISK_USAGE}%"
fi

# Check memory
MEM_AVAILABLE=$(free -m | awk 'NR==2 {print $7}')
MEM_TOTAL=$(free -m | awk 'NR==2 {print $2}')
MEM_PERCENT=$((100 - (MEM_AVAILABLE * 100 / MEM_TOTAL)))
if [ "$MEM_PERCENT" -lt 80 ]; then
    echo -e "${GREEN}✓${NC} Memory usage: ${MEM_PERCENT}%"
else
    echo -e "${YELLOW}⚠${NC} High memory usage: ${MEM_PERCENT}%"
fi

echo
echo "=== Health Check Complete ==="
```

## Maintenance tasks

### Daily maintenance

```bash theme={"dark"}
#!/bin/bash
# daily-maintenance.sh

# Rotate logs
sudo journalctl --rotate
sudo journalctl --vacuum-time=7d

# Clear cache
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# Check for updates
echo "Checking for updates..."
curl -s https://api.github.com/repos/stable-chain/stable/releases/latest | jq -r '.tag_name'

# Backup important config files
cp ~/.stabled/config/node_key.json ~/backups/node_key_$(date +%Y%m%d).json

# Generate report
echo "Daily report generated: $(date)" > ~/reports/daily_$(date +%Y%m%d).log
curl -s localhost:26657/status | jq >> ~/reports/daily_$(date +%Y%m%d).log
```

### Weekly maintenance

```bash theme={"dark"}
#!/bin/bash
# weekly-maintenance.sh

# Prune old data
stabled prune

# Compact database
stabled compact

# Update peer list
wget https://raw.githubusercontent.com/stable-chain/networks/main/testnet/peers.txt
cat peers.txt >> ~/.stabled/config/config.toml

# Create snapshot (optional)
./create-snapshot.sh

# System updates
sudo apt update
sudo apt upgrade -y

# Restart node (during low activity)
sudo systemctl restart ${SERVICE_NAME}
```

### Database maintenance

```bash theme={"dark"}
# Check database size
du -sh ~/.stabled/data/

# Analyze database
stabled debug db stats ~/.stabled/data
```

## Performance monitoring

### Resource usage tracking

```bash theme={"dark"}
#!/bin/bash
# track-resources.sh

while true; do
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    CPU=$(top -bn1 | grep "stabled" | awk '{print $9}')
    MEM=$(top -bn1 | grep "stabled" | awk '{print $10}')
    IO=$(iostat -x 1 2 | tail -n2 | awk '{print $14}')

    echo "$TIMESTAMP,CPU:$CPU,MEM:$MEM,IO:$IO" >> ~/metrics/resources.csv

    sleep 60
done
```

### Query performance

```bash theme={"dark"}
# Monitor RPC response times
while true; do
    START=$(date +%s%N)
    curl -s http://localhost:26657/status > /dev/null
    END=$(date +%s%N)
    DIFF=$((($END - $START) / 1000000))
    echo "RPC response time: ${DIFF}ms"
    sleep 5
done
```

## Monitoring best practices

1. **Set up redundant monitoring**
   * Use external monitoring services
   * Implement cross-node monitoring
   * Set up dead man's switch alerts

2. **Alert fatigue prevention**
   * Tune alert thresholds based on baseline
   * Use alert grouping and inhibition
   * Implement escalation policies

3. **Data retention**
   * Keep metrics for 30 days minimum
   * Archive important logs
   * Regular backup of monitoring configs

4. **Security**
   * Secure Grafana with strong passwords
   * Use HTTPS for all endpoints
   * Restrict prometheus access

5. **Documentation**
   * Document all custom metrics
   * Maintain runbooks for alerts
   * Keep dashboard descriptions updated

## Next steps

* [Review Troubleshooting Guide](/en/how-to/troubleshoot-node) for issue resolution
* [Configure Upgrades](/en/how-to/upgrade-node) with monitoring
* Set up custom alerts based on your requirements
