监控 Stable 节点并执行日常维护任务的综合指南。

监控栈概览

快速监控配置

步骤 1：启用 Prometheus 指标

# Edit ~/.stabled/config/config.toml
[instrumentation]
prometheus = true
prometheus_listen_addr = ":26660"
namespace = "stablebft"

重启节点：

sudo systemctl restart ${SERVICE_NAME}

步骤 2：安装 Prometheus

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus
 
# Create config
sudo tee /opt/prometheus/prometheus.yml > /dev/null <<EOF
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
scrape_configs:
  - job_name: 'stable-node'
    static_configs:
      - targets: ['localhost:26660']
        labels:
          instance: 'mainnode'
 
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
EOF
 
# Create systemd service
sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
[Unit]
Description=Prometheus
After=network.target
 
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
  --config.file=/opt/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/prometheus/data
 
[Install]
WantedBy=multi-user.target
EOF
 
# Start Prometheus
sudo useradd -rs /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
sudo systemctl enable prometheus
sudo systemctl start prometheus

步骤 3：安装 Grafana

# Add Grafana repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
 
# Install Grafana
sudo apt-get update
sudo apt-get install grafana
 
# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
 
# Access at http://your-ip:3000
# Default login: admin/admin

需要监控的关键指标

节点健康指标

指标	描述	告警阈值
`up`	节点可用性	= 0 持续 5m
`stablebft_consensus_height`	当前区块高度	5m 内无增长
`stablebft_consensus_validators`	活跃验证者	N/A
`stablebft_consensus_rounds`	共识轮次	> 3
`stablebft_consensus_block_interval`	出块时间	> 10s
`stablebft_p2p_peers`	已连接对等节点	< 3
`stablebft_mempool_size`	内存池大小	> 1500
`stablebft_mempool_failed_txs`	失败交易	> 100/min

系统指标

指标	描述	告警阈值
`node_cpu_seconds_total`	CPU 使用率	> 80% 持续 5m
`node_memory_MemAvailable_bytes`	可用内存	< 10%
`node_filesystem_avail_bytes`	可用磁盘	< 10%
`node_network_receive_bytes_total`	网络接收	> 100MB/s
`node_disk_io_time_seconds_total`	磁盘 I/O	> 80%
`node_load15`	系统负载	> CPU 核心数 * 2

Grafana 仪表盘配置

导入 Stable 仪表盘

{
  "dashboard": {
    "title": "Stable Node Monitoring",
    "panels": [
      {
        "title": "Block Height",
        "targets": [
          {
            "expr": "stablebft_consensus_height{chain_id=\"stabletestnet_2201-1\"}"
          }
        ]
      },
      {
        "title": "Peers",
        "targets": [
          {
            "expr": "stablebft_p2p_peers"
          }
        ]
      },
      {
        "title": "Block Time",
        "targets": [
          {
            "expr": "rate(stablebft_consensus_height[1m]) * 60"
          }
        ]
      },
      {
        "title": "Mempool Size",
        "targets": [
          {
            "expr": "stablebft_mempool_size"
          }
        ]
      }
    ]
  }
}

自定义仪表盘导入

通过 Grafana UI 导入仪表盘：

# Navigate to Dashboards > Import > Upload JSON file
# Or use Dashboard ID in Grafana's dashboard library

AlertManager 配置

安装 AlertManager

# Download AlertManager
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
sudo mv alertmanager-0.26.0.linux-amd64 /opt/alertmanager
 
# Configure
sudo tee /opt/alertmanager/alertmanager.yml > /dev/null <<EOF
global:
  resolve_timeout: 5m
 
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'team-notifications'
 
receivers:
  - name: 'team-notifications'
    webhook_configs:
      - url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        send_resolved: true
    email_configs:
      - to: 'alerts@yourteam.com'
        from: 'prometheus@yournode.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'your@gmail.com'
        auth_password: 'app-specific-password'
 
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
EOF
 
# Start AlertManager
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

告警规则

# /opt/prometheus/alerts.yml
groups:
  - name: stable_alerts
    rules:
      - alert: NodeDown
        expr: up{job="stable-node"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
 
      - alert: BlockProductionStopped
        expr: increase(stablebft_consensus_height[5m]) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Block production stopped"
 
      - alert: LowPeerCount
        expr: stablebft_p2p_peers < 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low peer count: {{ $value }}"
 
      - alert: HighMempool
        expr: stablebft_mempool_size > 1500
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High mempool size: {{ $value }}"
 
      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space: {{ $value | humanizePercentage }}"
 
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage: {{ $value }}%"

日志监控

Systemd 日志

# View recent logs
sudo journalctl -u ${SERVICE_NAME} -n 100
 
# Follow logs
sudo journalctl -u ${SERVICE_NAME} -f
 
# Filter by time
sudo journalctl -u ${SERVICE_NAME} --since "1 hour ago"
 
# Export logs
sudo journalctl -u ${SERVICE_NAME} --since today > stable-logs-$(date +%Y%m%d).log

日志分析脚本

#!/bin/bash
# analyze-logs.sh
 
# Count errors in last hour
echo "Errors in last hour:"
sudo journalctl -u ${SERVICE_NAME} --since "1 hour ago" | grep -c ERROR
 
# Show peer connections
echo "Peer connections:"
sudo journalctl -u ${SERVICE_NAME} --since "10 minutes ago" | grep "Peer connection" | tail -10
 
# Check for consensus issues
echo "Consensus rounds:"
sudo journalctl -u ${SERVICE_NAME} --since "30 minutes ago" | grep -E "enterNewRound|Timeout" | tail -20
 
# Memory usage patterns
echo "Memory warnings:"
sudo journalctl -u ${SERVICE_NAME} --since "1 day ago" | grep -i memory

Loki 配置（可选）

# Install Loki
wget https://github.com/grafana/loki/releases/download/v2.9.0/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
 
# Install Promtail
wget https://github.com/grafana/loki/releases/download/v2.9.0/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo mv promtail-linux-amd64 /usr/local/bin/promtail
 
# Configure Promtail
sudo tee /etc/promtail-config.yml > /dev/null <<EOF
server:
  http_listen_port: 9080
 
positions:
  filename: /tmp/positions.yaml
 
clients:
  - url: http://localhost:3100/loki/api/v1/push
 
scrape_configs:
  - job_name: stable
    systemd_journal:
      matches: "_SYSTEMD_UNIT=stabled.service"
      labels:
        job: stable
        host: localhost
EOF
 
# Start services
promtail -config.file=/etc/promtail-config.yml

健康检查端点

HTTP 端点

# Basic health check
curl -s http://localhost:26657/health
 
# Node status
curl -s http://localhost:26657/status | jq
 
# Net info
curl -s http://localhost:26657/net_info | jq
 
# Consensus state
curl -s http://localhost:26657/consensus_state | jq
 
# Unconfirmed transactions
curl -s http://localhost:26657/num_unconfirmed_txs | jq

健康检查脚本

#!/bin/bash
# health-check.sh
 
set -e
 
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
export SERVICE_NAME="stable"
 
echo "=== Stable Node Health Check ==="
echo
 
# Check if service is running
if systemctl is-active --quiet ${SERVICE_NAME}; then
    echo -e "${GREEN}✓${NC} Service is running"
else
    echo -e "${RED}✗${NC} Service is not running"
    exit 1
fi
 
# Check node sync status
SYNC_STATUS=$(curl -s localhost:26657/status | jq -r '.result.sync_info.catching_up')
if [ "$SYNC_STATUS" = "false" ]; then
    echo -e "${GREEN}✓${NC} Node is synced"
else
    echo -e "${YELLOW}⚠${NC} Node is syncing"
fi
 
# Check peer count
PEERS=$(curl -s localhost:26657/net_info | jq -r '.result.n_peers')
if [ "$PEERS" -ge 3 ]; then
    echo -e "${GREEN}✓${NC} Connected peers: $PEERS"
else
    echo -e "${YELLOW}⚠${NC} Low peer count: $PEERS"
fi
 
# Check disk space
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -lt 80 ]; then
    echo -e "${GREEN}✓${NC} Disk usage: ${DISK_USAGE}%"
else
    echo -e "${YELLOW}⚠${NC} High disk usage: ${DISK_USAGE}%"
fi
 
# Check memory
MEM_AVAILABLE=$(free -m | awk 'NR==2 {print $7}')
MEM_TOTAL=$(free -m | awk 'NR==2 {print $2}')
MEM_PERCENT=$((100 - (MEM_AVAILABLE * 100 / MEM_TOTAL)))
if [ "$MEM_PERCENT" -lt 80 ]; then
    echo -e "${GREEN}✓${NC} Memory usage: ${MEM_PERCENT}%"
else
    echo -e "${YELLOW}⚠${NC} High memory usage: ${MEM_PERCENT}%"
fi
 
echo
echo "=== Health Check Complete ==="

维护任务

每日维护

#!/bin/bash
# daily-maintenance.sh
 
# Rotate logs
sudo journalctl --rotate
sudo journalctl --vacuum-time=7d
 
# Clear cache
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
 
# Check for updates
echo "Checking for updates..."
curl -s https://api.github.com/repos/stable-chain/stable/releases/latest | jq -r '.tag_name'
 
# Backup important config files
cp ~/.stabled/config/node_key.json ~/backups/node_key_$(date +%Y%m%d).json
 
# Generate report
echo "Daily report generated: $(date)" > ~/reports/daily_$(date +%Y%m%d).log
curl -s localhost:26657/status | jq >> ~/reports/daily_$(date +%Y%m%d).log

每周维护

#!/bin/bash
# weekly-maintenance.sh
 
# Prune old data
stabled prune
 
# Compact database
stabled compact
 
# Update peer list
wget https://raw.githubusercontent.com/stable-chain/networks/main/testnet/peers.txt
cat peers.txt >> ~/.stabled/config/config.toml
 
# Create snapshot (optional)
./create-snapshot.sh
 
# System updates
sudo apt update
sudo apt upgrade -y
 
# Restart node (during low activity)
sudo systemctl restart ${SERVICE_NAME}

数据库维护

# Check database size
du -sh ~/.stabled/data/
 
# Analyze database
stabled debug db stats ~/.stabled/data

性能监控

资源使用追踪

#!/bin/bash
# track-resources.sh
 
while true; do
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    CPU=$(top -bn1 | grep "stabled" | awk '{print $9}')
    MEM=$(top -bn1 | grep "stabled" | awk '{print $10}')
    IO=$(iostat -x 1 2 | tail -n2 | awk '{print $14}')
 
    echo "$TIMESTAMP,CPU:$CPU,MEM:$MEM,IO:$IO" >> ~/metrics/resources.csv
 
    sleep 60
done

查询性能

# Monitor RPC response times
while true; do
    START=$(date +%s%N)
    curl -s http://localhost:26657/status > /dev/null
    END=$(date +%s%N)
    DIFF=$((($END - $START) / 1000000))
    echo "RPC response time: ${DIFF}ms"
    sleep 5
done

监控最佳实践

建立冗余监控
- 使用外部监控服务
- 实施跨节点监控
- 设置死信开关（dead man's switch）告警
预防告警疲劳
- 基于基线调整告警阈值
- 使用告警分组与抑制
- 实施升级策略
数据保留
- 指标至少保留 30 天
- 归档重要日志
- 定期备份监控配置
安全性
- 使用强密码保护 Grafana
- 所有端点使用 HTTPS
- 限制 Prometheus 的访问
文档
- 记录所有自定义指标
- 为告警维护操作手册（runbook）
- 保持仪表盘描述的更新

后续步骤

查阅故障排查指南以解决问题
配置升级并结合监控
根据您的需求设置自定义告警

监控栈概览

推荐技术栈

快速监控配置

步骤 1：启用 Prometheus 指标

步骤 2：安装 Prometheus

步骤 3：安装 Grafana

需要监控的关键指标

节点健康指标

系统指标

Grafana 仪表盘配置

导入 Stable 仪表盘

自定义仪表盘导入

AlertManager 配置

安装 AlertManager

告警规则

日志监控

Systemd 日志

日志分析脚本

Loki 配置（可选）

健康检查端点

HTTP 端点

健康检查脚本

维护任务

每日维护

每周维护

数据库维护

性能监控

资源使用追踪

查询性能

监控最佳实践

后续步骤