跳转到主要内容
用于监控 Stable 节点和执行常规维护任务的综合指南。

监控堆栈概览

推荐堆栈

  • Prometheus:指标收集
  • Grafana:可视化和仪表板
  • AlertManager:告警路由和管理
  • Node Exporter:系统指标
  • Loki:日志聚合(可选)

快速监控设置

步骤 1:启用 Prometheus 指标

# 编辑 ~/.stabled/config/config.toml
[instrumentation]
prometheus = true
prometheus_listen_addr = ":26660"
namespace = "stablebft"
重启节点:
sudo systemctl restart ${SERVICE_NAME}

步骤 2:安装 Prometheus

# 下载 Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus

# 创建配置
sudo tee /opt/prometheus/prometheus.yml > /dev/null <<EOF
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'stable-node'
    static_configs:
      - targets: ['localhost:26660']
        labels:
          instance: 'mainnode'

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
EOF

# 创建 systemd 服务
sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
  --config.file=/opt/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/prometheus/data

[Install]
WantedBy=multi-user.target
EOF

# 启动 Prometheus
sudo useradd -rs /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
sudo systemctl enable prometheus
sudo systemctl start prometheus

步骤 3:安装 Grafana

# 添加 Grafana 仓库
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# 安装 Grafana
sudo apt-get update
sudo apt-get install grafana

# 启动 Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

# 访问 http://your-ip:3000
# 默认登录:admin/admin

关键监控指标

节点健康指标

指标描述告警阈值
up节点可用性= 0 持续 5 分钟
stablebft_consensus_height当前区块高度5 分钟内无增长
stablebft_consensus_validators活跃验证者不适用
stablebft_consensus_rounds共识轮次> 3
stablebft_consensus_block_interval区块时间> 10 秒
stablebft_p2p_peers连接的对等节点< 3
stablebft_mempool_size内存池大小> 1500
stablebft_mempool_failed_txs失败的交易> 100/分钟

系统指标

指标描述告警阈值
node_cpu_seconds_totalCPU 使用率> 80% 持续 5 分钟
node_memory_MemAvailable_bytes可用内存< 10%
node_filesystem_avail_bytes可用磁盘< 10%
node_network_receive_bytes_total网络接收> 100MB/s
node_disk_io_time_seconds_total磁盘 I/O> 80%
node_load15系统负载> CPU 核心数 * 2

Grafana 仪表板设置

导入 Stable 仪表板

{
  "dashboard": {
    "title": "Stable 节点监控",
    "panels": [
      {
        "title": "区块高度",
        "targets": [
          {
            "expr": "stablebft_consensus_height{chain_id=\"stabletestnet_2201-1\"}"
          }
        ]
      },
      {
        "title": "对等节点",
        "targets": [
          {
            "expr": "stablebft_p2p_peers"
          }
        ]
      },
      {
        "title": "区块时间",
        "targets": [
          {
            "expr": "rate(stablebft_consensus_height[1m]) * 60"
          }
        ]
      },
      {
        "title": "内存池大小",
        "targets": [
          {
            "expr": "stablebft_mempool_size"
          }
        ]
      }
    ]
  }
}

AlertManager 配置

安装 AlertManager

# 下载 AlertManager
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
sudo mv alertmanager-0.26.0.linux-amd64 /opt/alertmanager

# 配置
sudo tee /opt/alertmanager/alertmanager.yml > /dev/null <<EOF
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'team-notifications'

receivers:
  - name: 'team-notifications'
    webhook_configs:
      - url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        send_resolved: true
    email_configs:
      - to: 'alerts@yourteam.com'
        from: 'prometheus@yournode.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'your@gmail.com'
        auth_password: 'app-specific-password'
EOF

告警规则

# /opt/prometheus/alerts.yml
groups:
  - name: stable_alerts
    rules:
      - alert: 节点宕机
        expr: up{job="stable-node"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "节点 {{ $labels.instance }} 已宕机"

      - alert: 区块生产停止
        expr: increase(stablebft_consensus_height[5m]) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "区块生产已停止"

      - alert: 对等节点数量过低
        expr: stablebft_p2p_peers < 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "对等节点数量过低:{{ $value }}"

      - alert: 内存池过载
        expr: stablebft_mempool_size > 1500
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "内存池大小过高:{{ $value }}"

健康检查脚本

#!/bin/bash
# health-check.sh

set -e

# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
export SERVICE_NAME="stable"

echo "=== Stable 节点健康检查 ==="
echo

# 检查服务是否运行
if systemctl is-active --quiet ${SERVICE_NAME}; then
    echo -e "${GREEN}✓${NC} 服务正在运行"
else
    echo -e "${RED}✗${NC} 服务未运行"
    exit 1
fi

# 检查节点同步状态
SYNC_STATUS=$(curl -s localhost:26657/status | jq -r '.result.sync_info.catching_up')
if [ "$SYNC_STATUS" = "false" ]; then
    echo -e "${GREEN}✓${NC} 节点已同步"
else
    echo -e "${YELLOW}⚠${NC} 节点正在同步"
fi

# 检查对等节点数量
PEERS=$(curl -s localhost:26657/net_info | jq -r '.result.n_peers')
if [ "$PEERS" -ge 3 ]; then
    echo -e "${GREEN}✓${NC} 连接的对等节点:$PEERS"
else
    echo -e "${YELLOW}⚠${NC} 对等节点数量过低:$PEERS"
fi

# 检查磁盘空间
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -lt 80 ]; then
    echo -e "${GREEN}✓${NC} 磁盘使用率:${DISK_USAGE}%"
else
    echo -e "${YELLOW}⚠${NC} 磁盘使用率过高:${DISK_USAGE}%"
fi

echo
echo "=== 健康检查完成 ==="

维护任务

日常维护

#!/bin/bash
# daily-maintenance.sh

# 轮换日志
sudo journalctl --rotate
sudo journalctl --vacuum-time=7d

# 清理缓存
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# 检查更新
echo "检查更新..."
curl -s https://api.github.com/repos/stable-chain/stable/releases/latest | jq -r '.tag_name'

# 备份重要配置文件
cp ~/.stabled/config/node_key.json ~/backups/node_key_$(date +%Y%m%d).json

监控最佳实践

  1. 设置冗余监控
    • 使用外部监控服务
    • 实现跨节点监控
    • 设置死人开关告警
  2. 防止告警疲劳
    • 基于基线调整告警阈值
    • 使用告警分组和抑制
    • 实现升级策略
  3. 数据保留
    • 至少保留 30 天指标
    • 归档重要日志
    • 定期备份监控配置

下一步