Documentation Index
Fetch the complete documentation index at: https://docs.stable.xyz/llms.txt
Use this file to discover all available pages before exploring further.
用于监控 Stable 节点和执行常规维护任务的综合指南。
监控堆栈概览
推荐堆栈
- Prometheus:指标收集
- Grafana:可视化和仪表板
- AlertManager:告警路由和管理
- Node Exporter:系统指标
- Loki:日志聚合(可选)
快速监控设置
步骤 1:启用 Prometheus 指标
# 编辑 ~/.stabled/config/config.toml
[instrumentation]
prometheus = true
prometheus_listen_addr = ":26660"
namespace = "stablebft"
重启节点:
sudo systemctl restart ${SERVICE_NAME}
步骤 2:安装 Prometheus
# 下载 Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus
# 创建配置
sudo tee /opt/prometheus/prometheus.yml > /dev/null <<EOF
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'stable-node'
static_configs:
- targets: ['localhost:26660']
labels:
instance: 'mainnode'
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
EOF
# 创建 systemd 服务
sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file=/opt/prometheus/prometheus.yml \
--storage.tsdb.path=/opt/prometheus/data
[Install]
WantedBy=multi-user.target
EOF
# 启动 Prometheus
sudo useradd -rs /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
sudo systemctl enable prometheus
sudo systemctl start prometheus
步骤 3:安装 Grafana
# 添加 Grafana 仓库
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
# 安装 Grafana
sudo apt-get update
sudo apt-get install grafana
# 启动 Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
# 访问 http://your-ip:3000
# 默认登录:admin/admin
关键监控指标
节点健康指标
| 指标 | 描述 | 告警阈值 |
|---|
up | 节点可用性 | = 0 持续 5 分钟 |
stablebft_consensus_height | 当前区块高度 | 5 分钟内无增长 |
stablebft_consensus_validators | 活跃验证者 | 不适用 |
stablebft_consensus_rounds | 共识轮次 | > 3 |
stablebft_consensus_block_interval | 区块时间 | > 10 秒 |
stablebft_p2p_peers | 连接的对等节点 | < 3 |
stablebft_mempool_size | 内存池大小 | > 1500 |
stablebft_mempool_failed_txs | 失败的交易 | > 100/分钟 |
系统指标
| 指标 | 描述 | 告警阈值 |
|---|
node_cpu_seconds_total | CPU 使用率 | > 80% 持续 5 分钟 |
node_memory_MemAvailable_bytes | 可用内存 | < 10% |
node_filesystem_avail_bytes | 可用磁盘 | < 10% |
node_network_receive_bytes_total | 网络接收 | > 100MB/s |
node_disk_io_time_seconds_total | 磁盘 I/O | > 80% |
node_load15 | 系统负载 | > CPU 核心数 * 2 |
Grafana 仪表板设置
导入 Stable 仪表板
{
"dashboard": {
"title": "Stable 节点监控",
"panels": [
{
"title": "区块高度",
"targets": [
{
"expr": "stablebft_consensus_height{chain_id=\"stabletestnet_2201-1\"}"
}
]
},
{
"title": "对等节点",
"targets": [
{
"expr": "stablebft_p2p_peers"
}
]
},
{
"title": "区块时间",
"targets": [
{
"expr": "rate(stablebft_consensus_height[1m]) * 60"
}
]
},
{
"title": "内存池大小",
"targets": [
{
"expr": "stablebft_mempool_size"
}
]
}
]
}
}
AlertManager 配置
安装 AlertManager
# 下载 AlertManager
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
sudo mv alertmanager-0.26.0.linux-amd64 /opt/alertmanager
# 配置
sudo tee /opt/alertmanager/alertmanager.yml > /dev/null <<EOF
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'team-notifications'
receivers:
- name: 'team-notifications'
webhook_configs:
- url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
send_resolved: true
email_configs:
- to: 'alerts@yourteam.com'
from: 'prometheus@yournode.com'
smarthost: 'smtp.gmail.com:587'
auth_username: 'your@gmail.com'
auth_password: 'app-specific-password'
EOF
告警规则
# /opt/prometheus/alerts.yml
groups:
- name: stable_alerts
rules:
- alert: 节点宕机
expr: up{job="stable-node"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "节点 {{ $labels.instance }} 已宕机"
- alert: 区块生产停止
expr: increase(stablebft_consensus_height[5m]) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "区块生产已停止"
- alert: 对等节点数量过低
expr: stablebft_p2p_peers < 3
for: 5m
labels:
severity: warning
annotations:
summary: "对等节点数量过低:{{ $value }}"
- alert: 内存池过载
expr: stablebft_mempool_size > 1500
for: 10m
labels:
severity: warning
annotations:
summary: "内存池大小过高:{{ $value }}"
健康检查脚本
#!/bin/bash
# health-check.sh
set -e
# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
export SERVICE_NAME="stable"
echo "=== Stable 节点健康检查 ==="
echo
# 检查服务是否运行
if systemctl is-active --quiet ${SERVICE_NAME}; then
echo -e "${GREEN}✓${NC} 服务正在运行"
else
echo -e "${RED}✗${NC} 服务未运行"
exit 1
fi
# 检查节点同步状态
SYNC_STATUS=$(curl -s localhost:26657/status | jq -r '.result.sync_info.catching_up')
if [ "$SYNC_STATUS" = "false" ]; then
echo -e "${GREEN}✓${NC} 节点已同步"
else
echo -e "${YELLOW}⚠${NC} 节点正在同步"
fi
# 检查对等节点数量
PEERS=$(curl -s localhost:26657/net_info | jq -r '.result.n_peers')
if [ "$PEERS" -ge 3 ]; then
echo -e "${GREEN}✓${NC} 连接的对等节点:$PEERS"
else
echo -e "${YELLOW}⚠${NC} 对等节点数量过低:$PEERS"
fi
# 检查磁盘空间
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -lt 80 ]; then
echo -e "${GREEN}✓${NC} 磁盘使用率:${DISK_USAGE}%"
else
echo -e "${YELLOW}⚠${NC} 磁盘使用率过高:${DISK_USAGE}%"
fi
echo
echo "=== 健康检查完成 ==="
维护任务
日常维护
#!/bin/bash
# daily-maintenance.sh
# 轮换日志
sudo journalctl --rotate
sudo journalctl --vacuum-time=7d
# 清理缓存
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
# 检查更新
echo "检查更新..."
curl -s https://api.github.com/repos/stable-chain/stable/releases/latest | jq -r '.tag_name'
# 备份重要配置文件
cp ~/.stabled/config/node_key.json ~/backups/node_key_$(date +%Y%m%d).json
监控最佳实践
-
设置冗余监控
- 使用外部监控服务
- 实现跨节点监控
- 设置死人开关告警
-
防止告警疲劳
- 基于基线调整告警阈值
- 使用告警分组和抑制
- 实现升级策略
-
数据保留
- 至少保留 30 天指标
- 归档重要日志
- 定期备份监控配置
下一步