모니터링 스택 개요
권장 스택
- Prometheus: 메트릭 수집
- Grafana: 시각화 및 대시보드
- AlertManager: 알림 라우팅 및 관리
- Node Exporter: 시스템 메트릭
- Loki: 로그 집계 (선택사항)
빠른 모니터링 설정
1단계: Prometheus 메트릭 활성화
Copy
Ask AI
# Edit ~/.stabled/config/config.toml
[instrumentation]
prometheus = true
prometheus_listen_addr = ":26660"
namespace = "stablebft"
Copy
Ask AI
sudo systemctl restart ${SERVICE_NAME}
2단계: Prometheus 설치
Copy
Ask AI
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus
# Create config
sudo tee /opt/prometheus/prometheus.yml > /dev/null <<EOF
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'stable-node'
static_configs:
- targets: ['localhost:26660']
labels:
instance: 'mainnode'
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
EOF
# Create systemd service
sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file=/opt/prometheus/prometheus.yml \
--storage.tsdb.path=/opt/prometheus/data
[Install]
WantedBy=multi-user.target
EOF
# Start Prometheus
sudo useradd -rs /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
sudo systemctl enable prometheus
sudo systemctl start prometheus
3단계: Grafana 설치
Copy
Ask AI
# Add Grafana repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
# Install Grafana
sudo apt-get update
sudo apt-get install grafana
# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
# Access at http://your-ip:3000
# Default login: admin/admin
주요 모니터링 메트릭
노드 상태 메트릭
| 메트릭 | 설명 | 알림 임계값 |
|---|---|---|
up | 노드 가용성 | 5분간 = 0 |
stablebft_consensus_height | 현재 블록 높이 | 5분간 증가 없음 |
stablebft_consensus_validators | 활성 검증자 수 | N/A |
stablebft_consensus_rounds | 합의 라운드 | > 3 |
stablebft_p2p_peers | 연결된 피어 | < 3 |
stablebft_mempool_size | 멤풀 크기 | > 1500 |
stablebft_mempool_failed_txs | 실패한 트랜잭션 총계 | > 100/분 |
시스템 메트릭
| 메트릭 | 설명 | 알림 임계값 |
|---|---|---|
node_cpu_seconds_total | CPU 사용량 | 5분간 > 80% |
node_memory_MemAvailable_bytes | 사용 가능한 메모리 | < 10% |
node_filesystem_avail_bytes | 사용 가능한 디스크 | < 10% |
node_network_receive_bytes_total | 네트워크 RX | > 100MB/초 |
node_disk_io_time_seconds_total | 디스크 I/O | > 80% |
node_load15 | 시스템 부하 | > CPU 코어 수 * 2 |
Grafana 대시보드 설정
Stable 대시보드 가져오기
Copy
Ask AI
{
"dashboard": {
"title": "Stable Node Monitoring",
"panels": [
{
"title": "Block Height",
"targets": [
{
"expr": "stablebft_consensus_height{chain_id=\"stabletestnet_2201-1\"}"
}
]
},
{
"title": "Peers",
"targets": [
{
"expr": "stablebft_p2p_peers"
}
]
},
{
"title": "Block Time",
"targets": [
{
"expr": "rate(stablebft_consensus_height[1m]) * 60"
}
]
},
{
"title": "Mempool Size",
"targets": [
{
"expr": "stablebft_mempool_size"
}
]
}
]
}
}
커스텀 대시보드 가져오기
Grafana UI를 통해 대시보드 가져오기:Copy
Ask AI
# Navigate to Dashboards > Import > Upload JSON file
# Or use Dashboard ID in Grafana's dashboard library
AlertManager 구성
AlertManager 설치
Copy
Ask AI
# Download AlertManager
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
sudo mv alertmanager-0.26.0.linux-amd64 /opt/alertmanager
# Configure
sudo tee /opt/alertmanager/alertmanager.yml > /dev/null <<EOF
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'team-notifications'
receivers:
- name: 'team-notifications'
webhook_configs:
- url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
send_resolved: true
email_configs:
- to: 'alerts@yourteam.com'
from: 'prometheus@yournode.com'
smarthost: 'smtp.gmail.com:587'
auth_username: 'your@gmail.com'
auth_password: 'app-specific-password'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
EOF
# Start AlertManager
sudo systemctl enable alertmanager
sudo systemctl start alertmanager
알림 규칙
Copy
Ask AI
# /opt/prometheus/alerts.yml
groups:
- name: stable_alerts
rules:
- alert: NodeDown
expr: up{job="stable-node"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
- alert: BlockProductionStopped
expr: increase(stablebft_consensus_height[5m]) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Block production stopped"
- alert: LowPeerCount
expr: stablebft_p2p_peers < 3
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer count: {{ $value }}"
- alert: HighMempool
expr: stablebft_mempool_size > 1500
for: 10m
labels:
severity: warning
annotations:
summary: "High mempool size: {{ $value }}"
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space: {{ $value | humanizePercentage }}"
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage: {{ $value }}%"
로그 모니터링
Systemd 로그
Copy
Ask AI
# View recent logs
sudo journalctl -u ${SERVICE_NAME} -n 100
# Follow logs
sudo journalctl -u ${SERVICE_NAME} -f
# Filter by time
sudo journalctl -u ${SERVICE_NAME} --since "1 hour ago"
# Export logs
sudo journalctl -u ${SERVICE_NAME} --since today > stable-logs-$(date +%Y%m%d).log
로그 분석 스크립트
Copy
Ask AI
#!/bin/bash
# analyze-logs.sh
# Count errors in last hour
echo "Errors in last hour:"
sudo journalctl -u ${SERVICE_NAME} --since "1 hour ago" | grep -c ERROR
# Show peer connections
echo "Peer connections:"
sudo journalctl -u ${SERVICE_NAME} --since "10 minutes ago" | grep "Peer connection" | tail -10
# Check for consensus issues
echo "Consensus rounds:"
sudo journalctl -u ${SERVICE_NAME} --since "30 minutes ago" | grep -E "enterNewRound|Timeout" | tail -20
# Memory usage patterns
echo "Memory warnings:"
sudo journalctl -u ${SERVICE_NAME} --since "1 day ago" | grep -i memory
Loki 설정 (선택사항)
Copy
Ask AI
# Install Loki
wget https://github.com/grafana/loki/releases/download/v2.9.0/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
# Install Promtail
wget https://github.com/grafana/loki/releases/download/v2.9.0/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo mv promtail-linux-amd64 /usr/local/bin/promtail
# Configure Promtail
sudo tee /etc/promtail-config.yml > /dev/null <<EOF
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://localhost:3100/loki/api/v1/push
scrape_configs:
- job_name: stable
systemd_journal:
matches: "_SYSTEMD_UNIT=stabled.service"
labels:
job: stable
host: localhost
EOF
# Start services
promtail -config.file=/etc/promtail-config.yml
헬스 체크 엔드포인트
HTTP 엔드포인트
Copy
Ask AI
# Basic health check
curl -s http://localhost:26657/health
# Node status
curl -s http://localhost:26657/status | jq
# Net info
curl -s http://localhost:26657/net_info | jq
# Consensus state
curl -s http://localhost:26657/consensus_state | jq
# Unconfirmed transactions
curl -s http://localhost:26657/num_unconfirmed_txs | jq
헬스 체크 스크립트
Copy
Ask AI
#!/bin/bash
# health-check.sh
set -e
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
export SERVICE_NAME="stable"
echo "=== Stable Node Health Check ==="
echo
# Check if service is running
if systemctl is-active --quiet ${SERVICE_NAME}; then
echo -e "${GREEN}✓${NC} Service is running"
else
echo -e "${RED}✗${NC} Service is not running"
exit 1
fi
# Check node sync status
SYNC_STATUS=$(curl -s localhost:26657/status | jq -r '.result.sync_info.catching_up')
if [ "$SYNC_STATUS" = "false" ]; then
echo -e "${GREEN}✓${NC} Node is synced"
else
echo -e "${YELLOW}⚠${NC} Node is syncing"
fi
# Check peer count
PEERS=$(curl -s localhost:26657/net_info | jq -r '.result.n_peers')
if [ "$PEERS" -ge 3 ]; then
echo -e "${GREEN}✓${NC} Connected peers: $PEERS"
else
echo -e "${YELLOW}⚠${NC} Low peer count: $PEERS"
fi
# Check disk space
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -lt 80 ]; then
echo -e "${GREEN}✓${NC} Disk usage: ${DISK_USAGE}%"
else
echo -e "${YELLOW}⚠${NC} High disk usage: ${DISK_USAGE}%"
fi
# Check memory
MEM_AVAILABLE=$(free -m | awk 'NR==2 {print $7}')
MEM_TOTAL=$(free -m | awk 'NR==2 {print $2}')
MEM_PERCENT=$((100 - (MEM_AVAILABLE * 100 / MEM_TOTAL)))
if [ "$MEM_PERCENT" -lt 80 ]; then
echo -e "${GREEN}✓${NC} Memory usage: ${MEM_PERCENT}%"
else
echo -e "${YELLOW}⚠${NC} High memory usage: ${MEM_PERCENT}%"
fi
echo
echo "=== Health Check Complete ==="
성능 모니터링
리소스 사용량 추적
Copy
Ask AI
#!/bin/bash
# track-resources.sh
while true; do
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
CPU=$(top -bn1 | grep "stabled" | awk '{print $9}')
MEM=$(top -bn1 | grep "stabled" | awk '{print $10}')
IO=$(iostat -x 1 2 | tail -n2 | awk '{print $14}')
echo "$TIMESTAMP,CPU:$CPU,MEM:$MEM,IO:$IO" >> ~/metrics/resources.csv
sleep 60
done
쿼리 성능
Copy
Ask AI
# Monitor RPC response times
while true; do
START=$(date +%s%N)
curl -s http://localhost:26657/status > /dev/null
END=$(date +%s%N)
DIFF=$((($END - $START) / 1000000))
echo "RPC response time: ${DIFF}ms"
sleep 5
done
모니터링 모범 사례
-
중복 모니터링 설정
- 외부 모니터링 서비스 사용
- 노드 간 교차 모니터링 구현
- Dead man’s switch 알림 설정
-
알림 피로도 방지
- 기준선을 바탕으로 알림 임계값 조정
- 알림 그룹화 및 억제 사용
- 에스컬레이션 정책 구현
-
데이터 보존
- 최소 30일간 메트릭 보관
- 중요 로그 아카이빙
- 모니터링 설정 정기 백업

