Stable Docs | Stable Architecture & Integration Guide

Stable 노드 모니터링 및 일상적인 유지보수 작업을 위한 종합 가이드입니다.

모니터링 스택 개요

권장 스택

Prometheus: 메트릭 수집
Grafana: 시각화 및 대시보드
AlertManager: 알림 라우팅 및 관리
Node Exporter: 시스템 메트릭
Loki: 로그 집계 (선택사항)

빠른 모니터링 설정

1단계: Prometheus 메트릭 활성화

# Edit ~/.stabled/config/config.toml
[instrumentation]
prometheus = true
prometheus_listen_addr = ":26660"
namespace = "stablebft"

노드 재시작:

sudo systemctl restart ${SERVICE_NAME}

2단계: Prometheus 설치

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
sudo mv prometheus-2.45.0.linux-amd64 /opt/prometheus

# Create config
sudo tee /opt/prometheus/prometheus.yml > /dev/null <<EOF
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'stable-node'
    static_configs:
      - targets: ['localhost:26660']
        labels:
          instance: 'mainnode'

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
EOF

# Create systemd service
sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
  --config.file=/opt/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/prometheus/data

[Install]
WantedBy=multi-user.target
EOF

# Start Prometheus
sudo useradd -rs /bin/false prometheus
sudo chown -R prometheus:prometheus /opt/prometheus
sudo systemctl enable prometheus
sudo systemctl start prometheus

3단계: Grafana 설치

# Add Grafana repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Install Grafana
sudo apt-get update
sudo apt-get install grafana

# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

# Access at http://your-ip:3000
# Default login: admin/admin

주요 모니터링 메트릭

노드 상태 메트릭

메트릭	설명	알림 임계값
`up`	노드 가용성	5분간 = 0
`stablebft_consensus_height`	현재 블록 높이	5분간 증가 없음
`stablebft_consensus_validators`	활성 검증자 수	N/A
`stablebft_consensus_rounds`	합의 라운드	> 3
`stablebft_p2p_peers`	연결된 피어	< 3
`stablebft_mempool_size`	멤풀 크기	> 1500
`stablebft_mempool_failed_txs`	실패한 트랜잭션 총계	> 100/분

시스템 메트릭

메트릭	설명	알림 임계값
`node_cpu_seconds_total`	CPU 사용량	5분간 > 80%
`node_memory_MemAvailable_bytes`	사용 가능한 메모리	< 10%
`node_filesystem_avail_bytes`	사용 가능한 디스크	< 10%
`node_network_receive_bytes_total`	네트워크 RX	> 100MB/초
`node_disk_io_time_seconds_total`	디스크 I/O	> 80%
`node_load15`	시스템 부하	> CPU 코어 수 * 2

Grafana 대시보드 설정

Stable 대시보드 가져오기

{
  "dashboard": {
    "title": "Stable Node Monitoring",
    "panels": [
      {
        "title": "Block Height",
        "targets": [
          {
            "expr": "stablebft_consensus_height{chain_id=\"stabletestnet_2201-1\"}"
          }
        ]
      },
      {
        "title": "Peers",
        "targets": [
          {
            "expr": "stablebft_p2p_peers"
          }
        ]
      },
      {
        "title": "Block Time",
        "targets": [
          {
            "expr": "rate(stablebft_consensus_height[1m]) * 60"
          }
        ]
      },
      {
        "title": "Mempool Size",
        "targets": [
          {
            "expr": "stablebft_mempool_size"
          }
        ]
      }
    ]
  }
}

커스텀 대시보드 가져오기

Grafana UI를 통해 대시보드 가져오기:

# Navigate to Dashboards > Import > Upload JSON file
# Or use Dashboard ID in Grafana's dashboard library

AlertManager 구성

AlertManager 설치

# Download AlertManager
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
sudo mv alertmanager-0.26.0.linux-amd64 /opt/alertmanager

# Configure
sudo tee /opt/alertmanager/alertmanager.yml > /dev/null <<EOF
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'team-notifications'

receivers:
  - name: 'team-notifications'
    webhook_configs:
      - url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        send_resolved: true
    email_configs:
      - to: 'alerts@yourteam.com'
        from: 'prometheus@yournode.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'your@gmail.com'
        auth_password: 'app-specific-password'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
EOF

# Start AlertManager
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

알림 규칙

# /opt/prometheus/alerts.yml
groups:
  - name: stable_alerts
    rules:
      - alert: NodeDown
        expr: up{job="stable-node"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"

      - alert: BlockProductionStopped
        expr: increase(stablebft_consensus_height[5m]) == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Block production stopped"

      - alert: LowPeerCount
        expr: stablebft_p2p_peers < 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low peer count: {{ $value }}"

      - alert: HighMempool
        expr: stablebft_mempool_size > 1500
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High mempool size: {{ $value }}"

      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space: {{ $value | humanizePercentage }}"

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage: {{ $value }}%"

로그 모니터링

Systemd 로그

# View recent logs
sudo journalctl -u ${SERVICE_NAME} -n 100

# Follow logs
sudo journalctl -u ${SERVICE_NAME} -f

# Filter by time
sudo journalctl -u ${SERVICE_NAME} --since "1 hour ago"

# Export logs
sudo journalctl -u ${SERVICE_NAME} --since today > stable-logs-$(date +%Y%m%d).log

로그 분석 스크립트

#!/bin/bash
# analyze-logs.sh

# Count errors in last hour
echo "Errors in last hour:"
sudo journalctl -u ${SERVICE_NAME} --since "1 hour ago" | grep -c ERROR

# Show peer connections
echo "Peer connections:"
sudo journalctl -u ${SERVICE_NAME} --since "10 minutes ago" | grep "Peer connection" | tail -10

# Check for consensus issues
echo "Consensus rounds:"
sudo journalctl -u ${SERVICE_NAME} --since "30 minutes ago" | grep -E "enterNewRound|Timeout" | tail -20

# Memory usage patterns
echo "Memory warnings:"
sudo journalctl -u ${SERVICE_NAME} --since "1 day ago" | grep -i memory

Loki 설정 (선택사항)

# Install Loki
wget https://github.com/grafana/loki/releases/download/v2.9.0/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki

# Install Promtail
wget https://github.com/grafana/loki/releases/download/v2.9.0/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo mv promtail-linux-amd64 /usr/local/bin/promtail

# Configure Promtail
sudo tee /etc/promtail-config.yml > /dev/null <<EOF
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
  - job_name: stable
    systemd_journal:
      matches: "_SYSTEMD_UNIT=stabled.service"
      labels:
        job: stable
        host: localhost
EOF

# Start services
promtail -config.file=/etc/promtail-config.yml

헬스 체크 엔드포인트

HTTP 엔드포인트

# Basic health check
curl -s http://localhost:26657/health

# Node status
curl -s http://localhost:26657/status | jq

# Net info
curl -s http://localhost:26657/net_info | jq

# Consensus state
curl -s http://localhost:26657/consensus_state | jq

# Unconfirmed transactions
curl -s http://localhost:26657/num_unconfirmed_txs | jq

헬스 체크 스크립트

#!/bin/bash
# health-check.sh

set -e

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
export SERVICE_NAME="stable"

echo "=== Stable Node Health Check ==="
echo

# Check if service is running
if systemctl is-active --quiet ${SERVICE_NAME}; then
    echo -e "${GREEN}✓${NC} Service is running"
else
    echo -e "${RED}✗${NC} Service is not running"
    exit 1
fi

# Check node sync status
SYNC_STATUS=$(curl -s localhost:26657/status | jq -r '.result.sync_info.catching_up')
if [ "$SYNC_STATUS" = "false" ]; then
    echo -e "${GREEN}✓${NC} Node is synced"
else
    echo -e "${YELLOW}⚠${NC} Node is syncing"
fi

# Check peer count
PEERS=$(curl -s localhost:26657/net_info | jq -r '.result.n_peers')
if [ "$PEERS" -ge 3 ]; then
    echo -e "${GREEN}✓${NC} Connected peers: $PEERS"
else
    echo -e "${YELLOW}⚠${NC} Low peer count: $PEERS"
fi

# Check disk space
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -lt 80 ]; then
    echo -e "${GREEN}✓${NC} Disk usage: ${DISK_USAGE}%"
else
    echo -e "${YELLOW}⚠${NC} High disk usage: ${DISK_USAGE}%"
fi

# Check memory
MEM_AVAILABLE=$(free -m | awk 'NR==2 {print $7}')
MEM_TOTAL=$(free -m | awk 'NR==2 {print $2}')
MEM_PERCENT=$((100 - (MEM_AVAILABLE * 100 / MEM_TOTAL)))
if [ "$MEM_PERCENT" -lt 80 ]; then
    echo -e "${GREEN}✓${NC} Memory usage: ${MEM_PERCENT}%"
else
    echo -e "${YELLOW}⚠${NC} High memory usage: ${MEM_PERCENT}%"
fi

echo
echo "=== Health Check Complete ==="

성능 모니터링

리소스 사용량 추적

#!/bin/bash
# track-resources.sh

while true; do
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    CPU=$(top -bn1 | grep "stabled" | awk '{print $9}')
    MEM=$(top -bn1 | grep "stabled" | awk '{print $10}')
    IO=$(iostat -x 1 2 | tail -n2 | awk '{print $14}')

    echo "$TIMESTAMP,CPU:$CPU,MEM:$MEM,IO:$IO" >> ~/metrics/resources.csv

    sleep 60
done

쿼리 성능

# Monitor RPC response times
while true; do
    START=$(date +%s%N)
    curl -s http://localhost:26657/status > /dev/null
    END=$(date +%s%N)
    DIFF=$((($END - $START) / 1000000))
    echo "RPC response time: ${DIFF}ms"
    sleep 5
done

모니터링 모범 사례

중복 모니터링 설정
- 외부 모니터링 서비스 사용
- 노드 간 교차 모니터링 구현
- Dead man’s switch 알림 설정
알림 피로도 방지
- 기준선을 바탕으로 알림 임계값 조정
- 알림 그룹화 및 억제 사용
- 에스컬레이션 정책 구현
데이터 보존
- 최소 30일간 메트릭 보관
- 중요 로그 아카이빙
- 모니터링 설정 정기 백업

다음 단계

문제 해결을 위한 트러블슈팅 가이드 검토
모니터링과 함께 업그레이드 구성
요구사항에 따른 커스텀 알림 설정

환영합니다

소개

개발자

Stable 아키텍처

리소스

모니터링 및 유지보수 가이드

모니터링 스택 개요

권장 스택

빠른 모니터링 설정

1단계: Prometheus 메트릭 활성화

2단계: Prometheus 설치

3단계: Grafana 설치

주요 모니터링 메트릭

노드 상태 메트릭

시스템 메트릭

Grafana 대시보드 설정

Stable 대시보드 가져오기

커스텀 대시보드 가져오기

AlertManager 구성

AlertManager 설치

알림 규칙

로그 모니터링

Systemd 로그

로그 분석 스크립트

Loki 설정 (선택사항)

헬스 체크 엔드포인트

HTTP 엔드포인트

헬스 체크 스크립트

성능 모니터링

리소스 사용량 추적

쿼리 성능

모니터링 모범 사례

다음 단계

환영합니다

소개

개발자

Stable 아키텍처

리소스

​모니터링 스택 개요

​권장 스택

​빠른 모니터링 설정

​1단계: Prometheus 메트릭 활성화

​2단계: Prometheus 설치

​3단계: Grafana 설치

​주요 모니터링 메트릭

​노드 상태 메트릭

​시스템 메트릭

​Grafana 대시보드 설정

​Stable 대시보드 가져오기

​커스텀 대시보드 가져오기

​AlertManager 구성

​AlertManager 설치

​알림 규칙

​로그 모니터링

​Systemd 로그

​로그 분석 스크립트

​Loki 설정 (선택사항)

​헬스 체크 엔드포인트

​HTTP 엔드포인트

​헬스 체크 스크립트

​성능 모니터링

​리소스 사용량 추적

​쿼리 성능

​모니터링 모범 사례

​다음 단계

모니터링 스택 개요

권장 스택

빠른 모니터링 설정

1단계: Prometheus 메트릭 활성화

2단계: Prometheus 설치

3단계: Grafana 설치

주요 모니터링 메트릭

노드 상태 메트릭

시스템 메트릭

Grafana 대시보드 설정

Stable 대시보드 가져오기

커스텀 대시보드 가져오기

AlertManager 구성

AlertManager 설치

알림 규칙

로그 모니터링

Systemd 로그

로그 분석 스크립트

Loki 설정 (선택사항)

헬스 체크 엔드포인트

HTTP 엔드포인트

헬스 체크 스크립트

성능 모니터링

리소스 사용량 추적

쿼리 성능

모니터링 모범 사례

다음 단계