Observability

Monitor Nivo’s health and performance with Prometheus metrics and Grafana dashboards.

Overview
Accessing Dashboards
1. Production
2. Local Development
Mission Control Dashboard
Metrics Exposed
Alert Rules
1. Example Alert Rule
Adding Custom Metrics
1. Custom Business Metrics
Prometheus Configuration
Troubleshooting
Architecture
Related Documentation

Overview

Nivo implements comprehensive observability using the RED methodology:

Metric	Description	Example
Rate	Request throughput	Requests per second per service
Errors	Error percentage	4xx/5xx responses
Duration	Request latency	P50, P95, P99 response times

graph LR
    subgraph Services
        ID[Identity]
        W[Wallet]
        TX[Transaction]
        L[Ledger]
        R[Risk]
        N[Notification]
        SIM[Simulation]
        GW[Gateway]
        RBAC[RBAC]
    end

    subgraph Observability Stack
        PROM[Prometheus<br/>Metrics Collection]
        GRAF[Grafana<br/>Visualization]
    end

    ID -->|/metrics| PROM
    W -->|/metrics| PROM
    TX -->|/metrics| PROM
    L -->|/metrics| PROM
    R -->|/metrics| PROM
    N -->|/metrics| PROM
    SIM -->|/metrics| PROM
    GW -->|/metrics| PROM
    RBAC -->|/metrics| PROM

    PROM --> GRAF

Accessing Dashboards

Production

Dashboard	URL	Access
Grafana	grafana.nivomoney.com	Login required

Local Development

# Start observability stack
make obs-up

# Or with full stack
make dev

Dashboard	URL
Grafana	localhost:3003
Prometheus	localhost:9090

Default Grafana credentials:

Username: admin
Password: admin

Mission Control Dashboard

The Mission Control dashboard provides at-a-glance visibility into the entire Nivo platform.

Row 1: Service Health

Nine stat panels showing real-time service status:

graph LR
    subgraph Service Health Status
        GW[Gateway<br/>🟢 UP]
        ID[Identity<br/>🟢 UP]
        L[Ledger<br/>🟢 UP]
        RBAC[RBAC<br/>🟢 UP]
        W[Wallet<br/>🟢 UP]
        TX[Transaction<br/>🟢 UP]
        R[Risk<br/>🟢 UP]
        N[Notification<br/>🟢 UP]
        SIM[Simulation<br/>🟢 UP]
    end

Green: Service is healthy (responding to scrapes)
Red: Service is down or unhealthy

Row 2: RED Metrics

Panel	Query	Purpose
Total Request Rate	`sum(rate(http_requests_total[5m]))`	Overall throughput
Error Rate %	`sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100`	Error percentage
P99 Latency	`histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))`	Worst-case latency

Row 3: Simulation Status

For the demo simulation engine:

Users Created
KYC Verified
Transactions Generated
Total Operations
Success Rate
Average Delay

Row 4: Infrastructure

Panel	Purpose
Goroutines by Service	Memory/concurrency health
Memory Usage by Service	Resource consumption

Row 5-6: Time Series

Detailed trends over time:

Request rate by service (stacked)
Error rate by service
Latency percentiles (P50/P95/P99)
Memory and goroutine trends

Metrics Exposed

All services expose metrics at /metrics in Prometheus format.

HTTP Metrics

# Request count by method, path, status
http_requests_total{method="POST", path="/api/v1/auth/login", status="200"}

# Request duration histogram
http_request_duration_seconds_bucket{le="0.1"}

# Active connections
http_connections_active

Go Runtime Metrics

# Goroutines
go_goroutines

# Memory
go_memstats_alloc_bytes
go_memstats_heap_inuse_bytes

# GC
go_gc_duration_seconds

Simulation Metrics

The simulation service exposes custom gauges:

# Simulation progress
nivo_simulation_users_created
nivo_simulation_kyc_verified
nivo_simulation_transactions_total
nivo_simulation_success_rate

Alert Rules

Pre-configured alerts in monitoring/prometheus/alerts.yml:

Alert	Condition	Severity
ServiceDown	Service not responding for 1m	Critical
HighErrorRate	Error rate > 5% for 5m	Warning
HighLatency	P99 > 2s for 5m	Warning
HighMemory	Memory > 80% of limit	Warning

Example Alert Rule

groups:
  - name: nivo-alerts
    rules:
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service  is down"
          description: " has been down for more than 1 minute"

      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          / sum(rate(http_requests_total[5m])) by (job) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on "

Adding Custom Metrics

Services use the shared metrics package:

import "nivo/shared/metrics"

// Initialize collector
collector := metrics.NewPrometheusCollector()

// In your handler, wrap with metrics middleware
r.Use(metrics.MetricsMiddleware(collector))

// Expose /metrics endpoint
r.Handle("/metrics", promhttp.Handler())

Custom Business Metrics

// Record a transaction
collector.RecordTransaction("transfer", "completed", 1500.00)

// Record wallet operation
collector.RecordWalletOperation("credit", "success")

// Record risk event
collector.RecordRiskEvent("blocked", "high_velocity")

Prometheus Configuration

Scrape configuration in monitoring/prometheus/prometheus.yml:

scrape_configs:
  - job_name: 'gateway'
    static_configs:
      - targets: ['gateway:8000']
    metrics_path: /metrics

  - job_name: 'identity'
    static_configs:
      - targets: ['identity:8080']

  - job_name: 'ledger'
    static_configs:
      - targets: ['ledger:8081']

  # ... all 9 services

Troubleshooting

Metrics not appearing in Grafana

Check Prometheus targets:

curl http://localhost:9090/api/v1/targets

Verify service is exposing metrics:
```
curl http://localhost:8080/metrics
```
Check Prometheus logs:
```
docker logs nivo-prometheus
```

Dashboard not loading

Verify Grafana is running:
```
docker ps | grep grafana
```
Check datasource configuration in Grafana UI

Verify network connectivity:

docker exec nivo-grafana wget -q -O - http://prometheus:9090/-/healthy

High memory alerts

Check which service is consuming memory:
- View “Memory Usage by Service” panel

Investigate with pprof (if enabled):

go tool pprof http://localhost:8080/debug/pprof/heap

Architecture

graph TB
    subgraph Docker Network
        subgraph Services
            S1[Service 1]
            S2[Service 2]
            S9[Service N]
        end

        subgraph Observability
            PROM[Prometheus<br/>:9090<br/>Internal Only]
            GRAF[Grafana<br/>:3000<br/>Internal Only]
        end

        subgraph Proxy
            NGINX[Nginx<br/>:443]
        end
    end

    S1 -->|/metrics| PROM
    S2 -->|/metrics| PROM
    S9 -->|/metrics| PROM

    PROM --> GRAF

    NGINX -->|grafana.nivomoney.com| GRAF

    User[User] -->|HTTPS| NGINX

Security notes:

Prometheus not exposed externally (internal network only)
Grafana accessible via nginx reverse proxy with HTTPS
Authentication required for Grafana access

System Architecture - Service overview
SSE Integration - Real-time event streaming
Development Guide - Local setup

Observability: If you can’t measure it, you can’t improve it.