Operations Hardening in v0.18.0: Storage Profiles, Witness Endpoints, Observability

Why operations work is protocol work

Protocol code is what miners, validators, and light clients see. Operations code is what node operators see — and in a decentralized network, node operators are the protocol. If running a node is difficult, observability is poor, or storage behavior is unpredictable, the network degrades in ways that no amount of protocol elegance can fix.

v0.18.0 addresses four concrete gaps.

OPS-1: Storage Profiles

The problem

Prior to v0.18.0, every Shell Chain node stored everything — full block bodies, complete state trie, all historical data — regardless of how it was used. A validator node and a light client proxy were configured identically. This made it impossible to operate a resource-constrained node correctly.

What shipped

Three declared profiles, set at node startup:

# node.toml
[storage]
profile = "full"   # "archive" | "full" | "light"

Profile	Block bodies	State trie	History depth
`archive`	All	Complete	Unlimited
`full`	Recent only	Recent only	Configurable (default 100k blocks)
`light`	Headers only	None (use witness proofs)	N/A

On startup, the node validates that on-disk data is consistent with the declared profile. If you declare full but the disk contains archive data, the node starts normally (archive is a superset). If you declare archive but the disk has already been pruned, startup fails with a clear error.

The shell_getStorageProfile RPC returns the current profile as a string. Wallets and explorers can surface this to users.

Runtime switching

archive ↔ full switching is supported: you can restate the profile and restart; pruning happens incrementally on the next startup. Downgrading to light requires a full re-sync from a checkpoint.

OPS-2: Witness Endpoint Hardening

The problem

shell_getWitness returned 501 Not Implemented on full nodes and was simply absent from light client API surfaces. This made it impossible to build compliant light client verifiers against production nodes.

What shipped

shell_getWitness(block_hash) now returns:

{
  "block_hash": "0x...",
  "state_root": "0x...",
  "merkle_proof": ["0x...", "0x..."],
  "pq_witness": {
    "validators": ["0x..."],
    "signatures": ["0x..."],
    "aggregate_proof": "0x..."
  }
}

This is served on all node types:

Archive / full: proof is constructed from local trie state.
Light: proof is fetched from peers and cached.

The pq_witness field contains the aggregated ML-DSA-65 signatures from the validator set for that block. A light client can verify finality independently, without trusting the responding node.

Unit tests in crates/light-client/ cover the full verification path.

OPS-3: Observability

Prometheus metrics

The /metrics endpoint (default: same port as RPC, configurable via SHELL_METRICS_PORT) exposes:

# HELP shell_mempool_size Current number of transactions in the mempool
# TYPE shell_mempool_size gauge
shell_mempool_size 42

# HELP shell_blocks_per_second Block production rate (1-minute window)
# TYPE shell_blocks_per_second gauge
shell_blocks_per_second 0.5

# HELP shell_peer_count Connected peers
# TYPE shell_peer_count gauge
shell_peer_count 8

# HELP shell_rpc_latency_seconds RPC handler latency
# TYPE shell_rpc_latency_seconds histogram
shell_rpc_latency_seconds_bucket{method="eth_getBlockByNumber",le="0.01"} 847

# HELP shell_libp2p_errors_total libp2p protocol errors
# TYPE shell_libp2p_errors_total counter
shell_libp2p_errors_total{error="connection_refused"} 3

Health endpoints

GET /healthz  → 200 OK if process is alive
GET /readyz   → 200 OK if synced within 2 blocks of network tip

These are compatible with Kubernetes liveness / readiness probes.

Tracing

Structured spans cover the full critical path. Set SHELL_LOG=trace (or debug, info) to control verbosity. Spans include:

rpc.handle — per-method, includes full request/response size
execution.batch — covers full batch tx execution
consensus.tick — one span per consensus round

A starter Grafana dashboard JSON is in docs/observability.md.

OPS-4: RPC Stability & Documentation

Unified error codes

crates/rpc/src/error.rs is now the single source of truth for all -32xxx error codes. Previously these were scattered across handlers with occasional collisions. The table is:

Code	Name	Meaning
-32700	`ParseError`	Invalid JSON
-32600	`InvalidRequest`	Not a valid request object
-32601	`MethodNotFound`	Method does not exist
-32602	`InvalidParams`	Invalid method parameters
-32603	`InternalError`	Internal JSON-RPC error
-32001	`MempoolFull`	Mempool at capacity
-32002	`NonceTooLow`	Nonce already used
-32003	`InsufficientFunds`	Balance too low
-32004	`InvalidSignature`	PQ signature verification failed
-32005	`PaymasterRejected`	Paymaster admission check failed
-32006	`BatchTooLarge`	Exceeds AA_MAX_INNER_CALLS (16)

Auto-generated RPC reference

docs/rpc-reference.md is generated from annotations on Rust handler functions via a cargo xtask rpc-docs command. It covers all eth_* and shell_* methods with parameter types, return types, and example request/response pairs.

Batch JSON-RPC (array of request objects in a single POST body) is now formally tested for both namespaces in crates/rpc/tests/batch_rpc.rs.

How to upgrade

# Pull the new binary
cargo install --git https://github.com/ShellDAO/shell-chain shell-node

# Add storage profile to your config
echo '[storage]\nprofile = "full"' >> node.toml

# Expose metrics (optional)
export SHELL_METRICS_PORT=9100

# Start
shell-node --config node.toml

First startup will run a one-time profile consistency check. On most full nodes this completes in under a second.

What's next

OPS items planned for v0.19.0:

Automated pruning scheduler with configurable retention windows
P2P metrics (per-peer bandwidth, gossip fan-out efficiency)
OpenTelemetry export alongside Prometheus