Why operations work is protocol work
Protocol code is what miners, validators, and light clients see. Operations code is what node operators see — and in a decentralized network, node operators are the protocol. If running a node is difficult, observability is poor, or storage behavior is unpredictable, the network degrades in ways that no amount of protocol elegance can fix.
v0.18.0 addresses four concrete gaps.
OPS-1: Storage Profiles
The problem
Prior to v0.18.0, every Shell Chain node stored everything — full block bodies, complete state trie, all historical data — regardless of how it was used. A validator node and a light client proxy were configured identically. This made it impossible to operate a resource-constrained node correctly.
What shipped
Three declared profiles, set at node startup:
# node.toml
[storage]
profile = "full" # "archive" | "full" | "light"
| Profile | Block bodies | State trie | History depth |
|---|---|---|---|
archive |
All | Complete | Unlimited |
full |
Recent only | Recent only | Configurable (default 100k blocks) |
light |
Headers only | None (use witness proofs) | N/A |
On startup, the node validates that on-disk data is consistent with the
declared profile. If you declare full but the disk contains archive data,
the node starts normally (archive is a superset). If you declare archive
but the disk has already been pruned, startup fails with a clear error.
The shell_getStorageProfile RPC returns the current profile as a string.
Wallets and explorers can surface this to users.
Runtime switching
archive ↔ full switching is supported: you can restate the profile and
restart; pruning happens incrementally on the next startup. Downgrading to
light requires a full re-sync from a checkpoint.
OPS-2: Witness Endpoint Hardening
The problem
shell_getWitness returned 501 Not Implemented on full nodes and was
simply absent from light client API surfaces. This made it impossible to
build compliant light client verifiers against production nodes.
What shipped
shell_getWitness(block_hash) now returns:
{
"block_hash": "0x...",
"state_root": "0x...",
"merkle_proof": ["0x...", "0x..."],
"pq_witness": {
"validators": ["0x..."],
"signatures": ["0x..."],
"aggregate_proof": "0x..."
}
}
This is served on all node types:
- Archive / full: proof is constructed from local trie state.
- Light: proof is fetched from peers and cached.
The pq_witness field contains the aggregated ML-DSA-65 signatures from
the validator set for that block. A light client can verify finality
independently, without trusting the responding node.
Unit tests in crates/light-client/ cover the full verification path.
OPS-3: Observability
Prometheus metrics
The /metrics endpoint (default: same port as RPC, configurable via
SHELL_METRICS_PORT) exposes:
# HELP shell_mempool_size Current number of transactions in the mempool
# TYPE shell_mempool_size gauge
shell_mempool_size 42
# HELP shell_blocks_per_second Block production rate (1-minute window)
# TYPE shell_blocks_per_second gauge
shell_blocks_per_second 0.5
# HELP shell_peer_count Connected peers
# TYPE shell_peer_count gauge
shell_peer_count 8
# HELP shell_rpc_latency_seconds RPC handler latency
# TYPE shell_rpc_latency_seconds histogram
shell_rpc_latency_seconds_bucket{method="eth_getBlockByNumber",le="0.01"} 847
# HELP shell_libp2p_errors_total libp2p protocol errors
# TYPE shell_libp2p_errors_total counter
shell_libp2p_errors_total{error="connection_refused"} 3
Health endpoints
GET /healthz → 200 OK if process is alive
GET /readyz → 200 OK if synced within 2 blocks of network tip
These are compatible with Kubernetes liveness / readiness probes.
Tracing
Structured spans cover the full critical path. Set SHELL_LOG=trace (or
debug, info) to control verbosity. Spans include:
rpc.handle— per-method, includes full request/response sizeexecution.batch— covers full batch tx executionconsensus.tick— one span per consensus round
A starter Grafana dashboard JSON is in docs/observability.md.
OPS-4: RPC Stability & Documentation
Unified error codes
crates/rpc/src/error.rs is now the single source of truth for all
-32xxx error codes. Previously these were scattered across handlers with
occasional collisions. The table is:
| Code | Name | Meaning |
|---|---|---|
| -32700 | ParseError |
Invalid JSON |
| -32600 | InvalidRequest |
Not a valid request object |
| -32601 | MethodNotFound |
Method does not exist |
| -32602 | InvalidParams |
Invalid method parameters |
| -32603 | InternalError |
Internal JSON-RPC error |
| -32001 | MempoolFull |
Mempool at capacity |
| -32002 | NonceTooLow |
Nonce already used |
| -32003 | InsufficientFunds |
Balance too low |
| -32004 | InvalidSignature |
PQ signature verification failed |
| -32005 | PaymasterRejected |
Paymaster admission check failed |
| -32006 | BatchTooLarge |
Exceeds AA_MAX_INNER_CALLS (16) |
Auto-generated RPC reference
docs/rpc-reference.md is generated from annotations on Rust handler
functions via a cargo xtask rpc-docs command. It covers all eth_* and
shell_* methods with parameter types, return types, and example
request/response pairs.
Batch JSON-RPC (array of request objects in a single POST body) is now
formally tested for both namespaces in crates/rpc/tests/batch_rpc.rs.
How to upgrade
# Pull the new binary
cargo install --git https://github.com/ShellDAO/shell-chain shell-node
# Add storage profile to your config
echo '[storage]\nprofile = "full"' >> node.toml
# Expose metrics (optional)
export SHELL_METRICS_PORT=9100
# Start
shell-node --config node.toml
First startup will run a one-time profile consistency check. On most full nodes this completes in under a second.
What's next
OPS items planned for v0.19.0:
- Automated pruning scheduler with configurable retention windows
- P2P metrics (per-peer bandwidth, gossip fan-out efficiency)
- OpenTelemetry export alongside Prometheus