Real Qwen3 inference on infrastructure we own. A signed receipt chain across three tiers, written to our own audit ledger. A live tamper test that caught a single mutated byte. Built and torn down in one work session, with the cost broken out below.
The HALO project is a sovereign hybrid AI architecture: an embodied sensor tier, a local routing tier on a phone, and an inference substrate running open-weights MoE inference on infrastructure the tenant controls. The previous post laid out the wire protocol, the receipt schema, and the work packages. The job for this session was simpler: actually run it end to end, with real inference and signed receipts, on real AWS.
Scope for the Day-1 run:
Qwen3-30B-A3B-Instruct loaded onto an r7iz.4xlarge in our Howdify Lab VPC, generating real tokens for a real prompthowdify-receipts DynamoDB tableBelow is the literal log from the working run, with the loading-progress noise trimmed.
=== HALO end-to-end demo at 2026-05-27T22:19:09 UTC === session_id = 323fbf8f-0c73-481f-8b7b-03fc7a325abd tenant_id = halo-demo-tenant model = /data/models/Qwen3-30B-A3B [0/5] derive HKDF keys TIER_KEY[0..2] derived (32B each) [1/5] TIER 0: simulated sensor capture, audio chunk receipt_id=d80e9016-..., hmac=3e66a9b0... ✓ signed [2/5] TIER 1: routing decision, escalate to substrate receipt_id=803e334e-..., hmac=02a04a2a... ✓ signed, parent-pointer to T0 [3/5] TIER 2: load Qwen3-30B-A3B model loaded in 9.6s prompt: 'What is sovereign inference, in one sentence?' (9 tokens) generated 32 tokens in 5.1s = 6.260 tok/s decoded: ' Sovereign inference is the process by which a sovereign entity, such as a nation-state, draws logical conclusions or makes decisions based on its own authority, independent of' receipt_id=adbb6cb9-..., hmac=712fc224... ✓ signed, parent-pointer to T1 [4/5] HALO-VERIFY: walk chain, check HMACs + parent pointers CLEAN PATH RESULT: PASS - 3 receipts verified, chain intact [5/5] TAMPER TEST: flip one byte in Tier 2 payload, re-verify mutated event_payload.output_preview on receipt adbb6cb9-... TAMPERED PATH RESULT: CORRECTLY DETECTED HMAC FAIL at receipt adbb6cb9-... (tier 2, kind halo_tier_response) === demo done at 2026-05-27T22:29:33 UTC === 3 receipts in howdify-receipts (org_id=halo-demo-tenant)
Notes on the inference number. Qwen3-30B-A3B is a 30B-parameter Mixture of Experts model with 3B active parameters per token. We ran it in plain Hugging Face transformers on CPU, with no specialized inference framework, on a single r7iz.4xlarge instance (16 vCPU, Intel Xeon Gold 6455B with AMX). 6.26 tokens per second is the unaccelerated baseline: the next phase wires the KTransformers AMX kernels into the inference path, which should land the substrate in the 30+ tok/s range we projected in the previous post.
Three receipts were written to our existing audit ledger, parent-pointer linked, ready for any future audit. The session id 323fbf8f-0c73-481f-8b7b-03fc7a325abd can be replayed at any time by the verifier.
Logging a chain is one thing. Detecting active tampering is what makes the chain matter.
The test was straightforward. After the verifier reported PASS on the clean chain, we read the Tier 2 receipt back from DynamoDB, mutated a single string inside its event payload, and wrote it back. Then we re-ran the verifier on the same session id.
The verifier deterministically identified:
No false alarms on the upstream tiers. No silent acceptance. The mutation broke verification at the next downstream check, exactly where the protocol said it should.
This is the bit that compounds. Inference models commoditize on a 12-month clock. Hardware commoditizes faster. The cross-tier audit chain is the architectural element that survives those cycles and gives regulated industries a reason to deploy sovereign infrastructure instead of accepting vendor-cloud opacity.
| Layer | Status | Note |
|---|---|---|
| VPC + private subnet provisioning | ✓ working | Howdify Lab VPC, in-band via VPCE for DynamoDB |
| HKDF key hierarchy (per-tier subkeys from root) | ✓ working | Three tier keys derived per session, byte-stable |
| Per-event HMAC-SHA256 signing | ✓ working | Tier 0, 1, 2 all signing with their own keys |
| Parent-pointer chain across tiers | ✓ working | Each receipt commits to parent's SHA-256 hash |
| Real LLM inference on sovereign substrate | ✓ working | Qwen3-30B-A3B, 6.26 tok/s on plain transformers CPU |
| Receipts persisted to howdify-receipts DDB | ✓ working | 3 receipts written, all retrievable |
| Canonical-bytes round-trip across DDB | ✓ working | Numeric normalization fix lands the read-side verify |
| Tamper detection via re-verify | ✓ working | Single mutated byte caught deterministically |
| AMX-accelerated MoE inference path | ~ deferred | Kernel layer validated, end-to-end wiring is next-phase |
| Physical hardware (Halo glasses) integration | ~ pending | Waits on the device shipping |
Everything in the upper rows is in place. The two lower rows are the gap between Lab POC and the next phase.
Lab POC integration always reveals the assumptions that the spec didn't make explicit. Five things broke before the final pass. Each fix took 1-5 minutes; together they were the difference between "we have a working stack" and "we have a working chain."
kt bench harness spent significant time trying to pin threads to cores 8-47 (which don't exist on r7iz.4xlarge). For the next phase we will either upsize the instance to one with the expected core count or write a custom benchmark targeted at the 8-core shape.Issue 5 is the one worth remembering. The spec called for "canonical JSON of the receipt payload" as the signing input. Spec-correct, implementation-broken: the canonicalization function has to match the type system at both write and read time, not just at write. This is the kind of detail that only surfaces when you actually run the full round-trip against the real database.
Full transparency. The day-1 spike cost ~$14.50 total, spread across:
| Resource | Usage | Cost |
|---|---|---|
| r7iz.4xlarge compute | ~5 hours active | $6.75 |
| NAT Gateway hourly + data transfer | ~70 GB through (model + source repos + pip wheels) | $3.15 |
| EBS gp3 4 TB (model storage) | Prorated for the run | $2.10 |
| Bedrock Haiku (intent classifier stub) | Not invoked in the demo | $0.00 |
| DynamoDB writes (3 receipts) | PAY_PER_REQUEST | marginal |
| Snapshots, EIPs, idle NAT cycles | Brief windows | $2.50 |
| Day-1 total | ~$14.50 |
After the run, we terminated the EC2 instance (which auto-deletes the attached EBS), deleted the NAT gateway, released the elastic IP, removed the IAM role and instance profile, and dropped the security group rules we added. Going-forward day-to-day cost: $0.
The next run rebuilds from the same provisioning script in about 10 minutes. The model re-downloads from Hugging Face in about half an hour. No persistent infrastructure carries forward, which is the right shape for spike work.
Three concrete items, in priority order.
Today's 6.26 tok/s was plain transformers on CPU. The kernel layer is already validated: kt_kernel imports cleanly, AMX-INT8 and AMX-BF16 extensions load on the Xeon Gold 6455B, and the doctor command confirms the full instruction set. The next session wires the kernel into the model's MoE blocks and re-measures. We expect a meaningful step up.
Right now the substrate is a Python script that loads the model, runs inference, signs a receipt, and exits. Production needs the FastAPI WebSocket endpoint that the Tier 1 phone calls into: receive escalation request, verify the incoming receipt chain, run the inference, sign per-token receipts as they stream, return tokens back to the phone in real time. That's the WP-B inference endpoint as scoped in the SOW.
The Tier 0 emulator implemented the wire protocol exactly as specified. When the open-source glasses arrive, the firmware engineers implement the same protocol natively, the transport changes from local WebSocket to BLE GATT, and the rest of the stack works unchanged. The audit chain begins at the lens.
Episode 04 lands the KTransformers AMX path and the FastAPI inference endpoint. Episode 05 picks up when the hardware ships.
Receipt chain first. Hardware second. Demos third.