A self-hosted stack still hits two or three tasks where a frontier model wins. Buying that access from Anthropic means a KYC account and a card. ppq.ai is the other door: an OpenAI-compatible proxy to Claude, GPT and others, paid per query over Bitcoin Lightning, no account. Here is what it is good for, where it betrays the sovereign premise, and exactly how I wired it as the fallback behind local Qwen.

Read article

Gemma-4-31B in NVIDIA's NVFP4 format fits a single DGX Spark and is a strong reasoner. But on Blackwell sm_121 the default FP4 kernel path is broken, and a dense 31B is bandwidth-bound at around 4 tok/s no matter what you do. I measured the baseline, the Marlin fix, and the honest conclusion: the real speedup is a model swap, not a flag.

Jun 25, 2026 strategydgx-spark

Gemma-4-31B NVFP4 on a Single DGX Spark: When the Quantization Is the Bottleneck

Gemma-4-31B in NVIDIA's NVFP4 format fits a single DGX Spark and is a strong reasoner. But on Blackwell sm_121 the default FP4 kernel path is broken, and a dense 31B is bandwidth-bound at around 4 tok/s no matter what you do. I measured the baseline, the Marlin fix, and the honest conclusion: the real speedup is a model swap, not a flag.

Read article

GLM-4.7-Flash is a 30B-A3B MoE coding model that fits a single 128GB DGX Spark with room to spare. Bringing it up on Blackwell sm_121 took two failures that every published recipe gets wrong: the 'AWQ' build is actually compressed-tensors, and the model speaks MLA, so flash_attn is illegal. Here is the working recipe, the single-stream decode number nobody reports, and what it does to my coding agent.

Jun 25, 2026 strategydgx-spark

GLM-4.7-Flash on a Single DGX Spark: the Repo Says AWQ, the Model Says MLA

GLM-4.7-Flash is a 30B-A3B MoE coding model that fits a single 128GB DGX Spark with room to spare. Bringing it up on Blackwell sm_121 took two failures that every published recipe gets wrong: the 'AWQ' build is actually compressed-tensors, and the model speaks MLA, so flash_attn is illegal. Here is the working recipe, the single-stream decode number nobody reports, and what it does to my coding agent.

Read article

Three local, self-hostable coding-agent CLIs that drive your own vLLM models instead of a cloud API: opencode, goose, and vibe. I run opencode as primary and goose as backup on a DGX Spark, and I retired vibe. Here is the decision, with the licences, the maintenance reality, and the one config gotcha each, so you can choose for your own box.

Jun 25, 2026 strategyagents

goose vs vibe vs opencode: Picking a Local Coding CLI for a Sovereign vLLM Stack (2026)

Three local, self-hostable coding-agent CLIs that drive your own vLLM models instead of a cloud API: opencode, goose, and vibe. I run opencode as primary and goose as backup on a DGX Spark, and I retired vibe. Here is the decision, with the licences, the maintenance reality, and the one config gotcha each, so you can choose for your own box.

Read article

The complete design of the retrieval system my local models run on: Markdown files, one JSON index, full-body BM25 chunked per section, served to agents over MCP. No vector database, no embeddings. Here is every decision and the reason behind it, with the external evidence that backs each one.

Jun 25, 2026 sovereign-airag

A No-Vector RAG That Works: The Architecture, Decision by Decision

The complete design of the retrieval system my local models run on: Markdown files, one JSON index, full-body BM25 chunked per section, served to agents over MCP. No vector database, no embeddings. Here is every decision and the reason behind it, with the external evidence that backs each one.

Read article

Running AI at home.No cloud. No compromises.

Scoping a build

Fighting an error

About

Insights

The Stack

// Latest Articles

Frontier AI on Bitcoin: ppq.ai as the No-KYC Cloud Fallback for a Sovereign Stack (2026)

Gemma-4-31B NVFP4 on a Single DGX Spark: When the Quantization Is the Bottleneck

GLM-4.7-Flash on a Single DGX Spark: the Repo Says AWQ, the Model Says MLA

goose vs vibe vs opencode: Picking a Local Coding CLI for a Sovereign vLLM Stack (2026)

A No-Vector RAG That Works: The Architecture, Decision by Decision

Running AI at home.
No cloud. No compromises.

Latest Articles