Skip to main content

announcements

Your model, converted: sub-quadratic inference for the models you already trust

lab358 converts open-weight LLMs into a natively sparse top-k attention architecture served by ANN retrieval — same benchmark quality, sub-quadratic long-context cost. Coming to AWS Marketplace.

Michael Beck
June 4, 2026

lab358 converts open-weight language models into a natively sparse attention architecture and serves them with approximate nearest-neighbor retrieval — same benchmark quality, sub-quadratic long-context cost. Coming to AWS Marketplace.

Inference, not training, is where AI budgets go: roughly 60–80% of total AI compute spend, in a market estimated at 106Bin2025andprojectedtoreach106B in 2025 and projected to reach 255B by 2030. And the workloads growing fastest — contracts, case law, patient histories, filings, codebases — are exactly the ones standard attention punishes, because its cost grows with the square of context length while the KV cache devours GPU memory.

The usual escape routes all give something up. State-space conversions compress history into a fixed-size state and lose long-range recall. Training-free sparse-attention and cache-eviction tricks approximate or discard context at inference time, trading quality for speed. Retraining a frontier model from scratch on a new architecture costs more than most companies will ever spend on inference.

We took a different route: don't replace the model — convert it.

What lab358 does

lab358 has developed a conversion-and-retraining procedure that transforms an existing pretrained transformer into a model with natively sparse attention: after conversion, each token needs only the top-k most relevant cached tokens — with k as small as 16–32 — to match full-attention quality. Sparsity isn't an approximation bolted on at serving time; it's exact model behavior.

Once attention is genuinely top-k at small k, the O(n2)O(n^2) attention computation can be replaced by approximate nearest-neighbor (ANN) search over the key-value cache — mature, hardware-flexible technology that runs on GPU, CPU, or SSD. Attention becomes a retrieval problem. Inference cost grows sub-quadratically with context length, and the KV cache no longer has to live in GPU memory at all.

In practice, three steps:

  1. Bring a model. An open-weight model you already use — standard tokenizer, standard corpora, no multi-phase retraining schedule.
  2. We convert it. Our conversion-and-retraining procedure retrains it to natively sparse top-k attention while preserving benchmark quality.
  3. Deploy it. The converted model ships as a SageMaker model package on AWS Marketplace. It deploys inside your AWS account and exposes an OpenAI-compatible endpoint. AWS brokers the deployment so lab358 cannot pull the container image or the model artifacts; your prompts, completions, and attached documents never leave your VPC.

What we've measured

We applied the procedure to SmolLM3-3B (Apache 2.0), producing a retrained 3B-parameter converted model:

  • With standard RoPE positional encoding, the converted model is benchmark-equivalent to the base.
  • With DRoPE — our decoupled variant that lets cached keys be indexed directly by standard ANN structures — benchmarks decrease by less than 3%.
  • In both cases, k = 16–32 retrieved tokens per query match full-attention quality.

The architecture itself has a second, independent proof point: a 130M-parameter model built on it from scratch matched GPT-2 Small on HellaSwag, ARC-Easy, and MMLU on roughly a quarter of the training data. The conversion procedure is how that architecture now ships.

SmolLM3-3B is the current conversion target; extending the recipe to additional open-weight families — and to larger checkpoints beyond 3B — is in progress. We'll publish those numbers when we have them: measured, not projected.

What it means for your workloads

The following benefits are projected from the measured results above; per-workload benchmarks are in progress.

  • Long-context cost drops a complexity class. Sub-quadratic scaling means the documents that are most expensive today see the largest savings.
  • The KV cache stops being a GPU problem. With ANN retrieval, cached context can live in CPU RAM or on SSD — smaller GPU footprints, and context lengths device memory alone would never allow.
  • Compute a document's cache once, reuse it everywhere. Long-document KV caches become reusable assets across every query against that document.
  • Nothing about your stack changes. Same model family you already trust, same tokenizer, OpenAI-compatible API.

Availability

lab358 is coming to AWS Marketplace as a SageMaker model package: metered, procurement-friendly, and sealed inside your AWS account. One delivery model — same posture for every customer, regulated or not. Data residency is the default, not a premium tier.

The listing isn't live yet. If you run long-context workloads and want early access — or want your model family on the supported list — request early access.

Investors and researchers: michael@lab358.ai.