Sluice

Queue-driven, scale-from-zero GPU inference for any Kubernetes.

Sluice is the open-source, vendor-neutral take on managed async inference (the SageMaker-Async / Vertex-Batch pattern) — on your own cluster. Traffic bursts deepen a queue instead of returning 503s, workers own their lifecycle and are never killed mid-job, and when your cluster runs out of GPUs, Sluice bursts to spot VMs in another region.

Why Sluice

Bring only your model — the worker SDK, gateway, autoscaler, and charts do the rest.

Burst-proof intake

A spike never fails a request. The gateway stores the body, enqueues an ID, and answers immediately with a ticket and a wait-time estimate.

Never kills a busy worker

Workers self-terminate when the queue is empty. The control plane only scales up and reaps the exited — no mid-inference SIGKILL, ever.

Scale to zero, emergently

Zero is not a controller decision — it's what's left when workers finish and exit. Idle apps cost nothing. No CRDs, no Deployment churn.

Stockout-aware placement

Spot first, on-demand second. Stuck pods mark a zone/GPU/pricing candidate as stocked out (shared across apps), and the playbook walks zones — then provisions VMs in other regions via Terraform when Kubernetes is exhausted.

Bring your own backends

Redis or SQS queues; S3, GCS, or MinIO object stores. App specs live in your bucket (kops-style), so the control plane is stateless and restartable.

One tiny contract

Request ID in the queue, body and result in the bucket. Any worker that can read both can serve — a pod in your cluster or a VM on another continent.

How it works

Four moving parts: a gateway, a queue, a bucket, and a controller that only ever scales up.

  1. Define an app, apply it One YAML: image, handler, resources, scaling, placement. sluice apply stores it in your object store — that bucket is the source of truth.
  2. Clients post work POST /v1/{app}/infer writes the body to the bucket, enqueues the request ID, and returns a ticket with an ETA. Batches work the same way.
  3. The controller scales up Queue depth ÷ messages-per-worker says how many workers should exist. The placement playbook picks the cheapest viable candidate — spot zones first, then on-demand, then cross-region VMs.
  4. Workers drain and disappear Each worker pulls IDs, fetches bodies, runs your handler, writes results, and exits when the queue is dry. The controller reaps the husks. Back to zero.

Quickstart

An app is one YAML file. This one runs on cluster GPUs when it can, and bursts to spot VMs in another region when it can't.

app.yaml
name: segmentation
image: ghcr.io/jugrajsingh/sluice-example-segmentation:latest
handler: handler:SegmentationHandler
queue:
  ref: sluice-segmentation
storage:
  prefix: apps/segmentation
resources:
  gpu: 1
  gpuType: nvidia-l4
  cpu: 4
  memoryGb: 16
scaling:
  messagesPerWorker: 8
  maxWorkers: 0   # unbounded
placement:
  mode: both        # kubernetes | vm | both
  pricing: [spot, on-demand]
  vm:
    provider: gce
    machineType: g2-standard-8
    regions: [us-central1, europe-west3]
shell
# register the app (spec lands in your bucket)
sluice apply -f app.yaml

# submit work — returns a ticket instantly
curl -s $GATEWAY/v1/segmentation/infer \
     -d @body.json
# {"ticket":"…","etaSeconds":42}

# poll for the result
curl -s $GATEWAY/v1/segmentation/status/$TICKET

# idle again? workers have already exited.
sluice status segmentation

Install with Helm: gateway, console, and autoscaler ship as one chart; workers are bare pods synthesized from the spec — nothing else to deploy.