Sluice

Queue-driven, scale-from-zero GPU inference for any Kubernetes.

Sluice is the open-source, vendor-neutral take on managed async inference (the SageMaker-Async / Vertex-Batch pattern) — on your own cluster. You bring only your model: a queue to sit behind, GPUs that scale from zero, and two ways to feed it — online and batch. Traffic bursts deepen the queue instead of returning 503s, workers own their lifecycle and are never killed mid-job, and when your cluster runs out of GPUs, Sluice bursts to spot VMs in another region.

View on GitHub Quickstart

Why Sluice

Bring only your model — the worker SDK, gateway, autoscaler, and charts do the rest.

Burst-proof online intake

A spike never fails a request. On POST /v1/{app}/infer, a cache hit returns an instant 200; otherwise the gateway enqueues and briefly long-polls — either a 200 result or a 202 with a ticket and a Retry-After. An optional _rid in the body is the idempotency/cache key, so a repeat is a free 200 — no GPU.

Batch is its own lane

Not the same path as online — batch is an upload-first, multi-file job on a 24h SLA. Create a job, push JSONL files to presigned URLs, submit (one queue message per file), poll status, then pull gzipped output parts. If a spot VM dies mid-file, the next worker resumes from the last checkpoint.

Never kills a busy worker

Workers self-terminate when the queue is empty. The control plane only scales up and reaps the exited — no mid-inference SIGKILL, ever.

Scale to zero, emergently

Zero is not a controller decision — it's what's left when workers finish and exit. Idle apps cost nothing. No CRDs, no Deployment churn.

Ordered, stockout-aware placement

You list placement candidates in priority order. A stuck pod marks its cluster/selector/pricing candidate as stocked out (shared across apps), and the controller advances to the next — another pool, another cluster, then VMs.

Bring your own backends

Redis or SQS queues; S3, GCS, or MinIO object stores. App specs live in your bucket (kops-style), so the control plane is stateless and restartable.

One box, both lanes

Request ID in the queue, body and result in the bucket. The same worker serves online and batch off separate queues — online is prioritized, batch backfills the idle GPU. A pod in your cluster or a VM on another continent, same contract.

Pack the GPU, no MPS

Declare how many model replicas share one GPU — run your own BaseHandler in-process, or front an unmodified HTTP model server with a Sluice queue-adapter. Same model on Kubernetes and burst VMs, tuned per GPU.

How it works

Four moving parts: a gateway, a queue, a bucket, and a controller that only ever scales up.

Define an app, apply it One YAML: image, handler, resources, scaling, placement. sluice apply stores it in your object store — that bucket is the source of truth.
Clients post online work POST /v1/{app}/infer writes the body to the bucket and enqueues its request ID. A cache hit answers 200 instantly; otherwise a short long-poll returns 200, or 202 with a ticket and a Retry-After to poll GET /v1/{app}/status/{ticket}.
Or run a batch job Batch is a separate, upload-first lane on a 24h SLA. POST /v1/{app}/batch returns a job_id; you upload JSONL files to presigned URLs, submit (one queue message per file), poll the job, and download gzipped output parts. Batch rides its own queue — online is prioritized, batch backfills the idle GPU.
The controller scales up Queue depth ÷ messages-per-worker says how many workers should exist. The placement playbook tries your ordered placement list top-to-bottom, marking each stocked-out candidate before moving to the next — pools, other clusters, then VMs.
Workers drain and disappear Each worker pulls IDs, fetches bodies, runs your handler, writes results, and exits when the queue is dry. The controller reaps the husks. Back to zero.

Quickstart

An app is one YAML file. This one runs on cluster GPUs when it can, and bursts to spot VMs in another region when it can't.

app.yaml

name: segmentation
image: ghcr.io/jugrajsingh/sluice-example-segmentation:latest
handler: handler:SegmentationHandler
queue:
  ref: sluice-segmentation
resources:
  gpu: 1
  gpuType: nvidia-l4
  cpu: 4
  memoryGb: 16
scaling:
  messagesPerInstance: 8
  maxInstances: 0   # unbounded
placement:                    # ordered — tried top to bottom
  - type: kubernetes
    provider: in-cluster
    spec:
      pricing: spot
      nodeSelectors: [{ cloud.google.com/gke-spot: "true" }]
  - type: vm             # burst when the cluster is out of GPUs
    provider: gce
    spec:
      pricing: spot
      machineType: g2-standard-8
      regions: [us-central1, europe-west3]

shell

# register the app (spec lands in your bucket)
sluice apply -f app.yaml

# submit online work — cache hit returns 200 now,
# otherwise a short long-poll, then 202 + a ticket
curl -s $GATEWAY/v1/segmentation/infer \
     -d @body.json
# 202 -> {"ticket":"…","retry_after":42}

# poll the ticket until it returns 200 with the result
curl -s $GATEWAY/v1/segmentation/status/$TICKET

# idle again? workers have already exited.
sluice get segmentation

Install with Helm: gateway, console, and autoscaler ship as one chart; workers are bare pods synthesized from the spec — nothing else to deploy.