Sluice is the open-source, vendor-neutral take on managed async inference (the SageMaker-Async / Vertex-Batch pattern) — on your own cluster. Traffic bursts deepen a queue instead of returning 503s, workers own their lifecycle and are never killed mid-job, and when your cluster runs out of GPUs, Sluice bursts to spot VMs in another region.
Bring only your model — the worker SDK, gateway, autoscaler, and charts do the rest.
A spike never fails a request. The gateway stores the body, enqueues an ID, and answers immediately with a ticket and a wait-time estimate.
Workers self-terminate when the queue is empty. The control plane only scales up and reaps the exited — no mid-inference SIGKILL, ever.
Zero is not a controller decision — it's what's left when workers finish and exit. Idle apps cost nothing. No CRDs, no Deployment churn.
Spot first, on-demand second. Stuck pods mark a zone/GPU/pricing candidate as stocked out (shared across apps), and the playbook walks zones — then provisions VMs in other regions via Terraform when Kubernetes is exhausted.
Redis or SQS queues; S3, GCS, or MinIO object stores. App specs live in your bucket (kops-style), so the control plane is stateless and restartable.
Request ID in the queue, body and result in the bucket. Any worker that can read both can serve — a pod in your cluster or a VM on another continent.
Four moving parts: a gateway, a queue, a bucket, and a controller that only ever scales up.
sluice apply
stores it in your object store — that bucket is the source of truth.POST /v1/{app}/infer writes the body to the bucket, enqueues the
request ID, and returns a ticket with an ETA. Batches work the same way.An app is one YAML file. This one runs on cluster GPUs when it can, and bursts to spot VMs in another region when it can't.
name: segmentation
image: ghcr.io/jugrajsingh/sluice-example-segmentation:latest
handler: handler:SegmentationHandler
queue:
ref: sluice-segmentation
storage:
prefix: apps/segmentation
resources:
gpu: 1
gpuType: nvidia-l4
cpu: 4
memoryGb: 16
scaling:
messagesPerWorker: 8
maxWorkers: 0 # unbounded
placement:
mode: both # kubernetes | vm | both
pricing: [spot, on-demand]
vm:
provider: gce
machineType: g2-standard-8
regions: [us-central1, europe-west3]
# register the app (spec lands in your bucket)
sluice apply -f app.yaml
# submit work — returns a ticket instantly
curl -s $GATEWAY/v1/segmentation/infer \
-d @body.json
# {"ticket":"…","etaSeconds":42}
# poll for the result
curl -s $GATEWAY/v1/segmentation/status/$TICKET
# idle again? workers have already exited.
sluice status segmentation
Install with Helm: gateway, console, and autoscaler ship as one chart; workers are bare pods synthesized from the spec — nothing else to deploy.