Inference

Real-time inference.
Cold starts in under a second.

Deploy and scale low-latency inference for LLMs, audio and image generation. From a single function to thousands of GPUs — same API, same pricing.

Start free See pricing

python

 1from asc import Function, gpu
 2 
 3@Function(gpu=gpu.L4(), keep_warm=1)
 4def generate(prompt: str) -> str:
 5    from llm import Pipeline
 6    pipe = Pipeline.from_pretrained("meta-llama/Llama-3-8B")
 7    return pipe(prompt, max_new_tokens=256)
 8 
 9# Deploy
10# $ asc deploy generate.py

<0ms

Cold start

GPUs burstable

Uptime SLA

Features

Designed for production.

Sub-second cold starts

Container snapshotting + lazy image loading get your first token in <800ms on L4 and A10G.

Per-second autoscale

Replicas scale in/out within seconds, billed per active GPU-second only.

Streaming first

Native SSE + WebSocket streaming with backpressure. Drop-in OpenAI-compatible router.

Pinned weights

Versioned model artifacts in our object store — instant rollbacks, zero re-downloads.

Global edge routing

Multi-region failover and locality-aware routing built in. No load balancer wiring.

Batched + concurrent

Dynamic batching and concurrent execution across replicas to maximise GPU utilisation.

Pricing

Metered. No markup.

Pay per active second / per GiB. Free tier covers small projects; $200/mo cap until you opt in. See the full calculator.

Line item	Unit	Rate (USD)
Functions — CPU	per 1M requests	$0.23
Functions — GPU L4	per GPU-second	$0.000095
GPU A10G pod	per hour	$0.35
Egress	per GiB	$0.098

Inference docs Pricing calculator GPU cloud

Ship your first deploy in minutes.

Free $30/month of compute. No card required.

Get started Read the docs

Real-time inference. Cold starts in under a second.