Modelplane

In early development · building in the open

The open source
control plane for
AI inference

Modelplane is the control plane above your inference clusters across cloud, neocloud, and on-premise. Platform teams set policy and capacity; developers declare a model and get a serving endpoint. Modelplane continuously reconciles the whole fleet: provisioning, scheduling, autoscaling, routing, and caching. All of it runs entirely under your control.

ModelDeployment
POSTSGLang
deepseek-r1
prefill / decode8× B200
ModelDeployment
POSTvLLM
llama-4-70b
tensor parallel4× H100
ModelDeployment
POSTTRT-LLM
qwen3-235b
data / expert8× H200
reconciling
provisioningschedulingautoscalingroutingcaching
policygovernancecompliance
InferenceCluster
GCP
Cloud · us-central1
256× TPU v6e8× H100
InferenceCluster
CoreWeave
Neocloud · gpu-east
72× GB2008× H200
InferenceCluster
DGX
On-prem · dc-1
32× H1008× A100

Created by

Built on


The inference ecosystem. Under one control plane.

Modelplane doesn’t replace the inference ecosystem, it orchestrates it across three layers: the models you run, the engines that serve them, and the infrastructure underneath, across accelerators and providers. It composes what your teams already choose and integrates new pieces as they emerge.

composesprovisionsschedulesautoscalesroutescaches

orchestrates

Models

open weights & custom

LlamaQwenDeepSeekMistralgpt-ossGemma+ any open-weight model

Serving

inference engines

vLLMSSGLangTensorRT-LLMTGIlllama.cppLLMDeploy+ any engine

Infrastructure

accelerators & providers

Accelerators

NVIDIAAMDGoogle TPUAWS TrainiumIntel Gaudi+ any accelerator

Providers

AWSGCPAzureCoreWeaveLambdaoon-prem+ any Kubernetes

Advanced serving. From single GPU to frontier.

Modelplane matches each model’s requirements and serving topologies to the hardware available, using expressive CEL selectors and composable API shapes. Topology is declared as shape, so it places anything from a single GPU to multi-node, disaggregated frontier serving, and new parallelism strategies as they emerge.

tensor parallel

Split each layer across GPUs in a node for low-latency single-model serving.

pipeline parallel

Stage a model across nodes so very large models fit beyond a single box.

data / expert

Replicate workers, or shard experts across them for MoE throughput.

prefill / decode

Disaggregate prefill and decode onto separate pools for frontier serving.

+ emerging topology

Described as shape, so future parallelism strategies just work.


A resource API for inference. Serving two roles.

Modelplane defines a flexible API for inference. Each role owns its own resources: developers declare model deployments and expose one service across regions, clouds, and managed vendors, while platform teams declare the fleet of clusters, accelerators, and gateways underneath.

Development & ML teams

Define model deployments: the model, the engine and its configuration, serving topology, hardware request, region, and environment. Then expose them as one service, weighted across regions, clouds, and managed vendors.

kind: ModelService

name: prod-llama

routing: weighted, openai

60

kind: ModelDeployment

model: llama-4-70b

cluster: aws-us-east

30

kind: ModelDeployment

model: llama-4-70b

cluster: gcp-eu-west

10

kind: ModelEndpoint

target: vendor-api

type: managed

Platform teams

Declare the fleet: a gateway over clusters across clouds and regions, each with its own hardware classes and node pools. Set the capacity, accelerators, policy, and cost controls the whole fleet runs within.

kind: InferenceGateway

name: prod-gateway

routes: all endpoints

kind: InferenceCluster

name: aws-us-east

pools: h200, h100

kind: InferenceCluster

name: gcp-eu-west

pools: tpu-v6e, a100

kind: InferenceCluster

name: onprem-dc1

pools: h100, l40s


Capabilities built for the fleet. Not just the cluster.

01 / Provisioning

Provision the fleet, or bring your own

Provision inference clusters on AWS, GCP, and Azure, or bring your own on any Kubernetes. Each gets hardware classes, node pools, an inference gateway, and the full serving stack, installed and continuously reconciled.

Provisioning

Provision · GKE / EKSBring your own · any K8s
Modelplane installs & reconciles
InferenceCluster● reconciled

classes: h200-8x, h100-8x · node pools · gateway

GPU operator & drivers

Serving engines

Inference gateway

02 / Scheduling

One global pool of capacity

Modelplane treats every cluster, cloud, and region as one global pool. A fleet scheduler places each model's replicas where its requirements match a cluster's capabilities, then hands off to the cluster's own scheduler and DRA, with support for advanced schedulers like KAI, Kueue, and Volcano.

Two-level scheduling

fleet scheduler

one global pool

tracks requirements

↔ capabilities

places replicas

aws-us-east
gcp-eu-west
azure-us2

cluster scheduler

DRA · KAI / Kueue / Volcano

bound

03 / Autoscaling

Scale replicas across clouds and regions

Every model exposes the standard Kubernetes scale subresource, so its replicas scale out across clusters, clouds, and regions, driven by hand or by HPA and KEDA.

roadmapScale-to-zero is on the roadmap.

Autoscaling

load
spec.replicas 6
GCP · eu-west
AWS · us-east
Azure · us-east2
min 1 · max 8 replicasscale subresource · HPA / KEDA

04 / Routing

One service, many replicas and endpoints

A model service is one stable, OpenAI-compatible endpoint over many replicas and model endpoints. Weighted routing spreads traffic across replicas for canary and A/B rollouts, and a managed endpoint can take a weighted share too.

roadmapAutomatic cross-cloud failover is on the roadmap.

One service, many endpoints

ModelService · prod-llama

● one OpenAI-compatible endpoint

60

ModelEndpoint

replica · aws-us-east

30

ModelEndpoint

replica · gcp-eu-west

10

ModelEndpoint

managed · vendor

05 / Platform native

The tools your team already knows

Platform teams operate Modelplane with the primitives they already own — Kubernetes APIs, GitOps workflows, Crossplane, Prometheus metrics, and RBAC. Declare inference clusters as code, manage them with ArgoCD or Flux, observe the fleet with your existing monitoring stack. No new control plane to learn. No separate operational model. Inference becomes just another workload your platform owns and governs.


Genuinely open. Yours to run and operate.

Modelplane is Apache 2 and open source end to end. The control plane lives entirely in your infrastructure and depends on nothing outside it, so no vendor can restrict, throttle, or revoke access. Donation to a neutral open source foundation is planned.

Built by the team behind Crossplane, the proven open source foundation for infrastructure control planes, trusted at Apple, JPMC, Nike, Elastic, Grafana, and MongoDB.


Build it in the open.

Modelplane is Apache 2 and developed in public, headed for a neutral foundation. If you run accelerators — any vendor, any cloud, on-prem — you can help prove out the fleet across real hardware and shape the API before it sets.