In early development · building in the open

The open source
control plane for
AI inference

Modelplane is the control plane above your inference clusters across cloud, neocloud, and on-premise. Platform teams set policy and capacity; developers declare a model and get a serving endpoint. Modelplane continuously reconciles the whole fleet: provisioning, scheduling, autoscaling, routing, and caching. All of it runs entirely under your control.

Get started →View on GitHub

ModelDeployment

POSTSGLang

deepseek-r1

prefill / decode8× B200

ModelDeployment

POSTvLLM

llama-4-70b

tensor parallel4× H100

ModelDeployment

POSTTRT-LLM

qwen3-235b

data / expert8× H200

reconciling

provisioningschedulingautoscalingroutingcaching

policygovernancecompliance

InferenceCluster

GCP

Cloud · us-central1

256× TPU v6e8× H100

InferenceCluster

CoreWeave

Neocloud · gpu-east

72× GB2008× H200

InferenceCluster

DGX

On-prem · dc-1

32× H1008× A100

Created by

Built on

The inference ecosystem. Under one control plane.

Modelplane doesn’t replace the inference ecosystem, it orchestrates it across three layers: the models you run, the engines that serve them, and the infrastructure underneath, across accelerators and providers. It composes what your teams already choose and integrates new pieces as they emerge.

composesprovisionsschedulesautoscalesroutescaches

orchestrates

Models

open weights & custom

Llama

Qwen

DeepSeek

Mistral

gpt-oss

Gemma+ any open-weight model

Serving

inference engines

vLLMSSGLang

TensorRT-LLM

TGIlllama.cppLLMDeploy+ any engine

Infrastructure

accelerators & providers

Accelerators

NVIDIA

AMD

Google TPU

AWS Trainium

Intel Gaudi+ any accelerator

Providers

AWS

GCP

Azure

CoreWeave

Lambdaoon-prem+ any Kubernetes

Advanced serving. From single GPU to frontier.

Modelplane matches each model’s requirements and serving topologies to the hardware available, using expressive CEL selectors and composable API shapes. Topology is declared as shape, so it places anything from a single GPU to multi-node, disaggregated frontier serving, and new parallelism strategies as they emerge.

tensor parallel

Split each layer across GPUs in a node for low-latency single-model serving.

→→

pipeline parallel

Stage a model across nodes so very large models fit beyond a single box.

data / expert

Replicate workers, or shard experts across them for MoE throughput.

→

prefill / decode

Disaggregate prefill and decode onto separate pools for frontier serving.

+ emerging topology

Described as shape, so future parallelism strategies just work.

A resource API for inference. Serving two roles.

Modelplane defines a flexible API for inference. Each role owns its own resources: developers declare model deployments and expose one service across regions, clouds, and managed vendors, while platform teams declare the fleet of clusters, accelerators, and gateways underneath.

Development & ML teams

Define model deployments: the model, the engine and its configuration, serving topology, hardware request, region, and environment. Then expose them as one service, weighted across regions, clouds, and managed vendors.

kind: ModelService

routing: weighted, openai

kind: ModelDeployment

model: llama-4-70b

cluster: aws-us-east

kind: ModelDeployment

model: llama-4-70b

cluster: gcp-eu-west

kind: ModelEndpoint

target: vendor-api

type: managed

Platform teams

Declare the fleet: a gateway over clusters across clouds and regions, each with its own hardware classes and node pools. Set the capacity, accelerators, policy, and cost controls the whole fleet runs within.

kind: InferenceGateway

routes: all endpoints

kind: InferenceCluster

pools: h200, h100

kind: InferenceCluster

pools: tpu-v6e, a100

kind: InferenceCluster

pools: h100, l40s

Capabilities built for the fleet. Not just the cluster.

01 / Provisioning

Provision the fleet, or bring your own

Provision inference clusters on AWS, GCP, and Azure, or bring your own on any Kubernetes. Each gets hardware classes, node pools, an inference gateway, and the full serving stack, installed and continuously reconciled.

Provisioning

Provision · GKE / EKSBring your own · any K8s

Modelplane installs & reconciles

InferenceCluster● reconciled

classes: h200-8x, h100-8x · node pools · gateway

✓ GPU operator & drivers

✓ Serving engines

✓ Inference gateway

02 / Scheduling

One global pool of capacity

Modelplane treats every cluster, cloud, and region as one global pool. A fleet scheduler places each model's replicas where its requirements match a cluster's capabilities, then hands off to the cluster's own scheduler and DRA, with support for advanced schedulers like KAI, Kueue, and Volcano.

Two-level scheduling

fleet scheduler

one global pool

tracks requirements

↔ capabilities

→

places replicas

aws-us-east

gcp-eu-west

azure-us2

→

cluster scheduler

DRA · KAI / Kueue / Volcano

bound

03 / Autoscaling

Scale replicas across clouds and regions

Every model exposes the standard Kubernetes scale subresource, so its replicas scale out across clusters, clouds, and regions, driven by hand or by HPA and KEDA.

roadmapScale-to-zero is on the roadmap.

Autoscaling

load

spec.replicas 6

GCP · eu-west

AWS · us-east

Azure · us-east2

min 1 · max 8 replicasscale subresource · HPA / KEDA

04 / Routing

One service, many replicas and endpoints

A model service is one stable, OpenAI-compatible endpoint over many replicas and model endpoints. Weighted routing spreads traffic across replicas for canary and A/B rollouts, and a managed endpoint can take a weighted share too.

roadmapAutomatic cross-cloud failover is on the roadmap.

One service, many endpoints

ModelService · prod-llama

● one OpenAI-compatible endpoint

ModelEndpoint

replica · aws-us-east

ModelEndpoint

replica · gcp-eu-west

ModelEndpoint

managed · vendor

05 / Platform native

The tools your team already knows

Platform teams operate Modelplane with the primitives they already own — Kubernetes APIs, GitOps workflows, Crossplane, Prometheus metrics, and RBAC. Declare inference clusters as code, manage them with ArgoCD or Flux, observe the fleet with your existing monitoring stack. No new control plane to learn. No separate operational model. Inference becomes just another workload your platform owns and governs.

Genuinely open. Yours to run and operate.

Modelplane is Apache 2 and open source end to end. The control plane lives entirely in your infrastructure and depends on nothing outside it, so no vendor can restrict, throttle, or revoke access. Donation to a neutral open source foundation is planned.

Built by the team behind Crossplane, the proven open source foundation for infrastructure control planes, trusted at Apple, JPMC, Nike, Elastic, Grafana, and MongoDB.

Get started →View on GitHub

Build it in the open.

Modelplane is Apache 2 and developed in public, headed for a neutral foundation. If you run accelerators — any vendor, any cloud, on-prem — you can help prove out the fleet across real hardware and shape the API before it sets.

★ Star on GitHub Join the community Bring hardware to the open test fleet

The open sourcecontrol plane forAI inference

The inference ecosystem. Under one control plane.

Advanced serving. From single GPU to frontier.

A resource API for inference. Serving two roles.

Capabilities built for the fleet. Not just the cluster.

Provision the fleet, or bring your own

One global pool of capacity

Scale replicas across clouds and regions

One service, many replicas and endpoints

The tools your team already knows

Genuinely open. Yours to run and operate.

Build it in the open.

The open source
control plane for
AI inference