Why every LLM provider speaks the same shape — and why that matters for your product
Tonight's lab had you design your own API from scratch. But most of the LLM APIs you'll call in the real world aren't "yours" — they're built on a shared shape that OpenAI defined and the rest of the industry copied.
That shared shape is the OpenAI API standard. Anthropic, Mistral, Groq, vLLM, Ollama, together.ai, Azure — they all serve a version of it. The consequence is huge: switching model providers is a config change, not a rewrite.
phi-4-mini through the OpenAI standard. The live demo further down this page is calling it right now.
The standard defines a handful of endpoints. Two matter most:
POST /v1/chat/completions — the main eventA conversation in, a reply out. Everything modern is built on this.
# Request POST /v1/chat/completions Content-Type: application/json Authorization: Bearer sk-... { "model": "gpt-4o", "messages": [ {"role": "system", "content": "You are a terse assistant."}, {"role": "user", "content": "What is an API?"} ], "max_tokens": 100, "temperature": 0.7 } # Response { "id": "chatcmpl-abc", "object": "chat.completion", "created": 1776721000, "model": "gpt-4o", "choices": [{ "index": 0, "message": {"role": "assistant", "content": "..."}, "finish_reason": "stop" }], "usage": {"prompt_tokens": 20, "completion_tokens": 34, "total_tokens": 54} }
GET /v1/models — discoveryLists what the server can serve. Essential for clients that talk to multiple providers.
GET /v1/models
{
"object": "list",
"data": [
{"id": "gpt-4o", "object": "model", ...},
{"id": "gpt-4o-mini", "object": "model", ...}
]
}
Other endpoints you'll see — /v1/embeddings, /v1/completions (legacy), /v1/audio/transcriptions, /v1/images/generations — all follow the same "model name + inputs + standard response envelope" pattern.
The Python and JS openai SDKs had massive adoption. Speaking the same protocol meant inheriting that ecosystem on day one.
Change base_url and an API key. Every application code path stays identical. Critical for fallbacks, cost optimization, and vendor risk.
LangChain, LlamaIndex, observability tools, evals — everything assumes the OpenAI shape. Your own FastAPI could, too.
Run vLLM or Ollama on your laptop with the same client code you'd use against GPT-4o. No mocking required.
This is a real call to phi-4-mini, served by vLLM on our class server (guapo). Served through the same OpenAI API standard that GPT-4o uses — only the base_url changes.
/llm/v1/chat/completions
Model: /models/phi-4-mini
HTTP 429, wait a few seconds and try again.
curl https://class.wize73.com/llm/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/models/phi-4-mini", "messages": [{"role": "user", "content": "What is an API?"}], "max_tokens": 100 }'
from openai import OpenAI client = OpenAI( base_url="https://class.wize73.com/llm/v1", api_key="not-required", # vLLM doesn't check it here ) response = client.chat.completions.create( model="/models/phi-4-mini", messages=[ {"role": "user", "content": "What is an API?"}, ], max_tokens=100, ) print(response.choices[0].message.content)
Swap base_url for OpenAI, Anthropic's compatibility layer, Groq, Mistral, or Ollama — the rest of the code doesn't change.
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://class.wize73.com/llm/v1", apiKey: "not-required", }); const response = await client.chat.completions.create({ model: "/models/phi-4-mini", messages: [{ role: "user", content: "What is an API?" }], max_tokens: 100, }); console.log(response.choices[0].message.content);
import requests r = requests.post( "https://class.wize73.com/llm/v1/chat/completions", json={ "model": "/models/phi-4-mini", "messages": [{"role": "user", "content": "What is an API?"}], "max_tokens": 100, }, timeout=60, ) print(r.json()["choices"][0]["message"]["content"])
When you design your product's API tonight, you have a choice:
Your API accepts and returns OpenAI-shaped payloads. Every chat client, LangChain app, and tool in the ecosystem can integrate with zero adaptation.
Best for: chat interfaces, agent backends, anything where the end caller is already OpenAI-aware.
Your API takes product-meaningful inputs and returns product-meaningful outputs. The LLM behind it is an implementation detail.
Best for: domain-specific features (sentiment, extraction, classification) where clients don't want to reason about prompts or tokens.
POST /v1/predict takes text and returns a sentiment label. The OpenAI call happens inside your service, invisible to the client. That insulation is the whole point of API-first design.
Swap the stub inference in api/main.py for a real call to the phi-4-mini endpoint. Your /v1/predict response shape doesn't change — just the implementation underneath.
import requests def run_inference(text: str) -> tuple[str, float, dict[str, float]]: prompt = f"Classify the sentiment of this text as positive, negative, or neutral. " \ f"Reply with only the single word. Text: {text}" r = requests.post( "https://class.wize73.com/llm/v1/chat/completions", json={ "model": "/models/phi-4-mini", "messages": [{"role": "user", "content": prompt}], "max_tokens": 5, "temperature": 0, }, timeout=30, ) raw = r.json()["choices"][0]["message"]["content"].strip().lower() label = raw if raw in ("positive", "negative", "neutral") else "neutral" probs = {label: 0.9, "neutral": 0.1} if label != "neutral" else {"neutral": 0.9} return label, probs[label], probs
Notice: your API's PredictResponse schema is unchanged. Clients don't know — or care — that there's an LLM behind the curtain now.