← Back to W30D1

The OpenAI API Standard

Why every LLM provider speaks the same shape — and why that matters for your product

1Why This Page Exists

Tonight's lab had you design your own API from scratch. But most of the LLM APIs you'll call in the real world aren't "yours" — they're built on a shared shape that OpenAI defined and the rest of the industry copied.

That shared shape is the OpenAI API standard. Anthropic, Mistral, Groq, vLLM, Ollama, together.ai, Azure — they all serve a version of it. The consequence is huge: switching model providers is a config change, not a rewrite.

Applied to this class: the vLLM instance running on guapo serves phi-4-mini through the OpenAI standard. The live demo further down this page is calling it right now.

2The Core Shape

The standard defines a handful of endpoints. Two matter most:

POST /v1/chat/completions — the main event

A conversation in, a reply out. Everything modern is built on this.

# Request
POST /v1/chat/completions
Content-Type: application/json
Authorization: Bearer sk-...

{
  "model": "gpt-4o",
  "messages": [
    {"role": "system", "content": "You are a terse assistant."},
    {"role": "user", "content": "What is an API?"}
  ],
  "max_tokens": 100,
  "temperature": 0.7
}

# Response
{
  "id": "chatcmpl-abc",
  "object": "chat.completion",
  "created": 1776721000,
  "model": "gpt-4o",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "..."},
    "finish_reason": "stop"
  }],
  "usage": {"prompt_tokens": 20, "completion_tokens": 34, "total_tokens": 54}
}

GET /v1/models — discovery

Lists what the server can serve. Essential for clients that talk to multiple providers.

GET /v1/models

{
  "object": "list",
  "data": [
    {"id": "gpt-4o", "object": "model", ...},
    {"id": "gpt-4o-mini", "object": "model", ...}
  ]
}

Other endpoints you'll see — /v1/embeddings, /v1/completions (legacy), /v1/audio/transcriptions, /v1/images/generations — all follow the same "model name + inputs + standard response envelope" pattern.

3Why the Industry Copied It

Client libraries already existed

The Python and JS openai SDKs had massive adoption. Speaking the same protocol meant inheriting that ecosystem on day one.

Swap providers in one line

Change base_url and an API key. Every application code path stays identical. Critical for fallbacks, cost optimization, and vendor risk.

Tooling compounds

LangChain, LlamaIndex, observability tools, evals — everything assumes the OpenAI shape. Your own FastAPI could, too.

Local dev = prod

Run vLLM or Ollama on your laptop with the same client code you'd use against GPT-4o. No mocking required.

4Live Demo — phi-4-mini on guapo

This is a real call to phi-4-mini, served by vLLM on our class server (guapo). Served through the same OpenAI API standard that GPT-4o uses — only the base_url changes.

LIVE Endpoint: /llm/v1/chat/completions Model: /models/phi-4-mini
Response will appear here.
Note: the live demo endpoint has a light rate limit (20 requests/min per IP) to prevent abuse while keeping it dead-simple to call. If you get HTTP 429, wait a few seconds and try again.

5Calling It From Code

curl
Python (openai SDK)
JavaScript (openai SDK)
Python (requests)
curl https://class.wize73.com/llm/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/phi-4-mini",
    "messages": [{"role": "user", "content": "What is an API?"}],
    "max_tokens": 100
  }'
from openai import OpenAI

client = OpenAI(
    base_url="https://class.wize73.com/llm/v1",
    api_key="not-required",   # vLLM doesn't check it here
)

response = client.chat.completions.create(
    model="/models/phi-4-mini",
    messages=[
        {"role": "user", "content": "What is an API?"},
    ],
    max_tokens=100,
)
print(response.choices[0].message.content)

Swap base_url for OpenAI, Anthropic's compatibility layer, Groq, Mistral, or Ollama — the rest of the code doesn't change.

import OpenAI from "openai";

const client = new OpenAI({
    baseURL: "https://class.wize73.com/llm/v1",
    apiKey: "not-required",
});

const response = await client.chat.completions.create({
    model: "/models/phi-4-mini",
    messages: [{ role: "user", content: "What is an API?" }],
    max_tokens: 100,
});
console.log(response.choices[0].message.content);
import requests

r = requests.post(
    "https://class.wize73.com/llm/v1/chat/completions",
    json={
        "model": "/models/phi-4-mini",
        "messages": [{"role": "user", "content": "What is an API?"}],
        "max_tokens": 100,
    },
    timeout=60,
)
print(r.json()["choices"][0]["message"]["content"])

6What This Means for Your MVP

When you design your product's API tonight, you have a choice:

Option A — Speak OpenAI

Your API accepts and returns OpenAI-shaped payloads. Every chat client, LangChain app, and tool in the ecosystem can integrate with zero adaptation.

Best for: chat interfaces, agent backends, anything where the end caller is already OpenAI-aware.

Option B — Product-shaped API

Your API takes product-meaningful inputs and returns product-meaningful outputs. The LLM behind it is an implementation detail.

Best for: domain-specific features (sentiment, extraction, classification) where clients don't want to reason about prompts or tokens.

Tonight's lab uses Option BPOST /v1/predict takes text and returns a sentiment label. The OpenAI call happens inside your service, invisible to the client. That insulation is the whole point of API-first design.

7Stretch Goal for Tonight's Lab

Swap the stub inference in api/main.py for a real call to the phi-4-mini endpoint. Your /v1/predict response shape doesn't change — just the implementation underneath.

import requests

def run_inference(text: str) -> tuple[str, float, dict[str, float]]:
    prompt = f"Classify the sentiment of this text as positive, negative, or neutral. " \
             f"Reply with only the single word. Text: {text}"
    r = requests.post(
        "https://class.wize73.com/llm/v1/chat/completions",
        json={
            "model": "/models/phi-4-mini",
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 5,
            "temperature": 0,
        },
        timeout=30,
    )
    raw = r.json()["choices"][0]["message"]["content"].strip().lower()
    label = raw if raw in ("positive", "negative", "neutral") else "neutral"
    probs = {label: 0.9, "neutral": 0.1} if label != "neutral" else {"neutral": 0.9}
    return label, probs[label], probs

Notice: your API's PredictResponse schema is unchanged. Clients don't know — or care — that there's an LLM behind the curtain now.