LLM API routers, gateways, serving infrastructure, and model hosting. Tools that sit between your app and one or more language models.







Nebius is an AI-native GPU cloud platform that rents NVIDIA H100 through GB200 clusters with managed Slurm, Kubernetes and an inference API.

Ollama is a local LLM runtime that downloads, runs, and serves open models on your own hardware via a CLI and an OpenAI-compatible API.

Cloud service for developers to build with open-source AI, offering APIs, distributed training systems, and leading open-source models.

Enterprise-scale AI solutions for ultra-fast language processing and inference.

High-speed, cost-efficient generative AI for product innovation with advanced fine-tuning capabilities.

Cloud platform for running, deploying, and scaling machine learning models with ease.

Globally distributed GPU cloud for AI tasks.

Modal offers an easy way for developers to run code in the cloud with serverless compute and containerized environments.

Unified API and marketplace for the best LLMs at the best prices for any prompt.

Unified compute platform for scalable AI and Python applications using Ray

Universal LLM proxy — call 100+ LLMs (OpenAI, Anthropic, Bedrock, Vertex) with one API.

Voltage Park is a GPU cloud platform that rents NVIDIA H100 and Blackwell clusters on-demand or on dedicated reserve for AI training and inference.

DeepInfra is an inference cloud that serves open-weight AI models — Llama, DeepSeek, Qwen, Mistral — behind a pay-per-token, OpenAI-compatible API.

Platform for software engineers to build AI applications.
An LLM gateway is a single API that routes requests to many language models behind one interface, handling keys, fallbacks, and cost tracking. OpenRouter is a common example, letting you switch models without rewriting code. It simplifies comparing providers and avoids lock-in to one vendor.
Groq is known for very low latency using custom hardware, and Fireworks and Together also optimize open-model serving for speed. The fastest choice depends on the model and request pattern, so benchmark on your own prompts. Latency, throughput, and cost trade off differently across providers.
Together AI, Fireworks, and Replicate host open models behind an API so you avoid managing GPUs, while RunPod and Modal give you raw compute to run them yourself. For local use, Ollama runs models on your own machine. Choose based on scale, control, and whether you want managed or self-operated serving.
A gateway like OpenRouter routes requests across providers through one API but does not host the models itself. A serving platform like Fireworks or Together runs the models and returns results. Many teams use a gateway in front of one or more serving platforms to balance cost and reliability.
Tools like Ollama download and run open models on your own hardware with a simple command, exposing a local API your app can call. Local serving keeps data private and removes per-call cost, but it is limited by your GPU or CPU. It suits development, privacy-sensitive use, and smaller models.
Route cheaper requests to smaller or open models, cache repeated responses, and trim prompt length. A gateway like OpenRouter makes it easy to switch models by price and performance, and open-model hosts like Together often cost less than frontier APIs. Match each task to the smallest model that meets quality.
Receive weekly updates so you can stay up-to-date with the world of AI
Receive weekly updates so you can stay up-to-date with the world of AI