LLM Gateways & Serving · Reviewed June 16, 2026

DeepInfra

DeepInfra is an inference cloud that serves open-weight AI models — Llama, DeepSeek, Qwen, Mistral — behind a pay-per-token, OpenAI-compatible API.

Pricing
Paid
Rating
4.46/ 5 · 167 reviews
Last reviewed
June 16, 2026
Channels
DeepInfra website homepage screenshot showing the product
01

Overview

DeepInfra

DeepInfra is an inference cloud that serves open-weight AI models behind a pay-per-token, OpenAI-compatible API. DeepInfra runs LLMs, embeddings, speech-to-text, and image and video models on its own GPU infrastructure, so developers call models like Llama, DeepSeek, Qwen, and Mistral without managing servers. As of 2026 it bills per token with no minimum commitment, and also rents dedicated GPU instances and reserved clusters for teams that need guaranteed throughput. DeepInfra is listed as a supported Hugging Face inference provider.

Production credibility: Founded September 2022 in Palo Alto by Nikola Borisov, Yessenzhar Kanapin, and Georgios Papoutsis, a team that previously built the backend for the 200M-user imo messaging app. Raised a $107M Series B in 2026 with participation from NVIDIA, Felicis, Samsung Next, and Supermicro, bringing total funding to roughly $136M; the prior $18M Series A closed in April 2025. Carries SOC 2 and ISO 27001 compliance with a stated zero-data-retention policy.

Key Features

  • OpenAI-compatible REST API for 50+ open-weight models
  • Per-token, pay-as-you-go billing with no minimum commitment
  • Text models including Llama, DeepSeek, Qwen, Mistral, Gemma, and Kimi
  • Embeddings, speech-to-text, and image/video generation endpoints
  • Dedicated GPU instances and reserved clusters for high throughput
  • SOC 2 and ISO 27001 compliance with zero-data-retention policy

Ideal Use Case

Developers and teams running open-weight models in production who want low per-token cost and an OpenAI-compatible API without standing up their own GPU fleet — and the option to reserve dedicated capacity as volume grows.

How DeepInfra differentiates

Together AI and Fireworks are the closest comparisons — all three are pay-per-token inference clouds for open models with OpenAI-compatible APIs. DeepInfra competes mainly on low per-model pricing (small models start in cents per million tokens) and breadth of model and modality coverage, backed by a team that ran infrastructure at consumer scale. There is no standing free tier; you add a card and pay per token, which suits production workloads more than casual experimentation.

FAQ

Q: What is DeepInfra? A: DeepInfra is an inference cloud that serves open-weight models — LLMs, embeddings, speech-to-text, and image generation — through a pay-per-token, OpenAI-compatible API, plus dedicated GPU instances for high-throughput teams.

Q: DeepInfra vs Together AI? A: Both are pay-per-token inference clouds for open models with OpenAI-compatible APIs; they compete largely on per-model pricing, latency, and model selection rather than feature category.

Q: Is DeepInfra open source? A: DeepInfra itself is a commercial platform (not open source) that specializes in serving open-weight models. It was founded in 2022 by ex-imo infrastructure engineers and is backed by investors including NVIDIA.

Q: Does DeepInfra have a free tier? A: No standing free tier — you add a card or pre-pay and are billed per token or request, though prices for small models are very low.

tl;dr

DeepInfra is an inference cloud that serves open-weight models — Llama, DeepSeek, Qwen, Mistral, and more — behind a pay-per-token, OpenAI-compatible API, with dedicated GPU options for scale. Founded by ex-imo infra engineers; $107M Series B with NVIDIA. A lower-cost alternative to Together AI and Fireworks for running open models in production.

02

Why Use DeepInfra

Rating
4.46
Across 167 verified reviews
Saved
170
By ToolDirectory readers
Pricing
Paid
Paid · publisher-listed
Listed
Since 2026
Continuously re-reviewed by editors
Category
LLM Gateways & Serving
Primary listing
Verified by editors during the most recent review · ToolDirectory.AI
03

FAQ

Q.
A.
Q: What is DeepInfra?
A: DeepInfra is an inference cloud that serves open-weight models — LLMs, embeddings, speech-to-text, and image generation — through a pay-per-token, OpenAI-compatible API, plus dedicated GPU instances for high-throughput teams.
Q.
A.
Q: DeepInfra vs Together AI?
A: Both are pay-per-token inference clouds for open models with OpenAI-compatible APIs; they compete largely on per-model pricing, latency, and model selection rather than feature category.
Q.
A.
Q: Is DeepInfra open source?
A: DeepInfra itself is a commercial platform (not open source) that specializes in serving open-weight models. It was founded in 2022 by ex-imo infrastructure engineers and is backed by investors including NVIDIA.
Q.
A.
Q: Does DeepInfra have a free tier?
A: No standing free tier — you add a card or pre-pay and are billed per token or request, though prices for small models are very low.
DeepInfra website homepage screenshot showing the product
04

User Reviews

4.46
Out of 5 · 167 ratings
5
104
4
44
3
13
2
4
1
2
05

Similar Tools

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI