AI Inference

The right inference infrastructure for every model size, at a cost you can forecast

Run frontier models on dedicated GPU servers. Run leaner models on high-performance CPU or Apple Silicon. Either way, you’re on Summit’s private, managed infrastructure instead of a shared public cloud.

Compare the two paths Talk to an engineer

Dedicated GPUs

Reserved for you, never pooled or rationed by quota.

Flat pricing

A known monthly rate on dedicated hardware, easy to forecast.

Private tenancy

Single-tenant infrastructure with no shared neighbors.

U.S.-based support

Remote Hands and engineers who pick up the phone.

Size the infrastructure

Why inference infrastructure isn’t one-size-fits-all

The hardware a 70B model needs and the hardware a quantized 7B model needs are not the same. Paying for one when you need the other is where budgets slip.

Large models need GPUs

Anything in the 70B-plus range needs GPU memory and parallel compute to serve at a usable speed. CPU inference isn’t practical at that scale.

Small models don’t

A quantized 7B to 13B model runs efficiently on a modern CPU or Apple Silicon. Renting GPUs for it means paying for capacity you’ll never touch.

Cloud pricing moves on you

Public cloud GPU availability comes and goes, and per-hour rates turn cost forecasting into a guessing game. Bills climb fast once you’re in production.

Right hardware, right cost

Match the infrastructure to the model and you get lower cost, better latency, and nothing sitting idle.

GPU and non-GPU paths

Choose the right fit for your model

Two paths, one private infrastructure. Pick based on the model you’re serving, and change your mind later if the workload changes.

GPU servers

Dedicated GPUs Private cloud Kubernetes

Best for

LLM API serving and copilot backends

Real-time inference with strict SLAs

Multimodal pipelines that pair vision and language

Non-GPU servers

Mac hosting Bare-metal servers

Best for

Internal tools, document summarization, and classification

Edge-sensitive or compliance-driven deployments

Teams replacing SaaS AI APIs with self-hosted models

Decide the tier

How to choose

A quick gut check before you talk to us. If you’re between two tiers, that’s what our engineers are for.

Go GPU when

Your model is 30B parameters or larger

Latency needs to stay sub-second

Throughput is high

You’re serving many concurrent users

Go non-GPU when

Your model is 7B to 13B and quantized

The workload is async or batch

Cost per token matters more than raw speed

Data locality or compliance drives placement

Both paths run on Summit’s private infrastructure with dedicated hardware and single-tenant isolation. Our engineers will size the right config for your workload before you commit to anything.

Inference workloads

Where teams run inference on Summit

The same private infrastructure covers heavy real-time serving and lean, cost-sensitive jobs. Here’s how the two paths break down in practice.

On GPU servers

LLM API backends

Copilots

Real-time fraud scoring

Computer vision

Multimodal pipelines

On non-GPU servers

Document summarization

Internal chatbots

Classification

RAG on smaller corpora

Compliance-sensitive inference

Why Summit

Why teams pick Summit for inference

You get the performance of dedicated hardware, the economics of owning your capacity, and people who help you get the sizing right.

Private, single-tenant hardware

Your GPUs and servers are yours. No shared tenants competing for the same silicon.

Pricing you can forecast

Flat monthly rates on dedicated hardware. No per-hour meter to reconcile at the end of the quarter.

U.S.-based support and Remote Hands

When something needs hands on the hardware, you reach engineers in the U.S., not a queue on another continent.

Compliance-ready by design

SOC 2 Type II (AT-101), plus HIPAA and PCI DSS handled through BAAs and shared responsibility frameworks.

22 data centers, 6 continents

Place inference close to your users or your data, wherever that needs to be.

Right-sizing from real engineers

Tell us the models and the workload. We’ll spec the config, GPU or non-GPU, so you’re not overbuying capacity.

Getting started

How it works

Four steps from first conversation to a running deployment you can scale.

1

Share your models and workload

Model sizes, latency targets, concurrency, and any compliance needs.

2

We size the config

Our engineers recommend GPU or non-GPU and the exact spec to match.

3

Deploy on dedicated hardware

Your infrastructure stands up on Summit’s private, single-tenant footprint.

4

Scale with ongoing support

Tune, grow, and lean on U.S.-based support as the workload changes.

Already running inference on public cloud?

Send us your current setup and usage. We’ll model the cost difference against dedicated Summit hardware so you can see the number before you move anything.

Model my cost difference

Cost comparison

Summit dedicated vs public cloud on-demand GPUs

The difference is less about a single hourly rate and more about how you pay, what you’re guaranteed, and who helps when it breaks.

	AWS p4d / p5 on-demand	Summit dedicated
Pricing model	Per-hour and metered, variable month to month	Flat monthly on hardware that’s yours
Cost forecasting	Hard to predict once you’re at production volume	Known in advance, easy to budget
GPU availability	Subject to regional capacity and quotas	Reserved and dedicated to you
Tenancy	Shared by default	Single tenant, private
Support	Tiered, ticket-based	U.S.-based engineers and Remote Hands
Right-sizing	Your team’s problem	Engineers spec the config with you

“

We were having trouble with our apps not sending data. We called up Summit, and they spent 2 hours talking us through it. It was a simple command line change and they fixed it. AWS won’t do that.

Cloud exit reading

Thinking about moving off public cloud?

The cost case for owning your inference capacity is the same case teams are making for their whole stack.

Case study

Why companies like 37signals are leaving the cloud

37signals (makers of Basecamp and HEY) on the economics behind their cloud exit.

Read the story ›

Solution

Cloud repatriation

Move workloads off public cloud onto dedicated, managed infrastructure.

Explore repatriation ›

Talk to us

Get a workload sizing

Tell us your models and let our engineers recommend the right config.

Start the conversation ›

Common questions

Inference infrastructure FAQ

That’s what our engineers do first. Share your model sizes, latency targets, concurrency, and any compliance requirements, and we’ll recommend the tier and the exact spec. If a workload sits on the line between the two, we’ll tell you that too.

Quantized models in the 7B to 13B range run efficiently on modern CPUs or Apple Silicon, especially for async and batch workloads where cost per token matters more than raw speed. Above roughly 30B, or when you need sub-second latency at high concurrency, GPUs are the right call.

Public cloud bills you per hour on shared, quota-limited capacity. Summit gives you dedicated hardware at a flat monthly rate, so cost is predictable and you’re not paying a premium for on-demand availability. Send us your current usage and we’ll model the difference for your specific workload.

Yes. Both paths run on Summit’s single-tenant infrastructure. Your GPUs and servers are dedicated to you, with no shared neighbors on the hardware.

We run SOC 2 Type II (AT-101), and handle HIPAA and PCI DSS through BAAs and shared responsibility frameworks. With 22 data centers across 6 continents, we can also place inference where data residency requires.

GPU deployments run on private cloud with Kubernetes, so you can orchestrate serving the way your team already works. Non-GPU workloads run on Mac hosting or bare-metal servers depending on the fit.

Get started

Talk to an engineer about your inference workload

Tell us the models you’re serving and what you need from them. We’ll size the right config and model the cost.