The right inference infrastructure for every model size, at a cost you can forecast
Run frontier models on dedicated GPU servers. Run leaner models on high-performance CPU or Apple Silicon. Either way, you’re on Summit’s private, managed infrastructure instead of a shared public cloud.
Why inference infrastructure isn’t one-size-fits-all
The hardware a 70B model needs and the hardware a quantized 7B model needs are not the same. Paying for one when you need the other is where budgets slip.
Large models need GPUs
Anything in the 70B-plus range needs GPU memory and parallel compute to serve at a usable speed. CPU inference isn’t practical at that scale.
Small models don’t
A quantized 7B to 13B model runs efficiently on a modern CPU or Apple Silicon. Renting GPUs for it means paying for capacity you’ll never touch.
Cloud pricing moves on you
Public cloud GPU availability comes and goes, and per-hour rates turn cost forecasting into a guessing game. Bills climb fast once you’re in production.
Right hardware, right cost
Match the infrastructure to the model and you get lower cost, better latency, and nothing sitting idle.
Choose the right fit for your model
Two paths, one private infrastructure. Pick based on the model you’re serving, and change your mind later if the workload changes.
GPU servers
Non-GPU servers
How to choose
A quick gut check before you talk to us. If you’re between two tiers, that’s what our engineers are for.
Go GPU when
Go non-GPU when
Both paths run on Summit’s private infrastructure with dedicated hardware and single-tenant isolation. Our engineers will size the right config for your workload before you commit to anything.
Where teams run inference on Summit
The same private infrastructure covers heavy real-time serving and lean, cost-sensitive jobs. Here’s how the two paths break down in practice.
Why teams pick Summit for inference
You get the performance of dedicated hardware, the economics of owning your capacity, and people who help you get the sizing right.
Private, single-tenant hardware
Your GPUs and servers are yours. No shared tenants competing for the same silicon.
Pricing you can forecast
Flat monthly rates on dedicated hardware. No per-hour meter to reconcile at the end of the quarter.
U.S.-based support and Remote Hands
When something needs hands on the hardware, you reach engineers in the U.S., not a queue on another continent.
Compliance-ready by design
SOC 2 Type II (AT-101), plus HIPAA and PCI DSS handled through BAAs and shared responsibility frameworks.
22 data centers, 6 continents
Place inference close to your users or your data, wherever that needs to be.
Right-sizing from real engineers
Tell us the models and the workload. We’ll spec the config, GPU or non-GPU, so you’re not overbuying capacity.
How it works
Four steps from first conversation to a running deployment you can scale.
Share your models and workload
Model sizes, latency targets, concurrency, and any compliance needs.
We size the config
Our engineers recommend GPU or non-GPU and the exact spec to match.
Deploy on dedicated hardware
Your infrastructure stands up on Summit’s private, single-tenant footprint.
Scale with ongoing support
Tune, grow, and lean on U.S.-based support as the workload changes.
Already running inference on public cloud?
Send us your current setup and usage. We’ll model the cost difference against dedicated Summit hardware so you can see the number before you move anything.
Summit dedicated vs public cloud on-demand GPUs
The difference is less about a single hourly rate and more about how you pay, what you’re guaranteed, and who helps when it breaks.
| AWS p4d / p5 on-demand | Summit dedicated | |
|---|---|---|
| Pricing model | Per-hour and metered, variable month to month | Flat monthly on hardware that’s yours |
| Cost forecasting | Hard to predict once you’re at production volume | Known in advance, easy to budget |
| GPU availability | Subject to regional capacity and quotas | Reserved and dedicated to you |
| Tenancy | Shared by default | Single tenant, private |
| Support | Tiered, ticket-based | U.S.-based engineers and Remote Hands |
| Right-sizing | Your team’s problem | Engineers spec the config with you |
We were having trouble with our apps not sending data. We called up Summit, and they spent 2 hours talking us through it. It was a simple command line change and they fixed it. AWS won’t do that.
Thinking about moving off public cloud?
The cost case for owning your inference capacity is the same case teams are making for their whole stack.
Why companies like 37signals are leaving the cloud
37signals (makers of Basecamp and HEY) on the economics behind their cloud exit.
Read the story ›Cloud repatriation
Move workloads off public cloud onto dedicated, managed infrastructure.
Explore repatriation ›Get a workload sizing
Tell us your models and let our engineers recommend the right config.
Start the conversation ›Inference infrastructure FAQ
That’s what our engineers do first. Share your model sizes, latency targets, concurrency, and any compliance requirements, and we’ll recommend the tier and the exact spec. If a workload sits on the line between the two, we’ll tell you that too.
Quantized models in the 7B to 13B range run efficiently on modern CPUs or Apple Silicon, especially for async and batch workloads where cost per token matters more than raw speed. Above roughly 30B, or when you need sub-second latency at high concurrency, GPUs are the right call.
Public cloud bills you per hour on shared, quota-limited capacity. Summit gives you dedicated hardware at a flat monthly rate, so cost is predictable and you’re not paying a premium for on-demand availability. Send us your current usage and we’ll model the difference for your specific workload.
Yes. Both paths run on Summit’s single-tenant infrastructure. Your GPUs and servers are dedicated to you, with no shared neighbors on the hardware.
We run SOC 2 Type II (AT-101), and handle HIPAA and PCI DSS through BAAs and shared responsibility frameworks. With 22 data centers across 6 continents, we can also place inference where data residency requires.
GPU deployments run on private cloud with Kubernetes, so you can orchestrate serving the way your team already works. Non-GPU workloads run on Mac hosting or bare-metal servers depending on the fit.
Talk to an engineer about your inference workload
Tell us the models you’re serving and what you need from them. We’ll size the right config and model the cost.