Cloud GPU Guide

Table of Contents

Firstly, which AI tools are worth running? #

Three good places to start are:

Run llama 2 70b
Run stable diffusion on your own GPU (locally, or on a rented GPU)
Run whisper on your own GPU (locally, or on a rented GPU)

So, which GPUs should I be using? #

If you’re using cloud GPUs:

If you want to run llama 2 70b
- 1x A100 80GB and use https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ with https://github.com/TheBlokeAI/dockerLLM/tree/main
If you want to run stable diffusion
- 1x 4090 if you want a balance of price and performance, or 1x 3090 if you want a lower price (it can run on cheaper GPUs too, and you could use 1x H100 if you wanted to go savage with it)
If you want to run whisper
- Same recommendations as stable diffusion. Though whisper-large can run on cards with lower vram, most of the clouds don’t have those cards, and the 4090 or 3090 will work well. And you can run it on a CPU, too.
If you want to fine-tune a large LLM
- An H100 cluster or A100 cluster
If you want to train a large LLM
- A large H100 cluster

More info here.

If you’re using a local GPU:

Same as above, but you probably won’t be able to train or fine tune an LLM!
Most of the open LLMs have versions available that can run on lower VRAM cards e.g. a GGML version of Llama 2 7b will run on most CPUs, even.
Thanks Bruce for prompting me to add this section

Should I run these models locally or with a cloud GPU? #

Both are reasonable choices!
For running models locally, see this wiki.
- You can run quite a few things even on surprisingly weak hardware.
- If you have powerful local hardware, this is a fun option to play around with
For running models in the cloud, Runpod’s templates are the quickest start
The easiest option is of course using a hosted instance, like DreamStudio, RunDiffusion, or Playground AI for stable diffusion, ChatGPT for an LLM, this for Falcon-40B (currently offline) and Falcon-40B-Uncensored, this for Falcon-40B-Instruct, OpenAI’s API for Whisper, and so on

In short, if you want to run them locally, run them locally, if you want to run them in the cloud, run them in the cloud. It’s your preference based on what GPU you have, how much time you want to spend, how much money you want to spend, and what seems more fun to you.

Which GPU cloud should I use? #

If you need a huge number of A100s/H100s - talk to Oracle, FluidStack, Lambda Labs, maybe a few others. Capacity is very low though for large quantities, especially of H100s, based on a couple of cloud founders/execs I’ve talked with.
If you need a couple A100s or H100s: Runpod, perhaps Tensordock or Latitude.
If you need 1x H100: Runpod (Fluidstack and Lambda have been out of on-demand capacity for qhite a while).
If you need cheap 3090s, 4090s, or A6000s: Tensordock.
If you need Stable Diffusion inference only: Salad.
If you need a wide variety of GPUs: Runpod or Tensordock.
If you want to play around with templates / general hobbyist: Runpod.

The large clouds generally have worse pricing and more complicated setups than the above.

If you’re tied to one of the big clouds (AWS, Azure, GCP), then you don’t have a choice, so use that.

More info here, here, here, here, here and here.

What’s the easiest GPU cloud to start with? #

Runpod and their templates. Pick a template, pick a GPU, click custommize deployment and increase the temporary and persistent disk space to an appropriate size, click set overrides, click continue, click deploy, then click view logs, then once it’s done setup, either use the URL provided by the logs or click to connect to whatever you deployed.

Note that Runpod Pods are not full-featured VMs, they are docker containers on host machines.

How much VRAM and system ram do I need, and how many vCPUs? #

Here are some basic and often-wrong rules of thumb:

VRAM (Video RAM / GPU RAM)
- Llama 2 70B GPTQ 4 bit 50-60GB
- Stable Diffusion 16GB+ preferred
- Whisper 12GB+ if using OpenAI version for optimal transcription speed, can be as low as running on a CPU if using a community version
System ram
- 1-2x your amount of VRAM
vCPUs
- 8-16 vCPUs should be more than sufficient for most non-large-scale GPU workloads
Disk space
- Very use case dependent. If you’re not sure, start with 100GB and see if that’s enough for your use case

More info here.

SXM or PCIe, and do I need NVLink? #

If you’re not sure, then you should assume it doesn’t matter for your use case. More info here.

What about InfiniBand? #

If you’re renting one or two GPUs, it’s not relevant for you. If you’re doing a cluster of thousands, you’ll likely want InfiniBand.

What’s the difference between RTX 6000, A6000, and 6000 Ada? #

3 different cards! It’s a confusing naming scheme.

RTX 6000 (Quadro RTX 6000, 24 GB VRAM, launched Aug 13, 2018)
RTX A6000 (48 GB VRAM, launched Oct 5, 2020)
RTX 6000 Ada (48 GB VRAM, launched Dec 3, 2022)

What about the difference between a DGX GH200, a GH200, and an H100? #

Chart version #

graph TB A(1x DGX GH200) subgraph GH200s [256x GH200s] B(Each GH200) subgraph H100Grace [H100 GPU & Grace CPU] C(1x H100) D(1x Grace CPU) end B -- "contains" --> H100Grace end A -- "contains" --> GH200s

Text version #

1x DGX GH200
- Contains 256x GH200s (“Grace Hoppers”)
  - Each GH200 contains 1x H100 and 1x Grace CPU

More info here.

What about DGX Cloud? #

It’s Nvidia’s official cloud offering aimed at enterprises. You buy it through Nvidia but rent through an existing cloud like Oracle.

8 GPUs per instance, starting at $37k/month.

More info here.

Are H100s a big upgrade from A100s? #

Yes, the speedup is significant, and I’m told that with H100s can scale up performance to large numbers of GPUs better than A100s can.

So for training LLMs, H100s are the best bet.

What about AMD, Intel, Cerebras #

For now, Nvidia is easier. We’ll put content about those other cards out soon. There’s other things that aren’t chips that are relevant to making Nvidia alternatives more workable too, and we’ll write about those.

AMD has some cards with 128GB and 192GB of HBM3 VRAM, which is cool. (MI300A and MI300X)

What next? #

If you’re ready to experiment, go set up a Runpod account, put a balance of $10, browse their templates, and deploy a template to a GPU instance. If you’re more experienced and don’t need templates, then consider starting with a different GPU cloud.

Submit feedback on this post or get early access and/or notifications of future posts.