Running Llama 2 on Runpod with Oobabooga's text-generation-webui

Table of Contents

As of July 27 I think the docker templates are now fixed, so the below shouldn’t be needed.

Here’s a quick guide with some fixes to get Llama 2 running on Runpod using Oobabooga’s (it’s not oogabooga, I got this wrong myself for a while!) text-generation-webui and TheBloke’s dockerLLM.

It won’t work out of the box with dockerLLM, so you’ll need to use some fixes like these.

Your total time from starting following this guide to being able to prompt the model should be approx 14 mins if you already have a runpod account, otherwise probably 20 minutes total.

I’ve tested this with 70B base model (TheBloke/Llama-2-70B-GPTQ), 70B chat model (TheBloke/Llama-2-70B-chat-GPTQ), and 7B base model (TheBloke/Llama-2-7B-GPTQ). It worked for all 3.

I’ve tested this on 1x A100 and 1x A6000. Both were able to run the 70B-GPTQ models.

For the 70B-GPTQ base model, 1x A6000 GPU (not 6000 Ada) was 5.55 tokens/s.

For the 7b-Chat model, 1x A100 GPU was 15.93 tokens/s.

The Guide #

Click this runpod template link¹
Select a suitable GPU (VRAM needed depends on which model size you want to use - for 70B-GPTQ it uses approx 35 GB of VRAM) and click ‘deploy’
Click customize deployment
Change expose http ports from ‘7860,’ to ‘7860,7861’ then click set overrides, then click continue, then click deploy (you can modify storage space if needed, but the default will be enough for 2x 70B GPTQ model downloads - they’re about 36 GB of disk space each)
On the ‘my pods’ page click ‘connect’ then click ‘start web terminal’ (if it shows that it’s not ready yet, wait 3-4 minutes then check again)
Copy the ‘username’ provided, then open the ‘connect to web terminal’ link in a new tab, and login using the username and password from the prior page in the other tab
Run the below set of commands, or use the one-liner

Individual commands:

cd text-generation-webui/
pip3 uninstall auto-gptq
pip3 install git+https://github.com/huggingface/transformers accelerate==0.21.0
GITHUB_ACTIONS=true pip3 install auto-gptq==0.2.2

Now download the model you want to use. Replace the model name in this string with the exact text (case-sensitive) of the model you want. In the same terminal, run:

python3 download-model.py TheBloke/Llama-2-70B-GPTQ

Now run the server with the below, and change the model name with the exact text of the model you downloaded, but with an underscore instead of a / (case-sensitive).

python3 server.py --loader autogptq --model TheBloke_Llama-2-70B-GPTQ --no_inject_fused_attention --listen

One-liner:

Alternatively, here’s all of that as a one liner to do it all with one command:

cd text-generation-webui/ && pip3 uninstall auto-gptq && pip3 install git+https://github.com/huggingface/transformers accelerate==0.21.0 && GITHUB_ACTIONS=true pip3 install auto-gptq==0.2.2 && python3 download-model.py TheBloke/Llama-2-70B-GPTQ && python3 server.py --loader autogptq --model TheBloke_Llama-2-70B-GPTQ --no_inject_fused_attention --listen

You’ll need to hit ‘Y’ to approve the action. It might take about 11 minutes in total. You can follow along with the progress by watching the output in the web terminal.

If it worked successfully, here’s what you’ll see:

In the ‘web terminal’ you should see

2023-07-19 16:35:55 INFO:Loaded the model in 24.66 seconds.

Running on local URL: http://0.0.0.0:7861

To create a public link, set share=True in launch().

If you see something like that, it was successful.

Keep the terminal window running. If you close it, it will shut down the web ui
Now go back to runpod, click ‘connect’ and then click ‘connect to http service [port 7861]’ (if it says port not ready on the link, try clicking it anyway, assume you see the ‘running on local url’ “successful message” listed above)
Now type in a prompt, set the max_tokens, and hit generate.

Prompt templates:

Base models:

None. Create a fake document so that if the model naturally continues what would be expected next in the document, you get the result you want. For example, instead of “What is the capital of France?” you’d write “The capital of France is "

Chat models:

<s>[INST] <<SYS>>
{your_system_message}
<</SYS>>

{user_message_1} [/INST]

and

<s>[INST] <<SYS>>
{your_system_message}
<</SYS>>

{user_message_1} [/INST] {model_reply_1}</s><s>[INST] {user_message_2} [/INST]

See here.

Additional info #

The above tutorial assumes you’ve used runpod before. If you haven’t, it’ll probably be helpful to watch the runpod video here.

If you’re not sure which model version to use (ggml, gptq, base), check out this - the short answer is use GGML if you have minimal VRAM, use GPTQ if you have significant VRAM. This tutorial is based only on GPTQ.

The above tutorial should work for these TheBloke Llama 2 quants (quantized models):

Once there’s a better version, email me or message me on discord (p_clay) and I’ll update the guide to reflect that easier version.

Acknowledgements #

Thanks to MrDragonFox and bartowski.

https://github.com/TheBlokeAI/dockerLLM/tree/main#runpod-theblokes-local-llms-ui ↩︎