This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

1,152

How to install LLaMA: 8-bit and 4-bit

Post Flair (click to view more posts with a particular flair)

Tutorial | Guide

Post Body

Getting Started with LLaMA

August 2023 Update: If you're new to Llama and local LLMs, this post is for you. This guide has been updated with the latest information, including the simplest ways to get started. You can skip the sections on manually installing with text generation web UI, which was part of the old original guide from six months ago. These sections are everything below the Old Guide header.

If you're looking for the link to the new Discord server, it's here: https://discord.gg/Y8H8uUtxc3

If you're looking for the subreddit list of models, go to the wiki: https://www.reddit.com/r/LocalLLaMA/wiki/models.

LLaMA FAQ

Q: What is r/LocalLLaMA about?

LocalLLaMA is a subreddit to discuss about Llama, the family of large language models created by Meta AI. It was created to foster a community around Llama similar to communities dedicated to open source like Stable Diffusion. Discussion of other local LLMs is welcome.

To learn more about Llama, read the Wikipedia page.

Q: Is Llama like ChatGPT?

A: The foundational Llama models are not fine-tuned for dialogue or question answering like ChatGPT. They should be prompted so that the expected answer is the natural continuation of the prompt. Fine-tuned Llama models have scored high on benchmarks and can resemble GPT-3.5-Turbo. Llama models are not yet GPT-4 quality.

Q: How to get started? Will this run on my [insert computer specs here?]

A: To get started, keep reading. You can very likely run Llama based models on your hardware even if it's not good.

System Requirements

8-bit Model Requirements for GPU inference

Model	VRAM Used	Card examples	RAM/Swap to Load*
LLaMA 7B / Llama 2 7B	10GB	3060 12GB, 3080 10GB	24 GB
LLaMA 13B / Llama 2 13B	20GB	3090, 3090 Ti, 4090	32 GB
LLaMA 33B / Llama 2 34B	~40GB	A6000 48GB, A100 40GB	~64 GB
LLaMA 65B / Llama 2 70B	~80GB	A100 80GB	~128 GB

^{*System RAM, not VRAM, required to load the model, in addition to having enough VRAM. Not required to run the model. You can use swap space if you do not have enough RAM.}

4-bit Model Requirements for GPU inference

Model	Minimum Total VRAM	Card examples	RAM/Swap to Load*
LLaMA 7B / Llama 2 7B	6GB	GTX 1660, 2060, AMD 5700 XT, RTX 3050, 3060	6 GB
LLaMA 13B / Llama 2 13B	10GB	AMD 6900 XT, RTX 2060 12GB, 3060 12GB, 3080, A2000	12 GB
LLaMA 33B / Llama 2 34B	~20GB	RTX 3080 20GB, A4500, A5000, 3090, 4090, 6000, Tesla V100	~32 GB
LLaMA 65B / Llama 2 70B	~40GB	A100 40GB, 2x3090, 2x4090, A40, RTX A6000, 8000	~64 GB

^{*System RAM, not VRAM, required to load the model, in addition to having enough VRAM. Not required to run the model. You can use swap space if you do not have enough RAM.}

llama.cpp Requirements for CPU inference

Model	Original Size	Quantized Size (4-bit)
7B	13 GB	3.9 GB
13B	24 GB	7.8 GB
33B	60 GB	19.5 GB
65B	120 GB	38.5 GB

^{As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.}

Projects and Installation

Since the unveil of LLaMA several months ago, the tools available for use have become better documented and simpler to use. There are three main projects that this community uses: text generation web UI, llama.cpp, and koboldcpp. This section contains information on each one.

text generation web UI

^{A gradio web UI for running Large Language Models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and GALACTICA.}

The developer of the project has created extensive documentation for installation and other information, and the old guide for manual installation is no longer necessary. To get started, all you have to do is download the one-click installer for the OS of your choice then download a model. For the full documentation, check here.

llama.cpp

^{Inference of LLaMA model in pure C/C}

This is the preferred option for CPU inference. For building on Linux or macOS, view the repository for usage. If you're on Windows, you can download the latest release from the releases page and immediately start using.

For all of the other info on using, the documentation here explains the different options and interaction.

koboldcpp

^{A self contained distributable from Concedo that exposes llama.cpp function bindings, allowing it to be used via a simulated Kobold API endpoint. You get llama.cpp with a fancy UI, persistent stories, editing tools, save formats, memory, world info, author's note, characters, scenarios and everything Kobold and Kobold Lite have to offer.}

The koboldcpp wiki explains everything you need to know to get started.

Models

To find known good models to download, including the base LLaMA and Llama 2 models, visit this subreddit's wiki: https://www.reddit.com/r/LocalLLaMA/wiki/models. You can also search Hugging Face.

Although there have been several fine-tuned models to be released, not all have the same quality. For the best first time experience, it's recommended to start with the official Llama 2 Chat models released by Meta AI or Vicuna v1.5 from LMSYS. They are the most similar to ChatGPT.

If you need a locally run model for coding, use Code Llama or a fine-tuned derivative of it. 7B, 13B, and 34B Code Llama models exist. If you're looking for visual instruction, then use LLaVA or InstructBLIP with Vicuna.

Other Info and FAQ

Q: Do these models provide refusals like ChatGPT?

A: This depends on the model. Some, like the Vicuna models trained on ShareGPT data, inherits refusals from ChatGPT for certain queries. Other models never provide refusals at all. If this is important for your use case, you can experiment with different choices to find your preferred option.

Q: How can I train a LoRA for a specific task or purpose?

A: Read this guide. If you have any questions after reading all of that, then you can ask in this subreddit.

Q: Where can I keep up with the latest news for local LLMS?

A: This subreddit! While the name of this subreddit is r/LocalLLaMA and focuses on LLaMA, discussion of all local LLMs is allowed and encouraged. You can be sure that the latest news and resources will be shared here.

Old Guide

Everything below this point is the old guide which was the original post, and you can skip everything here. Most of it has been deleted now, including the tips, resources, and LoRA tutorial, but the manual steps for the web UI will remain as a reference for anyone who wants it or anyone curious about how the process used to be. This old guide below and its information will no longer be updated.

Installing Windows Subsystem for Linux (WSL)

WSL installation is optional. If you do not want to install this, you can skip over to the Windows specific instructions below for 8-bit or 4-bit. This section requires an NVIDIA GPU.

On Windows, you may receive better performance when using WSL. To install WSL using the instructions below, first ensure you are running at least Windows 10 version 2004 and higher (Build 19041 and higher) or Windows 11. To check for this, type info in the search box on your taskbar and then select System Information. Alternatively, hit Windows R, type msinfo32 into the "Open" field, and then hit enter. Look at "Version" to see what version you are running.

Instructions:

Open Powershell in administrator mode
Enter the following command then restart your machine: wsl --install

This command will enable WSL, download and install the lastest Linux Kernel, use WSL2 as default, and download and install the Ubuntu Linux distribution.

After restart, Windows will finish installing Ubuntu. You'll be asked to create a username and password for Ubuntu. It has no bearing on your Windows username.
Windows will not automatically update or upgrade Ubuntu. Update and upgrade your packages by running the following command in the Ubuntu terminal (search for Ubuntu in the Start menu or taskbar and open the app): sudo apt update && sudo apt upgrade
You can now continue by following the Linux setup instructions for LLaMA. Check the necessary troubleshooting info below to resolve errors. If you plan on using 4-bit LLaMA with WSL, you will need to install the WSL-Ubuntu CUDA toolkit using the instructions below.

Extra tips:

To install conda, run the following inside the Ubuntu environment:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

To find the name of a WSL distribution and uninstall it (afterward, you can create a new virtual machine environment by opening the app again):

wsl -l
wsl --unregister <DistributionName>

To access the web UI from another device on your local network, you will need to configure port forwarding:

netsh interface portproxy add v4tov4 listenaddress=0.0.0.0 listenport=7860 connectaddress=localhost connectport=7860

Troubleshooting:

If you will use 4-bit LLaMA with WSL, you must install the WSL-Ubuntu CUDA toolkit, and it must be 11.7. This CUDA toolkit will not overwrite your WSL2 driver unlike the default CUDA toolkit. Follow these steps:

sudo apt-key del 7fa2af80 
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-wsl-ubuntu-11-7-local_11.7.0-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-11-7-local_11.7.0-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

In order to avoid a CUDA error when starting the web UI, you will need to apply the following fix as seen in this comment and issue #400:

cd /home/USERNAME/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/
cp libbitsandbytes_cuda117.so libbitsandbytes_cpu.so
conda install cudatoolkit

If for some reason installing the WSL-Ubuntu CUDA toolkit does not work for you, this alternate fix should resolve any errors relating to that.

You may also need to create symbolic links to get everything working correctly. Do not do this if the above commands resolve your errors. To create the symlinks, follow the instructions here then restart your machine.

Installing 8-bit LLaMA with text-generation-webui

Linux:

Follow the instructions here under "Installation"
Download the desired Hugging Face converted model for LLaMA here
Copy the entire model folder, for example llama-13b-hf, into text-generation-webui\models
Run the following command in your conda environment: python server.py --model llama-13b-hf --load-in-8bit

Windows:

Install miniconda
Activate conda via powershell, replacing USERNAME with your username: powershell -ExecutionPolicy ByPass -NoExit -Command "& 'C:\Users\USERNAME\miniconda3\shell\condabin\conda-hook.ps1' ; conda activate 'C:\Users\USERNAME\miniconda3' "
Follow the instructions here under "Installation", starting with the step "Create a new conda environment."
Download the desired Hugging Face converted model for LLaMA here
Copy the entire model folder, for example llama-13b-hf, into text-generation-webui\models
Download libbitsandbytes_cuda116.dll and put it in C:\Users\xxx\miniconda3\envs\textgen\lib\site-packages\bitsandbytes\
In \bitsandbytes\cuda_setup\main.py search for: if not torch.cuda.is_available(): return 'libsbitsandbytes_cpu.so', None, None, None, None and replace with: if torch.cuda.is_available(): return 'libbitsandbytes_cuda116.dll', None, None, None, None
In \bitsandbytes\cuda_setup\main.py search for this twice: self.lib = ct.cdll.LoadLibrary(binary_path) and replace with: self.lib = ct.cdll.LoadLibrary(str(binary_path))
Run the following command in your conda environment: python server.py --model llama-13b-hf --load-in-8bit

Note: for decapoda-research models, you must change "tokenizer_class": "LLaMATokenizer" to "tokenizer_class": "LlamaTokenizer" in text-generation-webui/models/llama-13b-hf/tokenizer_config.json

Installing 4-bit LLaMA with text-generation-webui

Linux:

Follow the instructions here under "Installation"
Continue with the 4-bit specific instructions here

Windows (Step-by-Step):

Install Build Tools for Visual Studio 2019 (has to be 2019) here. Check "Desktop development with C " when installing.
Install miniconda
Install Git from the website or simply with cmd prompt: winget install --id Git.Git -e --source winget
Open "x64 native tools command prompt" as admin
Activate conda, replacing USERNAME with your username: powershell -ExecutionPolicy ByPass -NoExit -Command "& 'C:\Users\USERNAME\miniconda3\shell\condabin\conda-hook.ps1' ; conda activate 'C:\Users\USERNAME\miniconda3' "
conda create -n textgen python=3.10.9
conda activate textgen
conda install cuda -c nvidia/label/cuda-11.3.0 -c nvidia/label/cuda-11.3.1
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
pip install torch==1.12 cu113 -f https://download.pytorch.org/whl/torch_stable.html
mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa --branch cuda --single-branch
cd GPTQ-for-LLaMa
git reset --hard c589c5456cc1c9e96065a5d285f8e3fac2cdb0fd
pip install ninja
$env:DISTUTILS_USE_SDK=1
python setup_cuda.py install
Download the 4-bit model of your choice and place it directly into your models folder. For instance, models/llama-13b-4bit-128g. The links for the updated 4-bit models are listed below in the models directory section. If you will use 7B 4-bit, download without group-size. For 13B 4-bit and up, download with group-size.
Run the following command in your conda environment: ^{without group-size} python server.py --model llama-7b-4bit --wbits 4 --no-stream ^{with group-size} python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128 --no-stream

Note: If you get the error "CUDA Setup failed despite GPU being available", do the patch in steps 6-8 of the 8-bit instructions above.

For a quick reference, here is an example chat with LLaMA 13B:

https://preview.redd.it/yhyuzgg6d8oa1.png?width=634&format=png&auto=webp&v=enabled&s=e703895ad863c57239d9d4602f105558ce6f1a8f

Comments

[not loaded or deleted]

SomeGuyInDeutschland

absolute life savers! I recommend making an edit to make this clearer in the instructions :) I'm sure a bunch of people would like to push the limit of what their hardware can load

SomeGuyInDeutschland

Hello, I am trying to set up a custom device_map via hugging face's instructions

https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu

I have this code inserted into my "server.py" folder for text-generation-webui

# Set the quantization config with llm_int8_enable_fp32_cpu_offload set to True
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
device_map = {
"transformer.word_embeddings": 0,
"transformer.word_embeddings_layernorm": 0,
"lm_head": "cpu",
"transformer.h": 0,
"transformer.ln_f": 0,
}
model_path = "decapoda-research/llama-7b-hf"
model_8bit = AutoModelForCausalLM.from_pretrained(
model_path,
device_map=device_map,
quantization_config=quantization_config,
)

However two problem

It downloads a new copy of the model from hugging face rather than from my model directory.
I get this error even after the download

File "C:\Windows\System32\text-generation-webui\server7b.py", line 33, in <module>
model_8bit = AutoModelForCausalLM.from_pretrained(
File "C:\Users\justi\miniconda3\envs\textgen\lib\site-packages\transformers\models\auto\auto_factory.py", line 471, in from_pretrained
return model_class.from_pretrained(
File "C:\Users\justi\miniconda3\envs\textgen\lib\site-packages\transformers\modeling_utils.py", line 2643, in from_pretrained
) = cls._load_pretrained_model(
File "C:\Users\justi\miniconda3\envs\textgen\lib\site-packages\transformers\modeling_utils.py", line 2966, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "C:\Users\justi\miniconda3\envs\textgen\lib\site-packages\transformers\modeling_utils.py", line 662, in _load_state_dict_into_meta_model
raise ValueError(f"{param_name} doesn't have any device set.")
ValueError: model.layers.0.self_attn.q_proj.weight doesn't have any device set.
(textgen) C:\Windows\System32\text-generation-webui>

Does anyone know how to do CPU/GPU offloading for text-generation-webui?

[not loaded or deleted]

SomeGuyInDeutschland

I can confirm I am having the exact same error and issues with ozcur/alpaca-native-4bit

Author

User Disabled

Account Strength

Disabled 3 months ago

Account Age

1 year

Verified Email

Yes

Verified Flair

Total Karma

1,366

Link Karma

453

Comment Karma

793

Profile updated: 1 day ago

Technical_Leather949

Subreddit

r/LocalLLaMA

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 1 year ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/LocalLLaMA/...