Nvidia Driver, LLaMA and ChatGPT
LLaMA (Large Language Model Meta AI) is a family of large language models (LLMs), released by Meta AI starting in February 2023.
I recently built my computer with Nvidia GPU. You may check out here, How to Build a Computer, https://lzwjava.github.io/computer.
After that, I started to run up the LLaMA project. LLaMa project’s GitHub URL is https://github.com/facebookresearch/llama.
Install Nvidia Driver
When you run the command,
torchrun --nproc_per_node 1 example_text_completion.py \
--ckpt_dir llama-2-7b/ \
--tokenizer_path tokenizer.model \
--max_seq_len 128 --max_batch_size 4
It shows the error, “RuntimeError: Distributed package doesn’t have NCCL built in”. Let’s learn about NCCL.
The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. I refer to the below websites to install NVIDIA drivers.
- CUDA Toolkit 12.2 Update 1 Downloads, https://developer.nvidia.com/cuda-downloads
- NVIDIA NCCL, https://developer.nvidia.com/nccl
- NVIDIA Deep Learning NCCL Documentation, https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html
- NVIDIA CUDA Installation Guide for Linux, https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
- After Installing Ubuntu, You Encounter Perform MOK Management, https://www.cnblogs.com/yutian-blogs/p/13019226.html
- Ubuntu 22.04 for Deep Learning, https://gist.github.com/amir-saniyan/b3d8e06145a8569c0d0e030af6d60bea
- Ubuntu 22.04 Notes, https://github.com/kmcminn/thinkpad/tree/main/extreme3g
When we successfully install the NVIDIA driver for our graphic card, and then we use nvidia-smi
command to show its detail, it can show the below info.
(base) lzw@lzw-MS-7E01:~$ nvidia-smi
Thu Aug 17 04:15:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4070 On | 00000000:01:00.0 On | N/A |
| 0% 34C P8 9W / 215W | 666MiB / 12282MiB | 15% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1926 G /usr/lib/xorg/Xorg 381MiB |
| 0 N/A N/A 2065 G /usr/bin/gnome-shell 120MiB |
| 0 N/A N/A 3482 G gnome-control-center 2MiB |
| 0 N/A N/A 3803 G ...irefox/2987/usr/lib/firefox/firefox 149MiB |
+---------------------------------------------------------------------------------------+
Actually, it is hard to reach this phase. Please refer to the link here carefully, Ubuntu 22.04 Notes, https://github.com/kmcminn/thinkpad/tree/main/extreme3g.
Learn LLaMA
After downloading the models, and trying to run the command, we will encounter the below error,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 11.69 GiB total capacity; 9.70 GiB already allocated; 64.81 MiB free; 9.70 GiB reserved in total by PyTorch) If reserved memory is » allocated memory try setting max_split_size_mb to avoid fragmentation.
As the memory of our graphic card is only 12 GB, and the size of the llama-2-7b model is around 13GB, so we can’t get it to run with our graphic card.
We try to use the other project, open-llama-3b, https://huggingface.co/openlm-research/open_llama_3b.
We encounter the below error.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
And we ask ChatGPT about this.
The ChatGPT gives us a very beautiful fix. We need to add the below code.
input_ids = input_ids.to(model.device)
Finally, we can run it up.
(llama) lzw@lzw-MS-7E01:~/Projects/open_llama_3b$ python run.py
Q: What is the largest animal?
A: The blue whale.
Q: What is the largest animal?
A: The blue whale. It is the largest animal on Earth. It is also the
We also try the project, llama2.c, https://github.com/karpathy/llama2.c.
(base) lzw@lzw-MS-7E01:~/Projects/llama2.c$ ./run stories15M.bin
One day, a big lion and a little rabb
it were in the woods. The big lion said, "Let's race to the river and drink some water." The little rabbit agreed, and they started to race.
The lion was fast, but he was careful not to hurt the little rabbit. The rabbit was getting close to the river, but the big lion didn't see him. When the lion crossed the river, he slipped and fell into the river.
The little rabbit saw the lion in the water and thought, "I should tell the lion not to drink." He said, "Lion, don't drink from the river. It's not safe." But the lion didn't listen and drank from the river.
The little rabbit tried to help the lion, but he couldn't. The lion drank too much and his tummy started to hurt. The big lion was very sad and said, "I told you not to drink from the river." The little rabbit went home crying, and the big lion still couldn't help.
achieved tok/s: 185.527329
Understand Code by ChatGPT
Zhiwei: Could you explain the below codes?
def generate(
self,
prompt_tokens: List[List[int]],
max_gen_len: int,
temperature: float = 0.6,
top_p: float = 0.9,
logprobs: bool = False,
echo: bool = False,
) -> Tuple[List[List[int]], Optional[List[List[float]]]]:
params = self.model.params
bsz = len(prompt_tokens)
assert bsz <= params.max_batch_size, (bsz, params.max_batch_size)
min_prompt_len = min(len(t) for t in prompt_tokens)
max_prompt_len = max(len(t) for t in prompt_tokens)
assert max_prompt_len <= params.max_seq_len
total_len = min(params.max_seq_len, max_gen_len + max_prompt_len)
pad_id = self.tokenizer.pad_id
tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device="cuda")
for k, t in enumerate(prompt_tokens):
tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device="cuda")
if logprobs:
token_logprobs = torch.zeros_like(tokens, dtype=torch.float)
prev_pos = 0
eos_reached = torch.tensor([False] * bsz, device="cuda")
input_text_mask = tokens != pad_id
for cur_pos in range(min_prompt_len, total_len):
logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
if logprobs:
token_logprobs[:, prev_pos + 1 : cur_pos + 1] = -F.cross_entropy(
input=logits.transpose(1, 2),
target=tokens[:, prev_pos + 1 : cur_pos + 1],
reduction="none",
ignore_index=pad_id,
)
if temperature > 0:
probs = torch.softmax(logits[:, -1] / temperature, dim=-1)
next_token = sample_top_p(probs, top_p)
else:
next_token = torch.argmax(logits[:, -1], dim=-1)
next_token = next_token.reshape(-1)
# only replace token if prompt has already been generated
next_token = torch.where(
input_text_mask[:, cur_pos], tokens[:, cur_pos], next_token
)
tokens[:, cur_pos] = next_token
eos_reached |= (~input_text_mask[:, cur_pos]) & (
next_token == self.tokenizer.eos_id
)
prev_pos = cur_pos
if all(eos_reached):
break
if logprobs:
token_logprobs = token_logprobs.tolist()
out_tokens, out_logprobs = [], []
for i, toks in enumerate(tokens.tolist()):
# cut to max gen len
start = 0 if echo else len(prompt_tokens[i])
toks = toks[start : len(prompt_tokens[i]) + max_gen_len]
probs = None
if logprobs:
probs = token_logprobs[i][start : len(prompt_tokens[i]) + max_gen_len]
# cut to eos tok if any
if self.tokenizer.eos_id in toks:
eos_idx = toks.index(self.tokenizer.eos_id)
toks = toks[:eos_idx]
probs = probs[:eos_idx] if logprobs else None
out_tokens.append(toks)
out_logprobs.append(probs)
return (out_tokens, out_logprobs if logprobs else None)
ChatGPT: Sure, here’s an explanation of the provided code:
def generate(
self,
prompt_tokens: List[List[int]],
max_gen_len: int,
temperature: float = 0.6,
top_p: float = 0.9,
logprobs: bool = False,
echo: bool = False,
) -> Tuple[List[List[int]], Optional[List[List[float]]]]:
This function is named generate
and accepts several parameters:
prompt_tokens
is a list containing sequences of tokens for prompts.max_gen_len
is the maximum length of the generated text.temperature
is a parameter used for controlling randomness during generation.top_p
is a probability threshold between 0 and 1 used for controlling diversity during sampling.logprobs
is a boolean indicating whether to record the log probabilities of the generation process.echo
is a boolean indicating whether to echo the input during generation.
params = self.model.params
bsz = len(prompt_tokens)
assert bsz <= params.max_batch_size, (bsz, params.max_batch_size)
Here, the function gets the model parameters and calculates the batch size of the prompts. It ensures that the batch size is not greater than the maximum batch size specified in the model parameters.
min_prompt_len = min(len(t) for t in prompt_tokens)
max_prompt_len = max(len(t) for t in prompt_tokens)
assert max_prompt_len <= params.max_seq_len
total_len = min(params.max_seq_len, max_gen_len + max_prompt_len)
These lines calculate the minimum and maximum lengths of the prompts and ensure that the maximum prompt length does not exceed the maximum sequence length specified in the model parameters. The total length is calculated based on the maximum generated length and the maximum prompt length.
pad_id = self.tokenizer.pad_id
tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device="cuda")
for k, t in enumerate(prompt_tokens):
tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device="cuda")
This section initializes the tokens
tensor with padding tokens and fills it with the token sequences from the prompts.
if logprobs:
token_logprobs = torch.zeros_like(tokens, dtype=torch.float)
If logprobs
is True, a tensor token_logprobs
is created to store log probabilities.
prev_pos = 0
eos_reached = torch.tensor([False] * bsz, device="cuda")
input_text_mask = tokens != pad_id
Variables prev_pos
(previous position), eos_reached
(whether end-of-sequence tokens are encountered), and input_text_mask
(positions not filled with padding) are initialized.
The following loop generates tokens starting from the minimum prompt length up to the total length.
for cur_pos in range(min_prompt_len, total_len):
logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
if logprobs:
token_logprobs[:, prev_pos + 1 : cur_pos + 1] = -F.cross_entropy(
input=logits.transpose(1, 2),
target=tokens[:, prev_pos + 1 : cur_pos + 1],
reduction="none",
ignore_index=pad_id,
)
if temperature > 0:
probs = torch.softmax(logits[:, -1] / temperature, dim=-1)
next_token = sample_top_p(probs, top_p)
else:
next_token = torch.argmax(logits[:, -1], dim=-1)
next_token = next_token.reshape(-1)
next_token = torch.where(
input_text_mask[:, cur_pos], tokens[:, cur_pos], next_token
)
tokens[:, cur_pos] = next_token
eos_reached |= (~input_text_mask[:, cur_pos]) & (
next_token == self.tokenizer.eos_id
)
prev_pos = cur_pos
if all(eos_reached):
break
Inside this loop, the model generates a probability distribution logits
for the next token. If logprobs
is True, it computes the log probabilities using cross-entropy. Depending on the temperature
, it uses softmax or argmax to select the next token. It then updates the tokens
tensor and checks for end-of-sequence tokens.
if logprobs:
token_logprobs = token_logprobs.tolist()
If logprobs
is True, the log probabilities tensor is converted to a Python list.
out_tokens, out_logprobs = [], []
for i, toks in enumerate(tokens.tolist()):
...
In this loop, for each input sample, it generates the output token sequence and associated log probabilities.
return (out_tokens, out_logprobs if logprobs else None)
Finally, the function returns the generated token sequences and their corresponding log probabilities (if requested).