Nvidia 驅動程式、LLaMA 與 ChatGPT

Home PDF

LLaMA(大型語言模型Meta AI)是由Meta AI自2023年2月起發布的一系列大型語言模型(LLMs)。

最近,我組裝了一台配備Nvidia GPU的電腦。你可以在這裡查看如何組裝電腦:https://lzwjava.github.io/computer




torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 4

會出現錯誤:“RuntimeError: Distributed package doesn’t have NCCL built in”。讓我們來了解一下NCCL。

NVIDIA集體通信庫(NCCL)實現了針對NVIDIA GPU和網絡優化的多GPU和多節點通信原語。 我參考了以下網站來安裝NVIDIA驅動:


(base) lzw@lzw-MS-7E01:~$ nvidia-smi
Thu Aug 17 04:15:43 2023       
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4070        On  | 00000000:01:00.0  On |                  N/A |
|  0%   34C    P8               9W / 215W |    666MiB / 12282MiB |     15%      Default |
|                                         |                      |                  N/A |
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|    0   N/A  N/A      1926      G   /usr/lib/xorg/Xorg                          381MiB |
|    0   N/A  N/A      2065      G   /usr/bin/gnome-shell                        120MiB |
|    0   N/A  N/A      3482      G   gnome-control-center                          2MiB |
|    0   N/A  N/A      3803      G   ...irefox/2987/usr/lib/firefox/firefox      149MiB |

實際上,達到這個階段並不容易。請仔細參考這裡的鏈接,Ubuntu 22.04 筆記,https://github.com/kmcminn/thinkpad/tree/main/extreme3g



torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 11.69 GiB total capacity; 9.70 GiB already allocated; 64.81 MiB free; 9.70 GiB reserved in total by PyTorch) If reserved memory is » allocated memory try setting max_split_size_mb to avoid fragmentation.




RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)



input_ids = input_ids.to(model.device)


(llama) lzw@lzw-MS-7E01:~/Projects/open_llama_3b$ python run.py 
Q: What is the largest animal?   
A: The blue whale.   
Q: What is the largest animal?    
A: The blue whale. It is the largest animal on Earth. It is also the  


(base) lzw@lzw-MS-7E01:~/Projects/llama2.c$ ./run stories15M.bin 
One day, a big lion and a little rabb
it were in the woods. The big lion said, "Let's race to the river and drink some water." The little rabbit agreed, and they started to race.
The lion was fast, but he was careful not to hurt the little rabbit. The rabbit was getting close to the river, but the big lion didn't see him. When the lion crossed the river, he slipped and fell into the river.
The little rabbit saw the lion in the water and thought, "I should tell the lion not to drink." He said, "Lion, don't drink from the river. It's not safe." But the lion didn't listen and drank from the river.
The little rabbit tried to help the lion, but he couldn't. The lion drank too much and his tummy started to hurt. The big lion was very sad and said, "I told you not to drink from the river." The little rabbit went home crying, and the big lion still couldn't help.
achieved tok/s: 185.527329


Zhiwei: 你能解釋一下下面的代碼嗎?

   def generate(
        prompt_tokens: List[List[int]],
        max_gen_len: int,
        temperature: float = 0.6,
        top_p: float = 0.9,
        logprobs: bool = False,
        echo: bool = False,
    ) -> Tuple[List[List[int]], Optional[List[List[float]]]]:
        params = self.model.params
        bsz = len(prompt_tokens)
        assert bsz <= params.max_batch_size, (bsz, params.max_batch_size)

        min_prompt_len = min(len(t) for t in prompt_tokens)
        max_prompt_len = max(len(t) for t in prompt_tokens)
        assert max_prompt_len <= params.max_seq_len
        total_len = min(params.max_seq_len, max_gen_len + max_prompt_len)

        pad_id = self.tokenizer.pad_id
        tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device="cuda")
        for k, t in enumerate(prompt_tokens):
            tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device="cuda")
        if logprobs:
            token_logprobs = torch.zeros_like(tokens, dtype=torch.float)

        prev_pos = 0
        eos_reached = torch.tensor([False] * bsz, device="cuda")
        input_text_mask = tokens != pad_id
        for cur_pos in range(min_prompt_len, total_len):
            logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
            if logprobs:
                token_logprobs[:, prev_pos + 1 : cur_pos + 1] = -F.cross_entropy(
                    input=logits.transpose(1, 2),
                    target=tokens[:, prev_pos + 1 : cur_pos + 1],
            if temperature > 0:
                probs = torch.softmax(logits[:, -1] / temperature, dim=-1)
                next_token = sample_top_p(probs, top_p)
                next_token = torch.argmax(logits[:, -1], dim=-1)

            next_token = next_token.reshape(-1)
            # only replace token if prompt has already been generated
            next_token = torch.where(
                input_text_mask[:, cur_pos], tokens[:, cur_pos], next_token
            tokens[:, cur_pos] = next_token
            eos_reached |= (~input_text_mask[:, cur_pos]) & (
                next_token == self.tokenizer.eos_id
            prev_pos = cur_pos
            if all(eos_reached):

        if logprobs:
            token_logprobs = token_logprobs.tolist()
        out_tokens, out_logprobs = [], []
        for i, toks in enumerate(tokens.tolist()):
            # cut to max gen len
            start = 0 if echo else len(prompt_tokens[i])
            toks = toks[start : len(prompt_tokens[i]) + max_gen_len]
            probs = None
            if logprobs:
                probs = token_logprobs[i][start : len(prompt_tokens[i]) + max_gen_len]
            # cut to eos tok if any
            if self.tokenizer.eos_id in toks:
                eos_idx = toks.index(self.tokenizer.eos_id)
                toks = toks[:eos_idx]
                probs = probs[:eos_idx] if logprobs else None
        return (out_tokens, out_logprobs if logprobs else None)

ChatGPT: 當然,以下是對提供代碼的解釋:

def generate(
    prompt_tokens: List[List[int]],
    max_gen_len: int,
    temperature: float = 0.6,
    top_p: float = 0.9,
    logprobs: bool = False,
    echo: bool = False,
) -> Tuple[List[List[int]], Optional[List[List[float]]]]:


params = self.model.params
bsz = len(prompt_tokens)
assert bsz <= params.max_batch_size, (bsz, params.max_batch_size)


min_prompt_len = min(len(t) for t in prompt_tokens)
max_prompt_len = max(len(t) for t in prompt_tokens)
assert max_prompt_len <= params.max_seq_len
total_len = min(params.max_seq_len, max_gen_len + max_prompt_len)


pad_id = self.tokenizer.pad_id
tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device="cuda")
for k, t in enumerate(prompt_tokens):
    tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device="cuda")


if logprobs:
    token_logprobs = torch.zeros_like(tokens, dtype=torch.float)


prev_pos = 0
eos_reached = torch.tensor([False] * bsz, device="cuda")
input_text_mask = tokens != pad_id



for cur_pos in range(min_prompt_len, total_len):
    logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
    if logprobs:
        token_logprobs[:, prev_pos + 1 : cur_pos + 1] = -F.cross_entropy(
            input=logits.transpose(1, 2),
            target=tokens[:, prev_pos + 1 : cur_pos + 1],
    if temperature > 0:
        probs = torch.softmax(logits[:, -1] / temperature, dim=-1)
        next_token = sample_top_p(probs, top_p)
        next_token = torch.argmax(logits[:, -1], dim=-1)

    next_token = next_token.reshape(-1)
    next_token = torch.where(
        input_text_mask[:, cur_pos], tokens[:, cur_pos], next_token
    tokens[:, cur_pos] = next_token
    eos_reached |= (~input_text_mask[:, cur_pos]) & (
        next_token == self.tokenizer.eos_id
    prev_pos = cur_pos
    if all(eos_reached):


if logprobs:
    token_logprobs = token_logprobs.tolist()


out_tokens, out_logprobs = [], []
for i, toks in enumerate(tokens.tolist()):


return (out_tokens, out_logprobs if logprobs else None)


Back 2025.03.07 Donate