meta.pkl in nanoGPT Training | Generated by AI
Question: What is meta.pkl in nanoGPT training?
Answer:
meta.pkl is the vocabulary metadata file produced by the character-level prep scripts (e.g. data/shakespeare_char/prepare.py). It pickles the tokenizer mapping so training and sampling share the same encoding.
It contains exactly three things:
meta = {
'vocab_size': vocab_size, # number of unique chars
'itos': itos, # int -> str (decode)
'stoi': stoi, # str -> int (encode)
}
with open('meta.pkl', 'wb') as f:
pickle.dump(meta, f)
Why it exists:
train.pylooks formeta.pklin the data dir to auto-setmeta_vocab_size, so the model’s embedding/output layer is sized correctly:
meta_path = os.path.join(data_dir, 'meta.pkl')
if os.path.exists(meta_path):
with open(meta_path, 'rb') as f:
meta = pickle.load(f)
meta_vocab_size = meta['vocab_size'] # else defaults to GPT-2's 50304
sample.pyloads it to decode generated token ids back to text:
stoi, itos = meta['stoi'], meta['itos']
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
The key point: it’s only generated for the char-level datasets. GPT-2/BPE datasets (openwebtext, shakespeare word-level) skip it because they use the fixed tiktoken gpt2 encoding — there train.py falls back to vocab_size=50304 and sample.py uses tiktoken.get_encoding("gpt2").
So: no meta.pkl = “use GPT-2 BPE”; meta.pkl present = “use this custom char vocab.”
References: