GloVe Format and Loading Guide | Generated by AI

Home 2025.10

GloVe vectors are plain-text files. Each line is one token followed by its embedding values, all space-separated. There’s no header (unlike many word2vec text files).

Example (300-d GloVe):

the 0.418 0.24968 -0.41242 0.1217 ... 0.34527   # 300 floats total
to   0.68047 -0.039263 -0.67412 0.056051 ... -0.067549
U.S. 0.3459  0.21513  -0.09982  0.25344 ... 0.024118

A few practical notes:

Your loader

def read_glove_vecs(glove_file):
    with open(glove_file, 'r', encoding='utf-8') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
    return words, word_to_vec_map

matches that format. Two small upgrades you may want:

  1. Use float32 to cut memory in half:
vec = np.asarray(line[1:], dtype=np.float32)
  1. Skip accidental “headers” or bad lines (some repackaged files add one):
parts = line.rstrip().split(' ')
if len(parts) < 2: 
    continue
if len(parts) == 2 and parts[0].isdigit() and parts[1].isdigit():
    # looks like a word2vec-style header: "<vocab> <dim>"
    continue
word, *vals = parts

Building an embedding matrix for your vocab

After you create dictionary: word -> id, map GloVe rows into a weight matrix for your nn.Embedding:

def build_embedding_matrix(dictionary, glove_path, dim=300, dtype=np.float32):
    _, glove = read_glove_vecs(glove_path)  # as above, but dtype=float32
    V = len(dictionary)
    W = np.random.uniform(-0.05, 0.05, (V, dim)).astype(dtype)
    for w, idx in dictionary.items():
        vec = glove.get(w)
        if vec is not None and vec.shape[0] == dim:
            W[idx] = vec
    return torch.tensor(W)

# usage
# emb_weight = build_embedding_matrix(dictionary, "glove.6B.300d.txt", dim=300)
# model.in_embeddings.weight.data.copy_(emb_weight)

Quick contrast with word2vec text/binary

If you ever need to use tooling expecting word2vec format, convert with gensim.scripts.glove2word2vec.


Back Donate