Load GloVe Vectors Two Ways | Generated by AI

Home 2025.08

You’ve got two easy ways to get GloVe vectors for words like the, to, U.S. and for your own text:


Option A — Download Stanford GloVe and look up with gensim

This is the most common route and works fully offline once downloaded.

# 1) Download & unzip (first run only)
import os, zipfile, urllib.request
url = "https://nlp.stanford.edu/data/glove.6B.zip"
zip_path = "glove.6B.zip"
if not os.path.exists(zip_path):
    urllib.request.urlretrieve(url, zip_path)
with zipfile.ZipFile(zip_path) as zf:
    zf.extract("glove.6B.300d.txt")  # 300-dim set

# 2) Convert GloVe format -> word2vec text format and load
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

glove_path = "glove.6B.300d.txt"
w2v_path   = "glove.6B.300d.w2v.txt"
if not os.path.exists(w2v_path):
    glove2word2vec(glove_path, w2v_path)

vectors = KeyedVectors.load_word2vec_format(w2v_path, binary=False)
print(vectors.vector_size)  # 300

# 3) Single-word lookups
print(vectors['the'][:5])   # first 5 dims
print(vectors['to'][:5])

# NOTE: glove.6B vocabulary is lowercased.
# 'U.S.' -> 'u.s.' in this vocab (if present). Safer: lowercase your tokens.
print(vectors['u.s.'][:5])

Now embed your own text:

import re, numpy as np

# simple tokenizer that keeps dots in tokens like u.s.
TOKEN_RE = re.compile(r"[A-Za-z0-9\.\-']+")

def tokenize(text: str):
    return TOKEN_RE.findall(text.lower())

def embed_tokens(tokens, kv: KeyedVectors):
    vecs = [kv[w] for w in tokens if w in kv.key_to_index]
    return np.stack(vecs) if vecs else np.zeros((0, kv.vector_size))

def embed_sentence_avg(text: str, kv: KeyedVectors):
    V = embed_tokens(tokenize(text), kv)
    return V.mean(axis=0) if V.size else np.zeros(kv.vector_size)

# Examples
print(embed_sentence_avg("The quick brown fox jumps over the lazy dog.", vectors)[:10])
print(embed_tokens(tokenize("U.S. interest rates rose today."), vectors).shape)  # (n_tokens, 300)

Tips


Option B — Let torchtext download and serve GloVe for you

This avoids manual downloading/conversion.

import torch
from torchtext.vocab import GloVe
import re

glove = GloVe(name="6B", dim=300)  # auto-downloads to ~/.vector_cache
stoi = glove.stoi    # word -> index
vecs = glove.vectors # tensor [vocab, 300]

def get_word_vec(word: str):
    idx = stoi.get(word.lower())
    return vecs[idx] if idx is not None else None

print(get_word_vec("the")[:5])
print(get_word_vec("to")[:5])
print(get_word_vec("U.S.")[:5])     # becomes None in many builds; try "u.s.":
print(get_word_vec("u.s.")[:5])

# Sentence embedding (average)
TOKEN_RE = re.compile(r"[A-Za-z0-9\.\-']+")
def embed_sentence_avg(text: str):
    toks = TOKEN_RE.findall(text.lower())
    xs = [get_word_vec(t) for t in toks]
    xs = [x for x in xs if x is not None]
    return torch.stack(xs, dim=0).mean(0) if xs else torch.zeros(glove.dim)

print(embed_sentence_avg("The quick brown fox...")[:10])

What you asked about the specific rows

Those rows like:

the  0.418 0.24968 -0.41242 0.1217 ... 0.34527  # 300 floats total
to   0.68047 -0.039263 -0.67412 0.056051 ... -0.067549
U.S. 0.3459  0.21513  -0.09982 0.25344 ... 0.024118

are exactly what you’ll see in the plain‐text .txt files: each line is token followed by 300 floats. With gensim or torchtext, you don’t need to parse lines yourself—they handle it and give you direct vector lookups.


Common pitfalls (and fixes)

If you want, tell me your exact text processing goal (classification, similarity, clustering, etc.) and I’ll sketch a minimal pipeline using GloVe end-to-end.


Back Donate