Enhancing Jekyll Blog with PyTorch | Generated by AI

Home 2025.08

Improving Your Jekyll Blog with ML/AI Using PyTorch

Since you’re running a Jekyll blog (a static site generator), integrating ML/AI features like recommendations or categorization requires some creativity. Jekyll builds static HTML, so dynamic elements (e.g., real-time recommendations) might need client-side JavaScript or precomputing during the build process via Jekyll plugins or scripts. You mentioned avoiding LLM APIs and focusing on your own neural networks with PyTorch—great, as this keeps everything local and customizable. I’ll outline practical ideas, focusing on PyTorch implementations. These assume you have access to basic libraries like NumPy (for data handling) and can handle text preprocessing manually or with simple tokenization (since advanced NLP libs like Hugging Face aren’t mentioned in your setup, but you can add them locally if needed).

You’ll likely create Python scripts (e.g., in your scripts/ directory) that run during Jekyll’s build process (via a Makefile hook or GitHub Actions if deployed). For example, process Markdown posts in _posts/, generate JSON data, and inject it into your site via Liquid templates.

1. Article Categorization with a PyTorch Classifier

Categorize posts automatically (e.g., into topics like “ML”, “Notes”, “Latex”) by training a simple neural network classifier. This is supervised learning: you’ll need to manually label a subset of your posts as training data. If you don’t have labels, start with unsupervised clustering (see below).

Steps:

Example PyTorch Code Snippet (in a script like scripts/categorize_posts.py):

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import os
from collections import Counter

# Step 1: Load and preprocess data (simplified)
def load_posts(posts_dir='_posts'):
    texts = []
    labels = []  # Assume manual labels: 0=ML, 1=Notes, etc.
    for file in os.listdir(posts_dir):
        if file.endswith('.md'):
            with open(os.path.join(posts_dir, file), 'r') as f:
                content = f.read().split('---')[2].strip()  # Skip frontmatter
                texts.append(content)
                # Placeholder: load label from a dict or CSV
                labels.append(0)  # Replace with actual labels
    return texts, labels

texts, labels = load_posts()
# Build vocab (top 1000 words)
all_words = ' '.join(texts).lower().split()
vocab = {word: idx for idx, word in enumerate(Counter(all_words).most_common(1000))}
vocab_size = len(vocab)

# Convert text to vectors (bag-of-words)
def text_to_vec(text):
    vec = np.zeros(vocab_size)
    for word in text.lower().split():
        if word in vocab:
            vec[vocab[word]] += 1
    return vec

X = np.array([text_to_vec(t) for t in texts])
y = torch.tensor(labels, dtype=torch.long)

# Step 2: Define model
class Classifier(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, 128)
        self.fc2 = nn.Linear(128, num_classes)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)

model = Classifier(vocab_size, num_classes=3)  # Adjust num_classes

# Step 3: Train
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()
X_tensor = torch.tensor(X, dtype=torch.float32)

for epoch in range(100):
    optimizer.zero_grad()
    outputs = model(X_tensor)
    loss = loss_fn(outputs, y)
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

# Step 4: Inference on new post
def classify_post(text):
    vec = torch.tensor(text_to_vec(text), dtype=torch.float32).unsqueeze(0)
    with torch.no_grad():
        pred = model(vec).argmax(1).item()
    return pred  # Map back to category name

# Save model: torch.save(model.state_dict(), 'classifier.pth')
# In build script: classify all posts and write to JSON

Improvements: For better accuracy, use word embeddings (train a simple Embedding layer in PyTorch) or add more layers. If unlabeled, switch to clustering (e.g., KMeans on embeddings—see next section). Run this script in your Makefile: jekyll build && python scripts/categorize_posts.py.

2. Recommendation System with PyTorch Embeddings

Recommend similar articles to readers (e.g., “You might also like…”). Use content-based recommendation: learn embeddings for each post, then compute similarity (cosine distance). No user data needed—just post content.

Steps:

Example PyTorch Code Snippet (in scripts/recommend_posts.py):

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Reuse load_posts and text_to_vec from above

texts, _ = load_posts()  # Ignore labels
X = np.array([text_to_vec(t) for t in texts])
X_tensor = torch.tensor(X, dtype=torch.float32)

# Autoencoder model
class Autoencoder(nn.Module):
    def __init__(self, input_size, embedding_size=64):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_size, 256),
            nn.ReLU(),
            nn.Linear(256, embedding_size)
        )
        self.decoder = nn.Sequential(
            nn.Linear(embedding_size, 256),
            nn.ReLU(),
            nn.Linear(256, input_size)
        )
    
    def forward(self, x):
        emb = self.encoder(x)
        return self.decoder(emb)

model = Autoencoder(vocab_size)
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.MSELoss()

for epoch in range(200):
    optimizer.zero_grad()
    reconstructed = model(X_tensor)
    loss = loss_fn(reconstructed, X_tensor)
    loss.backward()
    optimizer.step()
    if epoch % 20 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')

# Get embeddings
with torch.no_grad():
    embeddings = model.encoder(X_tensor).numpy()

# Recommend: for post i, find top 3 similar
similarities = cosine_similarity(embeddings)
for i in range(len(texts)):
    rec_indices = similarities[i].argsort()[-4:-1][::-1]  # Top 3 excluding self
    print(f'Recs for post {i}: {rec_indices}')

# Save embeddings to JSON for Jekyll
import json
with open('embeddings.json', 'w') as f:
    json.dump({'embeddings': embeddings.tolist(), 'posts': [f'post_{i}' for i in range(len(texts))]}, f)

Improvements: Use a variational autoencoder for better embeddings. If you have user views (via analytics), add collaborative filtering with a matrix factorization model in PyTorch. Client-side: Load JSON in JS and compute similarities on-the-fly for personalization.

3. Other Ideas with PyTorch

General Tips:

If you share more details (e.g., post count, topics), I can refine these!


Back Donate