Image Processing with CNNs and Vision Transformers (ViT) | Generated by AI

Home PDF

Below are code examples demonstrating how images are processed using Convolutional Neural Networks (CNNs) and Vision Transformers (ViT) in Python with PyTorch.


1. Image Processing with a CNN (Convolutional Neural Network)

CNNs are widely used for image classification, object detection, and feature extraction.

Example: Using a Pre-trained CNN (ResNet)

import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image

# Load a pre-trained ResNet model
model = models.resnet18(pretrained=True)
model.eval()  # Set to evaluation mode

# Define image preprocessing
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Load and preprocess an image
image = Image.open("example.jpg")  # Replace with your image path
input_tensor = preprocess(image)
input_batch = input_tensor.unsqueeze(0)  # Add batch dimension

# Move to GPU if available
if torch.cuda.is_available():
    input_batch = input_batch.to('cuda')
    model.to('cuda')

# Extract features (before the final classification layer)
with torch.no_grad():
    features = model(input_batch)

print("Feature vector shape:", features.shape)  # e.g., torch.Size([1, 1000])

Explanation:

  1. ResNet18 is a CNN architecture pre-trained on ImageNet.
  2. The image is preprocessed (resized, normalized).
  3. The model converts the image into a feature vector (e.g., 1000-dimensional for ResNet18).

2. Image Processing with a Vision Transformer (ViT)

ViTs treat images as sequences of patches and use self-attention mechanisms (like in NLP).

Example: Using a Pre-trained ViT (Hugging Face)

from transformers import ViTFeatureExtractor, ViTModel
from PIL import Image
import torch

# Load a pre-trained Vision Transformer (ViT)
model_name = "google/vit-base-patch16-224-in21k"
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)
model = ViTModel.from_pretrained(model_name)

# Load an image
image = Image.open("example.jpg")  # Replace with your image path

# Preprocess the image (convert to patches)
inputs = feature_extractor(images=image, return_tensors="pt")

# Extract features (CLS token or patch embeddings)
with torch.no_grad():
    outputs = model(**inputs)

# Get the feature vector (CLS token)
features = outputs.last_hidden_state[:, 0, :]  # Shape: [1, 768]

print("Feature vector shape:", features.shape)  # e.g., torch.Size([1, 768])

Explanation:

  1. ViT splits the image into 16x16 patches and processes them like tokens in NLP.
  2. The CLS token (first token) represents the entire image’s feature vector.
  3. The output is a 768-dimensional vector (for vit-base).

3. Comparing CNN vs. ViT Feature Extraction

| Model | Approach | Feature Vector Size | Libraries | |——-|———-|———————|———–| | CNN (ResNet18) | Convolutional layers + pooling | 1000 (ImageNet classes) | torchvision | | ViT (Google ViT-Base) | Patch embeddings + Transformer | 768 (hidden dimension) | transformers |


4. Applications of Image Feature Vectors

# Example: Compute similarity between two images
from sklearn.metrics.pairwise import cosine_similarity

# Assume features1 and features2 are extracted vectors
similarity = cosine_similarity(features1, features2)
print("Similarity score:", similarity[0][0])

Key Takeaways

  1. CNNs (e.g., ResNet) use convolutional layers to extract hierarchical features.
  2. ViTs (e.g., Google’s ViT) treat images as patch sequences and use self-attention.
  3. Both convert images into feature vectors for downstream tasks.

Would you like an example on training a custom CNN/ViT from scratch? Let me know! 🚀


Back 2025.03.29 Donate