Image Processing with CNNs and Vision Transformers (ViT) | Generated by AI
Below are code examples demonstrating how images are processed using Convolutional Neural Networks (CNNs) and Vision Transformers (ViT) in Python with PyTorch.
1. Image Processing with a CNN (Convolutional Neural Network)
CNNs are widely used for image classification, object detection, and feature extraction.
Example: Using a Pre-trained CNN (ResNet)
import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
# Load a pre-trained ResNet model
model = models.resnet18(pretrained=True)
model.eval() # Set to evaluation mode
# Define image preprocessing
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Load and preprocess an image
image = Image.open("example.jpg") # Replace with your image path
input_tensor = preprocess(image)
input_batch = input_tensor.unsqueeze(0) # Add batch dimension
# Move to GPU if available
if torch.cuda.is_available():
input_batch = input_batch.to('cuda')
model.to('cuda')
# Extract features (before the final classification layer)
with torch.no_grad():
features = model(input_batch)
print("Feature vector shape:", features.shape) # e.g., torch.Size([1, 1000])
Explanation:
- ResNet18 is a CNN architecture pre-trained on ImageNet.
- The image is preprocessed (resized, normalized).
- The model converts the image into a feature vector (e.g., 1000-dimensional for ResNet18).
2. Image Processing with a Vision Transformer (ViT)
ViTs treat images as sequences of patches and use self-attention mechanisms (like in NLP).
Example: Using a Pre-trained ViT (Hugging Face)
from transformers import ViTFeatureExtractor, ViTModel
from PIL import Image
import torch
# Load a pre-trained Vision Transformer (ViT)
model_name = "google/vit-base-patch16-224-in21k"
feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)
model = ViTModel.from_pretrained(model_name)
# Load an image
image = Image.open("example.jpg") # Replace with your image path
# Preprocess the image (convert to patches)
inputs = feature_extractor(images=image, return_tensors="pt")
# Extract features (CLS token or patch embeddings)
with torch.no_grad():
outputs = model(**inputs)
# Get the feature vector (CLS token)
features = outputs.last_hidden_state[:, 0, :] # Shape: [1, 768]
print("Feature vector shape:", features.shape) # e.g., torch.Size([1, 768])
Explanation:
- ViT splits the image into 16x16 patches and processes them like tokens in NLP.
- The
CLS token
(first token) represents the entire image’s feature vector. - The output is a 768-dimensional vector (for
vit-base
).
3. Comparing CNN vs. ViT Feature Extraction
| Model | Approach | Feature Vector Size | Libraries |
|——-|———-|———————|———–|
| CNN (ResNet18) | Convolutional layers + pooling | 1000 (ImageNet classes) | torchvision
|
| ViT (Google ViT-Base) | Patch embeddings + Transformer | 768 (hidden dimension) | transformers
|
4. Applications of Image Feature Vectors
- Image Search: Compare feature vectors (e.g., cosine similarity).
- Transfer Learning: Use pre-trained features for custom tasks.
- Object Detection: Extract regions of interest (ROI) from features.
# Example: Compute similarity between two images
from sklearn.metrics.pairwise import cosine_similarity
# Assume features1 and features2 are extracted vectors
similarity = cosine_similarity(features1, features2)
print("Similarity score:", similarity[0][0])
Key Takeaways
- CNNs (e.g., ResNet) use convolutional layers to extract hierarchical features.
- ViTs (e.g., Google’s ViT) treat images as patch sequences and use self-attention.
- Both convert images into feature vectors for downstream tasks.
Would you like an example on training a custom CNN/ViT from scratch? Let me know! 🚀