U-Net for Polyp Segmentation β Kvasir-SEG
Binary semantic segmentation model trained to detect gastrointestinal polyps in colonoscopy images.
Model Architecture
U-Net (Ronneberger et al., 2015) implemented from scratch in PyTorch.
Encoder (contracting path) β 4 blocks of two 3Γ3 Conv β ReLU, followed by 2Γ2 MaxPool:
| Level | Channels |
|---|---|
| enc1 | 3 β 64 |
| enc2 | 64 β 128 |
| enc3 | 128 β 256 |
| enc4 | 256 β 512 |
| bottleneck | 512 β 1024 |
Decoder (expanding path) β 4 blocks of ConvTranspose2d (2Γ2, stride 2) + skip connection concatenation + two 3Γ3 Conv β ReLU:
| Level | Channels (after concat) |
|---|---|
| dec4 | 1024 β 512 |
| dec3 | 512 β 256 |
| dec2 | 256 β 128 |
| dec1 | 128 β 64 |
Output: 1Γ1 Conv β 1 channel (raw logits for binary mask)
Skip connections use concatenation (not addition), transferring fine-grained spatial features from each encoder level to the matching decoder level. This is the key difference from ResNet residual connections, which use addition for gradient flow within the same path.
Input images are resized to 256Γ256 RGB. Output is a (B, 1, 256, 256) logit map; apply sigmoid > 0.5 to get the binary mask.
Loss Function
Combined BCE + Dice loss:
L = BCE(logits, targets) + DiceLoss(logits, targets)
Why not standard cross-entropy alone? Polyp segmentation has strong class imbalance β background pixels vastly outnumber polyp pixels. A model that predicts all background achieves high pixel accuracy but is clinically useless. BCE alone is dominated by the majority class.
Dice Loss directly optimizes the overlap between prediction and ground truth:
L_Dice = 1 - (2 * |Pred β© GT| + smooth) / (|Pred| + |GT| + smooth)
It is not affected by background dominance since it only measures the ratio of overlap to total predicted + true positives. Smooth term = 1e-6 for numerical stability.
BCE + Dice combines stable gradients (BCE) with overlap-aware optimization (Dice). This is the standard default for medical image segmentation.
Training
- Dataset: Kvasir-SEG β 800 train / 100 validation (from
Angelou0516/kvasir-seg) - Epochs: 50
- Optimizer: AdamW (via HuggingFace Trainer)
- Learning rate: 1e-4 (linear decay)
- Batch size: 4
- Device: Apple MPS (M4)
- Final training loss: 0.075
Data augmentation (train only, synchronized image+mask):
- Random horizontal flip (p=0.5)
- Random vertical flip (p=0.5)
- Random rotation Β±30Β°
- Color jitter β brightness=0.3, contrast=0.3 (image only)
Usage
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from torchvision.transforms import Compose, Resize, ToTensor
from model import UNet
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model
model = UNet().to(device)
weights_path = hf_hub_download(repo_id="beatrizfarias/unet-kvasir-seg", filename="model.safetensors")
model.load_state_dict(load_file(weights_path), strict=False)
model.eval()
# Preprocess
transform = Compose([Resize((256, 256)), ToTensor()])
image = transform(pil_image.convert("RGB")).unsqueeze(0).to(device)
# Predict
with torch.no_grad():
logits = model(image)
mask = (torch.sigmoid(logits) > 0.5).float()
Results
Training loss (BCE + Dice) over 50 epochs:
| Epoch | Loss |
|---|---|
| 1 | ~1.10 |
| 10 | ~0.90 |
| 20 | ~0.70 |
| 30 | ~0.51 |
| 40 | ~0.19 |
| 50 | 0.075 |
The model performs well on medium-to-large polyps with clear boundaries. Small or subtle lesions are harder β expected behavior for a from-scratch U-Net without attention mechanisms.
References
- Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI.
- Jha, D., et al. (2020). Kvasir-SEG: A Segmented Polyp Dataset. MMM.