U-Net for Polyp Segmentation — Kvasir-SEG

Binary semantic segmentation model trained to detect gastrointestinal polyps in colonoscopy images.

Model Architecture

U-Net (Ronneberger et al., 2015) implemented from scratch in PyTorch.

Encoder (contracting path) — 4 blocks of two 3×3 Conv → ReLU, followed by 2×2 MaxPool:

Level	Channels
enc1	3 → 64
enc2	64 → 128
enc3	128 → 256
enc4	256 → 512
bottleneck	512 → 1024

Decoder (expanding path) — 4 blocks of ConvTranspose2d (2×2, stride 2) + skip connection concatenation + two 3×3 Conv → ReLU:

Level	Channels (after concat)
dec4	1024 → 512
dec3	512 → 256
dec2	256 → 128
dec1	128 → 64

Output: 1×1 Conv → 1 channel (raw logits for binary mask)

Skip connections use concatenation (not addition), transferring fine-grained spatial features from each encoder level to the matching decoder level. This is the key difference from ResNet residual connections, which use addition for gradient flow within the same path.

Input images are resized to 256×256 RGB. Output is a (B, 1, 256, 256) logit map; apply sigmoid > 0.5 to get the binary mask.

Loss Function

Combined BCE + Dice loss:

L = BCE(logits, targets) + DiceLoss(logits, targets)

Why not standard cross-entropy alone? Polyp segmentation has strong class imbalance — background pixels vastly outnumber polyp pixels. A model that predicts all background achieves high pixel accuracy but is clinically useless. BCE alone is dominated by the majority class.

Dice Loss directly optimizes the overlap between prediction and ground truth:

L_Dice = 1 - (2 * |Pred ∩ GT| + smooth) / (|Pred| + |GT| + smooth)

It is not affected by background dominance since it only measures the ratio of overlap to total predicted + true positives. Smooth term = 1e-6 for numerical stability.

BCE + Dice combines stable gradients (BCE) with overlap-aware optimization (Dice). This is the standard default for medical image segmentation.

Training

Dataset: Kvasir-SEG — 800 train / 100 validation (from Angelou0516/kvasir-seg)
Epochs: 50
Optimizer: AdamW (via HuggingFace Trainer)
Learning rate: 1e-4 (linear decay)
Batch size: 4
Device: Apple MPS (M4)
Final training loss: 0.075

Data augmentation (train only, synchronized image+mask):

Random horizontal flip (p=0.5)
Random vertical flip (p=0.5)
Random rotation ±30°
Color jitter — brightness=0.3, contrast=0.3 (image only)

Usage

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from torchvision.transforms import Compose, Resize, ToTensor
from model import UNet

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model
model = UNet().to(device)
weights_path = hf_hub_download(repo_id="beatrizfarias/unet-kvasir-seg", filename="model.safetensors")
model.load_state_dict(load_file(weights_path), strict=False)
model.eval()

# Preprocess
transform = Compose([Resize((256, 256)), ToTensor()])
image = transform(pil_image.convert("RGB")).unsqueeze(0).to(device)

# Predict
with torch.no_grad():
    logits = model(image)
    mask = (torch.sigmoid(logits) > 0.5).float()

Results

Training loss (BCE + Dice) over 50 epochs:

Epoch	Loss
1	~1.10
10	~0.90
20	~0.70
30	~0.51
40	~0.19
50	0.075

The model performs well on medium-to-large polyps with clear boundaries. Small or subtle lesions are harder — expected behavior for a from-scratch U-Net without attention mechanisms.

References

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI.
Jha, D., et al. (2020). Kvasir-SEG: A Segmented Polyp Dataset. MMM.

Downloads last month: -; Downloads are not tracked for this model. How to track

beatrizfarias
/

unet-kvasir-seg