U-Net for Polyp Segmentation β€” Kvasir-SEG

Binary semantic segmentation model trained to detect gastrointestinal polyps in colonoscopy images.

Model Architecture

U-Net (Ronneberger et al., 2015) implemented from scratch in PyTorch.

Encoder (contracting path) β€” 4 blocks of two 3Γ—3 Conv β†’ ReLU, followed by 2Γ—2 MaxPool:

Level Channels
enc1 3 β†’ 64
enc2 64 β†’ 128
enc3 128 β†’ 256
enc4 256 β†’ 512
bottleneck 512 β†’ 1024

Decoder (expanding path) β€” 4 blocks of ConvTranspose2d (2Γ—2, stride 2) + skip connection concatenation + two 3Γ—3 Conv β†’ ReLU:

Level Channels (after concat)
dec4 1024 β†’ 512
dec3 512 β†’ 256
dec2 256 β†’ 128
dec1 128 β†’ 64

Output: 1Γ—1 Conv β†’ 1 channel (raw logits for binary mask)

Skip connections use concatenation (not addition), transferring fine-grained spatial features from each encoder level to the matching decoder level. This is the key difference from ResNet residual connections, which use addition for gradient flow within the same path.

Input images are resized to 256Γ—256 RGB. Output is a (B, 1, 256, 256) logit map; apply sigmoid > 0.5 to get the binary mask.

Loss Function

Combined BCE + Dice loss:

L = BCE(logits, targets) + DiceLoss(logits, targets)

Why not standard cross-entropy alone? Polyp segmentation has strong class imbalance β€” background pixels vastly outnumber polyp pixels. A model that predicts all background achieves high pixel accuracy but is clinically useless. BCE alone is dominated by the majority class.

Dice Loss directly optimizes the overlap between prediction and ground truth:

L_Dice = 1 - (2 * |Pred ∩ GT| + smooth) / (|Pred| + |GT| + smooth)

It is not affected by background dominance since it only measures the ratio of overlap to total predicted + true positives. Smooth term = 1e-6 for numerical stability.

BCE + Dice combines stable gradients (BCE) with overlap-aware optimization (Dice). This is the standard default for medical image segmentation.

Training

  • Dataset: Kvasir-SEG β€” 800 train / 100 validation (from Angelou0516/kvasir-seg)
  • Epochs: 50
  • Optimizer: AdamW (via HuggingFace Trainer)
  • Learning rate: 1e-4 (linear decay)
  • Batch size: 4
  • Device: Apple MPS (M4)
  • Final training loss: 0.075

Data augmentation (train only, synchronized image+mask):

  • Random horizontal flip (p=0.5)
  • Random vertical flip (p=0.5)
  • Random rotation Β±30Β°
  • Color jitter β€” brightness=0.3, contrast=0.3 (image only)

Usage

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from torchvision.transforms import Compose, Resize, ToTensor
from model import UNet

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model
model = UNet().to(device)
weights_path = hf_hub_download(repo_id="beatrizfarias/unet-kvasir-seg", filename="model.safetensors")
model.load_state_dict(load_file(weights_path), strict=False)
model.eval()

# Preprocess
transform = Compose([Resize((256, 256)), ToTensor()])
image = transform(pil_image.convert("RGB")).unsqueeze(0).to(device)

# Predict
with torch.no_grad():
    logits = model(image)
    mask = (torch.sigmoid(logits) > 0.5).float()

Results

Training loss (BCE + Dice) over 50 epochs:

Epoch Loss
1 ~1.10
10 ~0.90
20 ~0.70
30 ~0.51
40 ~0.19
50 0.075

The model performs well on medium-to-large polyps with clear boundaries. Small or subtle lesions are harder β€” expected behavior for a from-scratch U-Net without attention mechanisms.

References

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI.
  • Jha, D., et al. (2020). Kvasir-SEG: A Segmented Polyp Dataset. MMM.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train beatrizfarias/unet-kvasir-seg