Falcon Perception β€” ONNX (WebGPU)

ONNX export of tiiuae/falcon-perception (0.6B parameters) for in-browser inference via WebGPU.

The encoder and decoder weights are INT4 quantized (MatMulNBits, block_size=128) for efficient browser delivery.

Model Files

File Description Size
encoder.onnx 28-layer transformer backbone (INT4 quantized) 357 MB
decoder_step.onnx Single autoregressive decode step with KV cache (INT4 quantized) 261 MB
embed_tokens.onnx Token embedding lookup 256 MB
segm_head.onnx Segmentation mask projection 9 MB
coord_decoder.onnx Coordinate prediction head 96 MB
size_decoder.onnx Size prediction head 96 MB
coord_encoder.onnx Coordinate Fourier encoding 2 MB
size_encoder.onnx Size Fourier encoding 2 MB
img_projector.onnx Image patch projection 3 MB

Total download: ~1.1 GB (vs 2.5 GB fp32 original)

Architecture

Falcon Perception is an early-fusion vision-language model that performs open-vocabulary object detection and segmentation. It processes image patches and text tokens in a unified transformer with hybrid attention masking (bidirectional for images, causal for text).

The model outputs a structured chain-of-perception sequence per detected object:

<coord> β†’ <size> β†’ <seg>

Inference Pipeline

1. Tokenize text query β†’ token IDs
2. Process image β†’ pixel patches
3. embed_tokens(token_ids) β†’ token embeddings
4. img_projector(pixel_patches) β†’ image features
5. Scatter image features into token sequence
6. encoder(embeddings, freqs, mask) β†’ logits + hidden states
7. Autoregressive decode loop:
   a. Sample next token from logits
   b. decoder_step(token, kv_cache, ...) β†’ next logits
   c. If <coord> token: coord_decoder(hidden) β†’ xy coordinates
   d. If <size> token: size_decoder(hidden) β†’ hw sizes
   e. If <seg> token: segm_head(hidden, hr_features) β†’ binary mask

Conversion Details

  • Source model: tiiuae/falcon-perception (Apache 2.0)
  • Quantization: INT4 weight-only (MatMulNBits, asymmetric, block_size=128)
  • ONNX opset: 18
  • Modifications for ONNX compatibility:
    • FlexAttention β†’ F.scaled_dot_product_attention with dense bool mask
    • Triton squared_relu_gate kernel β†’ pure PyTorch: relu(gate).pow(2) * up
    • Complex-valued RoPE β†’ real cos/sin rotation
    • masked_scatter/masked_select β†’ torch.where + index gather
    • AnyUp FlexCrossAttention β†’ SDPA with precomputed window mask

Usage with ONNX Runtime Web (WebGPU)

import { InferenceSession } from 'onnxruntime-web/webgpu';

const encoder = await InferenceSession.create('./encoder.onnx', {
  executionProviders: ['webgpu'],
});

Citation

@article{falcon-perception,
  title={Falcon Perception},
  author={TII},
  year={2025},
  url={https://huggingface.co/tiiuae/falcon-perception}
}
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support