Falcon Perception — ONNX (WebGPU)

ONNX export of tiiuae/falcon-perception (0.6B parameters) for in-browser inference via WebGPU.

The encoder and decoder weights are INT4 quantized (MatMulNBits, block_size=128) for efficient browser delivery.

Model Files

File	Description	Size
`encoder.onnx`	28-layer transformer backbone (INT4 quantized)	357 MB
`decoder_step.onnx`	Single autoregressive decode step with KV cache (INT4 quantized)	261 MB
`embed_tokens.onnx`	Token embedding lookup	256 MB
`segm_head.onnx`	Segmentation mask projection	9 MB
`coord_decoder.onnx`	Coordinate prediction head	96 MB
`size_decoder.onnx`	Size prediction head	96 MB
`coord_encoder.onnx`	Coordinate Fourier encoding	2 MB
`size_encoder.onnx`	Size Fourier encoding	2 MB
`img_projector.onnx`	Image patch projection	3 MB

Total download: ~1.1 GB (vs 2.5 GB fp32 original)

Architecture

Falcon Perception is an early-fusion vision-language model that performs open-vocabulary object detection and segmentation. It processes image patches and text tokens in a unified transformer with hybrid attention masking (bidirectional for images, causal for text).

The model outputs a structured chain-of-perception sequence per detected object:

<coord> → <size> → <seg>

Inference Pipeline

1. Tokenize text query → token IDs
2. Process image → pixel patches
3. embed_tokens(token_ids) → token embeddings
4. img_projector(pixel_patches) → image features
5. Scatter image features into token sequence
6. encoder(embeddings, freqs, mask) → logits + hidden states
7. Autoregressive decode loop:
   a. Sample next token from logits
   b. decoder_step(token, kv_cache, ...) → next logits
   c. If <coord> token: coord_decoder(hidden) → xy coordinates
   d. If <size> token: size_decoder(hidden) → hw sizes
   e. If <seg> token: segm_head(hidden, hr_features) → binary mask

Conversion Details

Source model: tiiuae/falcon-perception (Apache 2.0)
Quantization: INT4 weight-only (MatMulNBits, asymmetric, block_size=128)
ONNX opset: 18
Modifications for ONNX compatibility:
- FlexAttention → F.scaled_dot_product_attention with dense bool mask
- Triton squared_relu_gate kernel → pure PyTorch: relu(gate).pow(2) * up
- Complex-valued RoPE → real cos/sin rotation
- masked_scatter/masked_select → torch.where + index gather
- AnyUp FlexCrossAttention → SDPA with precomputed window mask

Usage with ONNX Runtime Web (WebGPU)

import { InferenceSession } from 'onnxruntime-web/webgpu';

const encoder = await InferenceSession.create('./encoder.onnx', {
  executionProviders: ['webgpu'],
});

Citation

@article{falcon-perception,
  title={Falcon Perception},
  author={TII},
  year={2025},
  url={https://huggingface.co/tiiuae/falcon-perception}
}

Downloads last month: 12