Falcon Perception β ONNX (WebGPU)
ONNX export of tiiuae/falcon-perception (0.6B parameters) for in-browser inference via WebGPU.
The encoder and decoder weights are INT4 quantized (MatMulNBits, block_size=128) for efficient browser delivery.
Model Files
| File | Description | Size |
|---|---|---|
encoder.onnx |
28-layer transformer backbone (INT4 quantized) | 357 MB |
decoder_step.onnx |
Single autoregressive decode step with KV cache (INT4 quantized) | 261 MB |
embed_tokens.onnx |
Token embedding lookup | 256 MB |
segm_head.onnx |
Segmentation mask projection | 9 MB |
coord_decoder.onnx |
Coordinate prediction head | 96 MB |
size_decoder.onnx |
Size prediction head | 96 MB |
coord_encoder.onnx |
Coordinate Fourier encoding | 2 MB |
size_encoder.onnx |
Size Fourier encoding | 2 MB |
img_projector.onnx |
Image patch projection | 3 MB |
Total download: ~1.1 GB (vs 2.5 GB fp32 original)
Architecture
Falcon Perception is an early-fusion vision-language model that performs open-vocabulary object detection and segmentation. It processes image patches and text tokens in a unified transformer with hybrid attention masking (bidirectional for images, causal for text).
The model outputs a structured chain-of-perception sequence per detected object:
<coord> β <size> β <seg>
Inference Pipeline
1. Tokenize text query β token IDs
2. Process image β pixel patches
3. embed_tokens(token_ids) β token embeddings
4. img_projector(pixel_patches) β image features
5. Scatter image features into token sequence
6. encoder(embeddings, freqs, mask) β logits + hidden states
7. Autoregressive decode loop:
a. Sample next token from logits
b. decoder_step(token, kv_cache, ...) β next logits
c. If <coord> token: coord_decoder(hidden) β xy coordinates
d. If <size> token: size_decoder(hidden) β hw sizes
e. If <seg> token: segm_head(hidden, hr_features) β binary mask
Conversion Details
- Source model: tiiuae/falcon-perception (Apache 2.0)
- Quantization: INT4 weight-only (MatMulNBits, asymmetric, block_size=128)
- ONNX opset: 18
- Modifications for ONNX compatibility:
- FlexAttention β F.scaled_dot_product_attention with dense bool mask
- Triton squared_relu_gate kernel β pure PyTorch:
relu(gate).pow(2) * up - Complex-valued RoPE β real cos/sin rotation
- masked_scatter/masked_select β torch.where + index gather
- AnyUp FlexCrossAttention β SDPA with precomputed window mask
Usage with ONNX Runtime Web (WebGPU)
import { InferenceSession } from 'onnxruntime-web/webgpu';
const encoder = await InferenceSession.create('./encoder.onnx', {
executionProviders: ['webgpu'],
});
Citation
@article{falcon-perception,
title={Falcon Perception},
author={TII},
year={2025},
url={https://huggingface.co/tiiuae/falcon-perception}
}
- Downloads last month
- 12