aufklarer (Ivan)

posted an update about 24 hours ago

Post

116

Voice cloning models measured across five languages: OmniVoice, Chatterbox, VoxCPM2, Fish Audio

I published a new Soniqo benchmark post for local voice cloning models across five languages:

https://www.soniqo.audio/blog/voice-cloning-benchmarks

Models:

- OmniVoice int8
- Chatterbox Multilingual fp16
- VoxCPM2 bf16
- Fish Audio S2 Pro fp16

Languages:

- English
- German
- Modern Standard Arabic
- Spanish
- Mandarin Chinese

The benchmark uses Google FLEURS test clips as dataset references. Each row includes the reference audio, generated audio, speaker similarity, WER/CER, generated audio length, and RTF.

Main result in this run: OmniVoice was the strongest all-around row set, with 0.707 mean speaker cosine across all five languages, 0.0% ASR error, and mean RTF 0.45. VoxCPM2 bf16 was especially strong on Arabic speaker match. Fish Audio S2 Pro showed strong German/Arabic similarity but slower RTF. Chatterbox Multilingual was competitive on Arabic and Spanish.

This is an engineering benchmark, not a human MOS study. The speaker-similarity values should be compared within this table because every row uses the same local speaker-embedding pipeline.

Try the stack locally with Speech Studio:

https://www.soniqo.audio/speech-studio
https://github.com/soniqo/speech-studio

Underlying Swift library/CLI:

https://github.com/soniqo/speech-swift

Soniqo models and exports:

soniqo @aufklarer

What model or language should I add next?

reacted to their post with 👍 2 months ago

Post

1722

After running extensive benchmarks across ASR, TTS, and VAD on Apple Silicon, we found some results that weren't documented anywhere.

The most counterintuitive: INT8 runs 3.3x faster than INT4 on the Neural Engine. A 332 MB CoreML model allocates 1,677 MB at runtime. And the right architecture uses both MLX and CoreML simultaneously — not one or the other.

MLX talks to the GPU — programmable, fast for large transformer inference. CoreML talks to the Neural Engine — fixed-function silicon, 135x real-time for small feedforward models like VAD, near-zero power draw.

All benchmarks are from speech-swift, our open-source Swift library for on-device speech AI: ASR, TTS, VAD, diarization, speech-to-speech — everything running locally on Apple Silicon with no API, no cloud, no data leaving the device.

Models on HF: aufklarer/Qwen3-ASR-0.6B-MLX-4bit · aufklarer/parakeet-tdt-0.6b-coreml-int8 · aufklarer/PersonaPlex-7B-MLX-4bit

Full article: https://blog.ivan.digital
Library: https://github.com/soniqo/speech-swift

posted an update 2 months ago

Post

1722

After running extensive benchmarks across ASR, TTS, and VAD on Apple Silicon, we found some results that weren't documented anywhere.

The most counterintuitive: INT8 runs 3.3x faster than INT4 on the Neural Engine. A 332 MB CoreML model allocates 1,677 MB at runtime. And the right architecture uses both MLX and CoreML simultaneously — not one or the other.

MLX talks to the GPU — programmable, fast for large transformer inference. CoreML talks to the Neural Engine — fixed-function silicon, 135x real-time for small feedforward models like VAD, near-zero power draw.

All benchmarks are from speech-swift, our open-source Swift library for on-device speech AI: ASR, TTS, VAD, diarization, speech-to-speech — everything running locally on Apple Silicon with no API, no cloud, no data leaving the device.

Models on HF: aufklarer/Qwen3-ASR-0.6B-MLX-4bit · aufklarer/parakeet-tdt-0.6b-coreml-int8 · aufklarer/PersonaPlex-7B-MLX-4bit

Full article: https://blog.ivan.digital
Library: https://github.com/soniqo/speech-swift

posted an update 3 months ago

Post

175

AI Sentiment Analysis of 225,000 Central Bank Sentences Across 26 Banks

Central banks shape global markets through their communications — but tracking policy language across 26 institutions is impossible manually. I built an NLP pipeline that classifies every sentence in central bank statements and minutes into hawkish/dovish sentiment.

Why Sentence-Level?

Most analysis works at the document level: "the Fed was hawkish." But a single statement might have hawkish inflation language, dovish labor guidance, and a dovish dissent — all in one document. Sentence-level classification captures these tensions.

The Hard Part

Central bank language is domain-specific enough that generic sentiment fails badly:

- "Future monetary policy decisions will be conditional on the inflation outlook" — sounds like guidance, but it's boilerplate. Correct: neutral.
- "The member voted against the rate increase" — naive model says dissent_hawkish. But the dissenter wanted lower rates: dissent_dovish.
- "Average interest rate on ruble loans rose to 8.5%" — looks like a rate hike, but it's a market description, not policy: neutral.

The pipeline uses LLM classification with bank-specific prompts and dual-temperature self-validation.

Current Divergences

- BOJ at 0.75% — cautiously hawkish, normalizing after decades at zero https://monetary.live/boj.html
- SNB at 0.00% — neutral, back to the floor https://monetary.live/snb.html
- TCMB at 37% — hawkish, emergency tightening https://monetary.live/tcmb.html
- PBOC at 3.00% — dovish, supporting growth https://monetary.live/pboc.html
- Fed at 3.75% — mixed, cutting but cautious language https://monetary.live/fed.html

Explore

- Dashboard — all 26 banks at a glance: https://monetary.live
- Statements — recent publications with AI summaries: https://monetary.live/statements.html
- Full: https://dev.to/aufklarer/building-an-nlp-pipeline-to-classify-225000-central-bank-sentences-gaf

posted an update 3 months ago

Post

2215

We benchmarked https://github.com/soniqo/speech-swift, our open-source Swift library for on-device speech AI, against Whisper Large v3 (FP16) on LibriSpeech test-clean.

Three models beat it. Two architectural approaches:

Qwen3-ASR (LALM — Qwen3 LLM as ASR decoder, AuT encoder pretrained on ~40M hours) hits 2.35% WER at 1.7B 8-bit, running at 43x real-time on MLX. Greedy decoding matches beam search — the LLM decoder is strong enough that the greedy path is nearly always optimal.

Parakeet TDT (non-autoregressive transducer — FastConformer + TDT joint network) hits 2.74% WER in 634 MB as a CoreML INT8 model on the Neural Engine. No generative hallucination by design. Leaves GPU completely free.

Two findings worth flagging:
- 4-bit quantization is catastrophic for non-English: Korean 6.89% → 19.95% WER on FLEURS. Use 8-bit for multilingual.
- On CoreML, INT8 is 3.3x *faster* than INT4 — opposite of GPU behavior. Native ANE INT8 MACs vs INT4 lookup table indirection.

All numbers reproducible in 15 minutes.

Full article: https://blog.ivan.digital/we-beat-whisper-large-v3-with-a-600m-model-running-entirely-on-your-mac-20e6ce191174

Library: https://github.com/soniqo/speech-swift

Models: Qwen/Qwen3-ASR-0.6B, Qwen/Qwen3-ASR-1.7B, nvidia/parakeet-tdt-0.6b-v2

posted an update 4 months ago

Post

540

Speaker Diarization and VAD on Apple Silicon — MLX-Native Models

Three MLX-optimized models for on-device speaker diarization and voice activity detection, running natively on Apple Silicon via https://github.com/ivan-digital/qwen3-asr-swift:

- aufklarer/Silero-VAD-v5-MLX — Streaming VAD, 309K params, ~1.2 MB. Processes 32ms chunks at 23× real-time on M2 Max.
- aufklarer/Pyannote-Segmentation-MLX — Multi-speaker segmentation, ~1.49M params, ~5.7 MB. 7-class powerset output for up to 3 simultaneous speakers.
- aufklarer/WeSpeaker-ResNet34-LM-MLX — Speaker embedding, ~6.6M params, ~25 MB. 256-dim L2-normalized vectors with BatchNorm fused into Conv2d.

Together they form a diarization pipeline: pyannote segments → WeSpeaker embeds → agglomerative clustering links speakers across the recording. ~32 MB total.

git clone https://github.com/ivan-digital/qwen3-asr-swift
cd qwen3-asr-swift && swift build -c release

.build/release/audio diarize meeting.wav --max-speakers 4 --json
.build/release/audio vad-stream recording.wav

The library also includes ASR, TTS, multilingual synthesis, forced alignment, and speech-to-speech (PersonaPlex 7B). Apache 2.0.

Full architecture details: https://blog.ivan.digital/speaker-diarization-and-voice-activity-detection-on-apple-silicon-native-swift-with-mlx

Library: https://github.com/ivan-digital/qwen3-asr-swift

posted an update 4 months ago

Post

2528

PersonaPlex-7B on Apple Silicon (Swift + MLX Swift)

NVIDIA PersonaPlex is a full-duplex speech-to-speech model — it can listen while it speaks, which enables more natural conversational behaviors like interruptions, overlaps, and quick backchannels.

We put together a native Swift implementation using MLX Swift so it can run locally on Apple Silicon, along with a 4-bit MLX conversion and a small CLI/demo to make it easy to try out.

If you’re interested in on-device voice agents (or just want to see what full-duplex S2S looks like in a real Swift codebase), the details and setup notes are here:

Blog post: https://blog.ivan.digital/nvidia-personaplex-7b-on-apple-silicon-full-duplex-speech-to-speech-in-native-swift-with-mlx-0aa5276f2e23

Repo: https://github.com/ivan-digital/qwen3-asr-swift

reacted to their post with 🔥 5 months ago

Post

3465

Context Engineering for Code Agents: Why They Fail and How to Fix Them

Code agents don't fail because they can't code — they fail because their context turns into a junk drawer.

I wrote a practical survey covering the emerging discipline of context engineering for agentic hybrid applications: the techniques, papers, and architectural patterns that keep long-running code agents on track as their token windows fill up with tool logs, stale diffs, and repeated file dumps.
What's covered:

Why long context windows alone don't save you (position bias, distractor sensitivity)
Observation masking vs. LLM summarization — and when simple beats clever
Tool-output compression with approaches like LLMLingua-2
Trajectory reduction: pruning dead branches from agent history
Memory hierarchies: session → working set → notes → cross-session
How MCP and standardized tool interfaces reduce context debt
Dynamic context policies trained with RL (DeepMiner, MEM1)
Meta-agent CI loops for measuring regressions across agent configs

The core argument: the engineering challenge isn't "make the model smarter" — it's make the agent's context and verification smarter. That's where the real leverage is in 2026.

👉 Read the full post: https://blog.ivan.digital/context-engineering-for-agentic-hybrid-applications-why-code-agents-fail-and-how-to-fix-them-076cab699262

2 replies

·

posted an update 5 months ago

Post

3465

Context Engineering for Code Agents: Why They Fail and How to Fix Them

Code agents don't fail because they can't code — they fail because their context turns into a junk drawer.

I wrote a practical survey covering the emerging discipline of context engineering for agentic hybrid applications: the techniques, papers, and architectural patterns that keep long-running code agents on track as their token windows fill up with tool logs, stale diffs, and repeated file dumps.
What's covered:

Why long context windows alone don't save you (position bias, distractor sensitivity)
Observation masking vs. LLM summarization — and when simple beats clever
Tool-output compression with approaches like LLMLingua-2
Trajectory reduction: pruning dead branches from agent history
Memory hierarchies: session → working set → notes → cross-session
How MCP and standardized tool interfaces reduce context debt
Dynamic context policies trained with RL (DeepMiner, MEM1)
Meta-agent CI loops for measuring regressions across agent configs

The core argument: the engineering challenge isn't "make the model smarter" — it's make the agent's context and verification smarter. That's where the real leverage is in 2026.

👉 Read the full post: https://blog.ivan.digital/context-engineering-for-agentic-hybrid-applications-why-code-agents-fail-and-how-to-fix-them-076cab699262

2 replies

·

posted an update 5 months ago

Post

844

Qwen3-ASR Swift: On-Device Speech Recognition for Apple Silicon

I'm excited to release https://github.com/ivan-digital/qwen3-asr-swift, an open-source Swift implementation of Alibaba's
Qwen3-ASR, optimized for Apple Silicon using MLX.

Why Qwen3-ASR? Exceptional noise robustness — 3.5x better than Whisper in noisy conditions (17.9% vs 63% CER).

Features:
- 52 languages (30 major + 22 Chinese dialects)
- ~600MB model (4-bit quantized)
- ~100ms latency on M-series chips
- Fully local, no cloud API

Also more inference and model architecture in blog post https://blog.ivan.digital/qwen3-asr-swift-on-device-asr-tts-for-apple-silicon-architecture-and-benchmarks-27cbf1e4463f

replied to their post 7 months ago

I hear you — working code and not burning through your quota matter way more than any internal architecture diagram.

I’m looking at the architecture mainly to understand why some tools feel more reliable and easier to debug than others.

What’s been your experience so far — any setup that actually feels close to “it just works”?

reacted to their post with 🔥 7 months ago

Post

629

I did deep dive comparison of Claude Code vs OpenAI Codex code agents architectures, interesting what is your personal experience on this?

Both Claude Code and OpenAI Codex are built on the same backbone: a single-agent event loop that repeatedly thinks, calls tools, inspects the result, and repeats until it’s done. No swarms, no hidden graph orchestration — just one reflective agent iterating through a ReAct-style cycle.

https://blog.ivan.digital/claude-code-vs-openai-codex-agentic-planner-vs-shell-first-surgeon-d6ce988526e8

3 replies

·

reacted to their post with 🔥 7 months ago

Post

1459

Couple months ago I fine‑tuned Qwen3 Embeddings with LoRA on the LSPC dataset. This time I went the opposite way: a small, task‑specific 80M encoder with bidirectional attention, trained end‑to‑end. It outperforms the Qwen3 LoRA baseline on the same data (0.9315 macro‑F1 vs 0.8360). Details and code: https://blog.ivan.digital/beating-qwen3-lora-with-a-tiny-pytorch-encoder-on-the-large-scale-product-corpus-afe536de205f

posted an update 7 months ago

Post

629

I did deep dive comparison of Claude Code vs OpenAI Codex code agents architectures, interesting what is your personal experience on this?

Both Claude Code and OpenAI Codex are built on the same backbone: a single-agent event loop that repeatedly thinks, calls tools, inspects the result, and repeats until it’s done. No swarms, no hidden graph orchestration — just one reflective agent iterating through a ReAct-style cycle.

https://blog.ivan.digital/claude-code-vs-openai-codex-agentic-planner-vs-shell-first-surgeon-d6ce988526e8

3 replies

·

posted an update 7 months ago

Post

1459

Couple months ago I fine‑tuned Qwen3 Embeddings with LoRA on the LSPC dataset. This time I went the opposite way: a small, task‑specific 80M encoder with bidirectional attention, trained end‑to‑end. It outperforms the Qwen3 LoRA baseline on the same data (0.9315 macro‑F1 vs 0.8360). Details and code: https://blog.ivan.digital/beating-qwen3-lora-with-a-tiny-pytorch-encoder-on-the-large-scale-product-corpus-afe536de205f

reacted to their post with 🔥 8 months ago

Post

3284

Fine-Tuning Qwen3 Embeddings for product category classification on the Large-Scale Product Corpus

Language-models such as GPT, Llama, DeepSeek, Qwen trained with a filtered slice of Common Crawl. For e-commerce work, though, we can start with the Web Data Commons (WDC), the project by the University of Mannheim. It extracts web pages that carry some metadata and publishes the result as the Large-Scale Product Corpus (LSPC).

Search engines like Google reward pages that include detailed product markup, so merchants already populate their sites with SEO-friendly fields such as title, brand, GTIN, price — and, crucially, category labels. Thanks to these built-in annotations, the WDC Large-Scale Product Corpus arrives almost fully self-labelled. I used those labels to fine-tune Qwen3 Embedding with Low-Rank Adaptation (LoRA), code is available on github. The resulting 615 million-parameter checkpoint fits comfortably in limited GPU memory yet updates the model’s representation space, mapping raw product titles to six top-level categories with a macro-F1 of 0.836 (83.6 %).

More details: https://blog.ivan.digital/fine-tuning-qwen3-embeddings-for-product-category-classification-on-the-large-scale-product-corpus-3a0919506bc8

posted an update 8 months ago

Post

3284

Fine-Tuning Qwen3 Embeddings for product category classification on the Large-Scale Product Corpus

Language-models such as GPT, Llama, DeepSeek, Qwen trained with a filtered slice of Common Crawl. For e-commerce work, though, we can start with the Web Data Commons (WDC), the project by the University of Mannheim. It extracts web pages that carry some metadata and publishes the result as the Large-Scale Product Corpus (LSPC).

Search engines like Google reward pages that include detailed product markup, so merchants already populate their sites with SEO-friendly fields such as title, brand, GTIN, price — and, crucially, category labels. Thanks to these built-in annotations, the WDC Large-Scale Product Corpus arrives almost fully self-labelled. I used those labels to fine-tune Qwen3 Embedding with Low-Rank Adaptation (LoRA), code is available on github. The resulting 615 million-parameter checkpoint fits comfortably in limited GPU memory yet updates the model’s representation space, mapping raw product titles to six top-level categories with a macro-F1 of 0.836 (83.6 %).

More details: https://blog.ivan.digital/fine-tuning-qwen3-embeddings-for-product-category-classification-on-the-large-scale-product-corpus-3a0919506bc8

Ivan PRO

AI & ML interests

Recent Activity

Organizations

Ivan PRO

AI & ML interests

Recent Activity

Organizations

aufklarer's activity