SeaWolf-AI (VIDRAFT

replied to their post 4 days ago

Agreed on the framing, and the push is fair.

25.7 (single-stream) and 601 (aggregate under concurrency) aren't the same axis — you're right. The 23.4× is a system-level number, and part of it is batching that any MoE gets for free. That isn't the VKAE claim, and we'll present it as what it is rather than fold it into the speedup.

What the paper leads with is the controlled comparison: identical concurrency, batch, and harness, VKAE on vs. off. Single-stream and the throughput ceiling stay as separate, labeled context.

Same discipline on quality — the tail, not the mean: long-context recall and the hard subset, measured against VKAE-off on identical inputs. If it doesn't hold there, it doesn't ship.

To your question directly: VKAE-on vs. VKAE-off at identical concurrency. That's the headline.

reacted to their post with ❤️🔥 4 days ago

Post

5110

🔓 We ran genuine quantum key-recovery on 'real IBM quantum hardware' — and pushed the frontier well past the largest hardware demos we're aware of (which sat at N=4).

Using Simon's algorithm on ibm_kingston, we recovered the secret key of two symmetric-cipher structures:
• Even–Mansour — N=5 → N=10
• 3-round Feistel (DES-family) — block 6 → 8

Each verified against an 'independent control key', using error mitigation only (no QEC).

🧭 Honest scope: this is not a quantum speedup (the effective difficulty tracks the classical birthday bound ~2^{n/2}), not a break of real AES/RSA, and not 16-round DES (ours is 3-round). The recovery method is reserved for a forthcoming paper; formal record status is pending peer review.

📄 Write-up: https://huggingface.co/blog/FINAL-Bench/quantum
🕹️ Try it live in your browser: https://vidraft-quantumos.hf.space/crypto
🏆 Leaderboard: FINAL-Bench/quantum-bench-leaderboard

#quantum #cryptography #quantumcomputing

posted an update 4 days ago

Post

5110

🔓 We ran genuine quantum key-recovery on 'real IBM quantum hardware' — and pushed the frontier well past the largest hardware demos we're aware of (which sat at N=4).

Using Simon's algorithm on ibm_kingston, we recovered the secret key of two symmetric-cipher structures:
• Even–Mansour — N=5 → N=10
• 3-round Feistel (DES-family) — block 6 → 8

Each verified against an 'independent control key', using error mitigation only (no QEC).

🧭 Honest scope: this is not a quantum speedup (the effective difficulty tracks the classical birthday bound ~2^{n/2}), not a break of real AES/RSA, and not 16-round DES (ours is 3-round). The recovery method is reserved for a forthcoming paper; formal record status is pending peer review.

📄 Write-up: https://huggingface.co/blog/FINAL-Bench/quantum
🕹️ Try it live in your browser: https://vidraft-quantumos.hf.space/crypto
🏆 Leaderboard: FINAL-Bench/quantum-bench-leaderboard

#quantum #cryptography #quantumcomputing

reacted to their post with ❤️ 6 days ago

Post

2991

🚀 Adding a GPU without building one

AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere — how efficiently you use the GPUs you already have.

Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU — the effect of plugging in one more "virtual GPU."

VIDRAFT's VKAE, measured (B200, same-harness, no quality loss):

Qwen3.5-35B-A3B (MoE): 25.7 → 601 tok/s (23.4×)
Darwin-36B-Opus (in-house MoE): 25.0 → 280.8 (11.2×)
10,000+ tok/s peak aggregate under concurrency
The key: it's reproducible — model + serving shipped as one container.

docker pull vidraft/qwen35-vkae:601
Don't take our word for it — run it yourself. The mechanism will be released as a paper.

🏆 Leaderboard & demo 👉 VIDraft/vkae
Articles 👉 https://huggingface.co/blog/FINAL-Bench/vkae-leaderboard

3 replies

·

posted an update 6 days ago

Post

2991

🚀 Adding a GPU without building one

AI is usually framed as "how smart is the model / how many GPUs did you buy." The real bottleneck is elsewhere — how efficiently you use the GPUs you already have.

Training happens once; inference runs the entire time users use your product. So a service's economics come down to cost per token. Inference acceleration uses software to pull several times more out of the same GPU — the effect of plugging in one more "virtual GPU."

VIDRAFT's VKAE, measured (B200, same-harness, no quality loss):

Qwen3.5-35B-A3B (MoE): 25.7 → 601 tok/s (23.4×)
Darwin-36B-Opus (in-house MoE): 25.0 → 280.8 (11.2×)
10,000+ tok/s peak aggregate under concurrency
The key: it's reproducible — model + serving shipped as one container.

docker pull vidraft/qwen35-vkae:601
Don't take our word for it — run it yourself. The mechanism will be released as a paper.

🏆 Leaderboard & demo 👉 VIDraft/vkae
Articles 👉 https://huggingface.co/blog/FINAL-Bench/vkae-leaderboard

3 replies

·

replied to their post 6 days ago

You've pinned it exactly. Surprise rate turns the denominator — the modeled surface — from an assertion into a measurement: it makes the coverage number falsifiable.

Honest answer to your question: today Chitos records what Phase 3 fired (the numerator), but it does not yet formally diff "nodes actually reached" against "nodes the static model predicted reachable" and emit that as a surprise rate. That's precisely the gap you've named, and it's the right next step.

Here's how we intend to fold it in: every time a Phase 3 payload lands on a node the static model marked unreachable, we log it as a surprise, feed it back into the enumerator (the surface model gets falsified by its own execution), and report it per phase and per vuln class alongside each "not demonstrated." A low surprise rate is then what earns the coverage number, rather than us asserting it.

One extension: surprise rate is most meaningful stratified by vuln class (1% on the SQLi surface and 1% on the XSS surface don't mean the same thing), and to separate "accurate model" from "blunt prober" it has to be read together with a payload-diversity measure — a prober incapable of surprising will trivially report a low rate. Thanks for this; it's the kind of push that keeps the tool honest.

reacted to ginigen-ai's post with 👀 8 days ago

Post

10418

🧠 Does your LLM know when it's about to be wrong?

Most leaderboards measure accuracy. We measure metacognition — whether a model catches its own errors. Benchmark + leaderboard + adapters, all open. 🎉

The surprise: even a K-AI #1 model (JGOS-31B-Citizen) is the strongest on multiple-choice traps (trap_rate 0.005 — ~2 misses in 400) yet blind to its own free-form mistakes (self-confidence AUROC = 0.5, pure random). A tiny base-frozen adapter recovers that signal.

Two independent axes (never compared across a row): ① trap_rate — does it fall for tempting trap options? (lower = stronger) ② adapter gain Δ — how much a lightweight adapter catches errors the model itself misses. (higher = more adapter value)

What's open: 📊 300+100 trap problems (each with a hidden trap + TICOS type) 🏆 24-model leaderboard 🧩 11 per-model adapters — adapters, NOT fine-tunes (base stays frozen; the adapter just reads the hidden state → P(wrong))

Submit any HF model → auto-scored daily at 09:00 KST and added to the board.

🏆 Leaderboard → ginigen-ai/Metacognition-Leaderboard-Space

📊 Benchmark → ginigen-ai/Metacognition-Bench

🧩 Adapters → FINAL-Bench/metacognition-adapters-6a42c032e6beb803dd032961

📊 Article → https://huggingface.co/blog/ginigen-ai/metacognition

Benchmark by ginigen-ai · Adapters by FINAL-Bench (Darwin/Chimera platform + AETHER metacognition tech).

11 replies

·

reacted to ginigen-ai's post with 🤯❤️👀➕😎👍🔥 10 days ago

Post

5193

🍳 The RoboCasa Kitchen Leaderboard
What does it take for a robot to handle kitchen chores the way a person does? It has to see (Vision), understand instructions (Language), and actually act (Action) — and VLA (Vision-Language-Action) models are emerging as the answer. They're the bridge between large multimodal models and real-world embodied control.

RoboCasa Kitchen is a leading robot-learning benchmark in which a single-arm robot (Franka Panda) performs 24 atomic manipulation tasks — picking up cups and bowls, opening drawers and doors, turning faucets, pressing buttons, and more — inside a photorealistic simulated kitchen. Because the layout and object placement are randomized every episode, it tests genuine generalization rather than memorized motions. The score (success rate, SR) is the average fraction of the 24 tasks completed as instructed, measured over multiple seeds so results aren't down to luck.

The catch: this benchmark has no official leaderboard, and protocols (number of demonstrations, evaluation setup) differ from paper to paper, leaving scores scattered. Lining the numbers up naively quickly turns into an apples-to-oranges comparison.

This leaderboard fixes that by collecting published scores with their sources and comparing only what is genuinely comparable. It's split into three tables:

🏆 Kitchen 24-task (matched) — head-to-head under identical conditions (per the RLDX-1 Technical Report). This is the core ranking you can actually trust.
➕ Other protocols — self-reported under different setups (e.g. fewer demos). Not directly comparable, so kept separate.
🤖 GR1-Tabletop — a different, humanoid-based variant suite, separated to avoid confusion.

Any researcher can submit their own model's score directly, and submissions are reviewed before they appear on the board. Every number links to its source paper, so you can verify it yourself.

👉 ginigen-ai/robocasa-kitchen-leaderboard

reacted to their post with 👀 10 days ago

Post

5144

🐯 Chitos — The Security Scanner That Actually Proves It

Most security scanners hand you a suspect list and walk away. That gap between detection and proof is where attackers live — and it's exactly the gap that Chitos was built to close.

Chitos is the successor to Mythos, a static analyzer built for quick code health checks. Mythos was good at pattern matching — spotting dangerous sinks, mapping CWEs, producing readable reports. But static analysis has a structural ceiling. A rule that sees eval(user_input) can tell you that looks dangerous. It cannot tell you whether the input is reachable, whether sanitization three layers up covers this path, or whether there's a live exploit chain for your exact framework version. Chitos was built to answer those questions.

🔍 Phase 1 applies 50 language-agnostic rules across Python, JavaScript, Go, Java, C/C++, Rust, PHP, YAML and more — covering injection sinks, deserialization gadgets, credential leakage, broken crypto, and prototype pollution. Every candidate is re-verified before reaching the report. Findings that can't be substantiated are excluded, not handed to you as noise.

🔬 Phase 2 dispatches an autonomous web-search agent to hunt live CVE databases, exploit advisories, and public PoC repositories. It formulates hypotheses, verifies them, and synthesizes a structured threat narrative. This phase needs a user-supplied Claude API key — Phases 1 and 3 run entirely free.

🎯 Phase 3 is where Chitos diverges from everything else. Against targets you own or are authorized to test, it fires real payloads — XSS, SQLi, path traversal, command injection — mutates on block, captures hard evidence, and connects every proven finding into a kill-chain showing which vulnerabilities to remediate first.

No installation. No account. No code sent to third-party APIs.

Article: https://huggingface.co/blog/FINAL-Bench/chitos

Try it now 👉 https://chitos.vidraft.net

9 replies

·

posted an update 10 days ago

Post

5144

🐯 Chitos — The Security Scanner That Actually Proves It

Most security scanners hand you a suspect list and walk away. That gap between detection and proof is where attackers live — and it's exactly the gap that Chitos was built to close.

Chitos is the successor to Mythos, a static analyzer built for quick code health checks. Mythos was good at pattern matching — spotting dangerous sinks, mapping CWEs, producing readable reports. But static analysis has a structural ceiling. A rule that sees eval(user_input) can tell you that looks dangerous. It cannot tell you whether the input is reachable, whether sanitization three layers up covers this path, or whether there's a live exploit chain for your exact framework version. Chitos was built to answer those questions.

🔍 Phase 1 applies 50 language-agnostic rules across Python, JavaScript, Go, Java, C/C++, Rust, PHP, YAML and more — covering injection sinks, deserialization gadgets, credential leakage, broken crypto, and prototype pollution. Every candidate is re-verified before reaching the report. Findings that can't be substantiated are excluded, not handed to you as noise.

🔬 Phase 2 dispatches an autonomous web-search agent to hunt live CVE databases, exploit advisories, and public PoC repositories. It formulates hypotheses, verifies them, and synthesizes a structured threat narrative. This phase needs a user-supplied Claude API key — Phases 1 and 3 run entirely free.

🎯 Phase 3 is where Chitos diverges from everything else. Against targets you own or are authorized to test, it fires real payloads — XSS, SQLi, path traversal, command injection — mutates on block, captures hard evidence, and connects every proven finding into a kill-chain showing which vulnerabilities to remediate first.

No installation. No account. No code sent to third-party APIs.

Article: https://huggingface.co/blog/FINAL-Bench/chitos

Try it now 👉 https://chitos.vidraft.net

9 replies

·

reacted to their post with 🤗👍❤️ 23 days ago

Post

6791

🚀 Introducing FINAL-Bench Quantum — an open, neutral benchmark that finally puts quantum-computing methods on one fair yardstick.

Quantum results are notoriously hard to compare. The same "logical error rate" or "query fidelity" means very different things depending on the code, noise model, hardware, and shot count. FINAL-Bench Quantum fixes that: five events judged under identical, published protocols, where every number is labeled as either measured here or quoted from a source.

Five events: ① QEC Decoder ② Optimization (Max-Cut) ③ VQE ④ QRAM ⑤ Quantum Simulation

The rules are simple and strict:
✅ Track A (measured here, with 95% confidence intervals) is kept separate from Track B (quoted from papers, not directly comparable).
🔬 Simulation and real hardware are clearly distinguished, and no quantum-advantage claims are made.
🌍 Methods from Google, IBM, NVIDIA, USTC, Riverlane and more sit side by side, with origin flags and author credits.
📤 Anyone can submit their own method via the Submit tab for review and listing.

Already on the board: real IBM Heron r2 measurements (repetition-code distance boundary, 29–175× error reduction from d3 to d5), a real-chip QRAM query fidelity of 0.92, and H₂ VQE at chemical accuracy — always labeled honestly as simulation vs hardware.

A leaderboard is only useful if you can trust it, so neutrality is the whole point: strong competitors stay in even when they beat the host, sources are quoted faithfully, and a simulation is never rounded up into a hardware claim.

Leaderboard: FINAL-Bench/quantum-bench-leaderboard
Article: https://huggingface.co/blog/FINAL-Bench/quantum-leaderboard

#quantum #QEC #QuantumComputing #benchmark

2 replies

·

VIDRAFT_LAB

AI & ML interests

Recent Activity

Organizations

VIDRAFT_LAB

AI & ML interests

Recent Activity

Organizations

SeaWolf-AI's activity