MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

MusaCoder-27B

This repository contains model weights and configuration files for MusaCoder-27B, a specialized code generation model for native GPU kernel synthesis.

MusaCoder-27B is designed to generate CUDA/MUSA native kernels from PyTorch reference implementations, with a focus on compilability, numerical correctness, anti-fallback legality, and empirical speedup.

Introduction

MusaCoder-27B is a 27B-parameter code model developed by Moore Threads for PyTorch-to-CUDA/MUSA native kernel generation. Unlike general-purpose code models, MusaCoder focuses on low-level GPU programming tasks, including tensor shape reasoning, thread/block mapping, memory indexing, boundary handling, reduction strategies, numerical stability, and performance-oriented kernel optimization.

The model is trained through a full-stack post-training pipeline consisting of:

multi-source supervised fine-tuning data construction;
verifier-filtered rejection fine-tuning;
execution-feedback reinforcement learning;
strict native-kernel verification with MooreEval;
CUDA/MUSA-oriented kernel repair and optimization data.

MusaCoder-27B is released to promote the development of the MUSA open-source ecosystem, facilitate research on LLM-based code generation and GPU kernel synthesis, and encourage the community to explore cross-platform native kernel optimization.

Highlights

Native CUDA/MUSA Kernel Generation

MusaCoder-27B is optimized for generating native GPU kernels from PyTorch reference code. The model is not intended for generic business code generation; instead, it targets low-level kernel authoring where generated code must compile, run correctly, satisfy task constraints, and achieve measurable speedup.

MUSA-Oriented Kernel Synthesis

MusaCoder-27B supports PyTorch-to-MUSA kernel generation scenarios and can be used to explore automatic generation of MUSA native kernels from PyTorch reference programs. This provides a foundation model capability for the MUSA developer community and lowers the barrier to writing, validating, and optimizing MUSA kernels.

Full-Stack Training Pipeline

MusaCoder-27B is trained with a full-stack pipeline:

SFT teaches the model PyTorch-to-kernel task format, common kernel implementation patterns, GPU programming knowledge, review capability, and performance analysis.
RFT uses execution-based verification to select correct model-generated implementations while preserving implementation diversity.
RL uses real compilation, execution, correctness checking, anti-fallback detection, and runtime measurement as reward signals.

Execution-Based Verification

MusaCoder is developed together with MooreEval, an execution-based verifier and reward environment. MooreEval checks whether generated kernels:

can be parsed and compiled;
pass randomized correctness tests against PyTorch reference outputs;
avoid forbidden PyTorch/ATen computational fallbacks;
achieve real runtime speedup under synchronized event timing.

RL Stabilization Techniques

The training pipeline incorporates three stabilization techniques:

PrimeEcho: first-turn-anchored multi-turn reward for balancing repair ability and first-attempt quality.
Buffered Dynamic Retry: converts all-failed groups into feedback-conditioned repair tasks.
MirrorPop: sequence-level off-policy filtering based on absolute log-ratio deviation.

Model Details

Item	Description
Model name	MusaCoder-27B
Developer	Moore Threads
Base model	Qwen3.6-27B
Model type	Causal language model
Primary use	PyTorch-to-CUDA/MUSA native kernel generation
License	Apache License 2.0
Training precision	bf16
Recommended framework	Transformers / vLLM / SGLang-compatible inference

Intended Use

MusaCoder-27B is intended for research and development in:

PyTorch-to-CUDA/MUSA kernel generation;
native GPU kernel synthesis;
code generation for accelerator programming;
automatic kernel repair and optimization;
MUSA ecosystem development;
execution-feedback reinforcement learning for code models.

A typical input contains a PyTorch reference implementation, input constraints, and generation requirements. The model is expected to produce a ModelNew implementation using custom native CUDA/MUSA kernels.

Quickstart

Installation

pip install transformers accelerate torch

For high-throughput inference, users may also use vLLM or SGLang depending on their deployment environment.

Basic Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "MooreThreads/MusaCoder-27B"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

prompt = r"""
You are given a PyTorch reference implementation. Write a replacement ModelNew
that implements the same computation using a custom native CUDA/MUSA kernel.

Reference:
```python
import torch
import torch.nn as nn

class Model(nn.Module):
    def forward(self, x):
        return torch.relu(x)
```

Requirements:

* Define class ModelNew(nn.Module).
* Do not use forbidden PyTorch/ATen compute fallback in ModelNew.forward().
* The implementation must be compilable and numerically correct.
  """

messages = [
{"role": "user", "content": prompt},
]

text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
**inputs,
max_new_tokens=32000,
temperature=0.7,
top_p=0.95,
do_sample=True,
)

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Prompt Format

We recommend using a structured prompt that includes:

PyTorch reference code;
input shape and dtype constraints;
target backend, e.g., CUDA or MUSA;
explicit instruction to define ModelNew;
anti-fallback constraints;
optional correctness and performance requirements.

Example:

Given the following PyTorch reference model, generate a new implementation
class ModelNew(nn.Module) that uses custom native CUDA/MUSA kernels.

The generated implementation must:
- match the PyTorch reference numerically;
- compile successfully;
- avoid forbidden PyTorch/ATen compute fallback in forward();
- handle boundary cases correctly;
- prefer native kernel implementations over high-level library calls.

Evaluation

MusaCoder-27B is evaluated using the MooreEval protocol on KernelBench-style tasks.

The evaluation checks:

code extraction and interface validity;
compilation success;
randomized correctness against PyTorch reference;
forbidden PyTorch/ATen fallback detection;
synchronized runtime measurement;
Faster Rate with a speedup threshold of >1.1x.

KernelBench Results

Model	Overall Pass@8	Overall Avg.@8	Faster vs. Eager	Faster vs. Compile
Kimi K2.6	84.0	69.10	3.3	1.4
GLM-5.1	85.6	76.25	7.4	3.9
DeepSeek-V4_ProMax	84.8	60.05	5.7	3.0
Claude Opus 4.7	87.2	77.30	11.8	7.5
Qwen3.6-27B	67.2	35.60	3.4	1.6
MusaCoder-27B-SFT	84.8	79.40	6.3	4.1
MusaCoder-27B-RL	93.2	88.60	15.0	9.2

MUSA KernelBench Results

Model	Overall Pass@8	Overall Avg.@8	Faster vs. Eager
DeepSeek-V4-Pro	92.0	56.9	5.7
GLM-5.1	88.0	66.4	6.9
MusaCoder-27B-SFT	79.6	63.5	5.2
MusaCoder-27B-RL	92.4	81.7	12.5

Notes on Generated Code

Generated kernels should always be compiled and tested before use. GPU kernel generation is a high-risk code generation task because small mistakes in indexing, boundary handling, dtype conversion, or memory layout can lead to incorrect outputs, runtime failures, or illegal memory access.

We recommend validating generated code with:

randomized correctness tests;
multiple input shapes and dtypes;
non-contiguous tensor cases when applicable;
runtime profiling;
forbidden fallback detection.

Limitations

MusaCoder-27B is specialized for GPU kernel generation and may not be optimal for general-purpose chat or application development. The model may still generate code that:

fails to compile;
produces incorrect results for unseen edge cases;
uses inefficient thread/block layouts;
relies on disallowed high-level fallback APIs;
requires additional engineering adaptation for specific platforms or compiler versions.

Users should treat generated code as a candidate implementation that must be verified before deployment.

License

MusaCoder-27B is released under the Apache License 2.0.

MusaCoder-27B is initialized from and trained based on Qwen3.6-27B. Users should comply with the license terms of MusaCoder-27B as well as applicable license terms of upstream models and third-party components.

Citation

If you find MusaCoder useful, please cite:

@article{cheng2026musacoder,
  title={MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU},
  author={Cheng, Kun and Lu, Songshuo and Liao, Sicong and Li, Tankun and Zhang, Yafei and Yang, Dong and Lv, Qiheng and Wang, Hua and Chen, Zhi and Tang, Yaohua},
  journal={arXiv preprint arXiv:2606.04847},
  year={2026},
  eprint={2606.04847},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2606.04847}
}

Acknowledgements

MusaCoder is developed by Moore Threads AI. We thank the open-source community for advancing GPU programming, code generation, and execution-feedback learning. We also acknowledge the upstream base model and software ecosystems that make this work possible.