Qwen3.5-9B-Thai-Law-Base

This is a Continued Pre-Training (CPT) model based on Qwen/Qwen3.5-9B-Base, explicitly adapted to deeply understand the context, terminology, and nuances of Thai Law.

The model was continuously trained on a carefully cleaned and curated dataset comprising various Thai legal documents, acts, decrees, and official announcements.

Model Details

Base Model: Qwen/Qwen3.5-9B-Base
Architecture: Qwen2ForCausalLM
Language: Thai (th), English (en)
Domain: Law, Legal Advisory, Governance
Parameters: 8.95 Billion
Context Length: 4,096 tokens (Trained context size is 2,048)
Format: Safetensors

Intended Use

This is a base model (not chat-tuned or instruction-tuned). It requires further Supervised Fine-Tuning (SFT) or Reinforcement Learning (RLHF/GRPO) to act as a conversational AI or legal assistant.

Its primary intended use cases are:

Acting as a foundational backbone for downstream NLP legal tasks (Classification, Information Retrieval, RAG).
Being fine-tuned into a Chat model answering legal inquiries.

Training Data

The model was trained on roughly 68M+ tokens of pure Thai legal domain data.

Data sources: Thai Acts, Royal Decrees, Supreme Court Rulings, and official government publications.
Data preprocessing: Raw PDF OCR data was heavily cleaned to fix common Thai OCR errors (e.g., removing spaces before trailing vowels "า", merging floating tone marks, and normalizing Unicode).

Training Configuration

The training was performed using Hugging Face transformers and accelerate on a single NVIDIA H100 80GB GPU.

Epochs: 1
Effective Batch Size: 16 (Gradient Accumulation = 8, Per-device Batch = 2)
Max Sequence Length: 2,048
Optimizer: AdamW
Scheduler: Cosine with Warmup (200 steps)
Peak Learning Rate: 1e-5
Mixed Precision: bfloat16 (bf16)
Memory Optimization: Gradient Checkpointing + SDPA Attention

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Phonsiri/Qwen3.5-9B-Thai-Law-Base"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

text = "มาตรา ๑ พระราชบัญญัตินี้เรียกว่า"
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Inference Examples

Basic Chat (Thinking Mode)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

MODEL_PATH = "Phonsiri/Qwen3.5-9B-Thai-Law-Base"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    trust_remote_code=True,
    device_map="auto",
)
model.eval()

conversation_history = []

def chat(prompt, max_new_tokens=512, temperature=0.7, top_p=0.9, repetition_penalty=1.1):
    global conversation_history
    conversation_history.append({"role": "user", "content": prompt})
    formatted = tokenizer.apply_chat_template(
        conversation_history, tokenize=False, add_generation_prompt=True,
    )
    inputs = tokenizer(formatted, return_tensors="pt", truncation=True, max_length=4096).to(model.device)
    input_len = inputs["input_ids"].shape[-1]

    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    with torch.no_grad():
        output_ids = model.generate(
            **inputs, streamer=streamer,
            max_new_tokens=max_new_tokens, do_sample=True,
            temperature=temperature, top_p=top_p,
            repetition_penalty=repetition_penalty,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )

    generated_ids = output_ids[0][input_len:]
    response_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
    conversation_history.append({"role": "assistant", "content": response_text})
def clear_history():
    global conversation_history
    conversation_history = []

Sample Output

Input:

ในการจัดตั้งตลาดประเภทที่ ๑ อาคารที่ปลูกสร้างสำหรับให้เช่าขายของต้องมีลักษณะใดบ้าง

Output (31 input tokens):

ลักษณะอาคารที่สำคัญ:
1. โครงสร้างและความมั่นคงแข็งแรง
   - อาคารต้องมีโครงสร้างมั่นคงแข็งแรงเพียงพอต่อการใช้งาน
   - ไม่อยู่ในเขตพื้นที่เสี่ยงภัยธรรมชาติ
2. การเข้าออกและการจราจร
   - มีทางเข้า-ออกอย่างน้อย ๒ ทาง
3. ระบบไฟฟ้าและความปลอดภัยอัคคีภัย
   - มีระบบไฟฟ้าที่ได้มาตรฐานและมีเครื่องตัดไฟอัตโนมัติ
4. ระบบระบายน้ำและสุขาภิบาล
   - มีระบบท่อระบายน้ำและบ่อบำบัดน้ำเสียที่เหมาะสม
   - ห้องน้ำสาธารณะสะอาด มีจำนวนเพียงพอต่อผู้ใช้งาน
...

Note: The model uses a built-in reasoning block before generating its final answer. This is expected behavior from the Qwen3.5 chat template and reflects the model's internal chain-of-thought process.

Limitations & Disclaimer

This model generates text based on patterns in its training data. It is NOT a qualified legal professional. Any output derived from this model should be carefully reviewed by a human lawyer before being applied in any real-world legal scenario.
Being a base model, it might not follow instructions well and tends to act as an autocomplete engine.

Authors (ผู้จัดทำ)

This model was trained, fine-tuned, and maintained by the following contributors:

Phonsiri (@Phonsiriwillbejommarn)
Pimnara (@Pimnara-som)
CYP777 (@CYP777)
Nattanan (@GGEarth5632144
if have problem you contact Email: B6639334@g.sut.ac.th ,b6643041@g.sut.ac.th, B6643904@g.sut.ac.th ,nattanant563214@gmail.com

Acknowledgements

We would like to express our sincere gratitude to Lightning AI for providing us with 15 monthly credits through their generous credit program. Access to Lightning AI's cloud GPU infrastructure was instrumental in enabling the training runs for this project. Without their support, training a 9B-parameter model on a large-scale Thai legal corpus would not have been feasible for our team.

Downloads last month: 2,857

Safetensors

Model size

9B params

Tensor type

BF16

Phonsiri
/

Qwen3.5-9B-Thai-Law-Base