Qwen3.5-9B-Thai-Law-Base
This is a Continued Pre-Training (CPT) model based on Qwen/Qwen3.5-9B-Base, explicitly adapted to deeply understand the context, terminology, and nuances of Thai Law.
The model was continuously trained on a carefully cleaned and curated dataset comprising various Thai legal documents, acts, decrees, and official announcements.
Model Details
- Base Model: Qwen/Qwen3.5-9B-Base
- Architecture: Qwen2ForCausalLM
- Language: Thai (
th), English (en) - Domain: Law, Legal Advisory, Governance
- Parameters: 8.95 Billion
- Context Length: 4,096 tokens (Trained context size is 2,048)
- Format: Safetensors
Intended Use
This is a base model (not chat-tuned or instruction-tuned). It requires further Supervised Fine-Tuning (SFT) or Reinforcement Learning (RLHF/GRPO) to act as a conversational AI or legal assistant.
Its primary intended use cases are:
- Acting as a foundational backbone for downstream NLP legal tasks (Classification, Information Retrieval, RAG).
- Being fine-tuned into a Chat model answering legal inquiries.
Training Data
The model was trained on roughly 68M+ tokens of pure Thai legal domain data.
- Data sources: Thai Acts, Royal Decrees, Supreme Court Rulings, and official government publications.
- Data preprocessing: Raw PDF OCR data was heavily cleaned to fix common Thai OCR errors (e.g., removing spaces before trailing vowels "า", merging floating tone marks, and normalizing Unicode).
Training Configuration
The training was performed using Hugging Face transformers and accelerate on a single NVIDIA H100 80GB GPU.
- Epochs: 1
- Effective Batch Size: 16 (Gradient Accumulation = 8, Per-device Batch = 2)
- Max Sequence Length: 2,048
- Optimizer: AdamW
- Scheduler: Cosine with Warmup (200 steps)
- Peak Learning Rate: 1e-5
- Mixed Precision: bfloat16 (
bf16) - Memory Optimization: Gradient Checkpointing + SDPA Attention
How to use
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Phonsiri/Qwen3.5-9B-Thai-Law-Base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
text = "มาตรา ๑ พระราชบัญญัตินี้เรียกว่า"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Inference Examples
- Basic Chat (Thinking Mode)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
MODEL_PATH = "Phonsiri/Qwen3.5-9B-Thai-Law-Base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
trust_remote_code=True,
device_map="auto",
)
model.eval()
conversation_history = []
def chat(prompt, max_new_tokens=512, temperature=0.7, top_p=0.9, repetition_penalty=1.1):
global conversation_history
conversation_history.append({"role": "user", "content": prompt})
formatted = tokenizer.apply_chat_template(
conversation_history, tokenize=False, add_generation_prompt=True,
)
inputs = tokenizer(formatted, return_tensors="pt", truncation=True, max_length=4096).to(model.device)
input_len = inputs["input_ids"].shape[-1]
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
with torch.no_grad():
output_ids = model.generate(
**inputs, streamer=streamer,
max_new_tokens=max_new_tokens, do_sample=True,
temperature=temperature, top_p=top_p,
repetition_penalty=repetition_penalty,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
generated_ids = output_ids[0][input_len:]
response_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
conversation_history.append({"role": "assistant", "content": response_text})
def clear_history():
global conversation_history
conversation_history = []
Sample Output
Input:
ในการจัดตั้งตลาดประเภทที่ ๑ อาคารที่ปลูกสร้างสำหรับให้เช่าขายของต้องมีลักษณะใดบ้าง
Output (31 input tokens):
ลักษณะอาคารที่สำคัญ:
1. โครงสร้างและความมั่นคงแข็งแรง
- อาคารต้องมีโครงสร้างมั่นคงแข็งแรงเพียงพอต่อการใช้งาน
- ไม่อยู่ในเขตพื้นที่เสี่ยงภัยธรรมชาติ
2. การเข้าออกและการจราจร
- มีทางเข้า-ออกอย่างน้อย ๒ ทาง
3. ระบบไฟฟ้าและความปลอดภัยอัคคีภัย
- มีระบบไฟฟ้าที่ได้มาตรฐานและมีเครื่องตัดไฟอัตโนมัติ
4. ระบบระบายน้ำและสุขาภิบาล
- มีระบบท่อระบายน้ำและบ่อบำบัดน้ำเสียที่เหมาะสม
- ห้องน้ำสาธารณะสะอาด มีจำนวนเพียงพอต่อผู้ใช้งาน
...
Note: The model uses a built-in reasoning block before generating its final answer. This is expected behavior from the Qwen3.5 chat template and reflects the model's internal chain-of-thought process.
Limitations & Disclaimer
- This model generates text based on patterns in its training data. It is NOT a qualified legal professional. Any output derived from this model should be carefully reviewed by a human lawyer before being applied in any real-world legal scenario.
- Being a base model, it might not follow instructions well and tends to act as an autocomplete engine.
Authors (ผู้จัดทำ)
This model was trained, fine-tuned, and maintained by the following contributors:
- Phonsiri (@Phonsiriwillbejommarn)
- Pimnara (@Pimnara-som)
- CYP777 (@CYP777)
- Nattanan (@GGEarth5632144
- if have problem you contact Email: B6639334@g.sut.ac.th ,b6643041@g.sut.ac.th, B6643904@g.sut.ac.th ,nattanant563214@gmail.com
Acknowledgements
We would like to express our sincere gratitude to Lightning AI for providing us with 15 monthly credits through their generous credit program. Access to Lightning AI's cloud GPU infrastructure was instrumental in enabling the training runs for this project. Without their support, training a 9B-parameter model on a large-scale Thai legal corpus would not have been feasible for our team.
- Downloads last month
- 2,857