dpo_40k_pon

This model is a fine-tuned version of /p/scratch/taco-vlm/xiao4/models/Qwen2.5-VL-7B-Instruct on the finegrained_mc_dpo_4_pon dataset. It achieves the following results on the evaluation set:

Loss: 0.3368
Rewards/chosen: -0.5524
Rewards/rejected: -2.3684
Rewards/accuracies: 0.8600
Rewards/margins: 1.8160
Logps/chosen: -35.5458
Logps/rejected: -57.5918
Logits/chosen: -0.1963
Logits/rejected: -0.1836

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 64
total_eval_batch_size: 4
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.6935	0.0402	50	0.6920	-0.0014	-0.0040	0.5250	0.0026	-30.0355	-33.9482	0.4799	0.5010
0.6855	0.0804	100	0.6795	-0.0330	-0.0623	0.6350	0.0293	-30.3516	-34.5315	0.4683	0.4915
0.6672	0.1206	150	0.6494	-0.1083	-0.2066	0.7000	0.0983	-31.1045	-35.9736	0.4596	0.4794
0.61	0.1608	200	0.6038	-0.2134	-0.4363	0.7150	0.2229	-32.1554	-38.2706	0.4275	0.4467
0.5835	0.2010	250	0.5565	-0.2225	-0.6133	0.7175	0.3907	-32.2469	-40.0405	0.3819	0.4059
0.5212	0.2412	300	0.5275	-0.2683	-0.7938	0.7325	0.5255	-32.7043	-41.8459	0.3332	0.3546
0.509	0.2814	350	0.5019	-0.3348	-1.0059	0.7350	0.6712	-33.3692	-43.9671	0.2697	0.2931
0.4192	0.3216	400	0.4780	-0.4106	-1.2206	0.7625	0.8100	-34.1277	-46.1145	0.1839	0.2070
0.4495	0.3618	450	0.4522	-0.5549	-1.5425	0.7850	0.9877	-35.5702	-49.3332	0.1206	0.1397
0.3982	0.4020	500	0.4248	-0.5299	-1.6677	0.8025	1.1378	-35.3205	-50.5853	0.0812	0.0967
0.3802	0.4422	550	0.4040	-0.4697	-1.7501	0.8150	1.2804	-34.7186	-51.4086	0.0321	0.0461
0.3785	0.4824	600	0.3878	-0.4314	-1.8178	0.8400	1.3864	-34.3354	-52.0860	-0.0049	0.0072
0.3252	0.5226	650	0.3779	-0.5087	-1.9993	0.8425	1.4906	-35.1086	-53.9007	-0.0433	-0.0318
0.2898	0.5628	700	0.3647	-0.5194	-2.0933	0.8475	1.5739	-35.2159	-54.8409	-0.0803	-0.0727
0.3258	0.6030	750	0.3559	-0.4871	-2.1277	0.8525	1.6406	-34.8930	-55.1855	-0.1065	-0.0989
0.3676	0.6432	800	0.3500	-0.5069	-2.1902	0.8525	1.6833	-35.0906	-55.8103	-0.1341	-0.1283
0.3104	0.6834	850	0.3514	-0.4703	-2.1667	0.8575	1.6963	-34.7249	-55.5747	-0.1501	-0.1408
0.3575	0.7236	900	0.3445	-0.4988	-2.2507	0.8575	1.7518	-35.0100	-56.4149	-0.1680	-0.1594
0.3041	0.7638	950	0.3427	-0.5245	-2.2966	0.8550	1.7721	-35.2667	-56.8744	-0.1816	-0.1695
0.2917	0.8040	1000	0.3397	-0.5382	-2.3321	0.8550	1.7939	-35.4036	-57.2287	-0.1876	-0.1769
0.3623	0.8442	1050	0.3389	-0.5467	-2.3512	0.8550	1.8045	-35.4884	-57.4199	-0.1934	-0.1870
0.2827	0.8844	1100	0.3388	-0.5524	-2.3621	0.8550	1.8097	-35.5454	-57.5290	-0.1939	-0.1878
0.3302	0.9246	1150	0.3373	-0.5536	-2.3699	0.8550	1.8163	-35.5578	-57.6074	-0.1996	-0.1904
0.2456	0.9648	1200	0.3379	-0.5563	-2.3672	0.8600	1.8109	-35.5843	-57.5798	-0.2003	-0.1815

Framework versions

PEFT 0.17.1
Transformers 4.49.0
Pytorch 2.5.1+cu124
Datasets 4.0.0
Tokenizers 0.21.0

Downloads last month: 5

Model tree for xiaorui638/qwen2_5vl7b-dpo_80k_pon-lora

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Adapter

(243)

this model