ok, but what was the gains over untrained Qwen-VL 4B and 8B ? You don't provide baseline.
Also check in a very granular way the validation set. What works and what doesn't ? Is is the same for both with only some few slim differences ? Is there a part of the validation that is always wrong : orientation, relative position, identification, inter-relation,etc.? You have question_category, add question_type, object_type, object, other_object_type, other_object...
Can density vs diversity help for a section of validation questions?
I think you have too many variables to be able to get the answer you want.
Take 1 type of question, have it formulated in different ways (does A touch B, is A adjacent to B, is A over B...), then validate on checkpoints at every X steps, starting with step 0. Do you see different changes in diversity vs density, does it plateau, does it continue to get better accuracy (more epoch needed). Once you have a working pipeline for evaluation, test for a few more types of question. Save to dataset your validation tests, type (diverse,dense), step X, generated answer, reference answer, value pass|fail. Then you can reprocess your validation tests with different code/llm judge to inspect it, etc.
I would really recommend that you group your validation by task/item/orientation/interaction labels. This could show where you had specific gains in diversity vs density. Try different labeling strategies as it may influence greatly. This can be done on validation data that you already saved. Generate lots of plots to visualize.
Also remove validation on items that are already passing on baseline, will make your evaluation faster but you will not be able to test regression.
Before going even bigger, I would try a few more micro training. Select a small set of 100 images and check gains on validation specific to that set image (question on same subject, action...) and see what gains a small set gives to validation. Also try up to 15 epoch or until plateau.
Also for validation, try absolute answers like left/right/yes/closer/farther/over, depending on your validation function or do you use a llm for checking answer against answer in the dataset ? For example right hand is present in answer. Did you validate that you have no errors handling accuracy ?
Good luck