microsoft
/

Phi-Ground

+---
+license: mit
+base_model:
+- microsoft/Phi-3.5-vision-instruct
+tags:
+- GUI
+- Agent
+- Grounding
+- CUA
+---
+# Microsoft Phi-Ground-4B-7C
+<p align="center">
+   <a href="https://zhangmiaosen2000.github.io/Phi-Ground/" target="_blank">🤖 HomePage</a> | <a href="https://arxiv.org/abs/2507.23779" target="_blank">📄 Paper </a> | <a href="https://arxiv.org/abs/2507.23779" target="_blank">📄 Arxiv </a> | <a href="https://huggingface.co/microsoft/Phi-Ground" target="_blank"> 😊 Model </a> | <a href="" target="_blank"> 😊 Eval data </a>
+</p>
+![overview](docs/images/abstract.png)
+**Phi-Ground-4B-7C** is one of the Phi-Ground model family, finetuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with fixed input resolution 1008x672. The Phi-Ground
+ model family achieves state-of-the-art performance across all five grounding benchmarks for
+ models under 10B parameters in agent settings. In the end-to-end model setting, our model still
+ achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision. We believe
+ that the various details discussed in the tech report, along with our successes and failures, not only clarify
+ the construction of grounding models but also benefit other perception tasks.
+### Main results
+![overview](docs/images/r1.png)
+### Usage
+he current `transformers` version can be verified with: `pip list | grep transformers`.
+Examples of required packages:
+```
+flash_attn==2.5.8
+numpy==1.24.4
+Pillow==10.3.0
+Requests==2.31.0
+torch==2.3.0
+torchvision==0.18.0
+transformers==4.43.0
+accelerate==0.30.0
+```
+### Input Formats
+The model require strict input format including fixed image resolution, instruction-first order and system prompt.
+Input preprocessing
+```python
+from PIL import Image
+def process_image(img):
+    target_width, target_height = 336 * 3, 336 *2
+    img_ratio = img.width / img.height
+    target_ratio = target_width / target_height
+    if img_ratio > target_ratio:
+        new_width = target_width
+        new_height = int(new_width / img_ratio)
+    else:
+        new_height = target_height
+        new_width = int(new_height * img_ratio)
+    reshape_ratio = new_width / img.width
+    img = img.resize((new_width, new_height), Image.LANCZOS)
+    new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))
+    paste_position = (0, 0)
+    new_img.paste(img, paste_position)
+    return new_img
+instruction = "<your instruction>"
+prompt = """<|user|>
+The description of the element:
+{RE}
+Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000.
+<|image_1|>
+<|end|>
+<|assistant|>""".format(RE=instriuction)
+image_path = "<your image path>"
+image = process_image(Image.open(image_path))
+```
+Then you can use huggingface model or [vllm](https://github.com/vllm-project/vllm) to inference. End-to-end examples and benchmark results reproduction can be found [here]().