Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,3 +1,91 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
base_model:
|
| 4 |
+
- microsoft/Phi-3.5-vision-instruct
|
| 5 |
+
tags:
|
| 6 |
+
- GUI
|
| 7 |
+
- Agent
|
| 8 |
+
- Grounding
|
| 9 |
+
- CUA
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# Microsoft Phi-Ground-4B-7C
|
| 13 |
+
|
| 14 |
+
<p align="center">
|
| 15 |
+
<a href="https://zhangmiaosen2000.github.io/Phi-Ground/" target="_blank">π€ HomePage</a> | <a href="https://arxiv.org/abs/2507.23779" target="_blank">π Paper </a> | <a href="https://arxiv.org/abs/2507.23779" target="_blank">π Arxiv </a> | <a href="https://huggingface.co/microsoft/Phi-Ground" target="_blank"> π Model </a> | <a href="" target="_blank"> π Eval data </a>
|
| 16 |
+
</p>
|
| 17 |
+
|
| 18 |
+

|
| 19 |
+
|
| 20 |
+
**Phi-Ground-4B-7C** is one of the Phi-Ground model family, finetuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with fixed input resolution 1008x672. The Phi-Ground
|
| 21 |
+
model family achieves state-of-the-art performance across all five grounding benchmarks for
|
| 22 |
+
models under 10B parameters in agent settings. In the end-to-end model setting, our model still
|
| 23 |
+
achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision. We believe
|
| 24 |
+
that the various details discussed in the tech report, along with our successes and failures, not only clarify
|
| 25 |
+
the construction of grounding models but also benefit other perception tasks.
|
| 26 |
+
|
| 27 |
+
### Main results
|
| 28 |
+
|
| 29 |
+

|
| 30 |
+
|
| 31 |
+
### Usage
|
| 32 |
+
he current `transformers` version can be verified with: `pip list | grep transformers`.
|
| 33 |
+
|
| 34 |
+
Examples of required packages:
|
| 35 |
+
```
|
| 36 |
+
flash_attn==2.5.8
|
| 37 |
+
numpy==1.24.4
|
| 38 |
+
Pillow==10.3.0
|
| 39 |
+
Requests==2.31.0
|
| 40 |
+
torch==2.3.0
|
| 41 |
+
torchvision==0.18.0
|
| 42 |
+
transformers==4.43.0
|
| 43 |
+
accelerate==0.30.0
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
### Input Formats
|
| 48 |
+
|
| 49 |
+
The model require strict input format including fixed image resolution, instruction-first order and system prompt.
|
| 50 |
+
|
| 51 |
+
Input preprocessing
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
from PIL import Image
|
| 55 |
+
def process_image(img):
|
| 56 |
+
|
| 57 |
+
target_width, target_height = 336 * 3, 336 *2
|
| 58 |
+
|
| 59 |
+
img_ratio = img.width / img.height
|
| 60 |
+
target_ratio = target_width / target_height
|
| 61 |
+
|
| 62 |
+
if img_ratio > target_ratio:
|
| 63 |
+
new_width = target_width
|
| 64 |
+
new_height = int(new_width / img_ratio)
|
| 65 |
+
else:
|
| 66 |
+
new_height = target_height
|
| 67 |
+
new_width = int(new_height * img_ratio)
|
| 68 |
+
reshape_ratio = new_width / img.width
|
| 69 |
+
|
| 70 |
+
img = img.resize((new_width, new_height), Image.LANCZOS)
|
| 71 |
+
new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))
|
| 72 |
+
paste_position = (0, 0)
|
| 73 |
+
new_img.paste(img, paste_position)
|
| 74 |
+
return new_img
|
| 75 |
+
|
| 76 |
+
instruction = "<your instruction>"
|
| 77 |
+
prompt = """<|user|>
|
| 78 |
+
The description of the element:
|
| 79 |
+
{RE}
|
| 80 |
+
|
| 81 |
+
Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000.
|
| 82 |
+
<|image_1|>
|
| 83 |
+
<|end|>
|
| 84 |
+
<|assistant|>""".format(RE=instriuction)
|
| 85 |
+
|
| 86 |
+
image_path = "<your image path>"
|
| 87 |
+
image = process_image(Image.open(image_path))
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
Then you can use huggingface model or [vllm](https://github.com/vllm-project/vllm) to inference. End-to-end examples and benchmark results reproduction can be found [here]().
|