daiqi commited on
Commit
3b30c22
Β·
verified Β·
1 Parent(s): 2177a30

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +91 -3
README.md CHANGED
@@ -1,3 +1,91 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model:
4
+ - microsoft/Phi-3.5-vision-instruct
5
+ tags:
6
+ - GUI
7
+ - Agent
8
+ - Grounding
9
+ - CUA
10
+ ---
11
+
12
+ # Microsoft Phi-Ground-4B-7C
13
+
14
+ <p align="center">
15
+ <a href="https://zhangmiaosen2000.github.io/Phi-Ground/" target="_blank">πŸ€– HomePage</a> | <a href="https://arxiv.org/abs/2507.23779" target="_blank">πŸ“„ Paper </a> | <a href="https://arxiv.org/abs/2507.23779" target="_blank">πŸ“„ Arxiv </a> | <a href="https://huggingface.co/microsoft/Phi-Ground" target="_blank"> 😊 Model </a> | <a href="" target="_blank"> 😊 Eval data </a>
16
+ </p>
17
+
18
+ ![overview](docs/images/abstract.png)
19
+
20
+ **Phi-Ground-4B-7C** is one of the Phi-Ground model family, finetuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with fixed input resolution 1008x672. The Phi-Ground
21
+ model family achieves state-of-the-art performance across all five grounding benchmarks for
22
+ models under 10B parameters in agent settings. In the end-to-end model setting, our model still
23
+ achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision. We believe
24
+ that the various details discussed in the tech report, along with our successes and failures, not only clarify
25
+ the construction of grounding models but also benefit other perception tasks.
26
+
27
+ ### Main results
28
+
29
+ ![overview](docs/images/r1.png)
30
+
31
+ ### Usage
32
+ he current `transformers` version can be verified with: `pip list | grep transformers`.
33
+
34
+ Examples of required packages:
35
+ ```
36
+ flash_attn==2.5.8
37
+ numpy==1.24.4
38
+ Pillow==10.3.0
39
+ Requests==2.31.0
40
+ torch==2.3.0
41
+ torchvision==0.18.0
42
+ transformers==4.43.0
43
+ accelerate==0.30.0
44
+ ```
45
+
46
+
47
+ ### Input Formats
48
+
49
+ The model require strict input format including fixed image resolution, instruction-first order and system prompt.
50
+
51
+ Input preprocessing
52
+
53
+ ```python
54
+ from PIL import Image
55
+ def process_image(img):
56
+
57
+ target_width, target_height = 336 * 3, 336 *2
58
+
59
+ img_ratio = img.width / img.height
60
+ target_ratio = target_width / target_height
61
+
62
+ if img_ratio > target_ratio:
63
+ new_width = target_width
64
+ new_height = int(new_width / img_ratio)
65
+ else:
66
+ new_height = target_height
67
+ new_width = int(new_height * img_ratio)
68
+ reshape_ratio = new_width / img.width
69
+
70
+ img = img.resize((new_width, new_height), Image.LANCZOS)
71
+ new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))
72
+ paste_position = (0, 0)
73
+ new_img.paste(img, paste_position)
74
+ return new_img
75
+
76
+ instruction = "<your instruction>"
77
+ prompt = """<|user|>
78
+ The description of the element:
79
+ {RE}
80
+
81
+ Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000.
82
+ <|image_1|>
83
+ <|end|>
84
+ <|assistant|>""".format(RE=instriuction)
85
+
86
+ image_path = "<your image path>"
87
+ image = process_image(Image.open(image_path))
88
+ ```
89
+
90
+
91
+ Then you can use huggingface model or [vllm](https://github.com/vllm-project/vllm) to inference. End-to-end examples and benchmark results reproduction can be found [here]().