Spaces:
Running
Running
| title: Perceptual Copilot | |
| emoji: ποΈ | |
| colorFrom: yellow | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.33.1 | |
| app_file: main.py | |
| pinned: false | |
| license: mit | |
| ## β¨ What is Perceptual Copilot? | |
| Perceptual Copilot is a prototype that demonstrates the integration of OpenAI agents with visual tools to process real-time video streams. This experimental platform showcases both the promising potential and current limitations of equipping agents with vision capabilities to understand and interact with live visual data. | |
| ### Architecture Overview | |
| ``` | |
| βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ | |
| β Webcam βββββΆβ Memory βββββΆβ Gradio β | |
| βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ βββββββββββββββββββ | |
| β Agent βββββΆβ Tools β | |
| βββββββββββββββββββ βββββββββββββββββββ | |
| ``` | |
| ### Available Tools | |
| | Tool | Description | Output | | |
| |------|-------------|---------| | |
| | `caption` | Generate detailed image descriptions | Rich visual descriptions | | |
| | `ocr` | Extract text from images | Extracted text content | | |
| | `localize` | Detect and locate objects | Bounding boxes with labels | | |
| | `qa` | Answer questions about images | Contextual answers | | |
| | `time` | Get current timestamp | Current date and time | | |
| | _More tools coming soon..._ | Additional capabilities in development | Various outputs | | |
| ## π Quick Start | |
| ### Prerequisites | |
| - Webcam access | |
| ### Installation | |
| 1. **Install dependencies** | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 2. **Set up environment variables** | |
| ```bash | |
| export HF_TOKEN="your_huggingface_token" | |
| export API_KEY="your_openai_api_key" | |
| export END_LANG="your_llm_endpoint" | |
| export END_TASK="your_task_endpoint" | |
| export MODEL_AGENT="your_agent_model" | |
| export MODEL_MLLM="your_multimodal_model" | |
| export MODEL_LOC="your_localization_model" | |
| ``` | |
| 3. **Launch the application** | |
| ```bash | |
| python main.py | |
| ``` | |
| ## π‘ Usage Examples | |
| ### Basic Interaction | |
| - **User**: "What do you see?" | |
| - **Assistant**: *Generates detailed caption of current view* | |
| ### OCR Functionality | |
| - **User**: "Read the text in this document" | |
| - **Assistant**: *Extracts and returns all visible text* | |
| ### Object Detection | |
| - **User**: "What objects are in front of me?" | |
| - **Assistant**: *Identifies and localizes objects with bounding boxes* | |
| ## Acknowledgments | |
| - Built with [Gradio](https://gradio.app/) for the interactive web interface | |
| - Uses [Supervision](https://supervision.roboflow.com/) for frame annotation | |
| - WebRTC integration via [FastRTC](https://github.com/gradio-app/gradio) | |