scratchtoscale (Scratch to Scale)

Steveeeeeeen

authored 2 papers 18 days ago

Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement

Paper • 2510.23141 • Published Oct 27 • 4

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

Paper • 2510.06961 • Published Oct 8 • 10

dhruv3006

posted an update about 1 month ago

Post

214

GLM-4.5V model for local computer use

On OSWorld-V, it scores 35.8% - beating UI-TARS-1.5, matching Claude-3.7-Sonnet-20250219, and setting SOTA for fully open-source computer-use models.

Run it with Cua either: Locally via Hugging Face Remotely via OpenRouter

Github : https://github.com/trycua

Docs + examples: https://docs.trycua.com/docs/agent-sdk/supported-agents/computer-use-agents#glm-45v

dhruv3006

posted an update 2 months ago

Post

183

Computer Use with Sonnet 4.5

We ran one of our hardest computer-use benchmarks on Anthropic Sonnet 4.5, side-by-side with Sonnet 4.

Ask: "Install LibreOffice and make a sales table".

Sonnet 4.5: 214 turns, clean trajectory

Sonnet 4: 316 turns, major detours

The difference shows up in multi-step sequences where errors compound.

32% efficiency gain in just 2 months. From struggling with file extraction to executing complex workflows end-to-end. Computer-use agents are improving faster than most people realize.

Anthropic Sonnet 4.5 and the most comprehensive catalog of VLMs for computer-use are available in our open-source framework.

Start building: https://github.com/trycua/cua

dhruv3006

posted an update 3 months ago

Post

415

Cua: Best State-of-the-Art Computer-Use Agent

Build a SOTA Computer-Use Agent using Cua (https://github.com/trycua/cua), the open-source infrastructure and agent framework for controlling real desktop and browser environments.
Submissions are evaluated in HUD’s OSWorld-Verified benchmarking environment. The top-scoring team earns a secured interview with a Y Combinator partner for the next batch.

Prizes:
Guaranteed YC partner interview
Feature on the Cua blog + social channels
Swag pack for each team member

Eligibility: To be considered for judging and prizes, sign up at https://www.trycua.com/hackathon

eliebak

posted an update 3 months ago

Post

3747

Super excited to announce that our research team at Hugging Face will be doing an AMA on reddit r/LocalLLaMA.

Come ask any questions to the team behind SmolLM, FineWeb and more! And who knows, maybe there’ll be a shiny new release to talk about?

Thursday 4th September, 8AM-11AM PST 🤗

science

dhruv3006

posted an update 3 months ago

Post

259

Cua is hiring a Founding Engineer, UX & Design in SF

Cua is hiring a Founding Engineer, UX & Design in our brand new SF office.

Cua is building the infrastructure for general AI agents - your work will define how humans and computers interact at scale.

Location : SF

Referal Bonus : $5000

Apply here : https://www.ycombinator.com/companies/cua/jobs/a6UbTvG-founding-engineer-ux-design

Discord : https://discord.gg/vJ2uCgybsC

Github : https://github.com/trycua

Saurav2023

published a Space 3 months ago

README

👀

dhruv3006

posted an update 3 months ago

Post

1810

Human in the Loop for computer use agents (instant handoff from AI to you)

Sometimes the best “agent” is you.

We’re introducing Human in the Loop: instantly hand off from automation to human control when a task needs judgment.

Yesterday we shared our HUD evals for measuring agents at scale. Today you can become the agent when it matters take over the same session see what the agent sees and keep the workflow moving.

Lets you create clean training demos, establish ground truth for tricky cases, intervene on edge cases ( CAPTCHAs, ambiguous UIs) or step through debug without context switching.

You have full human control when you want.We even a fallback version where in it starts automated but escalate to a human only when needed.

Works across common stacks (OpenAI, Anthropic, Hugging Face) and with our Composite Agents. Same tools, same environment take control when needed.

Feedback welcome,curious how you’d use this in your workflows.

Blog : https://www.trycua.com/blog/human-in-the-loop.md

Github : https://github.com/trycua/cua

dhruv3006

posted an update 3 months ago

Post

1406

Pair a vision grounding model with a reasoning LLM with Cua

Cua just shipped v0.4 of the Cua Agent framework with Composite Agents - you can now pair a vision/grounding model with a reasoning LLM using a simple modelA+modelB syntax. Best clicks + best plans.

The problem: every GUI model speaks a different dialect.
• some want pixel coordinates
• others want percentages
• a few spit out cursed tokens like <|loc095|>

We built a universal interface that works the same across Anthropic, OpenAI, Hugging Face, etc.:

agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022",
tools=[computer]
)

But here’s the fun part: you can combine models by specialization.
Grounding model (sees + clicks) + Planning model (reasons + decides) →

agent = ComputerAgent(
model="huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-4o",
tools=[computer]
)

This gives GUI skills to models that were never built for computer use. One handles the eyes/hands, the other the brain. Think driver + navigator working together.

Two specialists beat one generalist. We’ve got a ready-to-run notebook demo - curious what combos you all will try.

Github : https://github.com/trycua/cua

Blog : https://www.trycua.com/blog/composite-agents

dhruv3006

posted an update 4 months ago

Post

248

Computer-Use Agents SOTA Challenge @ Hack the North (YC interview for top team) + Global Online ($2000 prize)

We’re bringing something new to Hack the North, Canada’s largest hackathon, this year: a head-to-head competition for Computer-Use Agents - on-site at Waterloo and a Global online challenge. From September 12–14, 2025, teams build on the Cua Agent Framework and are scored in HUD’s OSWorld-Verified environment to push past today’s SOTA on OS-World.

On-site (Track A)
Build during the weekend and submit a repo with a one-line start command. HUD executes your command in a clean environment and runs OSWorld-Verified. Scores come from official benchmark results; ties break by median, then wall-clock time, then earliest submission. Any model setup is allowed (cloud or local). Provide temporary credentials if needed.

HUD runs official evaluations immediately after submission. Winners are announced at the closing ceremony.

Deadline: Sept 15, 8:00 AM EDT

Global Online (Track B)
Open to anyone, anywhere. Build on your own timeline and submit a repo using Cua + Ollama/Ollama Cloud with a short write-up (what's local or hybrid about your design). Judged by Cua and Ollama teams on: Creativity (30%), Technical depth (30%), Use of Ollama/Cloud (30%), Polish (10%). A ≤2-min demo video helps but isn't required.

Deadline: Sept 22, 8:00 AM EDT (1 week after Hack the North)

Submission & rules (both tracks)
Deadlines: Sept 15, 8:00 AM EDT (Track A) / Sept 22, 8:00 AM EDT (Track B)
Deliverables: repo + README start command; optional short demo video; brief model/tool notes
Where to submit: links shared in the Hack the North portal and Discord
Commit freeze: we evaluate the submitted SHA
Rules: no human-in-the-loop after the start command; internet/model access allowed if declared; use temporary/test credentials; you keep your IP; by submitting, you allow benchmarking and publication of scores/short summaries.
Github : https://github.com/trycua

eliebak

posted an update 4 months ago

Post

689

Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale!

> It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training.
> They use WSD with a "Simple moving average" averaging the last 6 ckpt every 8B token.
> They trained on Finemath, Fineweb2, DCLM, TxT360.
> Lot of details in the finetuning data they used, for instance they used EvolKit and did some "dataset fusion" to have more compressed knowledge into the data.
> They mention they also tried Normalized GPT, QK-Norm and Cross Layer Attention.

Motif-Technologies/Motif-2.6B

dhruv3006

posted an update 4 months ago

Post

1984

GPT 5 for Computer Use agents.

Same tasks, same grounding model we just swapped GPT 4o with GPT 5 as the thinking model.

Left = 4o, right = 5.

Watch GPT 5 pull away.

Reasoning model: OpenAI GPT-5

Grounding model: Salesforce GTA1-7B

Action space: CUA Cloud Instances (macOS/Linux/Windows)

The task is: "Navigate to {random_url} and play the game until you reach a score of 5/5”....each task is set up by having claude generate a random app from a predefined list of prompts (multiple choice trivia, form filling, or color matching)"

Try it yourself here : https://github.com/trycua/cua

Docs : https://docs.trycua.com/docs/agent-sdk/supported-agents/composed-agents

2 replies

·

dhruv3006

posted an update 4 months ago

Post

183

So OpenAI is releasing a model on hf today???

dhruv3006

posted an update 5 months ago

Post

2416

Ever wish you could have someone watching your Github repo 24/7?

We built an agent that monitors your repo, finds who most recently starred it, and autonomously reaches out via email!

Discord : https://discord.com/invite/ZYN7f7KPjS

Get your API Key here : https://tally.so/r/nrYr4X

eliebak

posted an update 5 months ago

Post

4771

Kimi K2 tech report is full of gems as always. Here are my notes on it:

> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher)
> Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient.
> They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.

With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.

> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once.
> They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style
> They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.

The infra is also very nice, quick summary:
> PP=16 (1F1B schedule, a bit custom), EP=16, zero1
> No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU

dsouzadaniel

authored a paper 5 months ago

When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs

Paper • 2506.20544 • Published Jun 25 • 10

dhruv3006

posted an update 6 months ago

Post

259

WebBench: A real-world benchmark for Browser Agents

WebBench is an open, task-oriented benchmark designed to measure how effectively browser agents handle complex, realistic web workflows. It includes 2,454 tasks across 452 live websites selected from the global top-1000 by traffic.

Github : https://github.com/Halluminate/WebBench

dhruv3006

posted an update 6 months ago

Post

1493

Introducing Windows Sandbox support - run computer-use agents on Windows business apps without VMs or cloud costs.

Your enterprise software runs on Windows, but testing agents required expensive cloud instances. Windows Sandbox changes this - it's Microsoft's built-in lightweight virtualization sitting on every Windows 10/11 machine, ready for instant agent development.

Enterprise customers kept asking for AutoCAD automation, SAP integration, and legacy Windows software support. Traditional VM testing was slow and resource-heavy. Windows Sandbox solves this with disposable, seconds-to-boot Windows environments for safe agent testing.

What you can build: AutoCAD drawing automation, SAP workflow processing, Bloomberg terminal trading bots, manufacturing execution system integration, or any Windows-only enterprise software automation - all tested safely in disposable sandbox environments.

Free with Windows 10/11, boots in seconds, completely disposable. Perfect for development and testing before deploying to Windows cloud instances (coming later this month).

Check out the github here : https://github.com/trycua/cua

dhruv3006

posted an update 6 months ago

Post

4322

C/ua Cloud Containers - Docker for Computer-Use Agents. Zero local setup. Same Computer and Agent interfaces. Scale 1-100 agents instantly.

Github : https://github.com/trycua/cua

Website : https://www.trycua.com

AI & ML interests

Team members 278

scratchtoscale's activity

README