I believe Vy takes a vision approach similar to ChatGPT atlas. However the model feels very lightweight and accurate. Moreover, Vy has access to all the tools a desktop user has and a general understanding of how everything works. The model is also a proprietary model rather than being a wrapper which helps with rejecting sophisticated prompt injection attacks which some other products like browser use may fail at. Claude web browser extension is pretty similar to it, however it takes a DOM approach, which in the long run falls short. Vision is generalized while DOM approaches are specialized to certain task.
Vy by Vercept