Humalike - Give your AI agents the social intelligence they're missing

by
Today's models are capable enough. Smart enough. Fast enough. But we still feel they don’t fit in the room. Humalike is building the behavioral infrastructure for humanlike AI agents. The social skills & proactiveness your agents have been missing. APIs, models, benchmarks.

Add a comment

Replies

Best

congrats!!!

 tysm Madalina!

 Thank you so muuucch!!!

Amazing stuff!

Are you guys planning on launching a separate agent, or just the API's?

 Thanks for the comment! :)

No separate agent planned, but we did build an open-source plugin for Hermes Agent that uses our APIs, so you can see how it works in action :))

Come take a look:

 Thanks for the supp Kacper :)) We are building a lot. Not a separate agent, but soon we will launch new stuff!

How do you evaluate and correct agentic behavior without the agent losing its personality?

 Components that change behavior of agent (Theory of Mind, Turn-taking) take personality into account. It's not about changing personality of the agent, it's making his judgement and context more humanlike while still being aligned with his personality.

 tysm for supp! I'm curious, what made you think about this question? Are you working on something similar or?

This is an awesome idea.

Do you have some benchmarks comparing Humalike behaviors with the baseline model?

 Sadly there are no pre-existing benchmarks for this problem, we had to define the problems and evals ourselves. We open-sourced one of our benchmarks for inferring local social norms in group conversations.
Paper link:

We don't brag about it too much because being the best on your own benchmark has "trust me bro" energy haha

 Thanks for the comment!

 tysm Rezgar :))

the turn-taking and social observability APIs make sense to me, but persona/norms feel like they could go wrong in a way that's hard to detect. if the agent reads the room and picks a side or a tone to fit in, how do you catch it drifting into something the team didn't actually want, before a customer sees it

 Great question - this is exactly the failure mode we think about a lot.

We don’t want persona/norms to mean “the agent blindly adapts to the room.” The agent should understand the local context, but still stay inside the team’s intended personality, brand rules, safety boundaries, and escalation policy.

So persona is an anchor, not a free variable. Norms are interpreted, not blindly copied. And social observability is what helps catch drift in tone, intervention rate, conflict level, or user reactions before it becomes a customer-facing issue.

 I will also add that social observability is meant to help you monitor the drift of how users percept your agent. The components complement each other

that makes sense as a monitoring layer, my worry was more about the response lag though. if drift only shows up after enough interactions to trend, is there a way to catch it on a single bad interaction before it ships, or is this mostly a "watch the pattern over time" tool

 Oh I see. You can put hard rules on what agent should never do, if you use turn-taking just include it in the turn-taking system prompt. It will watch the generated message and if it breaks a hard rule, it will catch it.

There are also other things you could do to avoid drift but our goal is to make it so you don't have to worry about it. Solving this problem is on us, so you don't have to think about it:)

The community manager anecdote is universal, I've watched it happen in Discord servers, Slack workspaces, and group chats. The failure mode isn't the bot being wrong, it's the bot being present. Silence has always been the harder signal to model because there's no reward function for "you correctly didn't do anything."

The split between Turn-Taking and Theory of Mind is what I'd want to understand better. In practice they feel related but the failure modes are different, an agent can have decent turn-taking (waits for pauses, doesn't interrupt) while still fundamentally misreading what people actually want from the conversation. And vice versa: an agent can read the room well emotionally but still fire at the wrong beat. Is Turn-Taking gated by Theory of Mind under the hood, or are they genuinely independent modules that can score high/low separately?

Rooting for this. Building social behavior as infrastructure rather than as prompt tricks is overdue.

 Yes exactly! Turn-taking is component that benefits from all the other components, and actually we use ToM in turn-taking under the hood, nice catch. Turn-taking is the king of all components and it benefits from Social Signals, Norms, Persona, ToM and Memory - because knowing when to say something vs stay silent requires as much context as possible, and good judgment upon this context.

We split it because components still can be used independently - e.g. we used ToM component internally to analyze transcript history after the chat ended, not only to guide agent in real-time.

The split also helps thinking about Social Intelligence in general. "How do I make my AI behave better and less annoying" is the initial problem. It took us a while to categorize failure modes, understand different dimensions of social intelligence and create solutions upon them. It makes it easier to understand, debug and talk about it:))

The "categorize failure modes first, then build solutions" progression is exactly the shape of good infrastructure work. Feels obvious in retrospect and impossible in advance, most teams skip that step and end up with a monolithic "make it feel more human" prompt that can't be debugged when it breaks. Splitting the components so you can isolate which one failed on a bad session is the debugging superpower that only shows up if you did the taxonomy work first.

The ToM-in-turn-taking-under-the-hood detail is the honest architecture answer. Bundled feature that also ships as an independent module is the sweet spot, teams get the composed behavior by default but can dig deeper when they need to. That's the pattern I keep seeing work across infrastructure categories.

Curious about the eval side. Building a benchmark like LoSoNA for social norms feels genuinely hard, norms are context-dependent by definition, so any benchmark risks either overfitting to a specific culture or being so generic it doesn't measure anything real. How did you handle that tradeoff, is LoSoNA weighted toward a specific cultural context, or did you build it with explicit deltas per region/community type?

   Really appreciate this - you described the motivation very well.

On LoSoNA: we didn’t try to build a “universal norm” benchmark, because that would miss the point. The benchmark is about whether a model can infer a local norm from the conversation and adapt to it when that norm differs from the default assistant behavior.

So the unit we test is not “does the model know what is polite globally?” but “given this group’s demonstrated behavior, can it predict what response fits here?”

That also makes the cultural tradeoff more manageable. Long-term, the right direction is definitely broader coverage across regions, communities, domains, and communication styles, but the core eval is about local group norm inference, not a fixed list of norms.

The reframe from "universal norm" to "local inference" is what makes this actually useful. Universal-norm benchmarks always end up testing whether a model matches western-professional defaults, which just measures how well the training data aligned with a specific cultural assumption. Local-inference-from-demonstrated-behavior is the harder and more honest problem and it means the model can be wrong in a specific group without being wrong in general.

The follow-up I'd ask is about eval robustness. When you're testing local norm inference, the model's success depends on the group's demonstrated behavior being consistent enough to actually signal a norm. In real groups the demonstrated behavior is noisy, new members break patterns, subgroups have different sub-norms, and sometimes the loudest members set a norm the quiet majority doesn't share. Does LoSoNA handle noisy signal within a group, or is the eval mostly on cleaner group examples where the norm reads clearly?

Not trying to poke holes, genuinely curious because it's the class of problem where the eval design determines whether the metric tracks with real usage.

 Totally!! Theory of mind can be used as a solo component, but it also complements Turn-taking perfectly! Thanks for the supp!

Turn-taking feels like the sharp wedge here because a group-chat agent can be technically right and still hurt the conversation by speaking at the wrong moment. I like that you are treating social timing as infrastructure instead of another prompt rule. What kinds of examples do you show builders when an agent should wait, interrupt, or hand the floor to someone else?

 Thanks for your support!

 100%! This is a hard problem to solve. Agents will multiply, but their behavior is not improving at all. An example with or without Humalike, any is good, diff is huge.

Congrats on #2! The turn-taking piece feels underrated - a lot of agents are smart enough, but still don’t know when to speak, pause, or read the room.

 Thanks for the support!

 Totally!! tysm for the supp


Congrats! How does social memory balance personalization with user privacy? can users control what the agent remembers?


 Good question, we didn't see it as a requirement from the start. Can you share more info about the use case or how you imagine it?

 Thank you very much! That's a really great question!!

Social intelligence feels like a missing layer for many agents. Tool use is getting better, but reading context, timing, and group dynamics is what makes an AI teammate actually feel usable.

 Couldn't agree more. All of the APIs are suppose to make are supposed to make an agent feel like a teammate instead of a bot. That's exactly the gap we're building for, and it only gets harder the moment there's more than one person in the room.

Always happy to chat more about it!

 100%! tysm for the supp :))