Victor Strandmoe

Democratizing dataset influence on model performance

by

AI teams are data constrained, not model constrained and waste millions retraining models on data with little or negative impact.

They spend most of their budget collecting, processing, and labeling data without knowing what actually improves performance.

This leads to repeated failed retraining cycles, wasted GPU runs, and slow iteration because teams lack insights in which datasets improve the model and which degrade it.

Influence guided training has been shown to halve the convergence time. Dowser by Durinn tells AI teams which training data improves model performance and which data hurts it, democratizing what big model providers are doing.

How it works

Teams define a target capability or task → Dowser identifies high impact datasets from Huggingface and suggests optimized training directions.

Why now?

  • Training costs are exploding while performance gains are flattening

  • Synthetic data is increasingly contaminating training pipelines

  • Teams need precision, not more data

  • Influence methods are now viable via proxy models and distillation

Market

  • Every company training or fine tuning LLMs

  • 59% of AI budgets go to training data

  • 40% of firms spend over 70% of AI budget on data

https://durinn-concept-explorer.azurewebsites.net/

61 views

Add a comment

Replies

Best
yama

Data quality is often the bottleneck that teams overlook. This approach to identifying high-impact datasets could save a lot of wasted compute cycles. I'm curious about how Dowser handles cases where a dataset's influence varies depending on the existing training data composition - does it account for those interaction effects?

Victor Strandmoe

Thanks for your comment @yamamoto7 

Yes, but with an important caveat. Influence is always conditional on the current model and the concept directions. In our case, concepts are generated first and then used to define the projection directions for influence. If the concept definitions change, the influence scores change. We don’t model explicit dataset–dataset interactions, but redundancy and diminishing returns emerge naturally relative to the current data mix. All of this can be exposed via an API, where teams send representative training samples and recompute influence under different compositions using proxy models, without full retrains.

yama

@durinn That makes sense - using proxy models to estimate influence without full retrains sounds like a practical approach for iteration speed. The API design you mentioned could be really valuable for teams experimenting with different data mixes. Thanks for the detailed explanation.

Victor Strandmoe

@yamamoto7 We believe so too.

Do you know of a team that may be interested in trying it out? Feel free to test yourself by entering whatever you're looking to train for. Link in launch.

(professional) feedback is appreciated as we're raising currently