Farhan Syah

The problem with multi-model databases: Treating everything as a Document.

by

Ten years ago, the "database-per-microservice" / polyglot persistence model was the right way to build. But today, the bottleneck isn't the database engine —> it's the glue.

What is the "ETL Tax"?

The ETL (Extract, Transform, Load) Tax is the hidden, compounding cost of moving data between specialized databases in a microservice/polyglot architecture. When your data is fractured across Postgres, Redis, Pinecone, and Neo4j, you don't just pay for the databases — you pay a massive "tax" to keep them synchronized.

It impacts engineering teams in three fatal ways:

  1. Engineering Time: Developers stop building product features and instead become "plumbers," writing brittle sync scripts, Debezium connectors, and Kafka pipelines.

  2. Data Consistency & Staleness: Dealing with dual-write bugs, race conditions, and out-of-sync data (e.g., a user deletes their account in Postgres, but their embeddings still live in Pinecone).

  3. Network Latency: You cannot do complex, real-time AI queries (like traversing a Graph and doing a Vector search) if the engines have to communicate over a network boundary.

Trying to Escape from The Duct-Tape Architecture

I wanted out of the duct-tape architecture, so I looked into the current wave of "multi-model" databases.

I have massive respect for SurrealDB. They proved that you can consolidate workloads, and their Developer Experience (DX) is arguably the best in the industry right now. But as a developer looking to deeply optimize my stack, I hit an architectural wall when I looked under the hood.

The "Wrapper" Penalty

It's multi-model databases are essentially smart query/compute layers wrapped around generic KV storage backends (like RocksDB, TiKV, or FoundationDB).

This means that underneath the slick API, almost every type of data is ultimately serialized and stored as a Document/KV pair.

  • Have Time-Series/Analytics data? It's stored as a document.

  • Have Graph data? The nodes and edges are stored as documents.

The problem is that you cannot properly optimize database workloads if you don't control the underlying memory and disk layouts. If you want to do fast aggregations, you need a Columnar layout. If you want to do high-performance, deep graph traversals, you need a CSR (Compressed Sparse Row) layout. You simply cannot fake a true columnar engine or a native graph engine on top of a generic KV store without paying a severe performance penalty and relying heavily on network hops to the storage backend.

Building a Native Engine

I realized that if I wanted a consolidated database without sacrificing the raw performance of specialized engines, I had to build the storage layers from scratch.

I spent the last year building NodeDB. It’s a distributed, multi-model database written in Rust.

Instead of being a wrapper, NodeDB implements native, specialized storage engines that all live within the same memory space:

  • A true Columnar engine for Time-Series and analytics.

  • A native Graph engine using CSR layouts.

  • Native Vector / AI search indexing.

  • Relational / Document engines.

Because these engines share the same memory space, there are no internal network hops. You can execute a single query that does a semantic vector search, traverses a graph to find related entities, and filters by a relational tenant I at native speeds.

For those of you who have moved to multi-model databases like SurrealDB or ArangoDB, how has the performance held up at scale when doing heavy analytics or deep graph traversals? Does the convenience of a unified API outweigh the underlying KV-storage penalty?

53 views

Add a comment

Replies

Be the first to comment