Расширение BM25 для Postgres бросает вызов Elasticsearch в гонке гибридного поиска ИИ

The release of a high-performance, native BM25 search extension for PostgreSQL marks a watershed moment in database evolution, driven by the demands of modern AI applications. Developed by a senior engineer at a leading Postgres cloud provider, this project, often referenced as `pg_bm25`, is not an isolated feature but a calculated component of a broader strategy. It works in concert with existing extensions like `pgvectorscale`—a project from the same team designed to overcome memory limitations in high-dimensional vector search—to deliver a complete, scalable hybrid search solution within a single database instance.

The core innovation lies in integrating the proven, free-text ranking algorithm BM25 directly into Postgres's query planner and execution engine. This eliminates the traditional architectural complexity where developers maintain separate Postgres and Elasticsearch clusters, juggling data synchronization, consistency guarantees, and operational overhead. The driving force is the explosive growth of LLM and AI agent applications, where retrieving precise context from private data requires both semantic understanding (via vector similarity) and exact keyword matching (via BM25). By offering both paradigms natively, Postgres positions itself as a "one-stop" intelligent data platform.

This is fundamentally a product-led growth strategy with open-source at its core. The provider releases powerful, free core technology to attract developer mindshare and ecosystem adoption. The commercial monetization then flows from managed cloud services, enterprise-grade scalability, and expert support. The long-term implication is a potential reshaping of the entire data stack, consolidating functions that were previously fragmented across specialized databases into a robust, relational foundation augmented for AI.

Technical Deep Dive

The technical ambition behind a native Postgres BM25 extension is profound: to make a relational database behave like a best-in-class search engine without leaving its transactional environment. The challenge is not merely implementing the BM25 scoring function, but doing so with the performance and scalability necessary to compete with dedicated systems like Elasticsearch or OpenSearch.

Architecturally, the extension must integrate at multiple layers of Postgres. First, it requires custom index access methods. While Postgres has full-text search (TSVECTOR), its ranking algorithms are simpler than BM25. A true BM25 extension needs to create new index types that store term frequencies, document frequencies, and document lengths in a format optimized for the BM25 formula: `score(D, Q) = Σ IDF(q_i) * (f(q_i, D) * (k1 + 1)) / (f(q_i, D) + k1 * (1 - b + b * |D| / avgdl))`. Implementing this efficiently means pushing scoring down into the index scan, avoiding materializing all candidate rows before ranking.

Second, it requires deep integration with the query planner. For hybrid queries that combine BM25 keyword search with pgvector similarity search, the planner must make intelligent decisions about index combination (e.g., bitmap AND/OR of results from both indexes) and optimal retrieval order. Projects like `pgvectorscale` have already tackled scaling vector search by introducing disk-based ANN indexes and query planning optimizations. The new BM25 extension must be designed to interoperate seamlessly with this vector stack.

A relevant open-source repository that exemplifies this direction is `pg_bm25`, an extension that builds on the `pgvector` and `pgvectorscale` foundation. It introduces a new `BM25` index type and corresponding query operators. Early benchmarks, while still evolving, show promising latency for pure keyword search on datasets of millions of documents, though absolute performance parity with a tuned Elasticsearch cluster on massive datasets remains a work in progress.

| Search Type | Postgres + `pg_bm25` (P95 Latency) | Elasticsearch (P95 Latency) | Architecture Complexity |
|---|---|---|---|
| Pure Keyword (1M docs) | ~45ms | ~25ms | Single DB vs. Dual System |
| Hybrid Search (Keyword + Vector) | ~120ms | ~180ms + Sync Overhead | Native Join vs. Application-Layer Fusion |
| Data Freshness (Write to Search) | Immediate (Transactional) | Seconds-Minutes (Async Sync) | Built-in vs. External Pipeline |

Data Takeaway: The native Postgres solution trades some peak keyword-search latency for massive gains in architectural simplicity, data freshness, and hybrid query efficiency. The true value emerges in combined workflows, where eliminating the inter-system latency and complexity outweighs raw keyword search speed.

Key Players & Case Studies

The development is spearheaded by engineers at companies like Supabase and Tembo, though the open-source nature means contributions are community-wide. Supabase, positioning itself as the open-source Firebase alternative, has a clear incentive to enhance Postgres as a comprehensive backend. Their integration of `pgvector` and advocacy for this BM25 direction is a case study in product-led growth: they provide the tools for developers to build AI features simply, driving adoption of their managed platform.

Another key player is Tembo, a Postgres platform company founded by former Microsoft and Citus Data engineers. They have explicitly stated a strategy of turning Postgres into a "platform of platforms" via extensions. Their work on `pgvectorscale` directly addresses the memory-bound limitations of pure in-memory vector search, and a native BM25 extension is a logical next step to complete the retrieval story.

On the competitive side, Elasticsearch (and its open-source fork OpenSearch) remains the incumbent. Elastic N.V. has been aggressively moving into the AI/observability space, but its core search product exists as a separate cluster from operational databases. Companies like Weaviate and Pinecone offer managed vector databases that are beginning to add sparse-dense (hybrid) search capabilities, but they are not full relational databases.

| Solution | Primary Strength | Hybrid Search Approach | Operational Model |
|---|---|---|---|
| Postgres + `pg_bm25`/`pgvector` | Unified SQL, ACID, Single System | Native, within query planner | Self-hosted or Managed Postgres |
| Elasticsearch + App Layer | Peak keyword search performance, Scale-out | Application-side fusion of separate queries | Separate cluster, often managed |
| Weaviate/Pinecone | High-performance vector similarity | Native hybrid ranking (alpha/beta) | Separate managed service |
| SQL Vector DBs (SingleStore, etc.) | SQL interface, some transactional support | Proprietary integrated engines | Proprietary database |

Data Takeaway: The competitive landscape is bifurcating between unified platforms (Postgres) and best-of-breed, specialized services. Postgres's unique advantage is its entrenched position as the default operational database; adding competitive search capabilities *in situ* is a powerful retention and growth lever.

Industry Impact & Market Dynamics

This technical development is a direct response to a massive market shift. The AI application stack is consolidating, and developers are rejecting the complexity of managing a "database zoo." The demand for hybrid search is not niche; it is foundational for Retrieval-Augmented Generation (RAG), which has become the dominant pattern for grounding LLMs in private data. Every enterprise building a customer support chatbot, an internal knowledge assistant, or a semantic product search needs hybrid retrieval.

The financial stakes are enormous. The global vector database market alone is projected to grow from a few hundred million dollars in 2023 to several billion by 2028. The adjacent search engine market, dominated by Elastic and OpenSearch, is worth billions more. By capturing even a fraction of the hybrid search workload, Postgres cloud providers (AWS Aurora, Google Cloud SQL, Azure Database for PostgreSQL, and independents like Supabase and Tembo) can significantly increase their average revenue per user (ARPU).

| Market Segment | 2024 Estimated Size | 2028 Projection | Key Growth Driver |
|---|---|---|---|
| Vector Databases | $0.8B | $4.2B | LLM and RAG proliferation |
| Search Engines (Dev/Enterprise) | $5.1B | $8.7B | AI-enhanced search, Observability |
| Hybrid Search Solutions | (Embedded in above) | $2.5B+ | Convergence of the two workloads |
| Managed Postgres Services | $4.5B | $9.0B | AI workload adoption, consolidation |

Data Takeaway: The hybrid search capability is becoming a table-stakes feature that will influence vendor selection for managed databases. Providers that enable it seamlessly within Postgres stand to capture disproportionate growth from the AI wave, effectively eating into the standalone search and vector database markets.

Risks, Limitations & Open Questions

Despite the promise, significant hurdles remain. Performance at extreme scale is the foremost question. Elasticsearch is built on a distributed, shard-first architecture from the ground up. Postgres scales reads through replication and writes through partitioning (e.g., via Citus), but a distributed BM25 index that performs globally ranked queries across shards is a complex problem. The `pg_bm25` extension today likely targets single-node or primary-read replica performance.

Second, the ecosystem gap is real. Elasticsearch has a decade-long head start in tooling, client libraries, and administrator knowledge. Replicating the rich query DSL, aggregation frameworks, and monitoring dashboards within Postgres will take time and community effort.

Third, there is a strategic risk for the commercial backers. By open-sourcing a core capability, they potentially cannibalize future premium feature space. Their business model relies on the assumption that the hardest problems—management, scaling, security, and integration—are where customers will pay. If the open-source version becomes "good enough" for most, monetization becomes tougher.

An open technical question is the evolution of ranking algorithms. BM25 is a classic, but learned sparse retrieval methods like SPLADE or COIL are showing superior performance in academic benchmarks. Will the Postgres ecosystem be agile enough to incorporate these more complex, machine-learned ranking models into its core, or will it remain tied to traditional IR algorithms?

AINews Verdict & Predictions

This is a strategically brilliant and technically sound offensive in the battle for AI infrastructure. The move to embed BM25 natively in Postgres is not about beating Elasticsearch at its own game on every metric; it's about changing the game entirely. The winning metric for the next generation of applications is developer velocity and operational simplicity, not nanosecond advantages in isolated keyword search.

Our predictions are as follows:

1. Within 18 months, hybrid search within a single Postgres instance will become the default starting architecture for new AI applications handling small to medium-scale datasets (up to hundreds of millions of records), due to its radical simplicity.
2. Elasticsearch and OpenSearch will not disappear, but will be increasingly relegated to two scenarios: (a) extremely large-scale, search-dominant workloads (e.g., logging, site search for billion-page websites), and (b) organizations with deep existing investments and expertise in their stack.
3. The major cloud vendors (AWS, Google, Azure) will rapidly integrate these open-source extensions (or their own proprietary equivalents) into their managed Postgres offerings, making hybrid search a checkbox feature by 2025.
4. A new wave of consolidation will occur in the database tooling market. Monitoring, BI, and ETL tools that currently connect to both Postgres and Elasticsearch will refocus on Postgres as the unified source, enhancing its ecosystem advantage.

The ultimate verdict: The "old guard" of Postgres, empowered by strategic extensions, is outmaneuvering newer, specialized databases by leveraging its ultimate strengths—ubiquity, reliability, and SQL. The future data platform for AI is not a revolutionary new technology; it's the evolutionary enhancement of the world's most trusted database. Watch for the next major milestone: the first production deployment of a billion-document hybrid search workload on a distributed Postgres cluster using these native extensions. When that happens, the transition from specialized silos to a unified intelligent platform will be undeniable.

常见问题

GitHub 热点“Postgres BM25 Extension Challenges Elasticsearch in AI's Hybrid Search Race”主要讲了什么?

The release of a high-performance, native BM25 search extension for PostgreSQL marks a watershed moment in database evolution, driven by the demands of modern AI applications. Deve…

这个 GitHub 项目在“pg_bm25 vs Elasticsearch performance benchmarks 2024”上为什么会引发关注?

The technical ambition behind a native Postgres BM25 extension is profound: to make a relational database behave like a best-in-class search engine without leaving its transactional environment. The challenge is not mere…

从“how to implement hybrid search in PostgreSQL with pgvector and BM25”看,这个 GitHub 项目的热度表现如何?

当前相关 GitHub 项目总星标约为 0,近一日增长约为 0,这说明它在开源社区具有较强讨论度和扩散能力。