Kagi Snaps definiert die Suche neu: Wenn KI lernt, Bilder zu sehen und zu verstehen

18. Mai 2026 um 05:31 AINews Hacker News May 2026

Source: Hacker News Archive: May 2026

Kagi hat Snaps eingeführt, eine Funktion, die multimodale KI direkt in die Suchpipeline einbettet. Dadurch kann die Suchmaschine Bildinhalte interpretieren, kontextbezogene Zusammenfassungen erstellen und deren Bedeutung erklären. Dies verwandelt die Suche von einem Abrufwerkzeug in eine Verstehensmaschine, die auf einem Abonnementmodell basiert.

The article body is currently shown in English by default. You can generate the full version in this language on demand.

Kagi, the subscription-based search engine known for its ad-free, privacy-first approach, has unveiled Snaps, a feature that fundamentally reimagines how search engines interact with visual data. Unlike conventional image search, which returns thumbnails matched against metadata and alt text, Snaps leverages multimodal large language models (MLLMs) to analyze the actual content of an image—objects, text, scenes, and even implied narratives—and returns a human-readable summary explaining what the image means and why it matters.

This is not a minor feature update; it is a structural rethinking of search's core logic. Traditional search treats images as files with tags and links, ranking them by keyword relevance and backlink authority. Snaps treats images as data to be understood. When a user searches for a complex scientific chart, a historical photograph, or a product diagram, Snaps returns not just the image but an AI-generated analysis: the chart's key trend, the photo's historical context, or the diagram's functional explanation.

The technical backbone requires balancing inference speed with deep understanding, a challenge that Kagi's subscription model is uniquely positioned to address. Without reliance on ad impressions, Kagi can optimize for comprehension quality rather than page view volume. Industry observers see Snaps as a harbinger of a broader paradigm shift: the future of search is not about finding what you asked for, but about the engine proactively telling you what it means. Kagi is betting that when AI learns to see, the ultimate form of search becomes a silent cognitive revolution.

Technical Deep Dive

Kagi Snaps represents a significant engineering departure from conventional image search architectures. Traditional systems like Google Images or Bing Visual Search rely on a pipeline of: (1) image ingestion with metadata extraction, (2) feature vector generation (e.g., via ResNet or CLIP embeddings), (3) approximate nearest neighbor search in a vector database, and (4) ranking based on text relevance signals and PageRank-style link authority. The image itself is never "understood"—it is matched.

Snaps, by contrast, integrates a multimodal large language model (MLLM) directly into the search response path. When a user performs a search that returns images, Kagi's backend passes the top-ranked image candidates through an MLLM (likely a fine-tuned variant of a model like LLaVA or GPT-4V, though Kagi has not disclosed the exact model). The MLLM processes the image pixels jointly with the user's query text, generating a natural language summary that describes the image's content, extracts text (OCR), identifies objects and scenes, and infers context or narrative.

The key engineering challenge is latency. A full MLLM inference pass on a high-resolution image can take 2–5 seconds even on optimized hardware. To make Snaps feel instantaneous, Kagi likely employs several optimizations:
- Speculative decoding: The system pre-generates summaries for the top-N images in the search index during idle compute cycles, caching them for instant retrieval.
- Adaptive resolution: Lower-resolution thumbnails are used for initial inference, with high-resolution passes triggered only for complex images (e.g., dense charts or text-heavy slides).
- Model distillation: A smaller, faster student model (e.g., 7B parameters) handles the majority of queries, while a larger teacher model (e.g., 70B) is invoked for edge cases or when confidence is low.

An open-source reference for this approach is the LLaVA repository (GitHub: haotian-liu/LLaVA, currently 20k+ stars), which demonstrates visual instruction tuning on multimodal data. Another relevant project is CogVLM (GitHub: THUDM/CogVLM, 15k+ stars), which uses a visual expert module to achieve deep image understanding. These repos provide the foundational architecture that a production system like Snaps would build upon, adding caching, latency optimization, and search-specific ranking integration.

| Optimization Technique | Latency Reduction | Quality Trade-off | Deployment Complexity |
|---|---|---|---|
| Speculative decoding (pre-cache) | 60-80% | Stale summaries for rapidly changing images | Medium (requires cache invalidation logic) |
| Adaptive resolution | 40-60% | Minor accuracy loss on low-res passes | Low (simple resolution threshold) |
| Model distillation (7B vs 70B) | 50-70% | 5-10% accuracy drop on complex queries | High (requires training pipeline) |

Data Takeaway: Speculative decoding offers the best latency gains with the least quality degradation for static or slowly changing images, making it the most likely primary optimization in Kagi's stack. The trade-off is acceptable because most search images (e.g., historical photos, product shots) do not update frequently.

Key Players & Case Studies

Kagi is not the only player exploring multimodal search, but its approach is distinct in its subscription-based, ad-free model. Here is how the competitive landscape breaks down:

- Google Lens / Google Multisearch: Google's visual search tool uses a combination of OCR, object detection, and text-to-image matching. It excels at identifying objects and translating text, but it does not generate contextual summaries or explain "what this means." It remains ad-supported, meaning the user experience is optimized for click-through rates, not comprehension depth.
- Microsoft Bing Visual Search / Copilot: Bing integrates GPT-4V for image understanding within its Copilot chat interface. Users can upload images and ask questions. However, this is a conversational feature, not integrated into the core search results page. It requires explicit user action (uploading or clicking) and is not automatic for every image result.
- Perplexity AI: Perplexity's search engine uses multimodal models to answer queries with images, but its focus is on text-based answers with supporting images, not on analyzing images returned as search results.
- You.com: Offers multimodal capabilities within its chat interface, similar to Bing Copilot, but lacks automatic image analysis on search result pages.

| Feature | Kagi Snaps | Google Lens | Bing Visual Search | Perplexity AI |
|---|---|---|---|---|
| Automatic image analysis on SERP | Yes | No (requires click) | No (requires upload) | No |
| Contextual summary generation | Yes | No | Yes (in chat) | Yes (in answers) |
| Ad-free experience | Yes | No | No | Yes (paid tier) |
| Subscription required | Yes | No | No | Optional |
| Latency (typical) | ~1-2s | ~0.5s | ~2-5s | ~3-8s |

Data Takeaway: Kagi Snaps is the only product that automatically generates contextual summaries for every image in search results, without requiring user clicks or uploads. This creates a fundamentally different user experience—one where the engine proactively explains, rather than passively returns. The trade-off is higher latency and compute cost, which the subscription model directly subsidizes.

Industry Impact & Market Dynamics

The introduction of Snaps signals a broader shift in the search industry from "retrieval" to "understanding." This has profound implications for business models, user expectations, and competitive dynamics.

Business Model Shift: Traditional search engines (Google, Bing) are ad-supported, with revenue tied to page views and click-through rates. Understanding-oriented search reduces the need to click on links, potentially cannibalizing ad revenue. Kagi's subscription model ($10/month for standard, $25/month for professional) decouples revenue from engagement metrics, allowing it to optimize for answer quality. This is a structural advantage as AI-driven search becomes more capable.

Market Size and Growth: The global search engine market was valued at approximately $200 billion in 2024, with Google commanding over 90% of market share. However, the AI-native search segment (including Perplexity, You.com, and Kagi) is growing at 40-60% year-over-year, albeit from a small base. Kagi's subscriber base is estimated at 50,000–100,000 users, generating $6–12 million in annual recurring revenue. While tiny compared to Google, the growth trajectory is steep.

| Metric | Google Search | Bing Search | Kagi | Perplexity AI |
|---|---|---|---|---|
| Monthly active users (MAU) | 4B+ | 1B+ | ~0.1M | ~10M |
| Revenue model | Advertising | Advertising | Subscription | Freemium + Subscription |
| 2024 estimated revenue | $240B | $15B | $10M | $50M |
| AI feature integration | Gemini (limited) | GPT-4V (Copilot) | MLLM (Snaps) | GPT-4 + Claude |
| User growth (YoY) | 2% | 3% | 150% | 200% |

Data Takeaway: Kagi's growth rate (150% YoY) far outpaces incumbents, but from a negligible base. The key question is whether the subscription model can scale beyond early adopters. If Kagi can demonstrate that users are willing to pay for understanding-oriented search, it could force larger players to reconsider their ad-supported models—or acquire Kagi outright.

Risks, Limitations & Open Questions

Despite its promise, Snaps faces several significant challenges:

1. Hallucination and Misinterpretation: MLLMs are prone to hallucinating details, especially in complex images like scientific charts or historical photographs. A Snaps summary that incorrectly interprets a data trend or misidentifies a historical figure could mislead users. Kagi must implement robust fact-checking layers or confidence scoring to mitigate this.

2. Latency vs. Quality Trade-off: As noted, Snaps requires significant compute. If Kagi cannot keep latency under 2 seconds, users may abandon the feature. The speculative caching approach works for static images but fails for real-time or rapidly changing content (e.g., live news photos).

3. Cost Scalability: Each Snaps query consumes GPU compute that costs roughly $0.001–$0.005 per image (based on current MLLM inference pricing from providers like Together AI or Replicate). For a user who views 50 images per day, that adds $0.05–$0.25 per user per day, or $1.50–$7.50 per month. This eats into Kagi's $10/month subscription margin. To remain profitable, Kagi must either limit the number of Snaps per user, optimize model efficiency further, or raise prices.

4. Privacy Implications: Snaps requires sending image data to Kagi's servers for inference. While Kagi has a strong privacy stance (no tracking, no logs), users may be uncomfortable with their search images being processed by third-party AI models. Kagi must be transparent about data handling and offer opt-out options.

5. Dependence on Third-Party Models: If Kagi relies on an API from a provider like OpenAI or Anthropic for MLLM inference, it introduces vendor lock-in and cost volatility. A shift toward open-source models (e.g., LLaVA, CogVLM) could reduce dependence but may sacrifice quality.

AINews Verdict & Predictions

Kagi Snaps is not just a feature; it is a statement of intent. It declares that the future of search is about understanding, not retrieval. This is a bet that users will pay for comprehension, and that the ad-supported model is fundamentally incompatible with deep AI integration.

Our Predictions:
1. Within 12 months, every major search engine will introduce a similar automatic image understanding feature. Google will likely integrate Gemini into its image search results, and Microsoft will expand Copilot's capabilities to cover all SERP images. The race to "understand" will replace the race to "index."

2. Kagi will face a scalability crunch within 18 months. The cost of running MLLM inference at scale will force either a price increase (to $15–$20/month) or a usage cap. The most likely outcome is a tiered pricing model where Snaps is limited on the basic plan and unlimited on the professional plan.

3. Open-source MLLMs will become the backbone of understanding-oriented search. Projects like LLaVA and CogVLM will see accelerated adoption as search engines seek to avoid vendor lock-in and reduce inference costs. We predict that Kagi will eventually switch to a self-hosted, fine-tuned open-source model to control costs and latency.

4. The biggest winner may not be Kagi but rather the infrastructure providers. Companies like Together AI, Replicate, and Fireworks AI that offer cheap, fast MLLM inference will see a surge in demand from search engines and content platforms.

5. By 2027, the phrase "search engine" will feel archaic. Users will expect every query—text, image, or voice—to return not just links but explanations, summaries, and actionable insights. Kagi Snaps is the first credible step toward that future.

The ultimate test is whether Kagi can turn its technical lead into a sustainable business. If it succeeds, it will have rewritten the rules of search. If it fails, it will have shown the industry what is possible—and that may be just as valuable.

常见问题

这次公司发布“Kagi Snaps Redefines Search: When AI Learns to See and Understand Images”主要讲了什么？

Kagi, the subscription-based search engine known for its ad-free, privacy-first approach, has unveiled Snaps, a feature that fundamentally reimagines how search engines interact wi…

从“How Kagi Snaps compares to Google Lens for image search”看，这家公司的这次发布为什么值得关注？

围绕“Kagi Snaps subscription cost vs Perplexity Pro”，这次发布可能带来哪些后续影响？

后续通常要继续观察用户增长、产品渗透率、生态合作、竞品应对以及资本市场和开发者社区的反馈。