Technical Deep Dive
The system’s architecture is a hybrid pipeline that marries the structured reasoning of a knowledge graph (KG) with the conversational fluency of a large language model (LLM). At its foundation lies a meticulously curated TCM ontology: 241 syndromes (e.g., Liver Qi Stagnation, Spleen Qi Deficiency), 1263 symptoms (e.g., pale tongue, wiry pulse), and 2485 causal and associative relations. This KG is not a flat list but a directed graph where nodes represent clinical entities and edges encode relationships such as ‘has_symptom’, ‘caused_by’, and ‘treated_by’.
The inference process unfolds in three stages. First, the LLM parses the patient’s free-text description and extracts symptom entities, mapping them onto the KG. Second, the system enters a multi-turn dialogue loop: it identifies ambiguous or missing information (e.g., “Is the pain dull or stabbing?”) and generates clarifying questions. Each patient response updates the set of active symptom nodes, and a graph traversal algorithm computes the most probable syndrome(s) by evaluating path weights and co-occurrence statistics. The LLM serves as the natural language interface, while the KG provides the logical backbone—a classic hybrid approach that mitigates the hallucination tendencies of pure LLMs.
Once a syndrome is confirmed, the system retrieves treatment templates from the KG: herbal formulas, acupoint prescriptions, dietary advice, and lifestyle modifications. These are rendered as a multi-modal output: a textual explanation, a visual diagram of the acupoint locations, and a timeline chart showing expected recovery phases.
A relevant open-source project that parallels this approach is TCM-KG (GitHub repo: `tcm-kg/tcm-knowledge-graph`, ~1.2k stars), which provides a base ontology for TCM entities but lacks the LLM integration and multi-turn dialogue capabilities. Another is MedKG (GitHub repo: `medical-knowledge-graph/MedKG`, ~800 stars), which focuses on Western medicine. The current system’s innovation lies in bridging these two worlds with a real-time interactive loop.
Performance benchmarks are still emerging, but preliminary internal tests show:
| Metric | Value | Comparison Baseline (Pure LLM) |
|---|---|---|
| Syndrome accuracy (top-3) | 87.3% | 72.1% (GPT-4o, zero-shot) |
| Average dialogue turns to diagnosis | 4.2 | 1 (single query) |
| Patient satisfaction (1-5) | 4.6 | 3.8 |
| Physician agreement rate | 91.5% | 78.2% |
Data Takeaway: The hybrid system achieves 15 percentage points higher syndrome accuracy than a pure LLM baseline, albeit requiring more dialogue turns. The trade-off between efficiency and accuracy is acceptable in clinical settings where diagnostic confidence is paramount.