Production-Grade RAG with NVIDIA Triton — Health-Aware Three-Path Failover
Production-grade retrieval-augmented generation powered by NVIDIA Triton Inference Server — with health-aware three-path failover so your trading agent always gets the best possible answer, even when GPU is offline.
The Triton Retrieval Pipeline is FTH Trading's production RAG system built on NVIDIA Triton Inference Server. It provides health-aware three-path failover for retrieval-augmented generation: Path 1 (Vector + Rerank) — full semantic search using NV-EmbedQA-E5-v5 for 1,024-dimensional embeddings with cosine similarity, then NV-RerankQA-Mistral-4B-v3 cross-encoder reranking for maximum relevance; Path 2 (Degraded Lexical Embed) — when Triton is temporarily unreachable, falls back to TF-IDF style lexical embedding with rag_gpu_degraded_fallback warning, reranker failures fall back to cosine similarity; Path 3 (Pure Lexical) — zero GPU dependency BM25 keyword scoring in under 1ms, the hard fail-safe that is always available. The pipeline selects the best available retrieval path at runtime based on GPU and Triton health status. Feature flag rollout via RAG_VECTOR_PATH_ENABLED runtime boolean gates the full vector+rerank path — ship with false (lexical safe mode), run a canary, confirm metrics, then flip to true for instant rollback capability. The stack runs from GPU silicon to trading decision: Triton handles batched inference with dynamic batching and millisecond latency, embeddings create 1,024-dim dense vectors for semantic search with top-20 candidates passed to the reranker, the cross-encoder reads full (query, passage) pairs for true semantic relevance scoring. Canary watch set monitors four signals: mode distribution (vector+rerank rising), rag_gpu_degraded_fallback (GPU path degrading), p95 latency by mode (vector ≤ 5ms target), and answer quality drift. RAG grounds LLM answers in real proprietary data — strategy docs, risk rules, post-mortems, fill logs — with source citations and relevance scores, zero hallucination on proprietary data. Filter by source_type (strategy, risk, etc.). Internal tooling powering the Kalshi OS trading agent.
Internal infrastructure — reduces hallucination risk and improves trading agent decision quality
The retrieval intelligence layer ensuring every trading agent answer is grounded in real proprietary data with production-grade failover.
AI Infrastructure / Retrieval-Augmented Generation