Technical Reference Document

AI Modules, LLMs & Inference Architecture

Deerwalk Stop-Process Resolution Analyst — RAG-Powered Flask Application

Applicationai.techmauri.com/spartan Version1.0 StackPython · Flask · LangChain · FAISS · Groq DateApril 2026

Table of Contents

🏗️

1. System Overview

Architecture summary of the AI-powered resolution assistant

This application is a Retrieval-Augmented Generation (RAG) system built to help teams query historical IT stop-process resolution tickets. It ingests structured Excel data, encodes it into a searchable vector index, and answers natural-language questions using a large language model.

📄 Excel
Knowledge Base
🧠 Embedding
BGE-Small
🗄️ FAISS
Vector Index
🔍 Retriever
Top-K Search
💬 LLaMA 3.3
via Groq
📡 Flask API
/chat
ComponentTechnologyRoleProvider
Language ModelLLaMA 3.3 70B VersatileAnswer generationGroq (cloud API)
Embedding ModelBAAI/bge-small-en-v1.5Semantic encodingHuggingFace (local)
Vector StoreFAISSSimilarity searchMeta (local)
OrchestrationLangChainChain & prompt managementLangChain OSS
Web FrameworkFlaskREST API serverPallets
Data Ingestionpandas + openpyxlExcel parsingPyData
🧬

2. Embedding Model — BAAI/bge-small-en-v1.5

Local sentence-embedding model for semantic vector generation

Model ID
BGE Small English v1.5
BAAI/bge-small-en-v1.5
Bidirectional General Embedding model from Beijing Academy of AI. Optimized for English-language retrieval tasks.
Deployment
Local CPU Inference
device: cpu
Runs entirely on-premise using HuggingFaceEmbeddings wrapper. No external API calls for vector generation.

Configuration Details

from langchain_huggingface import HuggingFaceEmbeddings embeddings = HuggingFaceEmbeddings( model_name="BAAI/bge-small-en-v1.5", model_kwargs={"device": "cpu"}, encode_kwargs={ "normalize_embeddings": True, # cosine similarity-ready "batch_size": 8, # conservative batch for low-RAM servers } )
ParameterValuePurpose
Model NameBAAI/bge-small-en-v1.5Pre-trained checkpoint from HuggingFace Hub
DevicecpuShared hosting compatibility — no GPU required
Normalize EmbeddingsTrueProduces unit vectors for cosine distance comparisons
Batch Size8Low memory footprint for constrained environments
Vector Dimensions384Compact embedding size; fast index lookups
💡 Why BGE Small? BGE Small v1.5 achieves near-large-model retrieval quality at a fraction of the compute. It ranks among the top small models on the MTEB benchmark for English retrieval tasks, making it ideal for production deployment on shared cPanel hosting.
🗄️

3. Vector Store — FAISS

Facebook AI Similarity Search — persistent local index

FAISS (Facebook AI Similarity Search) stores and retrieves document embeddings using approximate nearest-neighbor search. The index is persisted to disk, so it is only built once from the Excel data and reloaded on subsequent starts.

from langchain_community.vectorstores import FAISS # Build index from documents (one-time) vs = FAISS.from_documents(docs, embeddings) vs.save_local("/home/techqbcv/ai.techmauri.com/faiss_index") # Reload on subsequent starts (fast) vs = FAISS.load_local(FAISS_PATH, embeddings, allow_dangerous_deserialization=True) # Retrieve top-10 relevant documents per query retriever = vs.as_retriever(search_kwargs={"k": 10})
SettingValueImpact
Index Path/home/techqbcv/ai.techmauri.com/faiss_indexPersisted to disk, no re-indexing on restart
Retrieval k10Top 10 most-similar tickets sent to LLM as context
Search TypeApproximate NN (L2 / cosine)Fast sub-linear lookup even at scale
Deserializationallow_dangerous=TrueRequired for loading pickle-based FAISS index
⚠️ Security Note allow_dangerous_deserialization=True is required to load persisted FAISS indexes (they use Python pickle). Ensure the faiss_index directory is not writable by untrusted users.
📦

4. Frameworks & Supporting Libraries

Full dependency inventory with roles

Orchestration
LangChain
langchain-core · langchain-groq · langchain-huggingface · langchain-community
Chains prompts, retrievers, LLMs, and output parsers into a unified RAG pipeline using LCEL (LangChain Expression Language).
Inference Runtime
PyTorch
torch (CPU mode)
Underlies HuggingFace model execution. Capped to 1 thread via torch.set_num_threads(1) for shared-server stability.
Web API
Flask + Werkzeug
flask · ProxyFix
Provides /, /chat, and /health endpoints. ProxyFix handles reverse-proxy headers from Passenger/cPanel.
Data Processing
pandas + openpyxl
pandas (dtype=str)
Reads all sheets from the Excel knowledge base, normalizes column names, and converts rows to LangChain Document objects.
LibraryPackageRole
LangChain Corelangchain-coreRunnables, prompts, output parsers
LangChain Groqlangchain-groqChatGroq client integration
LangChain HuggingFacelangchain-huggingfaceHuggingFaceEmbeddings wrapper
LangChain Communitylangchain-communityFAISS vector store integration
FAISSfaiss-cpuSimilarity search index
PyTorchtorchTensor computation for embeddings
TransformerstransformersHuggingFace model loading (transitive)
pandaspandasExcel data ingestion
FlaskflaskHTTP REST API
WerkzeugwerkzeugWSGI middleware (ProxyFix)
🤖

5. Large Language Model — LLaMA 3.3 70B via Groq

Cloud-hosted inference for answer generation

Base Model
Meta LLaMA 3.3 70B
llama-3.3-70b-versatile
Meta's flagship open-weight model with 70 billion parameters. The "versatile" variant is optimized for general instruction-following tasks.
Inference Provider
Groq LPU™ Cloud
api.groq.com
Groq's Language Processing Units deliver extremely low-latency inference compared to traditional GPU-based cloud providers.

LLM Configuration

from langchain_groq import ChatGroq llm = ChatGroq( model="llama-3.3-70b-versatile", temperature=0.2, # low → deterministic, factual responses max_tokens=2048 # sufficient for structured bullet-point answers )
ParameterValueRationale
Modelllama-3.3-70b-versatileHigh reasoning capability for technical IT analysis
Temperature0.2Near-deterministic — reduces hallucination for factual Q&A
Max Tokens2048Enough for multi-step resolutions with bullet points
API Key SourceHardcoded (env var recommended)Set via GROQ_API_KEY environment variable
Inference HardwareGroq LPU (cloud)No local GPU required; offloaded to Groq

System Prompt Design

The prompt constrains the LLM to act strictly as a resolution analyst, preventing hallucination by grounding answers in retrieved context:

""" You are an expert Deerwalk Stop-Process Resolution Analyst. Answer questions using **only** the provided historical tickets from the 7 teams. Be precise, professional, and structured. Use bullet points for causes, fixes, and steps. If the exact issue isn't found, say so and suggest the closest similar cases. Context: {context} Question: {question} Answer: """
🔐 Security Recommendation The GROQ API key is currently hardcoded in the source file. It should be moved to a cPanel environment variable (GROQ_API_KEY) immediately to prevent key leakage via version control or log files.
⛓️

6. LangChain Orchestration — LCEL RAG Chain

LangChain Expression Language pipeline composition

The application uses LangChain Expression Language (LCEL) to compose the retrieval and generation steps into a single chainable pipeline. Each component is a Runnable that passes output to the next step.

# Full LCEL chain definition chain = ( { "context": retriever | format_docs, # FAISS lookup → string "question": RunnablePassthrough() # user question passed through } | prompt # ChatPromptTemplate fills {context} and {question} | llm # ChatGroq generates completion | StrOutputParser() # extracts plain string from AIMessage )
LCEL ComponentClass / SourceInput → Output
RetrieverVectorStoreRetriever (FAISS)Query string → List[Document]
format_docsCustom lambdaList[Document] → formatted string
RunnablePassthroughlangchain_core.runnablesQuestion string → same string
ChatPromptTemplatelangchain_core.promptsDict → PromptValue
ChatGroqlangchain_groqPromptValue → AIMessage
StrOutputParserlangchain_core.output_parsersAIMessage → str
🔄

7. End-to-End RAG Pipeline Flow

Step-by-step data flow from user query to response

StepActionComponentOutput
1 User submits question via POST /chat Flask route JSON { "question": "..." }
2 Question is embedded into a vector BGE-Small (local) 384-dim float vector
3 Top-10 nearest documents retrieved FAISS retriever List of 10 Document objects
4 Documents concatenated with --- separator format_docs() Plain text context block
5 Context + question injected into prompt template ChatPromptTemplate Formatted chat messages
6 Prompt sent to Groq API for LLaMA inference ChatGroq (cloud) AIMessage with resolution text
7 Response parsed to plain string and returned StrOutputParser + Flask JSON { "answer": "..." }

Startup Initialization Sequence

# On server start (Passenger/WSGI import) 1. Load Excel → parse all sheets → build Document list 2. Load/build FAISS index (from disk cache or fresh) 3. Initialize HuggingFace embeddings (BGE model download if absent) 4. Build LangChain LCEL chain (retriever + prompt + LLM) 5. Flask app ready — /health returns "ok"
⚙️

8. Configuration, Environment & Performance Tuning

Environment variables, paths, and threading constraints

Key Configuration Constants

ConstantValuePurpose
EXCEL_FILE.../Teamwise_Stop_Issues_Resolution_2023.xlsxMulti-sheet knowledge base source
FAISS_PATH.../faiss_indexPersistent vector index directory
EMBEDDING_MODELBAAI/bge-small-en-v1.5HuggingFace model identifier
LLM_MODELllama-3.3-70b-versatileGroq model identifier
APPLICATION_ROOT/spartancPanel URL prefix for all routes

Performance & Stability Environment Variables

VariableValueEffect
OMP_NUM_THREADS1Prevents OpenMP from spawning extra threads
MKL_NUM_THREADS1Intel MKL thread cap for shared hosting
RAYON_NUM_THREADS1Rust-based Rayon thread limit (tokenizers)
TOKENIZERS_PARALLELISMfalsePrevents HuggingFace tokenizer deadlock
TQDM_DISABLE1Suppresses tqdm progress bars (BrokenPipeError fix)
HF_HUB_DISABLE_PROGRESS_BARS1Suppresses HuggingFace Hub download bars
TMPDIR / TEMP / TMP/home/techqbcv/tmpRedirects temp files to writable home directory
📝 Deployment Notes The app runs under Passenger (cPanel's WSGI host). stdout/stderr are redirected to /home/techqbcv/tmp/app_startup.log on import to prevent BrokenPipeError caused by Passenger closing the standard streams before tqdm can write.

API Endpoints

EndpointMethodDescription
/spartan/GETServes the frontend HTML UI with doc count and status
/spartan/chatPOSTAccepts { "question": "..." }, returns { "answer": "..." }
/spartan/healthGETReturns system status, doc count, and KB load state