Technical Document – AI Modules & LLMs

🏗️

1. System Overview

Architecture summary of the AI-powered resolution assistant

This application is a Retrieval-Augmented Generation (RAG) system built to help teams query historical IT stop-process resolution tickets. It ingests structured Excel data, encodes it into a searchable vector index, and answers natural-language questions using a large language model.

📄 Excel
Knowledge Base

→

🧠 Embedding
BGE-Small

→

🗄️ FAISS
Vector Index

→

🔍 Retriever
Top-K Search

→

💬 LLaMA 3.3
via Groq

→

📡 Flask API
/chat

Component	Technology	Role	Provider
Language Model	LLaMA 3.3 70B Versatile	Answer generation	Groq (cloud API)
Embedding Model	BAAI/bge-small-en-v1.5	Semantic encoding	HuggingFace (local)
Vector Store	FAISS	Similarity search	Meta (local)
Orchestration	LangChain	Chain & prompt management	LangChain OSS
Web Framework	Flask	REST API server	Pallets
Data Ingestion	pandas + openpyxl	Excel parsing	PyData

🧬

2. Embedding Model — BAAI/bge-small-en-v1.5

Local sentence-embedding model for semantic vector generation

Model ID

BGE Small English v1.5

BAAI/bge-small-en-v1.5

Bidirectional General Embedding model from Beijing Academy of AI. Optimized for English-language retrieval tasks.

Deployment

Local CPU Inference

device: cpu

Runs entirely on-premise using HuggingFaceEmbeddings wrapper. No external API calls for vector generation.

Configuration Details

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",
    model_kwargs={"device": "cpu"},
    encode_kwargs={
        "normalize_embeddings": True,   # cosine similarity-ready
        "batch_size": 8,                  # conservative batch for low-RAM servers
    }
)

Parameter	Value	Purpose
Model Name	`BAAI/bge-small-en-v1.5`	Pre-trained checkpoint from HuggingFace Hub
Device	`cpu`	Shared hosting compatibility — no GPU required
Normalize Embeddings	`True`	Produces unit vectors for cosine distance comparisons
Batch Size	`8`	Low memory footprint for constrained environments
Vector Dimensions	384	Compact embedding size; fast index lookups

💡 Why BGE Small? BGE Small v1.5 achieves near-large-model retrieval quality at a fraction of the compute. It ranks among the top small models on the MTEB benchmark for English retrieval tasks, making it ideal for production deployment on shared cPanel hosting.

🗄️

3. Vector Store — FAISS

Facebook AI Similarity Search — persistent local index

FAISS (Facebook AI Similarity Search) stores and retrieves document embeddings using approximate nearest-neighbor search. The index is persisted to disk, so it is only built once from the Excel data and reloaded on subsequent starts.

from langchain_community.vectorstores import FAISS

# Build index from documents (one-time)
vs = FAISS.from_documents(docs, embeddings)
vs.save_local("/home/techqbcv/ai.techmauri.com/faiss_index")

# Reload on subsequent starts (fast)
vs = FAISS.load_local(FAISS_PATH, embeddings, allow_dangerous_deserialization=True)

# Retrieve top-10 relevant documents per query
retriever = vs.as_retriever(search_kwargs={"k": 10})

Setting	Value	Impact
Index Path	`/home/techqbcv/ai.techmauri.com/faiss_index`	Persisted to disk, no re-indexing on restart
Retrieval k	`10`	Top 10 most-similar tickets sent to LLM as context
Search Type	Approximate NN (L2 / cosine)	Fast sub-linear lookup even at scale
Deserialization	`allow_dangerous=True`	Required for loading pickle-based FAISS index

⚠️ Security Note allow_dangerous_deserialization=True is required to load persisted FAISS indexes (they use Python pickle). Ensure the faiss_index directory is not writable by untrusted users.

📦

4. Frameworks & Supporting Libraries

Full dependency inventory with roles

Orchestration

LangChain

langchain-core · langchain-groq · langchain-huggingface · langchain-community

Chains prompts, retrievers, LLMs, and output parsers into a unified RAG pipeline using LCEL (LangChain Expression Language).

Inference Runtime

PyTorch

torch (CPU mode)

Underlies HuggingFace model execution. Capped to 1 thread via torch.set_num_threads(1) for shared-server stability.

Web API

Flask + Werkzeug

flask · ProxyFix

Provides /, /chat, and /health endpoints. ProxyFix handles reverse-proxy headers from Passenger/cPanel.

Data Processing

pandas + openpyxl

pandas (dtype=str)

Reads all sheets from the Excel knowledge base, normalizes column names, and converts rows to LangChain Document objects.

Library	Package	Role
LangChain Core	`langchain-core`	Runnables, prompts, output parsers
LangChain Groq	`langchain-groq`	ChatGroq client integration
LangChain HuggingFace	`langchain-huggingface`	HuggingFaceEmbeddings wrapper
LangChain Community	`langchain-community`	FAISS vector store integration
FAISS	`faiss-cpu`	Similarity search index
PyTorch	`torch`	Tensor computation for embeddings
Transformers	`transformers`	HuggingFace model loading (transitive)
pandas	`pandas`	Excel data ingestion
Flask	`flask`	HTTP REST API
Werkzeug	`werkzeug`	WSGI middleware (ProxyFix)

🤖

5. Large Language Model — LLaMA 3.3 70B via Groq

Cloud-hosted inference for answer generation

Base Model

Meta LLaMA 3.3 70B

llama-3.3-70b-versatile

Meta's flagship open-weight model with 70 billion parameters. The "versatile" variant is optimized for general instruction-following tasks.

Inference Provider

Groq LPU™ Cloud

api.groq.com

Groq's Language Processing Units deliver extremely low-latency inference compared to traditional GPU-based cloud providers.

LLM Configuration

from langchain_groq import ChatGroq

llm = ChatGroq(
    model="llama-3.3-70b-versatile",
    temperature=0.2,    # low → deterministic, factual responses
    max_tokens=2048  # sufficient for structured bullet-point answers
)

Parameter	Value	Rationale
Model	`llama-3.3-70b-versatile`	High reasoning capability for technical IT analysis
Temperature	`0.2`	Near-deterministic — reduces hallucination for factual Q&A
Max Tokens	`2048`	Enough for multi-step resolutions with bullet points
API Key Source	Hardcoded (env var recommended)	Set via `GROQ_API_KEY` environment variable
Inference Hardware	Groq LPU (cloud)	No local GPU required; offloaded to Groq

System Prompt Design

The prompt constrains the LLM to act strictly as a resolution analyst, preventing hallucination by grounding answers in retrieved context:

"""
You are an expert Deerwalk Stop-Process Resolution Analyst.
Answer questions using **only** the provided historical tickets from the 7 teams.
Be precise, professional, and structured. Use bullet points for causes, fixes, and steps.

If the exact issue isn't found, say so and suggest the closest similar cases.

Context:
{context}

Question: {question}

Answer:
"""

🔐 Security Recommendation The GROQ API key is currently hardcoded in the source file. It should be moved to a cPanel environment variable (GROQ_API_KEY) immediately to prevent key leakage via version control or log files.

⛓️

6. LangChain Orchestration — LCEL RAG Chain

LangChain Expression Language pipeline composition

The application uses LangChain Expression Language (LCEL) to compose the retrieval and generation steps into a single chainable pipeline. Each component is a Runnable that passes output to the next step.

# Full LCEL chain definition
chain = (
    {
        "context": retriever | format_docs,   # FAISS lookup → string
        "question": RunnablePassthrough()      # user question passed through
    }
    | prompt          # ChatPromptTemplate fills {context} and {question}
    | llm             # ChatGroq generates completion
    | StrOutputParser() # extracts plain string from AIMessage
)

LCEL Component	Class / Source	Input → Output
Retriever	`VectorStoreRetriever` (FAISS)	Query string → List[Document]
format_docs	Custom lambda	List[Document] → formatted string
RunnablePassthrough	`langchain_core.runnables`	Question string → same string
ChatPromptTemplate	`langchain_core.prompts`	Dict → PromptValue
ChatGroq	`langchain_groq`	PromptValue → AIMessage
StrOutputParser	`langchain_core.output_parsers`	AIMessage → str

🔄

7. End-to-End RAG Pipeline Flow

Step-by-step data flow from user query to response

Step	Action	Component	Output
1	User submits question via POST `/chat`	Flask route	JSON `{ "question": "..." }`
2	Question is embedded into a vector	BGE-Small (local)	384-dim float vector
3	Top-10 nearest documents retrieved	FAISS retriever	List of 10 Document objects
4	Documents concatenated with `---` separator	`format_docs()`	Plain text context block
5	Context + question injected into prompt template	ChatPromptTemplate	Formatted chat messages
6	Prompt sent to Groq API for LLaMA inference	ChatGroq (cloud)	AIMessage with resolution text
7	Response parsed to plain string and returned	StrOutputParser + Flask	JSON `{ "answer": "..." }`

Startup Initialization Sequence

# On server start (Passenger/WSGI import)
Load Excel → parse all sheets → build Document list
Load/build FAISS index (from disk cache or fresh)
Initialize HuggingFace embeddings (BGE model download if absent)
Build LangChain LCEL chain (retriever + prompt + LLM)
Flask app ready — /health returns "ok"

⚙️

8. Configuration, Environment & Performance Tuning

Environment variables, paths, and threading constraints

Key Configuration Constants

Constant	Value	Purpose
`EXCEL_FILE`	`.../Teamwise_Stop_Issues_Resolution_2023.xlsx`	Multi-sheet knowledge base source
`FAISS_PATH`	`.../faiss_index`	Persistent vector index directory
`EMBEDDING_MODEL`	`BAAI/bge-small-en-v1.5`	HuggingFace model identifier
`LLM_MODEL`	`llama-3.3-70b-versatile`	Groq model identifier
`APPLICATION_ROOT`	`/spartan`	cPanel URL prefix for all routes

Performance & Stability Environment Variables

Variable	Value	Effect
`OMP_NUM_THREADS`	1	Prevents OpenMP from spawning extra threads
`MKL_NUM_THREADS`	1	Intel MKL thread cap for shared hosting
`RAYON_NUM_THREADS`	1	Rust-based Rayon thread limit (tokenizers)
`TOKENIZERS_PARALLELISM`	false	Prevents HuggingFace tokenizer deadlock
`TQDM_DISABLE`	1	Suppresses tqdm progress bars (BrokenPipeError fix)
`HF_HUB_DISABLE_PROGRESS_BARS`	1	Suppresses HuggingFace Hub download bars
`TMPDIR / TEMP / TMP`	`/home/techqbcv/tmp`	Redirects temp files to writable home directory

📝 Deployment Notes The app runs under Passenger (cPanel's WSGI host). stdout/stderr are redirected to /home/techqbcv/tmp/app_startup.log on import to prevent BrokenPipeError caused by Passenger closing the standard streams before tqdm can write.

API Endpoints

Endpoint	Method	Description
`/spartan/`	GET	Serves the frontend HTML UI with doc count and status
`/spartan/chat`	POST	Accepts `{ "question": "..." }`, returns `{ "answer": "..." }`
`/spartan/health`	GET	Returns system status, doc count, and KB load state

AI Modules, LLMs & Inference Architecture

Table of Contents

1. System Overview

2. Embedding Model — BAAI/bge-small-en-v1.5

Configuration Details

3. Vector Store — FAISS

4. Frameworks & Supporting Libraries

5. Large Language Model — LLaMA 3.3 70B via Groq

LLM Configuration

System Prompt Design

6. LangChain Orchestration — LCEL RAG Chain

7. End-to-End RAG Pipeline Flow

Startup Initialization Sequence

8. Configuration, Environment & Performance Tuning

Key Configuration Constants

Performance & Stability Environment Variables

API Endpoints