In this tutorial, we’ll walk through the process of building a Financial Question-Answering chatbot using Retrieval-Augmented Generation (RAG), LLaMA 3, and Streamlit. We’ll deploy our solution on RunPod’s GPU cloud environment for cost-effective development and testing.
What We’re Building
We’ll create an investment education chatbot that:
- Retrieves relevant information from financial educational content
- Uses the LLaMA 3 model to generate beginner-friendly responses
- Has a simple, intuitive UI built with Streamlit
- Runs efficiently on a GPU-accelerated environment
Prerequisites
- Basic Python knowledge
- Familiarity with NLP concepts (helpful but not required)
- A RunPod account (or any other GPU cloud provider)
- Access to the LLaMA 3 model (we’ll use meta-llama/Llama-3.2–3B-Instruct)
Setting Up Your RunPod Environment
RunPod provides cost-effective GPU environments perfect for AI development. Let’s walk through setting up the environment step by step:
Create an Account and Add Credits
- Sign up at https://www.runpod.io
- Add credits to your account to deploy GPU pods
Add Your SSH Public Key
To enable secure and password-less access from your local machine:
- Generate an SSH key (if you don’t already have one):
ssh-keygen -t ed25519 -C "your_email@example.com"
Press enter to save at default path (~/.ssh/id_ed25519) and optionally add a passphrase.
- Copy your public key:
cat ~/.ssh/id_ed25519.pub
- Go to RunPod Dashboard → Settings → SSH Public Keys
- Paste your copied public key
- Click Update Public Key
This ensures your pod will automatically trust your device when connecting via SSH.
Deploy a GPU Pod
- Go to the Pods tab → click + Deploy
- Select the PyTorch template or RunPod VS Code Server if you want in-browser development
- Choose a GPU with at least 16GB VRAM (e.g., T4, A10, or 3090)
- Under instance settings, check: SSH Terminal Access, Start Jupyter Notebook (optional)
Connect via VSCode SSH
Once your pod is deployed:
- In the Pod Dashboard, click Connect → Connection Options
- Choose SSH over exposed TCP
- Copy the SSH connection string (e.g., ssh root@213.xxx.xxx.xxx -p 40000 -i ~/.ssh/id_ed25519)
- Open VSCode → Remote Explorer → Add New SSH Host
- Paste the copied string or add it to your ~/.ssh/config:
Host runpod
HostName 213.xxx.xxx.xxx
User root
Port 40000
IdentityFile ~/.ssh/id_ed25519
6. In Remote Explorer, click on the runpod host to connect
Project Setup
Once connected to your RunPod via VSCode, let’s set up our project structure:
mkdir invest-rag
cd invest-rag
touch invest_app.py requirements.txt README.md .gitignore
Create a basic .gitignore file:
venv/
Install the required packages:
pip install streamlit torch transformers sentence-transformers accelerate bitsandbytes
Create your requirements.txt:
streamlit==1.35.0
torch==2.1.0
transformers==4.36.0
sentence-transformers==2.5.0
accelerate==0.25.0
bitsandbytes==0.41.1
streamlit==1.35.0
A Python framework for building interactive web apps for data science and machine learning projects. Used to build and serve the chatbot UI in your project.
transformers==4.36.0
A Hugging Face library that provides pre-trained models like LLaMA, BERT, GPT, etc. Enables loading and running large language models (LLMs) for generating answers.
sentence-transformers==2.5.0
A library that extends Hugging Face Transformers for computing dense sentence embeddings. Used to encode chunks of financial text and queries for semantic similarity.
accelerate==0.25.0
A utility library from Hugging Face that simplifies device placement (e.g., GPU/CPU) and multi-GPU training or inference. Required when using device_map=’auto’ or quantized models.
bitsandbytes==0.41.1
A lightweight CUDA-based library for quantization-aware inference (e.g., 4-bit/8-bit). Enables running large models like LLaMA with lower memory usage using 4-bit quantization.
Preparing Your Investment Dataset
Here’s how to prepare clean_text.pkl and invest_embeddings.pt from your raw financial education text, based on the flow in your project:
- clean_text.pkl → cleaned text chunks (Python list)
- invest_embeddings.pt → tensor of embeddings (PyTorch tensor)
Load and Preprocess the Raw Text
You can start with a .txt or .pdf file. Here’s an example using plain text:
# Load raw data
with open("financial_education.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
If you’re using PDFs, use PyPDF2:
import PyPDF2
reader = PyPDF2.PdfReader("your_file.pdf")
raw_text = ""
for page in reader.pages:
raw_text += page.extract_text()
Chunk the Text
Break long documents into manageable chunks for embedding.
def chunk_text(text, max_words=100):
sentences = text.split('.')
chunks = []
current_chunk = []
for sentence in sentences:
current_chunk.append(sentence.strip())
if len(" ".join(current_chunk).split()) > max_words:
chunks.append(" ".join(current_chunk))
current_chunk = []
return chunks
chunk_list = chunk_text(raw_text)
Clean the Chunks (optional)
You can clean the chunks with a simple LLM-based function or regular expressions to remove unwanted artifacts (page numbers, headers, etc.).
cleaned_chunks = [chunk.replace('\n', ' ').strip() for chunk in chunk_list]
You may use an LLM for more sophisticated cleaning (as you did earlier).
Save the Cleaned Text
import pickle
with open("clean_text.pkl", "wb") as f:
pickle.dump(cleaned_chunks, f)
Generate Embeddings
Use SentenceTransformer to encode the cleaned chunks:
from sentence_transformers import SentenceTransformer
import torch
embedding_model = SentenceTransformer("all-MiniLM-L12-v2")
embeddings = embedding_model.encode(
cleaned_chunks,
convert_to_tensor=True,
normalize_embeddings=True
)
Save the Embeddings
torch.save(embeddings, "invest_embeddings.pt")
print("Saved invest_embeddings.pt")
After running the script, your folder will contain:
- clean_text.pkl: list of cleaned text chunks
- invest_embeddings.pt: dense tensor of sentence embeddings
You can now load them in your Streamlit app using:
with open("clean_text.pkl", "rb") as f:
cleaned_text = pickle.load(f)
embeddings = torch.load("invest_embeddings.pt").to(device)
Building the RAG Chatbot
Now, let’s create our invest_app.py file with the main application code:
import streamlit as st
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
import pickle
device = 'cuda' if torch.cuda.is_available() else 'cpu'
@st.cache_resource
def load_models(model_name="meta-llama/Llama-3.2-3B-Instruct"):
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
#quantization_config=bnb_config,
device_map=device,
)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_safetensors=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
model.generation_config.pad_token_id = 128001
return model, tokenizer
@st.cache_resource
def load_embedding_model(embedding_model_name="all-MiniLM-L12-v2"):
return SentenceTransformer(embedding_model_name)
@st.cache_data
def load_data():
with open("clean_text.pkl", "rb") as f:
cleaned_text = pickle.load(f)
embeddings = torch.load("invest_embeddings.pt").to(device)
return cleaned_text, embeddings
model, tokenizer = load_models()
embedding_model = load_embedding_model()
cleaned_text, embeddings = load_data()
def RAG_INVEST(query):
query_encoded = embedding_model.encode([query], convert_to_tensor=True)
similarities = embedding_model.similarity(query_encoded, embeddings)
scores, top_5_indices = torch.topk(similarities[0], k=5)
CONTEXT_TEXT = '\n'.join([cleaned_text[idx] for idx in top_5_indices if similarities[0][idx] > 0])
conversation = [
{"role": "user", "content": f'''{query}
You are a knowledgeable investment assistant. You only answer using the provided context, which contains educational content about investing and stock markets.
If the question is unrelated to investments, say "This question is not related to investment education."
You must provide accurate, beginner-friendly explanations that a high school student can understand.
Use only the context below when answering:
CONTEXT: {CONTEXT_TEXT}
Answer in plain English within 200 words.'''},
]
prompt = tokenizer.apply_chat_template(conversation, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
output = model.generate(
**inputs,
do_sample=True,
max_new_tokens=256
)
processed_text = tokenizer.decode(output[0][len(inputs.input_ids[0]) + 3:], skip_special_tokens=True)
return processed_text, CONTEXT_TEXT
# Streamlit UI
st.title("📈 Investment Q&A Chatbot")
query = st.text_area("💬 Ask your investment question:", "What is the difference between stocks and mutual funds?")
if st.button("Get Answer") and query:
with st.spinner("Searching through investment knowledge..."):
response, CONTEXT_TEXT = RAG_INVEST(query)
st.write(response)
with st.expander("🔎 See Retrieved Context"):
st.markdown(
f'<div style="background-color:#f3f3f3; padding:10px; border-radius:5px;">{CONTEXT_TEXT}</div>',
unsafe_allow_html=True
)
Running the Application
Now let’s run our Streamlit app on RunPod. Since RunPod is a remote server, we need to expose the Streamlit port:
streamlit run invest_app.py
Understanding the Code
Let’s break down the key components of our RAG system:
Caching Mechanisms
We use Streamlit’s caching decorators to avoid reloading models between interactions:
- @st.cache_resource: For models that should be loaded once and reused
- @st.cache_data: For data that can be reused between queries
Model Quantization
We use 4-bit quantization to reduce the memory footprint of our LLM:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
This allows running larger models on less powerful GPUs.
RAG Pipeline
Our RAG pipeline follows these steps:
- Encode the user query into an embedding vector
- Compare this vector with pre-computed embeddings of our financial texts
- Retrieve the most relevant text chunks based on similarity scores
- Provide these chunks as context to the LLM
- Generate a targeted response based on this context
Prompt Engineering
We carefully craft our prompt to ensure the model:
- Stays in character as an investment assistant
- Only uses the provided context
- Creates beginner-friendly explanations
- Politely declines unrelated questions
- Keeps responses concise (within 200 words)
Testing and Improving
Test your chatbot with various investment questions:
- “What is dollar-cost averaging?”
- “Should I invest in stocks or bonds?”
- “How do ETFs work?”
- “What’s the difference between a bull and bear market?”
Watch how the system retrieves and uses different contexts. If you notice any issues:
- Poor retrieval: Improve your embedding model or text chunking strategy
- Inaccurate answers: Refine your prompt structure or add more guardrails
- Slow responses: Consider further optimizing with model quantization or better caching
Deployment Considerations
For a production deployment, consider:
- Authentication: Add user authentication for service access
- Rate limiting: Prevent abuse by implementing rate limits
- Logging: Track usage patterns and identify improvement areas
- Content filtering: Ensure the system rejects inappropriate questions
- Regular updates: Refresh financial data to maintain accuracy
Conclusion
Congratulations! You’ve built a financial RAG chatbot that leverages the power of LLaMA 3 to provide contextually relevant investment information. This system demonstrates how to combine retrieval and generation for more accurate and helpful AI responses.
The full code for this tutorial is available at https://github.com/seonokkim/invest-rag.
By using RunPod’s GPU capabilities, we’ve made development cost-effective while maintaining the performance needed for responsive AI applications. This approach can be extended to many other domains beyond finance — from healthcare to education or customer support.
Happy coding!
'Studies & Courses > NLP & Text Mining' 카테고리의 다른 글
[Text Mining] Text Classification (0) | 2020.05.24 |
---|---|
[Text Mining] Text Reprocessing (0) | 2020.04.02 |
[Text Mining] Introduction to Text Mining (0) | 2020.03.30 |
댓글