본문 바로가기
Studies & Courses/NLP & Text Mining

Building a Financial RAG Chatbot Using LLaMA, Streamlit and RunPod (VSCode)

by Air’s Big Data 2025. 5. 16.

In this tutorial, we’ll walk through the process of building a Financial Question-Answering chatbot using Retrieval-Augmented Generation (RAG), LLaMA 3, and Streamlit. We’ll deploy our solution on RunPod’s GPU cloud environment for cost-effective development and testing.

What We’re Building

We’ll create an investment education chatbot that:

  • Retrieves relevant information from financial educational content
  • Uses the LLaMA 3 model to generate beginner-friendly responses
  • Has a simple, intuitive UI built with Streamlit
  • Runs efficiently on a GPU-accelerated environment

Prerequisites

  • Basic Python knowledge
  • Familiarity with NLP concepts (helpful but not required)
  • A RunPod account (or any other GPU cloud provider)
  • Access to the LLaMA 3 model (we’ll use meta-llama/Llama-3.2–3B-Instruct)

Setting Up Your RunPod Environment

RunPod provides cost-effective GPU environments perfect for AI development. Let’s walk through setting up the environment step by step:

Create an Account and Add Credits

Add Your SSH Public Key

To enable secure and password-less access from your local machine:

  • Generate an SSH key (if you don’t already have one):
ssh-keygen -t ed25519 -C "your_email@example.com"

 

Press enter to save at default path (~/.ssh/id_ed25519) and optionally add a passphrase.

  • Copy your public key:
cat ~/.ssh/id_ed25519.pub
  • Go to RunPod Dashboard → Settings → SSH Public Keys
  • Paste your copied public key
  • Click Update Public Key

This ensures your pod will automatically trust your device when connecting via SSH.

Deploy a GPU Pod

  • Go to the Pods tab → click + Deploy
  • Select the PyTorch template or RunPod VS Code Server if you want in-browser development
  • Choose a GPU with at least 16GB VRAM (e.g., T4, A10, or 3090)
  • Under instance settings, check: SSH Terminal Access, Start Jupyter Notebook (optional)

Connect via VSCode SSH

Once your pod is deployed:

  1. In the Pod Dashboard, click Connect → Connection Options
  2. Choose SSH over exposed TCP
  3. Copy the SSH connection string (e.g., ssh root@213.xxx.xxx.xxx -p 40000 -i ~/.ssh/id_ed25519)
  4. Open VSCode → Remote Explorer → Add New SSH Host
  5. Paste the copied string or add it to your ~/.ssh/config:
Host runpod
  HostName 213.xxx.xxx.xxx
  User root
  Port 40000
  IdentityFile ~/.ssh/id_ed25519

6. In Remote Explorer, click on the runpod host to connect

 

 

Project Setup

Once connected to your RunPod via VSCode, let’s set up our project structure:

mkdir invest-rag
cd invest-rag
touch invest_app.py requirements.txt README.md .gitignore

Create a basic .gitignore file:

venv/

 

Install the required packages:

pip install streamlit torch transformers sentence-transformers accelerate bitsandbytes

 

Create your requirements.txt:

streamlit==1.35.0
torch==2.1.0
transformers==4.36.0
sentence-transformers==2.5.0
accelerate==0.25.0
bitsandbytes==0.41.1

 

streamlit==1.35.0

A Python framework for building interactive web apps for data science and machine learning projects. Used to build and serve the chatbot UI in your project.

transformers==4.36.0

A Hugging Face library that provides pre-trained models like LLaMA, BERT, GPT, etc. Enables loading and running large language models (LLMs) for generating answers.

sentence-transformers==2.5.0

A library that extends Hugging Face Transformers for computing dense sentence embeddings. Used to encode chunks of financial text and queries for semantic similarity.

accelerate==0.25.0

A utility library from Hugging Face that simplifies device placement (e.g., GPU/CPU) and multi-GPU training or inference. Required when using device_map=’auto’ or quantized models.

bitsandbytes==0.41.1

A lightweight CUDA-based library for quantization-aware inference (e.g., 4-bit/8-bit). Enables running large models like LLaMA with lower memory usage using 4-bit quantization.

 

Preparing Your Investment Dataset

Here’s how to prepare clean_text.pkl and invest_embeddings.pt from your raw financial education text, based on the flow in your project:

  • clean_text.pkl → cleaned text chunks (Python list)
  • invest_embeddings.pt → tensor of embeddings (PyTorch tensor)

Load and Preprocess the Raw Text

You can start with a .txt or .pdf file. Here’s an example using plain text:

# Load raw data
with open("financial_education.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

If you’re using PDFs, use PyPDF2:

import PyPDF2
reader = PyPDF2.PdfReader("your_file.pdf")
raw_text = ""
for page in reader.pages:
    raw_text += page.extract_text()

 

Chunk the Text

Break long documents into manageable chunks for embedding.

def chunk_text(text, max_words=100):
    sentences = text.split('.')
    chunks = []
    current_chunk = []

for sentence in sentences:
        current_chunk.append(sentence.strip())
        if len(" ".join(current_chunk).split()) > max_words:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
    return chunks
chunk_list = chunk_text(raw_text)

 

Clean the Chunks (optional)

You can clean the chunks with a simple LLM-based function or regular expressions to remove unwanted artifacts (page numbers, headers, etc.).

cleaned_chunks = [chunk.replace('\n', ' ').strip() for chunk in chunk_list]

You may use an LLM for more sophisticated cleaning (as you did earlier).

Save the Cleaned Text

import pickle
with open("clean_text.pkl", "wb") as f:
    pickle.dump(cleaned_chunks, f)

 

Generate Embeddings

Use SentenceTransformer to encode the cleaned chunks:

from sentence_transformers import SentenceTransformer
import torch

embedding_model = SentenceTransformer("all-MiniLM-L12-v2")

embeddings = embedding_model.encode(
    cleaned_chunks,
    convert_to_tensor=True,
    normalize_embeddings=True
)

 

Save the Embeddings

torch.save(embeddings, "invest_embeddings.pt")
print("Saved invest_embeddings.pt")

After running the script, your folder will contain:

  • clean_text.pkl: list of cleaned text chunks
  • invest_embeddings.pt: dense tensor of sentence embeddings

You can now load them in your Streamlit app using:

with open("clean_text.pkl", "rb") as f:
    cleaned_text = pickle.load(f)
embeddings = torch.load("invest_embeddings.pt").to(device)

 

Building the RAG Chatbot

Now, let’s create our invest_app.py file with the main application code:

import streamlit as st
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
import pickle

device = 'cuda' if torch.cuda.is_available() else 'cpu'

@st.cache_resource
def load_models(model_name="meta-llama/Llama-3.2-3B-Instruct"):
    
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, 
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        #quantization_config=bnb_config,
        device_map=device,
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name, use_safetensors=True)
    tokenizer.pad_token_id = tokenizer.eos_token_id
    model.generation_config.pad_token_id = 128001
    return model, tokenizer

@st.cache_resource
def load_embedding_model(embedding_model_name="all-MiniLM-L12-v2"):
    return SentenceTransformer(embedding_model_name)

@st.cache_data
def load_data():
    with open("clean_text.pkl", "rb") as f:
        cleaned_text = pickle.load(f)
    embeddings = torch.load("invest_embeddings.pt").to(device)
    return cleaned_text, embeddings

model, tokenizer = load_models()
embedding_model = load_embedding_model()
cleaned_text, embeddings = load_data()

def RAG_INVEST(query):
    query_encoded = embedding_model.encode([query], convert_to_tensor=True)
    similarities = embedding_model.similarity(query_encoded, embeddings)
    scores, top_5_indices = torch.topk(similarities[0], k=5)

    CONTEXT_TEXT = '\n'.join([cleaned_text[idx] for idx in top_5_indices if similarities[0][idx] > 0])

    conversation = [
        {"role": "user", "content": f'''{query}
You are a knowledgeable investment assistant. You only answer using the provided context, which contains educational content about investing and stock markets.
If the question is unrelated to investments, say "This question is not related to investment education."
You must provide accurate, beginner-friendly explanations that a high school student can understand.
Use only the context below when answering:
CONTEXT: {CONTEXT_TEXT}
Answer in plain English within 200 words.'''},
    ]

    prompt = tokenizer.apply_chat_template(conversation, tokenize=False)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        output = model.generate(
            **inputs,
            do_sample=True,
            max_new_tokens=256
        )

    processed_text = tokenizer.decode(output[0][len(inputs.input_ids[0]) + 3:], skip_special_tokens=True)
    return processed_text, CONTEXT_TEXT

# Streamlit UI
st.title("📈 Investment Q&A Chatbot")

query = st.text_area("💬 Ask your investment question:", "What is the difference between stocks and mutual funds?")

if st.button("Get Answer") and query:
    with st.spinner("Searching through investment knowledge..."):
        response, CONTEXT_TEXT = RAG_INVEST(query)
    st.write(response)
    with st.expander("🔎 See Retrieved Context"):
        st.markdown(
            f'<div style="background-color:#f3f3f3; padding:10px; border-radius:5px;">{CONTEXT_TEXT}</div>',
            unsafe_allow_html=True
        )

 

Running the Application

Now let’s run our Streamlit app on RunPod. Since RunPod is a remote server, we need to expose the Streamlit port:

streamlit run invest_app.py

 

Understanding the Code

Let’s break down the key components of our RAG system:

Caching Mechanisms

We use Streamlit’s caching decorators to avoid reloading models between interactions:

  • @st.cache_resource: For models that should be loaded once and reused
  • @st.cache_data: For data that can be reused between queries

Model Quantization

We use 4-bit quantization to reduce the memory footprint of our LLM:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

This allows running larger models on less powerful GPUs.

 

RAG Pipeline

Our RAG pipeline follows these steps:

  1. Encode the user query into an embedding vector
  2. Compare this vector with pre-computed embeddings of our financial texts
  3. Retrieve the most relevant text chunks based on similarity scores
  4. Provide these chunks as context to the LLM
  5. Generate a targeted response based on this context

 

Prompt Engineering

We carefully craft our prompt to ensure the model:

  • Stays in character as an investment assistant
  • Only uses the provided context
  • Creates beginner-friendly explanations
  • Politely declines unrelated questions
  • Keeps responses concise (within 200 words)

 

Testing and Improving

Test your chatbot with various investment questions:

  • “What is dollar-cost averaging?”
  • “Should I invest in stocks or bonds?”
  • “How do ETFs work?”
  • “What’s the difference between a bull and bear market?”

Watch how the system retrieves and uses different contexts. If you notice any issues:

  1. Poor retrieval: Improve your embedding model or text chunking strategy
  2. Inaccurate answers: Refine your prompt structure or add more guardrails
  3. Slow responses: Consider further optimizing with model quantization or better caching

 

Deployment Considerations

For a production deployment, consider:

  1. Authentication: Add user authentication for service access
  2. Rate limiting: Prevent abuse by implementing rate limits
  3. Logging: Track usage patterns and identify improvement areas
  4. Content filtering: Ensure the system rejects inappropriate questions
  5. Regular updates: Refresh financial data to maintain accuracy

 

Conclusion

Congratulations! You’ve built a financial RAG chatbot that leverages the power of LLaMA 3 to provide contextually relevant investment information. This system demonstrates how to combine retrieval and generation for more accurate and helpful AI responses.

The full code for this tutorial is available at https://github.com/seonokkim/invest-rag.

By using RunPod’s GPU capabilities, we’ve made development cost-effective while maintaining the performance needed for responsive AI applications. This approach can be extended to many other domains beyond finance — from healthcare to education or customer support.

Happy coding!

댓글