🤖Understanding Large Language Models: Running and Analyzing Quantized LLM on a Local Machine 🚀
Introduction 🎯
LLaMA (Large Language Model Meta AI) is an open-source language model framework that allows running large language models efficiently on local machines. In this guide, I will explain step-by-step how I set up and interacted with llama.cpp to run the Mistral-7B-Instruct model, explored its context window limitations, and experimented with follow-up questions.
Additionally, I conducted a detailed inference-level analysis to understand exactly how the model processes and retains information during long interactions. Soon, I will be pre-training the model to consistently remember values such as a = 5 and c = 6 throughout multiple questioning sessions, and I will also explore new methods to train the model instantaneously.
This article includes:
- 🛠️ Installing dependencies
- 🚀Running Mistral-7B using llama.cpp
- 💬 Understanding how the model handles conversations
- 🔍 Investigating the effect of long conversations and context limits
- 🧠 Suggestions to extend model memory
- 🎯 Upcoming plans for fine-tuning and pre-training for better recall
- 🏆 Use Cases: I will be writing a separate article on the magical applications of this model and how it can be leveraged in different domains. Stay tuned! 😊
Understanding llama.cpp vs Mistral-7B 🔍
llama.cpp is a highly optimized inference engine that enables running LLaMA-based models on consumer hardware, including CPUs, Apple Silicon, and NVIDIA GPUs.
Mistral-7B-Instruct is the actual model that I used for inference, which is loaded into llama.cpp for execution. While llama.cpp provides the environment to run various models, the specific behavior of the model depends on the weights used — in this case, Mistral-7B.
In simple terms:
llama.cpp = The software that runs large language models efficiently
Mistral-7B = The specific model I used for inference
This means that while I am using llama.cpp, I am actually running Mistral-7B for all interactions and experiments.
Prerequisites ✅
Before getting started, ensure that your system meets the following requirements:
- Operating System: Linux/macOS/Windows.
- Hardware: At least 16GB RAM (32GB recommended for larger models).
- GPU (Optional): A CUDA-compatible NVIDIA GPU for faster inference.
Step 1: Clone the llama.cpp Repository 🖥️
To begin, clone the llama.cpp implementation repository:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
Step 2: Install Dependencies ⚙️
You need to install the required dependencies. Here’s how:
For Ubuntu/Linux:
sudo apt update && sudo apt install build-essential cmake python3 python3-pip git
pip install torch numpy scipy
For macOS:
brew install cmake python3
For Windows:
- Install Visual Studio with C++ build tools
- Install Python 3.8+
- Use WSL (Windows Subsystem for Linux) for better compatibility
Step 3: Download and Use the Mistral-7B-Instruct Model 📥
For this experiment, I used the Mistral-7B-Instruct-v0.2.Q4_K_M.gguf model. After acquiring the model weights, move them into the models directory:
mkdir models && mv mistral-7b-instruct-v0.2.Q4_K_M.gguf models/
Step 4: Build the llama.cpp 🔨
Navigate to the cloned llama.cpp
directory and compile the model:
For CPU-only Inference
make
For macOS (Apple Silicon M1/M2)
make LLAMA_METAL=1
Note: All apple M series chips can use GPU power using Metal Api's 🙂
For NVIDIA GPUs (CUDA Acceleration)
make LLAMA_CUBLAS=1
This will compile the necessary binaries for running the model efficiently on your system.
Step 5: Running Mistral-7B using llama.cpp 🏃♂️
Now, run the Mistral model with a prompt:
After building the project inside bin file we can see all the modes of communication we can make use of. for now I am using CLI only.
./llama-cli -m ../../Models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "You are my teacher."
To enter interactive mode:
./llama-cli -m ../../../Models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
🧠 How It Works: Handling Follow-Up Questions 💬
Understanding Context Length and Memory 📏
When interacting with the model, each follow-up question is processed along with the previous conversation within the model’s context window (typically 2048 tokens). This means:
- The model remembers past interactions as long as they fit within the context window.
- Once the conversation exceeds the context limit, older parts of the conversation are discarded.
- The model does not have built-in persistent memory — everything is forgotten when the session ends.
🔍 Step-by-Step Analysis of Follow-Up Questions
Step 1: First Question
> User: If a=2 and b=4, what is a+b?
[TOKENIZED INPUT]
Token 0: "If"
Token 1: "a"
Token 2: "="
Token 3: "2"
...
Token 12: "b"
Token 13: "+"
Token 14: "?"
[MODEL OUTPUT] "a+b=6"
Step 2: Follow-Up Question
> User: Now what is b-a?
[TOKENIZED INPUT]
Token 0: "If"
Token 1: "a"
Token 2: "="
Token 3: "2"
...
Token 14: "?"
Token 15: "Now"
Token 16: "what"
Token 17: "is"
Token 18: "b"
Token 19: "-"
Token 20: "a"
Token 21: "?"
[MODEL OUTPUT] "b-a=2"
🔹 Key Insight: Each new question is processed along with all previous interactions to ensure coherence.
Step 3: Long Conversation (Exceeding Context Window)
> User: Now what is a*b?
[MODEL OUTPUT] "I don't have previous values. Can you provide them again?"
🔹 Why? The model has a fixed context size. Older exchanges get removed when new ones arrive.
ScreenShots (use server to explore UI😊):
🚀 Future Work Exploring Instant Training for Better Recall
I am planning to explore methods to train the model instantly so it can remember values such as a=5
and c=6
across multiple follow-ups without losing context. This will involve:
- Experimenting with on-the-fly learning techniques 🧠
- Testing fast fine-tuning within a session
- Analyzing real-time reinforcement learning
By following these steps, you can run the model locally, interact with it, and understand its memory constraints. Next goal: Find ways to train the model instantly! 🚀