🤖Understanding Large Language Models: Running and Analyzing Quantized LLM on a Local Machine 🚀

Partha Sai Guttikonda
5 min read2 days ago

--

Photo by Solen Feyissa on Unsplash

Introduction 🎯

LLaMA (Large Language Model Meta AI) is an open-source language model framework that allows running large language models efficiently on local machines. In this guide, I will explain step-by-step how I set up and interacted with llama.cpp to run the Mistral-7B-Instruct model, explored its context window limitations, and experimented with follow-up questions.
Additionally, I conducted a detailed inference-level analysis to understand exactly how the model processes and retains information during long interactions. Soon, I will be pre-training the model to consistently remember values such as a = 5 and c = 6 throughout multiple questioning sessions, and I will also explore new methods to train the model instantaneously.

This article includes:

  • 🛠️ Installing dependencies
  • 🚀Running Mistral-7B using llama.cpp
  • 💬 Understanding how the model handles conversations
  • 🔍 Investigating the effect of long conversations and context limits
  • 🧠 Suggestions to extend model memory
  • 🎯 Upcoming plans for fine-tuning and pre-training for better recall
  • 🏆 Use Cases: I will be writing a separate article on the magical applications of this model and how it can be leveraged in different domains. Stay tuned! 😊

Understanding llama.cpp vs Mistral-7B 🔍

llama.cpp is a highly optimized inference engine that enables running LLaMA-based models on consumer hardware, including CPUs, Apple Silicon, and NVIDIA GPUs.

Mistral-7B-Instruct is the actual model that I used for inference, which is loaded into llama.cpp for execution. While llama.cpp provides the environment to run various models, the specific behavior of the model depends on the weights used — in this case, Mistral-7B.

In simple terms:

llama.cpp = The software that runs large language models efficiently
Mistral-7B = The specific model I used for inference

This means that while I am using llama.cpp, I am actually running Mistral-7B for all interactions and experiments.

Prerequisites ✅

Before getting started, ensure that your system meets the following requirements:

  • Operating System: Linux/macOS/Windows.
  • Hardware: At least 16GB RAM (32GB recommended for larger models).
  • GPU (Optional): A CUDA-compatible NVIDIA GPU for faster inference.

Step 1: Clone the llama.cpp Repository 🖥️

To begin, clone the llama.cpp implementation repository:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Step 2: Install Dependencies ⚙️

You need to install the required dependencies. Here’s how:

For Ubuntu/Linux:

sudo apt update && sudo apt install build-essential cmake python3 python3-pip git
pip install torch numpy scipy

For macOS:

brew install cmake python3

For Windows:

  1. Install Visual Studio with C++ build tools
  2. Install Python 3.8+
  3. Use WSL (Windows Subsystem for Linux) for better compatibility

Step 3: Download and Use the Mistral-7B-Instruct Model 📥

For this experiment, I used the Mistral-7B-Instruct-v0.2.Q4_K_M.gguf model. After acquiring the model weights, move them into the models directory:

mkdir models && mv mistral-7b-instruct-v0.2.Q4_K_M.gguf models/

Step 4: Build the llama.cpp 🔨

Navigate to the cloned llama.cpp directory and compile the model:

For CPU-only Inference

make

For macOS (Apple Silicon M1/M2)

make LLAMA_METAL=1
Note: All apple M series chips can use GPU power using Metal Api's 🙂

For NVIDIA GPUs (CUDA Acceleration)

make LLAMA_CUBLAS=1

This will compile the necessary binaries for running the model efficiently on your system.

Step 5: Running Mistral-7B using llama.cpp 🏃‍♂️

Now, run the Mistral model with a prompt:
After building the project inside bin file we can see all the modes of communication we can make use of. for now I am using CLI only.

./llama-cli -m ../../Models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -p "You are my teacher."

To enter interactive mode:

./llama-cli -m ../../../Models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
Photo by Brannon Naito on Unsplash

🧠 How It Works: Handling Follow-Up Questions 💬

Understanding Context Length and Memory 📏

When interacting with the model, each follow-up question is processed along with the previous conversation within the model’s context window (typically 2048 tokens). This means:

  1. The model remembers past interactions as long as they fit within the context window.
  2. Once the conversation exceeds the context limit, older parts of the conversation are discarded.
  3. The model does not have built-in persistent memory — everything is forgotten when the session ends.

🔍 Step-by-Step Analysis of Follow-Up Questions

Step 1: First Question

> User: If a=2 and b=4, what is a+b?
[TOKENIZED INPUT]
Token 0: "If"
Token 1: "a"
Token 2: "="
Token 3: "2"
...
Token 12: "b"
Token 13: "+"
Token 14: "?"
[MODEL OUTPUT] "a+b=6"

Step 2: Follow-Up Question

> User: Now what is b-a?
[TOKENIZED INPUT]
Token 0: "If"
Token 1: "a"
Token 2: "="
Token 3: "2"
...
Token 14: "?"
Token 15: "Now"
Token 16: "what"
Token 17: "is"
Token 18: "b"
Token 19: "-"
Token 20: "a"
Token 21: "?"
[MODEL OUTPUT] "b-a=2"

🔹 Key Insight: Each new question is processed along with all previous interactions to ensure coherence.

Step 3: Long Conversation (Exceeding Context Window)

> User: Now what is a*b?
[MODEL OUTPUT] "I don't have previous values. Can you provide them again?"

🔹 Why? The model has a fixed context size. Older exchanges get removed when new ones arrive.

ScreenShots (use server to explore UI😊):

local screenshot

🚀 Future Work Exploring Instant Training for Better Recall

I am planning to explore methods to train the model instantly so it can remember values such as a=5 and c=6 across multiple follow-ups without losing context. This will involve:

  • Experimenting with on-the-fly learning techniques 🧠
  • Testing fast fine-tuning within a session
  • Analyzing real-time reinforcement learning

By following these steps, you can run the model locally, interact with it, and understand its memory constraints. Next goal: Find ways to train the model instantly! 🚀

--

--

Partha Sai Guttikonda
Partha Sai Guttikonda

Written by Partha Sai Guttikonda

Engineering Intelligence: ML in Imaging | Full-Stack AI Innovator

Responses (1)