Skip to main content

Command Palette

Search for a command to run...

Context Engineering Part 3: Beyond Message Counting

Not all messages are equal. So why treat them that way?

Published
3 min read
A

I'm Software Engineer with 7+ years of experience in designing, implementing and debugging softwares including backend services, automation tools and Mobile SDK.

I love building things and currently building lessentext.com

In Part 2, we saw how sliding windows fail by treating all messages as equally disposable. Now we'll explore token-based management—a smarter approach that considers actual content size rather than just message count.

The Key Insight: Messages Aren't Equal in Size

Sliding windows drop the oldest message, regardless of size. But in reality:

A one-word reply and a 500-word code block are very different in terms of token usage.

How Token-based Dropping Works

Implementation

class TokenBasedManager:
    def __init__(self, max_tokens=4000):
        self.max_tokens = max_tokens
        self.messages = []

    def add_message(self, role, content):
        self.messages.append({"role": role, "content": content})
        self._trim_context()

    def _trim_context(self):
        total = self._count_tokens(self.messages)
        while total > self.max_tokens and len(self.messages) > 1:
            # Remove oldest non-system message
            for i, msg in enumerate(self.messages):
                if msg['role'] != 'system':
                    removed = self.messages.pop(i)
                    total -= self._count_tokens([removed])
                    break

    def _count_tokens(self, messages):
        # Using tiktoken for OpenAI models
        return sum(len(tokenizer.encode(m['content'])) for m in messages)

Problems with Token-based Dropping

Token-based management is more precise, but introduces its own issues:

Problem 1: Unequal Treatment of Messages

Problem 2: Broken Conversation Pairs

Problem 3: Token Counting Overhead

# Every trim operation requires tokenizing ALL messages
# This is computationally expensive

def count_tokens(messages):
    total = 0
    for msg in messages:
        total += len(tokenizer.encode(msg['content']))  # Expensive!
    return total

# Optimization: Cache token counts per message
class OptimizedTokenManager:
    def __init__(self, max_tokens):
        self.max_tokens = max_tokens
        self.messages = []       # (message, token_count) tuples
        self.total_tokens = 0

    def add_message(self, message):
        count = len(tokenizer.encode(message['content']))
        self.messages.append((message, count))
        self.total_tokens += count
        self._trim()

    def _trim(self):
        while self.total_tokens > self.max_tokens and len(self.messages) > 1:
            removed_msg, removed_count = self.messages.pop(0)
            self.total_tokens -= removed_count

Smart Token Allocation

A better approach—budget your tokens like you budget money:

Implementation

def smart_token_allocation(messages, max_tokens=4000):
    # Define budget zones
    system_budget     = 500
    recent_budget     = 1500
    historical_budget = 1500
    response_buffer   = 500   # Leave room for the model's response

    system_msgs    = [m for m in messages if m['role'] == 'system']
    other_msgs     = [m for m in messages if m['role'] != 'system']

    # Recent = last N messages that fit in budget
    recent_msgs    = select_within_budget(other_msgs[-5:], recent_budget)

    # Historical = important older messages selected by relevance
    older_msgs     = other_msgs[:-5]
    historical_msgs = select_important(older_msgs, historical_budget)

    return system_msgs + historical_msgs + recent_msgs

The Core Limitation

Both sliding window and token-based dropping share the same fundamental flaw:

This limitation brings us to our next topic: summarization—compressing context without losing meaning.


Read Part 4: Summarization to learn how to compress context intelligently instead of dropping it entirely.

#AI #TokenManagement #ContextEngineering #LLM

Context Engineering

Part 2 of 4

Explore the critical discipline of Context Engineering for Large Language Models (LLMs). This comprehensive series dives deep into the fundamental constraints of context windows, why traditional approaches fail, and how to build robust, efficient, and intelligent AI systems that truly understand and remember. Learn about advanced techniques like summarization, RAG, tool use, and causal-aware context management to overcome AI amnesia and unlock the full potential of LLMs in production environments.

Up next

Context Engineering Part 2: The Sliding Window Trap

The most common solution to AI amnesia is also the worst: sliding windows. It seems reasonable—keep only the most recent messages, drop the oldest when full. But this seemingly logical approach create