Context Engineering Part 3: Beyond Message Counting

In Part 2, we saw how sliding windows fail by treating all messages as equally disposable. Now we'll explore token-based management—a smarter approach that considers actual content size rather than just message count.

The Key Insight: Messages Aren't Equal in Size

Sliding windows drop the oldest message, regardless of size. But in reality:

A one-word reply and a 500-word code block are very different in terms of token usage.

How Token-based Dropping Works

Implementation

class TokenBasedManager:
    def __init__(self, max_tokens=4000):
        self.max_tokens = max_tokens
        self.messages = []

    def add_message(self, role, content):
        self.messages.append({"role": role, "content": content})
        self._trim_context()

    def _trim_context(self):
        total = self._count_tokens(self.messages)
        while total > self.max_tokens and len(self.messages) > 1:
            # Remove oldest non-system message
            for i, msg in enumerate(self.messages):
                if msg['role'] != 'system':
                    removed = self.messages.pop(i)
                    total -= self._count_tokens([removed])
                    break

    def _count_tokens(self, messages):
        # Using tiktoken for OpenAI models
        return sum(len(tokenizer.encode(m['content'])) for m in messages)

Problems with Token-based Dropping

Token-based management is more precise, but introduces its own issues:

Problem 1: Unequal Treatment of Messages

Problem 2: Broken Conversation Pairs

Problem 3: Token Counting Overhead

# Every trim operation requires tokenizing ALL messages
# This is computationally expensive

def count_tokens(messages):
    total = 0
    for msg in messages:
        total += len(tokenizer.encode(msg['content']))  # Expensive!
    return total

# Optimization: Cache token counts per message
class OptimizedTokenManager:
    def __init__(self, max_tokens):
        self.max_tokens = max_tokens
        self.messages = []       # (message, token_count) tuples
        self.total_tokens = 0

    def add_message(self, message):
        count = len(tokenizer.encode(message['content']))
        self.messages.append((message, count))
        self.total_tokens += count
        self._trim()

    def _trim(self):
        while self.total_tokens > self.max_tokens and len(self.messages) > 1:
            removed_msg, removed_count = self.messages.pop(0)
            self.total_tokens -= removed_count

Smart Token Allocation

A better approach—budget your tokens like you budget money:

Implementation

def smart_token_allocation(messages, max_tokens=4000):
    # Define budget zones
    system_budget     = 500
    recent_budget     = 1500
    historical_budget = 1500
    response_buffer   = 500   # Leave room for the model's response

    system_msgs    = [m for m in messages if m['role'] == 'system']
    other_msgs     = [m for m in messages if m['role'] != 'system']

    # Recent = last N messages that fit in budget
    recent_msgs    = select_within_budget(other_msgs[-5:], recent_budget)

    # Historical = important older messages selected by relevance
    older_msgs     = other_msgs[:-5]
    historical_msgs = select_important(older_msgs, historical_budget)

    return system_msgs + historical_msgs + recent_msgs

The Core Limitation

Both sliding window and token-based dropping share the same fundamental flaw:

This limitation brings us to our next topic: summarization—compressing context without losing meaning.

Read Part 4: Summarization to learn how to compress context intelligently instead of dropping it entirely.

#AI #TokenManagement #ContextEngineering #LLM

Context Engineering Part 3: Beyond Message Counting

The Key Insight: Messages Aren't Equal in Size

How Token-based Dropping Works

Implementation

Problems with Token-based Dropping

Problem 1: Unequal Treatment of Messages

Problem 2: Broken Conversation Pairs

Problem 3: Token Counting Overhead

Smart Token Allocation

Implementation

The Core Limitation

Comments