Context Engineering Part 3: Beyond Message Counting
Not all messages are equal. So why treat them that way?
I'm Software Engineer with 7+ years of experience in designing, implementing and debugging softwares including backend services, automation tools and Mobile SDK.
I love building things and currently building lessentext.com
In Part 2, we saw how sliding windows fail by treating all messages as equally disposable. Now we'll explore token-based management—a smarter approach that considers actual content size rather than just message count.
The Key Insight: Messages Aren't Equal in Size
Sliding windows drop the oldest message, regardless of size. But in reality:
A one-word reply and a 500-word code block are very different in terms of token usage.
How Token-based Dropping Works
Implementation
class TokenBasedManager:
def __init__(self, max_tokens=4000):
self.max_tokens = max_tokens
self.messages = []
def add_message(self, role, content):
self.messages.append({"role": role, "content": content})
self._trim_context()
def _trim_context(self):
total = self._count_tokens(self.messages)
while total > self.max_tokens and len(self.messages) > 1:
# Remove oldest non-system message
for i, msg in enumerate(self.messages):
if msg['role'] != 'system':
removed = self.messages.pop(i)
total -= self._count_tokens([removed])
break
def _count_tokens(self, messages):
# Using tiktoken for OpenAI models
return sum(len(tokenizer.encode(m['content'])) for m in messages)
Problems with Token-based Dropping
Token-based management is more precise, but introduces its own issues:
Problem 1: Unequal Treatment of Messages
Problem 2: Broken Conversation Pairs
Problem 3: Token Counting Overhead
# Every trim operation requires tokenizing ALL messages
# This is computationally expensive
def count_tokens(messages):
total = 0
for msg in messages:
total += len(tokenizer.encode(msg['content'])) # Expensive!
return total
# Optimization: Cache token counts per message
class OptimizedTokenManager:
def __init__(self, max_tokens):
self.max_tokens = max_tokens
self.messages = [] # (message, token_count) tuples
self.total_tokens = 0
def add_message(self, message):
count = len(tokenizer.encode(message['content']))
self.messages.append((message, count))
self.total_tokens += count
self._trim()
def _trim(self):
while self.total_tokens > self.max_tokens and len(self.messages) > 1:
removed_msg, removed_count = self.messages.pop(0)
self.total_tokens -= removed_count
Smart Token Allocation
A better approach—budget your tokens like you budget money:
Implementation
def smart_token_allocation(messages, max_tokens=4000):
# Define budget zones
system_budget = 500
recent_budget = 1500
historical_budget = 1500
response_buffer = 500 # Leave room for the model's response
system_msgs = [m for m in messages if m['role'] == 'system']
other_msgs = [m for m in messages if m['role'] != 'system']
# Recent = last N messages that fit in budget
recent_msgs = select_within_budget(other_msgs[-5:], recent_budget)
# Historical = important older messages selected by relevance
older_msgs = other_msgs[:-5]
historical_msgs = select_important(older_msgs, historical_budget)
return system_msgs + historical_msgs + recent_msgs
The Core Limitation
Both sliding window and token-based dropping share the same fundamental flaw:
This limitation brings us to our next topic: summarization—compressing context without losing meaning.
#AI #TokenManagement #ContextEngineering #LLM
