2 minutes
Clean Up Your LLM API Streaming With Word-Aware Chunking
If you’ve built APIs that stream LLM responses, you’ve probably run into this annoying issue: words getting cut off mid-stream. You know, when your UI shows something like:
Hel lo wo rld
Yeah, not great. While most frameworks give you streaming responses out of the box, they don’t care about word boundaries. But with a little buffering magic, we can fix this pretty easily.
The idea is instead of streaming characters as they come in, we’ll collect them until we have complete words before sending them to the client.
Here’s how to do it:
def chunk_text(text: str) -> list[str]:
"""Split text into complete words and whitespace chunks"""
chunks = []
buffer = ""
for char in text:
buffer += char
# When we hit whitespace, we can safely split
if char.isspace():
# If buffer is all whitespace, keep it together
if buffer.isspace():
chunks.append(buffer)
buffer = ""
# Otherwise split at the whitespace
else:
word = buffer[:-1] # everything before whitespace
chunks.append(word)
chunks.append(char) # the whitespace itself
buffer = ""
# Don't forget any remaining text
if buffer:
chunks.append(buffer)
return chunks
def stream_text(text_iterator):
buffer = ""
for text in text_iterator:
buffer += text
chunks = chunk_text(buffer)
# Keep incomplete words in buffer
if chunks and not chunks[-1].isspace():
buffer = chunks.pop()
else:
buffer = ""
for chunk in chunks:
yield {"type": "token", "content": chunk}
That’s all there is to it! When text comes in from your LLM, like a completion response (e.g. llm.complete(prompt)
), it gets buffered until you have complete words, then sent along to the client. Words stay intact and your users get a smoother experience.
The nice thing about this approach is that it’s pretty generic - you can use it with any kind of text streaming, not just LLM responses. Just wrap your stream with this buffer logic and you’re good to go. Users might not notice when it works perfectly, but they’ll definitely notice when words are getting chopped up!