How I added free article narration with Piper, then moved to Kokoro for quality
Listen while you read
I wanted article narration on this site without adding API cost, a backend, or a heavy client-side player.
That meant the constraints were clear from the start:
- generation had to be local
- the workflow had to be free
- the site still had to stay static
- audio files had to stay small
- the reading experience had to remain great with or without audio
The implementation started with Piper because it is simple, local, and easy to script. It later pushed me toward Kokoro because quality matters more than raw convenience once people actually have to listen for a few minutes.
The implementation shape
The site is static Astro, so I treated narration as a build artifact, not a runtime feature.
The flow is straightforward:
- write the article in Markdown
- transform it into cleaner narration text
- synthesize audio locally
- compress it for the web
- render a native audio player only when audio exists
That keeps the frontend lean. There is no waveform library, no streaming layer, and no extra framework code just to play an article.
The first version with Piper
Piper was the right first engine.
It gave me:
- local text-to-speech
- no API bill
- a scriptable CLI
- easy voice downloads
- small enough output for a portfolio site
For a static site, that is a strong starting point. I could generate files
locally, commit or publish the outputs, and let the article page render a native
<audio> player.
The first lesson, though, was that a working TTS pipeline is not the same as a pleasant narration experience.
The real quality problem was not only the model
The earliest audio sounded too flat.
It was understandable, but it had the typical symptoms of “text was converted to sound” rather than “an article was narrated”:
- pacing felt uniform
- headings landed awkwardly
- wrapped Markdown lines hurt the rhythm
- transitions between ideas felt abrupt
- longer articles became tiring to listen to
That exposed an important lesson: text cleanup and chunking matter almost as much as the engine.
What improved Piper the most
Before changing models, I changed the pipeline.
1. Rebuild paragraphs from Markdown
Markdown source is often wrapped for readability in the editor. A TTS engine does not know that those are soft wraps. If you synthesize line by line, the voice inherits that awkward structure.
Joining consecutive lines back into real paragraphs improved flow immediately.
2. Generate per chunk, not in one long pass
I switched to chunk-based synthesis:
- title
- headings
- paragraphs
- list items
Then I concatenated the output with controlled silence between chunks.
This made the narration breathe more like spoken structure and less like a single flat block.
3. Add pauses intentionally
A small pause after a sentence is not enough for article narration. Headings, paragraphs, and lists need different spacing.
That was another useful reminder: prosody is partly model quality, but pacing is also an editorial choice.
4. Compress for spoken-word, not music
For the web output, I optimized around speech:
- mono audio
- Opus first
- loudness normalization
- slightly slower playback when needed
- no shipping of large WAV masters
That kept the files small without making the listening experience noticeably worse.
Where Piper hit its ceiling
After the pipeline improvements, Piper became much better.
But better structure did not fully solve voice quality. The remaining issue was prosody. Even with better chunking and a stronger voice, the delivery could still feel a little too even for long-form narration.
Piper remained good at:
- being local
- being free
- being simple to automate
- generating compact spoken-word files
But it was less convincing at sounding relaxed and natural across an entire article.
That was the point where I stopped asking “can this work?” and started asking “would I actually want to listen to this for five minutes?”
Why I moved to Kokoro
Kokoro made sense as the next quality step while staying local and free.
The tradeoff is simple:
- Piper is easier and lighter
- Kokoro sounds better
For article narration, the second point matters more.
The voice I targeted for comparison was George (bm_george), because I
wanted a calmer, more natural long-form read than the Piper voices were giving
me.
The shift to Kokoro also introduced a practical lesson that had nothing to do with sound quality: local tooling still has ecosystem constraints.
In my case, Kokoro required a compatible Python version and espeak-ng, while
Piper fit more easily into the existing workflow. That is the kind of friction
that is easy to ignore in a demo and impossible to ignore in a real authoring
pipeline.
What I learned from the transition
Local and free is realistic
You do not need a paid TTS API to add article narration to a static site.
If the workflow is offline, scripted, and editorially selective, local generation is completely viable.
Quality is a product decision, not just an engineering one
The cheapest or simplest model is not automatically the right one if the output becomes part of the reading experience.
For article narration, “good enough” quality gets exposed very quickly.
The text pipeline matters a lot
Better narration came less from magical model settings and more from treating the article as spoken structure:
- clean paragraphs
- separate headings
- controlled pauses
- fewer noisy artifacts
Start with a ranked subset
Not every article deserves audio on day one.
The right move was to start with a small set of high-value articles and evaluate:
- quality
- file size
- workflow friction
- whether the feature actually improves the site
Storage matters once the archive grows
Audio files accumulate quickly. Even optimized speech files are still binary artifacts.
That is why the long-term direction is to move the audio off-repo and into R2, while keeping the site itself lean.
What I would recommend
If you want to add narrated articles to a static site with minimal cost, I would approach it in this order:
- start local
- clean the text before synthesis
- generate per paragraph or section
- optimize for spoken-word delivery
- test on a small set of important articles
- move to a better model if the voice quality is still not there
That sequence matters. Without the cleanup and chunking work, it is too easy to blame the engine for problems caused by the pipeline.
Where I landed
Piper was the right way to prove the concept.
Kokoro was the right next step once the goal changed from “generate audio” to “publish narration worth listening to.”
That is probably the cleanest summary of the whole experiment: the first version proved that free local narration was feasible, and the second version taught me that audio quality is part of product quality, not a bonus layer on top.
If you need help building features like this into a content product or developer site, get in touch.
Need help with something like this?
Send the product goal, timeline, and current blockers. I’ll help you find the smallest useful next step.
Start a conversation