How I added free article narration with Piper, then moved to Kokoro for quality

I wanted article narration on this site without adding API cost, a backend, or a heavy client-side player.

That meant the constraints were clear from the start:

generation had to be local
the workflow had to be free
the site still had to stay static
audio files had to stay small
the reading experience had to remain great with or without audio

The implementation started with Piper because it is simple, local, and easy to script. It later pushed me toward Kokoro because quality matters more than raw convenience once people actually have to listen for a few minutes.

The implementation shape

The site is static Astro, so I treated narration as a build artifact, not a runtime feature.

The flow is straightforward:

write the article in Markdown
transform it into cleaner narration text
synthesize audio locally
compress it for the web
render a native audio player only when audio exists

That keeps the frontend lean. There is no waveform library, no streaming layer, and no extra framework code just to play an article.

The first version with Piper

Piper was the right first engine.

It gave me:

local text-to-speech
no API bill
a scriptable CLI
easy voice downloads
small enough output for a portfolio site

For a static site, that is a strong starting point. I could generate files locally, commit or publish the outputs, and let the article page render a native <audio> player.

The first lesson, though, was that a working TTS pipeline is not the same as a pleasant narration experience.

The real quality problem was not only the model

The earliest audio sounded too flat.

It was understandable, but it had the typical symptoms of “text was converted to sound” rather than “an article was narrated”:

pacing felt uniform
headings landed awkwardly
wrapped Markdown lines hurt the rhythm
transitions between ideas felt abrupt
longer articles became tiring to listen to

That exposed an important lesson: text cleanup and chunking matter almost as much as the engine.

What improved Piper the most

Before changing models, I changed the pipeline.

1. Rebuild paragraphs from Markdown

Markdown source is often wrapped for readability in the editor. A TTS engine does not know that those are soft wraps. If you synthesize line by line, the voice inherits that awkward structure.

Joining consecutive lines back into real paragraphs improved flow immediately.

2. Generate per chunk, not in one long pass

I switched to chunk-based synthesis:

title
headings
paragraphs
list items

Then I concatenated the output with controlled silence between chunks.

This made the narration breathe more like spoken structure and less like a single flat block.

3. Add pauses intentionally

A small pause after a sentence is not enough for article narration. Headings, paragraphs, and lists need different spacing.

That was another useful reminder: prosody is partly model quality, but pacing is also an editorial choice.

4. Compress for spoken-word, not music

For the web output, I optimized around speech:

mono audio
Opus first
loudness normalization
slightly slower playback when needed
no shipping of large WAV masters

That kept the files small without making the listening experience noticeably worse.

Where Piper hit its ceiling

After the pipeline improvements, Piper became much better.

But better structure did not fully solve voice quality. The remaining issue was prosody. Even with better chunking and a stronger voice, the delivery could still feel a little too even for long-form narration.

Piper remained good at:

being local
being free
being simple to automate
generating compact spoken-word files

But it was less convincing at sounding relaxed and natural across an entire article.

That was the point where I stopped asking “can this work?” and started asking “would I actually want to listen to this for five minutes?”

Why I moved to Kokoro

Kokoro made sense as the next quality step while staying local and free.

The tradeoff is simple:

Piper is easier and lighter
Kokoro sounds better

For article narration, the second point matters more.

The voice I targeted for comparison was George (bm_george), because I wanted a calmer, more natural long-form read than the Piper voices were giving me.

The shift to Kokoro also introduced a practical lesson that had nothing to do with sound quality: local tooling still has ecosystem constraints.

In my case, Kokoro required a compatible Python version and espeak-ng, while Piper fit more easily into the existing workflow. That is the kind of friction that is easy to ignore in a demo and impossible to ignore in a real authoring pipeline.

What I learned from the transition

Local and free is realistic

You do not need a paid TTS API to add article narration to a static site.

If the workflow is offline, scripted, and editorially selective, local generation is completely viable.

Quality is a product decision, not just an engineering one

The cheapest or simplest model is not automatically the right one if the output becomes part of the reading experience.

For article narration, “good enough” quality gets exposed very quickly.

The text pipeline matters a lot

Better narration came less from magical model settings and more from treating the article as spoken structure:

clean paragraphs
separate headings
controlled pauses
fewer noisy artifacts

Start with a ranked subset

Not every article deserves audio on day one.

The right move was to start with a small set of high-value articles and evaluate:

quality
file size
workflow friction
whether the feature actually improves the site

Storage matters once the archive grows

Audio files accumulate quickly. Even optimized speech files are still binary artifacts.

That is why the long-term direction is to move the audio off-repo and into R2, while keeping the site itself lean.

If you want to add narrated articles to a static site with minimal cost, I would approach it in this order:

start local
clean the text before synthesis
generate per paragraph or section
optimize for spoken-word delivery
test on a small set of important articles
move to a better model if the voice quality is still not there

That sequence matters. Without the cleanup and chunking work, it is too easy to blame the engine for problems caused by the pipeline.

Where I landed

Piper was the right way to prove the concept.

Kokoro was the right next step once the goal changed from “generate audio” to “publish narration worth listening to.”

That is probably the cleanest summary of the whole experiment: the first version proved that free local narration was feasible, and the second version taught me that audio quality is part of product quality, not a bonus layer on top.

If you need help building features like this into a content product or developer site, get in touch.

How I added free article narration with Piper, then moved to Kokoro for quality

Listen while you read

The implementation shape

The first version with Piper

The real quality problem was not only the model

What improved Piper the most

1. Rebuild paragraphs from Markdown

2. Generate per chunk, not in one long pass

3. Add pauses intentionally

4. Compress for spoken-word, not music

Where Piper hit its ceiling

Why I moved to Kokoro

What I learned from the transition

Local and free is realistic

Quality is a product decision, not just an engineering one

The text pipeline matters a lot

Start with a ranked subset

Storage matters once the archive grows

Where I landed

Need help with something like this?

Listen while you read

The implementation shape

The first version with Piper

The real quality problem was not only the model

What improved Piper the most

1. Rebuild paragraphs from Markdown

2. Generate per chunk, not in one long pass

3. Add pauses intentionally

4. Compress for spoken-word, not music

Where Piper hit its ceiling

Why I moved to Kokoro

What I learned from the transition

Local and free is realistic

Quality is a product decision, not just an engineering one

The text pipeline matters a lot

Start with a ranked subset

Storage matters once the archive grows

What I would recommend

Where I landed

Related work

Need help with something like this?