This article is co-authored with generative AI. While I have cross-checked facts against official documentation where possible, errors may remain. Please verify primary sources before making important decisions.

I ran an experiment to narrate technical blog articles with a synthetic voice cloned from my own speech. The audio is generated with ElevenLabs Voice Cloning + the v3 model (eleven_v3, in alpha at the time of writing).

This post records an A/B comparison of v2 (eleven_multilingual_v2) and v3 on identical Japanese narration material, together with operational observations.

As a side effect, the resulting audio is wrapped as MP4 (cover image + audio + waveform overlay) and placed on YouTube under a dedicated playlist.

Background

I'm interested in approaches that use AI to reproduce a specific person's voice and speaking style, and then read written texts or interview transcripts in something close to that person's voice. Similar efforts are being made in the context of digitally archiving historical or deceased figures, and I wanted to gather first-hand technical and ethical findings on what works and what does not.

Working with someone else's voice raises rights, consent, and ethical concerns, so I am running this self-experiment first — putting my own voice through the same pipeline to assess synthesis quality, operational cost, and pitfalls.

Experimental pipeline

Article (Markdown)
  ↓ Narration script (.txt) — currently semi-manual via Claude Code
  ↓ ElevenLabs API (eleven_v3) → MP3
  ↓ Pillow → 1920x1080 cover image
  ↓ ffmpeg: still cover + audio + showfreqs bars → MP4
  ↓ YouTube Data API: publish + dedicated playlist + tag-based playlists

A purely static cover looks visually flat, so the ffmpeg showfreqs filter overlays an audio waveform bar at the bottom.

Voice Cloning options

ElevenLabs offers two voice cloning paths:

  • IVC (Instant Voice Cloning) — generates a clone instantly from 1–5 minutes of audio sample, using inference-time conditioning
  • PVC (Professional Voice Cloning) — generates a fine-tuned model from 30+ minutes of audio, said to be more stable for long-form narration

This experiment uses IVC, with a short past public talk (a 2023 Digital Archives Society lightning talk) as the reference sample. PVC would likely be more appropriate for long-form stability, but I wanted to first see how far IVC goes for practical use.

Writing the narration script