mitchell vitez

dark mode

blog about music art media

resume email github

Character Voicing

The Animal Crossing series has characters speaking in Animalese. It’s a language based on picking apart syllables/letters in a host language, and synthesizing them individually, at a fairly fast clip.

We can program a version of this kind of thing, taking any source paragraph in English and producing a spoken language that sounds very little like English. Throughout, we’ll be using the example of this little poem:

The cow is of the bovine ilk;
One end is moo, the other, milk

Ogden Nash

First of all, we can remove all punctuation and make everything uniformly lowercase:

the cow is of the bovine ilk
one end is moo the other milk

ogden nash

To simplify, we can map several letters to nearby sounding-ones. This reduces the number of different syllables we’ll have to synthesize later, while not hugely impacting the overall sound of the language.

replacements = [
  ('y', 'i'),
  ('w', 'u'),
  ('x', 's'),
  ('z', 's'),
  ('j', 'g'),
  ('q', 'k'),
  ('c', 'k'),
]

In our case, the only change made is from cow to kou.

the kou is of the bovine ilk
one end is moo the other milk

ogden nash

We’ll switch now to processing each word individually. Let’s take these as examples:

bovine
other
moo
ogden

We split up the word into “syllables” defined by the placement of consonants. If the word starts with a vowel, it gets its own syllable. Otherwise, we grab each consonant and its following vowel, if it has one.

bo vi ne
o t he r
mo o
o g de n

Next, for each syllable we’ll want to phoneticize—convert that syllable into a sound. In our language, there are exactly five vowels (since we mapped y to i and w to u). We can turn them into phonemes—individual sounds.

vowels = 'aeiou'

phonemes = {
  'a': 'ah',
  'e': 'ay',
  'i': 'ee',
  'o': 'oh',
  'u': 'ooh'
}

For each character of a syllable we replace it if it’s a vowel, and leave it alone if it’s a consonant. We can pick which vowels to use for lone consonants through whatever method we want (even randomly, though this means words won’t be pronounced the same way every time).

boh vee nay
oh tah hay ray
moh oh
oh gah day nee

We’ll next group each word—a collection of phonemes—with hyphens (so text-to-speech readers group them more closely).

boh-vee-nay
oh-tah-hay-ray
moh-oh
oh-gah-day-nee

Some final cleanup before we synthesize sounds—adding a period at the end of each line to make text-to-speech take longer pauses there.

ah-tah-hay koh-ooh ee-see oh-fooh tah-hay boh-vee-nay ee-lah-kah.
oh-nay ay-nah-dee ee-see moh-oh tah-hay oh-tah-hay-ray mee-lah-kah.

oh-gah-day-nee nah-sah-hah.

In this scheme, there are 75 total phonemes—five vowel sounds for each of our 14 consonants, plus the lone vowel sounds:

 ah  ay  ee  oh  ooh
bah bay bee boh booh
dah day dee doh dooh
fah fay fee foh fooh
gah gay gee goh gooh
hah hay hee hoh hooh
kah kay kee koh kooh
lah lay lee loh looh
mah may mee moh mooh
nah nay nee noh nooh
pah pay pee poh pooh
rah ray ree roh rooh
sah say see soh sooh
tah tay tee toh tooh
vah vay vee voh vooh

We could theoretically record each of the 75 syllables individually and put them together via “concatenative synthesis”—just gluing phonemes back-to-back. However, I’m just going to drop our poem into a text-to-speech site. Here’s the original poem again, for reference:

The cow is of the bovine ilk;
One end is moo, the other, milk

Ogden Nash

The default speed from that site was super slow, far too slow for our language.

It sounds better sped up about 4x.

But, 4x speed and a pitch boost of 4–6 semitones seemed to provide the best final results:

With this technique, we could take any paragraph in English and convert into our language automatically. This could be used to provide character voices in video games given just the text (perhaps in multiple different source languages).