English to IPA Translation Using a Neural Network

Published: Aug. 23, 2020

Is it possible for a computer to learn the general patterns in English spelling?听

By: Zach Ryan
Course: Machine Learning & Linguistics (Ling 4100)听
Advisor:听Prof. Mans Hulden
LURA 2020

听

Introduction to g2p

English spelling is notoriously irregular. While some linguists - notably Noam Chomsky and Morris Halle - have argued that the spelling system of English is actually great (they called it 鈥渃lose to optimal鈥�) - it is obvious that the system is full of irregularities.听 Thinking for a moment about word pairs such as do and no, though and thought, plough and rough, cough and trough, should quickly convince most speakers that there are no definite rules about how to go from a word spelled out and its pronunciation. Not only that, but we have complete outliers such as the words colonel and boatswain - pronounced [k蓾nl泰] and [bo蕣sn泰], respectively.听 However, even though the pronunciations of some highly frequent words may be quite irregular, most people tend to agree on the pronunciation of new or unseen words.听

A Grapheme-to-phoneme (or g2p) system is a common tool in computational linguistics to automatically convert a spelled out word to a sequence of IPA symbols. It is often a component of text-to-speech systems - programs where you can enter text and get an audio representation of the pronunciation. It is especially useful when dealing with new and unknown words because the pronunciation of such words has to be guessed somehow and can鈥檛 be looked up in a pronunciation dictionary.

To develop a g2p system for a language, there are essentially two common ways to go about it. The first one is to develop explicit rules about how to convert sequences of letters to sequences of IPA symbols. The second is to train a machine learning system based on known examples of spelling-pronunciation pairs and hope that it generalizes well and can pronounce previously unseen words.听 In this work, I have experimented with a neural network machine learning model largely because developing rules by hand is really a non-trivial time-consuming process. Why? Consider examples such as these:

听

Examples of non-trivial g2p

We can鈥檛 develop rules about how to convert single letters to single IPA symbols in English. For some languages this is possible, but not for English. For example, in the case of the word phoenix one can see how multiple letters can map to one IPA symbol and how one letter can sometimes map to multiple IPA symbols:

Even for sequences of symbols, it is not possible to create simple rules that can be followed in every situation 鈥� take, for example, the words phone and phoenix. From these two words you could assume:

Unfortunately this does not hold true for a word like loophole where instead the ph would have to follow a different kind of rule:

听听

Of course, you could get around this problem but that means that your system needs explicit knowledge about morphology to figure out that loophole is a compound of loop + hole to not make an incorrect generalization.

Finally, you鈥檇 need a large set of memorized exceptions to the rules such as boatswain and colonel mentioned above. And of course also frequent irregulars such as no and do.
听

Neural Networks for Translation

It turns out that the problems listed above are also why machine translation between languages is difficult to write explicit rules for. We often have multiple words in one language translating to a single word in another.听 In many cases, the translation of a particular word will depend on its context (like the IPA translation of ph). We also have idioms that need to be memorized, word orders that need to be swapped sometimes鈥� The list is long.听 However, certain types of neural networks - called sequence-to-sequence (seq2seq) models - have recently revolutionized the field of machine translation. We can think of seq2seq models as black boxes that learn to mimic the translation of examples they have seen previously.

What鈥檚 an example? An example is a pair of sequences of symbols such as this one:

INPUT听 (English): 听听 How are you ?
OUTPUT (Spanish):听听驴 C贸mo est谩s ?

Here, the input consists of 4 symbols 鈥淗ow鈥�, 鈥渁re鈥�, 鈥測ou鈥�, and 鈥�?鈥�. The output consists of the symbols 鈥溌库€�, 鈥淐贸mo鈥�, 鈥渆st谩s鈥�, and 鈥�?鈥�. If you provide a seq2seq system with enough such examples (usually hundreds of thousands for machine translation), it will eventually learn to translate text it hasn鈥檛 seen before surprisingly fluently.

In reality these 鈥渟ymbols鈥� are represented as numbers and the neural network has no knowledge about the meaning or even spelling of these words. It just learns to imitate the types of translations that are being done - essentially learning to translate sequences of numbers to sequences of other numbers, based on examples. One can train a neural network to translate any sequences of symbols as long as there is some consistent pattern that it can learn.

So, a way to translate into IPA spelling using this technology is by doing a kind of 鈥渕achine translation鈥� based on input/output examples of sequences of letter symbols - treating a letter in a word the same as you would a word in a sentence. Instead of word sequences, we 鈥渢rain鈥� the model on sequences of letters, such as:

INPUT : 听听 p h o e n i x
OUTPUT: 听 f i n 瑟 k s

听

The model

My model and results are all available on Github [].听 (Specifically I developed the system using an open source neural network tool called OpenNMT. The exact network model I used was a 2-layer long short-term memory (LSTM) that had 500 hidden units on the encoder/decoder. This is pretty standard for machine translation, and also works well for g2p.)

听

Results

To evaluate the performance of the model I first set apart some spelling/pronunciation pairs that the seq2seq model wasn鈥檛 trained on (my test set) and only evaluated it on how well it guessed the pronunciation of these words it had never seen as examples.

听

BLEU score

To determine if the model I created for English to IPA translation is any good I evaluated it using two methods. The first method is known as a BLEU Score and is widely used to evaluate machine translation systems. This evaluation checks if a particular letter in the predicted output is also found in the gold standard (the correct answer) but does not care about the order of these letters. Having a BLEU Score equal to one means the compared strings were identical. In a machine translation setting this evaluation method is OK because different languages can have different structures permitting free word order so we may not want to penalize a different word order than the 鈥渃orrect鈥� one that much. However when translating from grapheme to phoneme preserving the correct order in the predicted translation is of course vital.

听

Edit distance

With this in mind, I also used an evaluation method based on edit distance. This measure looks for the shortest possible way to transform a system鈥檚 incorrect answer to the correct answer. It provides a much better representation of what the model created is capable of. Having an edit distance equal to zero means the system鈥檚 guess and the gold (correct) answer are identical. Having an edit distance of 1 means that the answer is one change away from the correct one.

听

Edit distance example

For example, below you鈥檒l see an output provided by my system, and the correct output for the word tribunal.

Here each arrow represents a substitution that is needed to turn the predicted IPA text into the actual IPA text. That means in this example the word would have an edit distance of 2.

In my experiments on English g2p, my model got a BLEU Score of 0.929 and an average edit distance of 0.341. One way to think about this result is that roughly every third word it was tested on produced an error, and to correct that error you would have to change one IPA symbol to some other symbol, or delete or insert an IPA symbol.

听

Fun things to do with g2p

One obvious thing to do is to train an inverse model - one that goes from pronunciation to spelling.听 Once we have systems to perform the translation both ways, we could also use them for other things. One use case is to compare the transparency of spelling systems in different languages. We could do this by answering the questions: is it hard to go from the spelling to the IPA (is the language hard to read?) and; is it hard to go from IPA to spelling (is the language hard to write?)

We know that neural networks learn to translate between consistently regular patterns of translation quite well. This fact would allow us to compare languages in a somewhat objective way, and see which languages have 鈥渆asy鈥� spelling and writing rules and which are full of irregulars.听 Something like this was recently done in an article posted on arXiv (), which is where the below figure comes from:

听

As you can see, English (en) is second only to French (fr) in unpredictability when going from IPA to spelling (writing). In the other direction - going from spelling to IPA (reading) - English is the most irregular language(!)听 Another thing these models can tell us is the following: which words have the most surprising pronunciations in the language. Here鈥檚 a list that comes from someone who trained a model similar to mine and found the weirdest spellings in English, according to the model. Do you know how these words are pronounced?

听

Worcestershire

hors d'oeuvre

dachshunds

Descartes

Aristophanes

palimpsest

marquis

Tags:

LURA 2020-2

Zach Ryan

I am a senior majoring in Computer Science. Besides writing code, I enjoy skiing, hiking, and fly fishing and after graduation would like to pursue a career in software development or machine learning.听

乐播传媒

Search

Other ways to search: