Home
En español

How to create a language
by Pablo David Flores




Writing your language

Once you have determined which sounds your language will have, you'll need a way to write them down in the Roman alphabet (transliterate them), and perhaps an alphabet of its own. We'll talk about alphabets in a minute.

Transliteration can be a nightmare. The ideal thing would be having one symbol for each sound, but the Roman alphabet doesn't have symbols to represent some very common sounds. Here you have your first choice: will you invent or use one symbol for each sound, or use some other devices? If you want one symbol for each sound, then you'll probably have to use either non-letter symbols (such as ' @ ?) or resort to diacritic marks, i. e. modify letter symbols by using little signs on top of (or below) them. The accents and diaeresis over vowels are diacritic marks: á è î ÿ. English doesn't use any diacritic marks. Spanish shows some stressed vowels with an accute accent: acá éramos ínfimos órganos súbitos, and writes the palatalized nasal sound as ñ (as in año). French uses accents to show that an e sound should be pronounced and for the sake of tradition in many words: été âme à mère; and it has a letter ç for s before a, o, u. Portuguese shows nasalized vowels with a tilde (~) over them (as in são). German shows front versions of back vowels with a diaeresis over them (ö ü). Danish writes a rounded a with å, and a front o with ø. Many languages have nonstandard letters for certain sounds, and unless you speak those languages and your keyboard is configured for them, you won't be able to easily access to them when writing your language in your computer.

If you don't want to use so many strange symbols, you'll probably have to use two or more symbols to represent some sounds, like English uses sh and th for single sounds. These are called digraphs (trigraphs are possible but to be avoided for the sake of length). The letter h is very good for digraphs. But you have to take something into account: two symbols should never be used to form a digraph if they can appear on their own to represent two different sounds. English can use th because the cluster /t/+/h/ does not appear in English, but couldn't use sn to represent a nasal fricative, because some words have sn with the value of /sn/.

Transliteration has no rules on which symbols you use to represent which sound, but you should try to make the language readable: it's OK to use zh to represent /f/, but most people will surely read something completely different from /f/ when they find it, and besides, you already have a more familiar f to fill that place, right?

Transliteration should be as phonemic as possible. English is a bad example; words are written the way they were pronounced centuries ago, so the written and spoken forms of a word are usually inconsistent. French is even worse (in a word like oiseau, pronounced /wa'zo/, there's not one sound corresponding to its 'proper' letter). Written Spanish and Italian are quite phonemic, and almost as much important, the sounds can be guessed from the written form, although inaccurate. Some languages are remarkably consistent in their written forms.



Alphabets and other scripts

An alphabet is a collection of symbols representing sounds. You can invent an alphabet for your language if you want to. If you do, and your romanized spelling is phonemic, then your alphabet should be too: one symbol for one sound. You can use digraphs and add diacritics to your own alphabet. If your language derives from another language for which you already had an alphabet, then probably the newest language will use the old alphabet, but some letters will have changed sound. For example, Spanish uses the Latin alphabet, but the letter c now represents s before e, i. This is not phonemic spelling, but the change is completely regular.

When inventing letters, play around with them and write them quickly one after another. People write carelessly in most cases, and elaborate letters are likely to be simplified. Also try to make each letter different from all others, so that they are not confused. When two symbols look very similar, people find ways to distinguish them. The dot over the i appeared when the little stick of the lowercase i began to be confused with the vertical lines of m's and n's in Gothic handwriting. Computer fonts and programmers distinguish 0 (zero) and O (the letter o) by writing a slash over the zero.

You have to decide how you will read and write. Will it be from left to right, like the Roman and Cyrillic alphabets are usually written? Hebrew and Arabic are written from right to left, and vowels are not written except in children's books and (Arabic) in the Koran. Japanese is usually written from top to bottom and from right to left, but it's written from left to right in certain books, like mathematics ones.

Alphabets are not the only kind of writing. Chinese uses ideograms, or characters which used to represent a picture of an object. Each character represents a concept and is read as a syllable; but words that sound the same and are not related are written as different characters. Chinese characters have two parts, the radical and the phonetic. The radical gives an idea of the meaning, while the phonetic gives an idea of the sound; a radical can sometimes act as a phonetic and viceversa.

Japanese uses a mixed system of kanji (ideograms) and kana (phonetic syllabic characters). In general, the main content of what you're trying to say is written in kanji, while particles, conjunctions and inflectional endings are written in kana. There are about 90 kana divided into two sets (hiragana and katakana). Hiragana are most often used for original Japanese words; katakana are preferred for borrowed words, and also to add emphasis, just like italics in the Roman alphabet. Also, when an unusual kanji is used, it can be clarified by spelling it phonetically in hiragana, which are called furigana ('handicap kana'). You can change the quality of the consonant in a kana by using some diacritic marks. There are 1945 'standard' kanji, of which 1006 are taught in elementary school, and each kanji can be read according to its Japanese pronunciation (kun-yomi) or its original Chinese pronunciation (on-yomi). As if it weren't confusing already, each kanji can have several readings of each of the two forms. [See a description of Japanese and Chinese writing here. Includes a hiragana-katakana chart!]

Korean uses an alphabet called Hangul, which is a featural code, a system in which similar sounds are represented by similar symbols. I don't know when this was originated, but it requires a remarkable phonetic analysis. In Hangul, symbols are grouped in syllables, making the writing look as if it was composed of many ideograms or syllabic characters, which is not the case.

Arabic uses a cursive alphabet, which is unusual because most peoples in history have started out with block letters, due to the nature of the material support for writing. Arabic was written with fine brushes on some kind of smooth surface from the beginning, I guess; cursive letters are completely inadequate for (quick) stone carving or clay.

Thai, while a syllabic language, uses a phonetic alphabet of single letters, which often have little curls and twists at the ends. Some other scripts of peoples in that area of the globe use that kind of characters which seem a bit too much elaborate. The reason is that they were first written using materials which required lines to be 'closed' in some way.

This all boils down to a principle: to invent an alphabet, you must know where it's going to be written and by what means.

Inventing an alphabet is simple, but a syllabary (or ideograms) can be a headache, so you should think of it carefully before. Ideograms are probably the worst kind of writing, and you should probably refrain from using them unless you have a photographic memory. Syllabaries are fine, but they work best on very restricted languages; English has an enormous number of possible syllables, and inventing a sign for each one would be impossible.

Take a look at some natural language scripts in Ancient Scripts, a page with examples from all around the world.

Ordering your script

We're used to have our letters in order. This is very useful for dictionaries and phone books, and for indexes in general. How are you going to order your symbols?

Western alphabets derived from the Roman alphabet usually follow a predictable order. English uses a relatively small set of symbols, and digraphs aren't considered independent symbols, but this is not so in other languages. For example:

Some other languages, using non-Latin scripts, order their characters in different fashion. Some of them use the phonetic features of sounds to order the letters; for example, first the labials (p, b, m, f), then the alveolars (t, d, n, s) and so on.

As for syllabaries, there's usually also a fixed order. In Japanese, both types of kana are arranged like this: first the vowels, a i u e o, then the syllables beginning with k (ka, ki, ku, ke, ko), then t-, n-, h-, m-, y-, r-, w-, and finally the symbol for syllabic n. Another order, more traditional, was used in former times (and is still used in indexes and tables, as opposed to the modern order, which is used in dictionaries). This order follows a poem by Buddhist monk Kuukai, which uses each character of hiragana exactly once:

Iro ha nihohe to
chirinuru wo
waka yo tare so.
Tsune naramu
uwi no okuyama
kefu koete
asaki yume
mishi wehi mo sesu.
(Note: this is probably not good modern Japanese, nor is this the correct pronunciation. The kana for ha is pronounced wa, and the kana for wi and we are obsolete. The kana for wo is pronounced o.)

As for ideograms, Japanese kanji (and Chinese hanzi) are ordered by the radical number and, within the same radical, by the number of strokes needed to write the character (there's a method to count them properly).

It would be a nice idea to have letters with names that mean something, or that can be recited in order. Latin letters have meaningless names in all languages that use them, and their names are often too similar to one another, hence the need for codes like 'Alpha, Bravo, Charlie'... Other languages and scripts don't have such problems.