The phonology and phonotactics of Zemo in regular expressions

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Body

TL;DR: It's a precise way to describe how a language works that's also compatible with most programming languages. Reading it would be cool. I mean, you don't have to.

The main reason I went with regular expressions for my language is because it want it to be very regular. This is kind of comparable to the formal grammar specification of Lojban, except magnitudes easier to understand.

If you're not familiar with regular expressions, I explain how this works as I go. If you have no idea what's going on, http://www.regular-expressions.info/ is a good resource to learn regular expressions, along with http://rubular.com/ to practice them.

Here's the whole thing, where the last line matches any valid word:

G = [ʔgɟdb] # stops
Ɣ = [ ɣʝzβ] # fricatives
Ŋ = [ ŋɲnm] # nasals
L = [ ʟɻlu] # liquids
O = [ oaei] # simple vowels
C = G|Ɣ|Ŋ # consonants
V = L|O   # vowels
P = C|V   # phonemes
I = (?<!P)V         # initial V
  | C(?!P)          # terminating C
  | (?:gŋ|ɟɲ|dn|bm) # homorganic GŊ
  | (?<!V)(C)\      # doubled C not after V
  | V(C)(?!\ )      # non-doubled C after V
  | (G)(?!\ )G      # GG that isn't a double
(?!P*I)P  # I-free phoneme sequence

To make it easier to see what's going on, I gave values to capital letters that could just be inserted like (?:I) where they're used, and used freeform syntax (ignoring space characters and comments). Also note that this wasn't necessarily designed for efficiency so much as to codify the phonotactics in a precise and clear manner.

G = [ʔgɟdb] # stops
Ɣ = [ ɣʝzβ] # fricatives
Ŋ = [ ŋɲnm] # nasals
L = [ ʟɻlu] # liquids
O = [ oaei] # simple vowels

These are the basic classes for the phonemes, organized in the same type of grid I usually go with. Square brackets denote a character class, which could be any of its constituent characters.

C = (?:G|Ɣ|Ŋ) # consonants
V = (?:L|O)   # vowels
P = (?:C|V)   # phonemes

These are less granular, and they specify the first three rows as consonants and the last two as vowels. | means "or". Note that (?:...) is used because (...) denotes a capture group, a section that can be referenced later in the expression, which can interfere with the next section.

Now for the core of the expression, the illegal elements.

(?<!P)V
C(?!P)

This is where we start using lookarounds, which don't represent characters themselves but check immediately before and after. The first is a vowel not proceeded by a phoneme, which makes the vowel initial. The second is a consonant not succeeded by a phoneme, which makes it final.

(?:gŋ|ɟɲ|dn|bm)

This is just to prevent homorganic GŊ sequences because they could be interpreted as GeŊ within the allophony, which would break the phonotactics by effectively hiding a vowel.

(?<!V)(C)\ 
V(C)(?!\ )

The first is a consonant not proceeded by a vowel and succeeded by itself, i.e. doubled. The second is a vowel followed by a consonant that is not doubled. Combined, these form the requirement that doubled consonants only appear to continue a word after vowels. (C) is a capture group, and \ refers back to the most recent capture group, which I used for modularity. This doesn't work in all regular expression engines, but I couldn't find a universal equivalent.

(G)(?!\ )G

This is a stop followed by a stop that is not itself, which is achieved by the interesting effect of combining a lookaround with a pattern in the same location. This is to ensure that different stops cannot be combined, which is a criterion I might relax later.

If you made it this far, you're almost there!

(?!P*I)P

This is a sequence of one or more phonemes P such that there is no instance of I within it. The P* means "zero or more phonemes", and the reason I used it is because I needed the negative lookahead to be able to match I anywhere in the word but no farther. This pattern (?!A*B)A in particular is quite useful for describing groups without B, which could be reversed to (?!A*(?!B))A to give only groups comprised of B, a more standard way of describing phonotactics.

Hopefully this was clear enough to follow. If anyone's interested in doing this kind of thing for their languages, I could write a script that takes text formatted like this and tests for matches.

Also, yes, this is the language I keep renaming. I've almost got it.

Author

Account Strength

90%

Account Age

11 years

Verified Email

Yes

Verified Flair

Total Karma

4,207

Link Karma

718

Comment Karma

3,489

Profile updated: 1 day ago

Posts updated: 7 months ago

digigon

/r/sika (en) [es fr ja]

Subreddit

r/minlangs

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 10 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/minlangs/co...