Unicode Parser by Andrew Plotkin Search for other extensions by this author Approved for the Public Library


Download version 8/150625For Inform 7 6L38



No description provided

Tags glulx parser unicode


When you include this extension, I7 will appear to behave as it always does. However, the command line will be read using a Unicode-friendly input call, and the internal parsing dictionary will contain Unicode strings instead of byte strings. This means that, theoretically, you can define nouns and verbs using any Unicode character (not just basic Latin-1.) However, the I7 language does not currently permit this. So we have to indulge in some trickery to make these definitions possible. (By the way, if you're reading these docs in the I7 IDE, you'll see a lot of "[unicode ...]" substitutions in the sample code. You can type Unicode characters directly in your I7 source code! The samples should be written that way, but I7 mangles them when it formats the IDE documentation. Read the Unicode Parser.i7x file directly to see cleaner sample code.) Section: Unicode synonyms for verbs To define a verb synonym with Unicode characters: Include (- Verb '@{3C0}@{3B1}@{3AF}@{3C1}@{3BD}@{3C9}' = 'get'; Verb '@{3C0}@{3B1}@{3B9}@{3C1}@{3BD}@{3C9}' = 'get'; Verb '@{11D}et' = 'get'; -) after "Grammar" in "Output.i6t". The strings here are single-quoted strings of characters defined with the I6 '@{hexadecimal}' format. The first line is the Greek word "παίρνω". (I apologize for butchering Greek here -- all my translation is due to Google!) With this definition, the command "παίρνω lamp" will work. So will "Παίρνω Lamp"; as usual, commands are converted to lower case where possible. The second line is the same word, but without the accent mark. The dictionary considers accents significant while matching, so if you want to accept the verb "παιρνω" (or "Παιρνω") you need this line. (Again, I don't know if a Greek speaker would leave off the accent mark! Probably not.) The third line defines the verb "ĝet" in the same way. This is by way of demonstrating normalization. The Unicode standard permits two ways to define this string: "ĝet" and "ĝet". These probably look the same to you, but they're not. The former is three characters long, as you might expect; it starts with the Unicode character named LATIN SMALL LETTER G WITH CIRCUMFLEX. (Unicode loves these verbose names.) The second example is *four* characters long; the first character is LATIN SMALL LETTER G, and the second is COMBINING CIRCUMFLEX ACCENT. The "^" stacks on top of the "g" when the pair is displayed. The combined form is more common, but a player might type either form. Therefore, this extension *tries* to accept both, by "normalizing" the input words. However, the Glk normalization function is relatively new, and may not be available. The Mac Inform IDE 6G60 lacks this call, for example. So the four-character form will not be recognized by the verb definition shown above. To accept it, we'd need an additional line: Include (- Verb 'g@{302}et' = 'get'; -) after "Grammar" in "Output.i6t". You can also define an entire verb line (with prepositions and everything), using the I6 syntax: Include (- Verb '@{11D}et' * 'i@{3B7}' noun -> Enter; -) after "Grammar" in "Output.i6t". (Accepts the command "ĝet iη boat".) However, it is currently not possible to refer to a custom action this way -- only to the predefined ones. Section: Unicode synonyms for nouns This is also ugly. To define a synonym for an object, we have to define an I6 class: Include (- Class rock_name_class with name '@{3B2}@{3C1}@{3AC}@{3C7}@{3BF}@{3C2}' '@{3B2}@{3C1}@{3AC}@{3C7}@{3BF}@{3C3}'; -) before "Object Tree" in "Output.i6t". The rock is a thing. Include (- class rock_name_class -) when defining the rock. The rock_name_class class acts as a mix-in which adds the strings "βράχος" and "βράχοσ" to the rock. (I'm including the two variations on the final letter sigma.) We can now accept the command "παίρνω βράχος" (or "Παίρνω Βράχος"). Section: Synonyms from Unicode properties It's possible to recognize an object from an indexed text property, and the indexed text can contain Unicode. This is less ugly, and you can set it up without requiring I6 code. But it's not very flexible; it only lets you recognize one Unicode word per object. (Or one per property, I suppose. You could add several properties that work this way.) The lamp has an indexed text called the greek-synonym. Understand the greek-synonym property as describing the lamp. The greek-synonym of the lamp is "λυχνία". Section: Details for the I6 hacker This extension modifies Inform's internal command buffers to be Unicode arrays (arrays of 32-bit integers) rather than plain character arrays (arrays of 8-bit characters). These are the "buffer", "buffer2", and "buffer3" arrays. We update the parser functions that manage these arrays: VM_ReadKeyboard, VM_CopyBuffer, VM_PrintToBuffer, VM_Tokenise, LTI_Insert, GGWordCompare, WordAddress, PrintSnippet, SpliceSnippet, NounDomain, CPrintOrRun, SetPlayersCommand, DECIMAL_TOKEN, TIME_TOKEN, INDEXED_TEXT_TY_ROGPR, DA_Topic, TestKeyboardPrimitive, and of course a couple of sections of Parser__parse. We add a Glulx_PrintAnyToArrayUni function, which prints to a Unicode array. Section: Caveats This extension is intended for Inform 7 build 6L38. It will not work with earlier versions, and has not been tested with any later version. Things which definitely don't work (as of 6L38): - Parsing defined units, such as "$1.25" or "26 kg". The parsing routines for these are generated by I7. - Automatic testing of Unicode commands, such as "test me with 'get λυχνία'." The test-command arrays are generated by I7 as byte arrays, and any Unicode characters are mangled into literal "[unicode ...]" strings. (Test commands that contain only Latin-1 characters will continue to work.) - Writing and reading Unicode in command-history files. This is possible (it would require modifying more uses of gg_commandstr) but the feature is not in common use these days. - Any extension that uses I6 code to manipulate the command buffer directly. Example: ** Ungrammatical Greek - Defining verb and noun synonyms containing Unicode characters. In this sample, we accept the synonym "παίρνω" for taking, "σήμα" for the sign, and "βράχος" for the rock. We also accept the variant "παιρνω", and "σημα", and "βραχος", "βράχοσ", "βραχοσ". *: "Ungrammatical Greek" Include Unicode Parser by Andrew Plotkin. Ancient Greece is a room. "You stand in the crossroads at the center of Classical Athens, circa 330 BC. Except that you used a cut-rate time machine to get here, so everybody is wearing blue jeans and you're pretty sure their Greek is by way of Google Translate." A sign is fixed in place in Greece. "A [sign] reads: Test me with 'παίρνω βράχος'!" After printing the name of the sign: say " (σήμα)". A rock is in Greece. After printing the name of the rock: say " (βράχος)". Include (- Class sign_name_class with name '@{3C3}@{3AE}@{3BC}@{3B1}' '@{3C3}@{3B7}@{3BC}@{3B1}'; -) before "Object Tree" in "Output.i6t". Include (- class sign_name_class -) when defining the sign. Include (- Class rock_name_class with name '@{3B2}@{3C1}@{3AC}@{3C7}@{3BF}@{3C2}' '@{3B2}@{3C1}@{3AC}@{3C7}@{3BF}@{3C3}' '@{3B2}@{3C1}@{3B1}@{3C7}@{3BF}@{3C2}' '@{3B2}@{3C1}@{3B1}@{3C7}@{3BF}@{3C3}'; -) before "Object Tree" in "Output.i6t". Include (- class rock_name_class -) when defining the rock. Include (- Verb '@{3C0}@{3B1}@{3AF}@{3C1}@{3BD}@{3C9}' '@{3C0}@{3B1}@{3B9}@{3C1}@{3BD}@{3C9}' = 'get'; -) after "Grammar" in "Output.i6t". Example: **** Tedious UniParse Test - A bunch of boring test cases to ensure that everything works. *: "Tedious Test" Include Unicode Parser by Andrew Plotkin. The Kitchen is a room. The description is "To really test this extension, run through all of the following commands. (I can't use a 'test me' script for all of this, because Unicode isn't interpreted correctly in testing commands!)[para][command list]". To say command list: say " [fix]>> test me[/fix] [em]('x me'; tests a basic test command)[/em][br]"; say " [fix]>> παίρνω βράχος[/fix] [em](takes the rock)[/em][br]"; say " [fix]>> drop ΒΡΑΧΟΣ[/fix] [em](drops the rock)[/em][br]"; say " [fix]>> examine article[/fix] [em](prints 'An article is a device to test capitalization. The article is not otherwise interesting; it's just an article'; tests a/an/the/A/An/The)[/em][br]"; say " [fix]>> examine brass lamp[/fix] [em](tests property recognition of indexed text)[/em][br]"; say " [fix]>> xyz me[/fix] [em](translated to 'examine me'; tests snippet splicing)[/em][br]"; say " [fix]>> xyz βράχος[/fix] [em](examines the rock; tests snippet splicing with unicode)[/em][br]"; say " [fix]>> say hello there to steve[/fix] [em](tests topic parsing)[/em][br]"; say " [fix]>> say παίρνω βράχος to steve[/fix] [em](ditto, unicode)[/em][br]"; say " [fix]>> x qlamp[/fix] [em](examines the rock; tests replacing the player's command)[/em][br]"; say " [fix]>> x qrock[/fix] [em](examines the lamp; ditto, unicode)[/em][br]"; say " [fix]>> say qrock foo to steve[/fix] [em](tests splicing *and* replacement)[/em][br]"; say " [fix]>> set lamp to lead[/fix] [em]('You set the lead lamp to 'lead''; tests displaying an action with a topic)[/em][br]"; say " [fix]>> x lead lamp[/fix] [em](recognition of new property)[/em][br]"; say " [fix]>> set lamp to Ω37∞Б[/fix] [em]('You set the ω37∞б lamp to 'ω37∞б''; ditto, unicode; also lowercasing)[/em][br]"; say " [fix]>> x ω37∞б lamp[/fix] [em](recognition of new property)[/em][br]"; say " [fix]>> examine dfg rock[/fix] [em]('You can't see any such thing'...)[/em][br]"; say " [fix].. oops βράχος[/fix] [em](tests 'oops')[/em][br]"; say " [fix]>> get [/fix][em]('What do you want to get?')[/em][br]"; say " [fix].. βράχος[/fix] [em](takes the rock; tests disambiguation splicing)[/em][br]"; say " [fix]>> get lamp then get rock [/fix][em](tests command chaining)[/em][br]"; say " [fix]>> examine me[/fix][br]"; say " [fix].. again[/fix] [em](tests 'again')[/em][br]"; say " [fix]>> examine βράχος[/fix] [br]"; say " [fix].. again[/fix] [em](ditto, unicode)[/em][br]"; say " [fix]>> i.again[/fix] [em](tests a particular parser guard against infinite loop)[/em][br]"; say " [fix]>> count 3. count 19. count 321. count five[/fix] [em](test number parsing)[/em][br]"; say " [fix]>> count 98765. count -543210[/fix] [em](test large number parsing)[/em][br]"; say " [fix]>> measure 3. measure -2.1. measure 1.2e3. measure 4*10^-1[/fix] [em](test real number parsing)[/em][br]"; say " [fix]>> measure -4.jump. measure 3.1 * 10^1. examine me[/fix] [em](more real number parsing)[/em][br]"; say " [fix]>> time 3[/fix] [em](test time parsing)[/em][br]"; say " [fix]>> time 11 pm[/fix] [em](ditto; multiple on a line don't work)[/em][br]"; say " [fix]>> time 4:50[/fix] [br]"; say " [fix]>> time 20 to 5 pm[/fix] [br]"; The lamp is in the Kitchen. The lamp has an indexed text called the adjective. The adjective of the lamp is "brass". The printed name of the lamp is "[adjective] lamp". Understand the adjective property as describing the lamp. The rock is in the Kitchen. An article is in the Kitchen. Steve is a person in the Kitchen. Check examining the article: instead say "[A noun] is a device to test capitalization. [The noun] is not otherwise interesting; it's just [a noun]." Include (- Class rock_name_class with name '@{3B2}@{3C1}@{3AC}@{3C7}@{3BF}@{3C2}' '@{3B2}@{3C1}@{3AC}@{3C7}@{3BF}@{3C3}' '@{3B2}@{3C1}@{3B1}@{3C7}@{3BF}@{3C2}' '@{3B2}@{3C1}@{3B1}@{3C7}@{3BF}@{3C3}'; -) before "Object Tree" in "Output.i6t". Include (- class rock_name_class -) when defining the rock. Include (- Verb '@{3C0}@{3B1}@{3AF}@{3C1}@{3BD}@{3C9}' '@{3C0}@{3B1}@{3B9}@{3C1}@{3BD}@{3C9}' = 'get'; -) after "Grammar" in "Output.i6t". To decide what snippet is snippet at word (N - number) length (L - number): (- (({N})*100 + ({L})) -). To say para -- running on: (- DivideParagraphPoint(); new_line; -). To say br -- running on: (- new_line; -). To say em -- running on: (- style underline; -). To say /em -- running on: (- style roman; -). To say fix -- running on: (- font off; -). To say /fix -- running on: (- font on; -). After reading a command: let T be indexed text; let T be the player's command; if T matches the regular expression "^xyz": replace word number 1 in T with "examine"; say "(Changing command to '[T]'.)"; change the text of the player's command to T; if word number 2 in T is "qlamp": let snip be snippet at word 2 length 1; replace snip with "lamp"; say "(Changing command to '[the player's command]'.)"; if word number 2 in T is "qrock": let snip be snippet at word 2 length 1; replace snip with "βράχος"; say "(Changing command to '[the player's command]'.)"; Check answering Steve that: instead say "You say '[the topic understood]' to Steve." Check setting the lamp to: say "(Current action: [current action].)[br]"; now the adjective of the lamp is the topic understood; instead say "You set the lamp to '[the topic understood]'." Counting is an action applying to one number. Understand "count [number]" as counting. Report counting: say "You count to [the number understood]." Measuring is an action applying to one real number. Understand "measure [real number]" as measuring. Report measuring: say "You measure [the real number understood]." Time-checking is an action applying to one time. Understand "time [time]" as time-checking. Report time-checking: say "That's [the time understood]." Test me with "x me".