The language of string::Parse algorithm

The string::Parse command differentiates between parsed types. Available ones include:

  • Automaton,
  • Grammar,
  • String,
  • RegExp, and
  • Tree.

All these are themselves described by their own gammars.

To access the parser of the particular types the call to the string::Parse algorithm is followed with the @-initiated type.

Labels

The exposed parsers for the types listed above seldom parse strings and numbers (among others) themselves. In general, if the sequece is to be treated as a label (a value of an alphabet’s symbol, etc.) a general parser is used.

Labels include primitive literals:

  • string and
  • integer or composites:
  • pair,
  • set, and
  • vector.

Anything that starts with a character or underscore and continues as a character, an underscore, or a digit sequece is a string literal. Anything that starts with a digit and continues with digits is an integer literal. Composites are enclosed in a matching pair of brackets, and contain a comma separated sequence of labels (which for pair has to be of length two and for set and vector it is arbitrary).

<1, 2>
{a, b}
[{a}, b]

Be careful, the parser differentiates between integer and string labels. Symbol 01234 and 1234 are integers therefore 01234 == 1234. Zero does not mean octal numbers.

Finite Automata

Finite automata (FA) use a “tabular notation” which is very similar to the notation from the AAG course.

The header of the table (the first row) specifies the type of the automaton:

  • DFA – deterministic FA,
  • NFA – nondeterministic FA,
  • ENFA – nondeterministic FA with epsilon transitions,
  • MISNFA – nondeterministic FA with multiple initial states, and
  • MISENFA – nondeterministic FA with multiple initial states and epsilon transitions.

Further, the first row contains a list of alphabet symbols. Alphabet symbols are separated with a space. If the automaton type allows epsilon-transitions, you have to add a special symbol #E (ε) to end of the list.

Every other row specifies a state and its transition function in the following way. An initial (> symbol), a final (<), or both state traits can be specified (in any order) at the beginning of each row before the state label. After the state label a transition function specification follows. Transition function for a state is a space separated list of states (in case of determnistic automaton) or state sets (in case of nondeterministic automaton). State sets are pipe (|) separated states. Naturally, the length of the transition list and the order relates to the list of alphabet symbols (including epsilon if applicable).

If there is no outgoing transition for given state and symbol combination a dash (-) symbol is used.

NFA example:

NFA a g
>0 0|1 0
1 2 -
2 - 3
<3 - -

Given the fact that alphabet symbols and state labels are parsed with a general label parser, the labels of states can also be sets. State {0,1,2} stands for the set of states 0, 1 a 2, meaning {0, 1, 2} == {2, 0, 1}.

DFA example:

DFA a g
>{0} {0, 1} {0}
{0, 1} {0, 1, 2} {0}
<{0, 3} {0, 1} {0}
{0, 1, 2} {0, 1, 2} {0, 3}

Tree automata

Tree automata are parsed similarly to finite automata. The available types to parse are:

  • DFTA and
  • NFTA.

The format is table like again, with the first row specifying the type and alphabet. Alphabet symbols are ranked; it is the number of predecesor nodes in the accepted tree.

The rest of the table has reversed meaning, first come the comma separated sequences of the source states in brackets for each element of the alphabet. The dash (-) stands for no source. Each source state pack size matches the rank of the corresponding alphabet symbol. The target state concludes each line allong with the specification whether the state is final (>) or not.

DFTA alter 2 concat 2 iter 1 a 0 b 0 c 0 eps 0 emp 0
- - - [] - - - - al
- - - - [] [] [] [] as>
[al,al] [al,al] - - - - - - as
[as,as] [as,as] [as] - - - - - as
[al,as],[as,al] [al,as],[as,al] [al] - - - - - al

Note that there may be multiple lines for a single target state to allow some logical grouping in somewhat bigger transition function set. In such a case the target state is final in at least one line marks the target state as final.

Regular Expressions

Regular expressions are described using the well-known string format.

We use the following operators:

  • + to describe alternation,
  • * to describe iteration and parentheses, and
  • () to achieve priorities. The concatenation is implicit, however, you need to delimit the two concatenated symbols with a space otherwise the sequence is treated as a single symbol.

Regular expressions ε and ∅ are described using the special symbols #E and #0, respectively.

One symbol of the regular expression corresponds to one symbol of the alphabet. Two concatenated symbols must be delimited with a space. For instance, the input 11 corresponds to one alphabet symbol 11. However, the input 1 1 described the concatenation of two symbols 1. Integer literals loose leading zeros, if an integer is followed by a string literal, the two are separate concatenated symbols.

Example: a(b + c)* + #0 b c c(#E + a) corresponds to the regular expression a(b+c)* + ∅ bcc(ε+a) over alphabet {a,b,c}.

Grammars

Grammars are described using very similar notation to the one you know from AAG course. Grammar is a quadruplet (N, T, P, S) where N, T and P are sets and S is an element of N.

First, we specify the type of the grammar:

  • RIGHT_RG stands for (right) regular grammar,
  • LEFT_RG stands for left regular grammar,
  • CFG stands for context-free grammar,
  • GNF stands for context-free grammar in the Greibach normal form, and
  • CNF stands for context-free grammar in the Chomsky normal form,
  • there are more types, RIGHT_LG, LEFT_LG (right and left linear grammars) to conclude grammars describing regular languages and LG (linear grammar), EpsilonFreeCFG (CFG without epsilon rules) to conclude grammars describing context free languages.

Immediately following are the opening parenthesis of the quadruplet, the set of nonterminal symbols, the set of terminal symbols, the set of rules, the initial nonterminal, and the closing parenthesis.

The set of nonterminal and terminal symbols is inside curly brackets {}. The symbols in the set are delimited by a comma and a space, e.g. {A, B, C}.

The set of rules is again inside curly brackets. Left hand side and right hand side of the rules are delimited using an arrow (->). We can join multiple rules with the same left hand side using a pipe (|). The individual rules are delimited by a comma symbol. See example below.

In the case of epsilon-rule (A -> ε), the right hand side of the rule is empty (there is no special symbol for &; see Example 1 - nonterminal S or Example 2 - nonterminal D). Again, spaces need to be used to delimit the symbols on the rule’s right hand side, i.e. abc is a single symbol abc, a b c is a sequence of symbols a, b, c.

Example 1: Regular grammar

RIGHT_RG (
{S, A, B, C},
{a, b, c},
{
	S -> a A | b B |,
	A -> a A | b B,
	B -> b C,
	C -> c A | a
},
S)

Example 2: Context-free grammar

CFG (
{C, A, B, D},
{a, b},
{
	A -> a A b |,
	B -> a a A b b | B,
	C -> b B | D,
	D -> 
},
A)

Strings

Strings are enclosed in quotes, actually, since the query language already uses quotes to recognise string literals, the quotes enclosing the actual string need to be escaped.

The usual rules for a sequence of symbols apply, i.e. to delimit the symbols spaces need to be used.

String parser automatically computes an alphabet from the string content.

\"a b\" is a string ab consisting of two symbols. \"\" is an empty string. \"ab\" is a string ab consisting of only one symbol, ab.

Trees

In general there are two options, the tree may be ranked or unranked. Either the ranks are associated with each symbol, or each sequence of subtrees is enclosed in a root-bar pair.

Ranks correspond to the number of subtrees of the given node.

RANKED_TREE a 2 b 2 b 2 a 0 a 0 a 0 a 0
UNRANKED_TREE a b b a | a | | a | | a | |

There are other kinds of trees that allow various special symbols:

  • RANKED_PATTERN,
  • RANKED_EXTENDED_PATTERN,
  • RANKED_NONLINEAR_PATTERN,
  • UNRANKED_PATTERN,
  • UNRANKED_NONLINEAR_PATTERN, and
  • UNRANKED_EXTENDED_PATTERN.

To specify a node wildcard, #N is used. It behaves like a node’s symbol, i.e. rank or bar symbol need to be used with the node wildcard. Node wildcards are only permitted in extended patterns. To specity a subtree wildcard, #S is used. It automatically is associated with rank 0 and thus it must not be specified, however in order to use it in an unranked pattern it needs to be followed by a bar symbol. Subtree wildcards are allowed in all patterns. Nonlinear variables are any other #-initiated symbols. They behave eactly like subtree wildcards. Nonlinear variables are only permitted in nonlinear patterns.

RANKED_PATTERN a 2 a 1 #S a 0
RANKED_EXTENDED_PATTERN a 2 #N 1 #S a 0
UNRANKED_EXTENDED_PATTERN a #N #S | #S | #S | | |"