Rocksolid Light - comp.arch.embedded

On 3/22/2023 1:37 PM, George Neuner wrote:
>>>>> Then there is flex.
>>>>>
>>>>> flex has some DFA optimizations available. First, flex can compress
>>>>> the data tables which encode its DFA states. Second, flex can
>>>>> discover "equivalence" classes: groups of characters which result in
>>>>> the same action. And finally, flex can [sometimes] discover
>>>>> "meta-equivalence" classes: commonly expected sequences of characters
>>>>> and/or other equivalence classes.
>>>>
>>>> But, those are mapping equivalent *inputs* together, not equivalent
>>>> *states*. I.e., treating "BEGIN" and "begin" as equivalent.
>>>
>>> No. Consider the case of just recognizing a decimal digit: compare
>>> the graph using the alternation: (0|1|2|3|4|5|6|7|8|9), vs the graph
>>> using the class [:digit:].
>>>
>>> Using the OR alternation, including start you have 11 nodes. Start has
>>> 10 transitions exiting, and each digit node has a single transition
>>> entering.
>>
>> Are you using "node" as a synonym for "state"?
>
> I am using graph terminology - mostly because all FA builders start by
> constructing a state /graph/. After the graph is complete, it may be
> implemented using tables, or Duff's device, etc. ... but it doesn't
> start that way. 8-)

I build FSMs similarly. But, you can't commit graphs to
ASCII text whereas tables are a natural consequence.

The difference seems largely to be that DFA are geared towards
expressing "languages" (sequences of symbols) whereas FSMs
are geared towards expressing sequences of events/actions.

E.g., you can build an expression to describe the legal
symbol sequences to create a particular type of token
(NUMBER, BEGIN, END, etc.) with the assumption that every
symbol *in* those tokens is handled by a single action
(accumulate_digit).

By contrast, an FSM will often have a variety of very different
symbols recognized in a given state and acted upon differently
(POWER_FAIL, POWER_RESTORED, LOW_BATTERY, CLEAR, ENTER, BARCODE_DETECTED,
etc.). These tend to have more "work" associated with their
recognition than a set of equivalent symbols (e.g., digits).

And, while each may be handled differently from the others,
they tend to be handled the same way when encountered in
different states. I.e., a POWER_FAIL is processed the same
each place it is considered significant.

So, you want a way of expressing this common set of conditions/symbols/events
in a way that can be reused in many different states.

>> E.g.,
>>
>> State Powered_Up
>> On '1' Goto Entry Executing Accept_Digit()
>> On '2' Goto Entry Executing Accept_Digit()
>> On '3' Goto Entry Executing Accept_Digit()
>> On '4' Goto Entry Executing Accept_Digit()
>> On '5' Goto Entry Executing Accept_Digit()
>> On '6' Goto Entry Executing Accept_Digit()
>> On '7' Goto Entry Executing Accept_Digit()
>> On '8' Goto Entry Executing Accept_Digit()
>> On '9' Goto Entry Executing Accept_Digit()
Check Exceptions
>> ..
>>
>> State Operation_Complete
>> On '1' Goto Entry Executing Accept_Digit()
>> On '2' Goto Entry Executing Accept_Digit()
>> On '3' Goto Entry Executing Accept_Digit()
>> On '4' Goto Entry Executing Accept_Digit()
>> On '5' Goto Entry Executing Accept_Digit()
>> On '6' Goto Entry Executing Accept_Digit()
>> On '7' Goto Entry Executing Accept_Digit()
>> On '8' Goto Entry Executing Accept_Digit()
>> On '9' Goto Entry Executing Accept_Digit()
Check Exceptions
>> ..

State Exceptions (this is a misnomer but used for syntactic simplicity)
On Barcode_Detected Goto HandleBarcode Executing OverwriteAccumulator()
On Power_Fail Goto SignalOutage Executing AlertOperator()
On Low_Battery Goto SafeguardData Executing OrderlyShutdown()

>> In each of these, I would have a single transition exiting the
>> state labeled "0+1+2+3+4+5+6+7+8+9" invoking Accept_Digit()
>> and destined for "Entry"
>>
>> Operation_Complete (i.e., the *end* of some undisclosed sequence
>> of actions, preparing to restart on another iteration) is equivalent
>> to Powered_Up -- presumably the state just before that sequence of
>> actions is first started (assuming "..." are equivalent)
>>
>> I can detect this with an implication table. Even if a series of
>> states are involved (step1, step2, step3) An algorithm can
>> similarly detect it. And, remove one of these two states (or
>> series of states) from the machine (for example, "Operation_Complete"),
>> replacing all references to it in the machine with "Powered_Up".
>
> Absolutely it /could/ be done with table based algorithms, but I've
> never seen it done that way in any real (non toy) implementation ...
> graph based solutions scale better to large problems.
>
>>> Using the digit class, you have 2 nodes, with 10 transitions that all
>>> get you from start to the digit class node.
>>
>> I don't see where the extra nodes (states) come from.
>
> Think about the "class" as a state in an /NDFA/ ... there are multiple
> transitions that can get you in - one transition (action) that gets
> you out.
>
> When the "class" is implemented it becomes (for lack of better term) a
> subroutine (subgraph) in the resulting DFA. The more the subgraph can
> be reused, the more savings in the DFA.
>
>>> Obviously this is simplistic, because the members of the character
>>> class form a subgraph which itself has to be recognized. The
>>> important point here is that the subgraph as a whole can represent a
>>> /single/ node in a much more complex graph - its constituent
>>> characters need not be repeated in the complex graph. More on this
>>> below.
>>>
>>> A complex DFA that combines many different regex may present other
>>> opportunities to recognize given (possibly arbitrary) sets of
>>> characters - opportunites that may not be apparent from looking at the
>>> constituent regex.
>>>
>>>> Would it recognize a common sequence of state transitions
>>>> that occurs in two different places in the grammar? E.g.,
>>>> specifying the syntax for a numeric quantity in two different
>>>> places only to realize that it's actually the same part of
>>>> the grammar expressed as two different instances?
>>>
>>> When given the option to find equivalence classes, flex can identify
>>> sets of characters that are used repeatedly. Those characters are
>>> gathered into an "equivalence" that then can be a node in the DFA
>>> instead of redundantly repeating individual characters.
>>
>> OK, but that just replaces N *identical*, "parallel" transitions
>> with a single one labeled with a class instead of a discrete value.
>> Just like "0+1+2+3+4+5+6+7+8+9". It hasn't reduced the number of states.
>
> It will reduce the number of states IF the class is reused with the
> same action. Consider
>
> integer: [+-]?[:digit:]+
> float: (((integer)\.)|((integer)?\.(integer)))(\e(integer))?
>
> The [:digit:] class is resused 5 times. The definition of "integer"
> also forms a (5x) reused meta-class that flex could recognize if told
> to look for them.

OK. It's a special case of alternation *and* equivalence.
I build "state subroutines" to handle sets of symbols that
are handled the same way but "invoked" from different states
(see Exceptions). But, the individual symbols can invoke
different actions *and* different next states -- as long as
they are consistent in each "application".

> Since, in this example, the [:digit:] class is always used with the
> same action, it will be implemented in the DFA state graph just once.
> Since the class itself consists of ~13 states: START that waits for
> input, 0..9 that accept input, common EXIT, and common ERROR out if
> something else is input ... treating it AS a class saves 52 states (13
> x 4) in the state graph.
>
> The common exit and error states out may be eliminated from the final
> FA (assuming no conflicting uses of [:digit:], but they will be
> included in initial construction of the state graph (think of them
> like subroutine preamble/postamble).
>
>>> Remember DFA are deterministic - a node can't take different actions
>>> depending on which of multiple transitions entered (or left) it ... so
>>> if you want the same character to be recognized in a different context
>>> (leading to a different action), you must repeat it in a different
>>> node.
>
>>>> <number> ::= <digit><digits>
>>>>
>>>> <value> ::= <digit><digits>
>>>>
>>>> <expr> ::= <number> <op> <number> | <value> <op> <value>
>>>>
>>>> (silly example; but the inputs in each case are the same)
>>>
>>> You're mixing abstraction levels here: <digit>, <digits>, <number>,
>>> and <value> are lexical tokens, whereas <expr> is syntax.
>>
>> I'm making the point that <number> and <value> are equivalent
>> results. A machine that determines if an input satisfied
>> one would similarly satisfy the other.
>>
>> So, the two machines would be duplicates of each other and
>> subject to being optimized into one (because <expr> treats
>> <number> and <value> as equivalent in *its* machine.
>
> Yes they are duplicates at some level of abstraction, but that level
> is above what the tool can recognize and deal with.

That was what I was hoping *might* be possible. I can automate
implication analysis to detect and *suggest* these equivalence
reductions (cuz it's a simple process and my tables reduce to
just simple blocks of tuples).

But, I don't want the tool to automatically make that reduction
because the developer may not have *intended* the two states to be
equivalent. I.e., claiming that they are equivalent alerts the
developer of the fact that he may have forgotten something
(or, incorrectly specified a symbol/action).

OTOH, for nontrivial FSM, it is often too hard for a developer to
recognize sequences of states that *can* be equivalenced -- esp
with the notational shorthands that I've developed.

E.g.,

State Operation_Finished
On Barcode_Detected Goto HandleBarcode Executing OverwriteAccumulator()
On '1' Goto Entry Executing Accept_Digit()
On '2' Goto Entry Executing Accept_Digit()
On '3' Goto Entry Executing Accept_Digit()
On Power_Fail Goto SignalOutage Executing AlertOperator()
On Low_Battery Goto SafeguardData Executing OrderlyShutdown()
On '4' Goto Entry Executing Accept_Digit()
On '5' Goto Entry Executing Accept_Digit()
On '6' Goto Entry Executing Accept_Digit()
On '7' Goto Entry Executing Accept_Digit()
On '8' Goto Entry Executing Accept_Digit()
On '9' Goto Entry Executing Accept_Digit()
..

is equivalent to Operation_Complete. And, would be so if it
additionally (redundantly) listed "Check Exceptions"

Finally, mnemonics for symbols resolve to specific values.
So, there's nothing that prevents a developer from assigning
two mnemonics to the same value -- intentionally or accidentally.

[I have some mechanisms that minimize the chance of this
happening. But, there's nothing to prevent me from defining
two semantically equivalent symbols that get bound to
different values -- e.g., Power_Fail and Power_Die]

> The programmer
> labeled them differently and so the tool has to assume both are
> required, even if in use there is no practical way to distinguish them
> in the input.
>
>>> Knowing that yacc and bison CAN handle characters as tokens, and
>>> assuming you have defined <digit> and <digits> elsewhere in your
>>> grammar, neither yacc nor bison can find this kind of equivalence. In
>>> yacc it will result in a reduce/reduce error. In bison what happens
>>> depends on the kind of parser you asked for (LALR,SLR,LR,GLR), but in
>>> any case the result won't be pretty.
>>>
>>> Assuming instead that you meant for <number> and <value> to be
>>> recognized by the lexer rather than the parser ... flex (not lex)
>>
>> Then the question would be whether or not the lexer could
>> recognize these "little machines" were equivalent and rewite
>> the grammar so only one was present.
>
> You'd need a combination tool that produced parser and lexer from a
> unfied spec. There are such tools: e.g., ANTLR ... but I'm not aware
> of any tools that do /that/ kind of optimization.
>
> It's all about the context: there's no practical way to merge
> identical recognizers if they directly lead to different actions. In
> the examples above, [:digit:] could be reused only because every use
> of it simply accumulated another input character ... the differences
> occurred when a non-digit character was entered.

Yes. My point wrt FSM having lots of different input symbols and
different associated next-states/actions.

> If instead you did something like:
>
> integer: [:digit:] return 'i'
> hex: [:digit:]|['a'-'f'] return 'h';
>
> This would blow up in your face because 0..9 would never be recognized
> a hex digit, but more importantly the 2 uses of the class lead
> /immediately/ to different actions so the class subroutine (subgraph)
> would have to be repeated in the FA with different exit actions.

Yes. If the tool places an implicit priority on the rules
based on the order in which they are encountered. I intentionally
don't specify this in the design of the tables, leaving the
"post processor" some latitude in how it implements them
and the runtime some potential efficiencies.

[This also lets me embed ambiguities -- for certain clients :> ]

E.g.,
State Operation
On Barcode_Detected Goto HandleBarcode Executing OverwriteAccumulator()
On '1' Goto Entry Executing Accept_Digit()
On '2' Goto Entry Executing Accept_Digit()
On '3' Goto Entry Executing Accept_Digit()

and

State Operation
On '1' Goto Entry Executing Accept_Digit()
On '2' Goto Entry Executing Accept_Digit()
On '1' Goto Entry Executing Reject_Digit()
On '3' Goto Entry Executing Accept_Digit()
On Barcode_Detected Goto HandleBarcode Executing OverwriteAccumulator()

can be made equivalent -- or different. Barcodes could take priority
over individual digits -- or not. Etc. The dispatch can trade
speed for space -- or vice versa.

[Of course, any parser generator can offer similar options.
But, as most parsers are used in a different context, the
tradeoffs aren't as productive]

>>> Assuming an appropriate graphical IDE, a designer certainly could
>>> specify a state graph and code for actions, and have a program
>>> generated from it. Given the right input from the designer, a great
>>> deal of checking could be done against the graph to verify that it
>>> covers enumerated inputs and transitions, that specified inputs lead
>>> to specified actions, that action code exists, etc.
>>>
>>> But what is NOT possible is to verify that all /possible/ inputs and
>>> state transitions have been enumerated. Nor is it possible to verify
>>> that required actions have been specified, or necessarily that the
>>> actions are being taken in proper context ... those are things for
>>> which the tool simply MUST trust the graph designer.
>>
>> And, as above, if you want the code to do X, then, somehow, you must
>> put "X" into the graph. In the examples, here, we've lumped everything
>> that needs to be "done" into single function calls associated with each
>> of the transitions. As a single verb-noun is unlikely to be capable of
>> describing much detail, you're still stuck with code -- that the developer
>> had to explicitly write.
>>
>> The dream that you can convert a drawing into an arbitrary *application*
>> exists as a pipe.
>
> Absolutely: somehow actions have to be specified, whether as arbitrary
> user entered code attached to graph nodes, or as predefined "action"
> nodes linked into the graph.

And, the actions must be *complete*. Or, you have to drag in
another abstraction mechanism.

A state diagram for a hardware machine is complete as it spells out
all of the gazintas and cumzoutas.

A railroad diagram effectively describes *syntax* but rarely much
more.

A "programming tool" with the same level of visibility would
likely have a boatload of cryptic syntax/notations that would
make reliable use of the tool dubious (unless a full-time job).

> But as I said above, there are things that simply can't be checked
> without embedding significant domain knowledge into the tool itself.
> That essentially precludes any notion of a generic tool ... even if
> the tool included an expert system, it's likely that no generic
> interface to the expert system could be designed that would
> satisfactorily deal with the needs of many different domains.

Subject	Replies	Author
Text on FSM By: George Neuner on Wed, 22 Mar 2023	13	George Neuner

COBOL is for morons. -- E. W. Dijkstra

devel / comp.arch.embedded / Re: Text on FSM