HW2 - Regular Language Processing

Draw (eg by hand with a pen) NFAs to accept the languages
1. number of a's is even and number of b's is multiple of three (alphabet = {a, b})
2. a comma-serparated list of legal C identifiers, with white space allowed where appropriate. (By a legal C identifier I mean any sequence of letters, digits, and '_', starting with a letter or an underscore.)

Write Scheme functions to do each of the following. They should not use any side effects, should be properly commented and indented, etc. Your filename should have the extension .scm, for our convenience as well as your own. (10 points each.)

Write accepts? which takes a DFA in the form described below, and a list of symbols representing a string that might be in the language, and returns true iff the DFA accepts the given string.
Write nfa-to-dfa which takes an NFA in the below syntax and transforms it into a DFA, using the naive power-set method. (There is no need for the resulting DFA to be minimized or optimized in any fashion.)
Write re-to-nfa which takes a regular expression in the below syntax and returns an NFA which accepts the language described by the given regular expression.

Examples:

(accepts? (nfa-to-dfa (re-to-nfa '(concat foo (* bar) foo)))
          '(foo bar bar bar bar foo))   ; returns #t

(accepts? (nfa-to-dfa (re-to-nfa '(concat foo (* bar) foo)))
          '(foo bar bar bar bar))       ; returns #f

Representations

All the representations are optimized for ease of manipulation in Scheme. Symbols in the alphabet (ie in Sigma) can be any Scheme symbol.

DFA

List of four elements. The first is the alphabet (as a list). The second is a list consisting of the start state. The third is a list of accepting states. The fourth is a list of list of ``state descriptors.'' Each state descriptor is a list whose first element is the name of the state, and whose remaining elements are lists of targets of transitions and the symbols associated with those transitions. For example, the ``number-of-a's minus number-of-b's equals two modulo three'' DFA from class would be represented as:

((a b)                           ; Sigma (the alphabet)
 (mod0)                          ; start state
 (mod2)                          ; accepting states
 ((mod0 (mod1 a) (mod2 b))       ; transitions out of state MOD0
  (mod1 (mod2 a) (mod0 b))       ; ... MOD1
  (mod2 (mod0 a) (mod1 b))))     ; ... MOD2

There is an implicit ``sink state'' as described in class.

NFA

The representation of NFAs is the same as for DFAs, except that there might be more than one listed transition out of a given state for a given symbol, and the list of symbols in a transition can be empty, to indicate an ``epsilon transition''. Eg:

((a b c d e f g h i j k l m n o p q r s t u v w x y z)
 (s)
 (b0c b1c)
 ((s (b0a b) (b1a b))
  (b0a (b0b o))
  (b0b (b0c b))
  (b0c (b0a b))
  (b1a (b1b i))
  (b1b (b1c b))
  (b1c (b1a b))))

Regular Expression

This is a scheme-ey syntax for REs.

A symbol is an RE.
The s-expression (* re) is an RE, denoting the Kleene closure of re.
The s-expression (+ re) is an RE, denoting one or more repetitions of re.
The s-expression (concat re1 re2 ...) is an RE, denoting the concatenation of the given re's.
The s-expression (union re1 re2 ...) is an RE, denoting the union of the given re's.

Due fifteen minutes before class: 3:45pm, Wed Sep 5.

We will take off as many points for a 23-hour-late assignment as for a 24-second late one, so please don't skip class to work on it.

Turn in your code by running

 ~bap/bin/handin your-file.scm

on a regular UNM CS machine. (You should use whatever filename is appropriate in place of your-file.scm. You can put multiple files on the command line, or even directories. Directories will have their entire contents handed in, so please be careful with them, and in particular be sure to clean out any cruft.)

If you wish, the answers to problem 1 can be drawn by hand and turned in before the start of class, in the classroom, on Wed Sep 5.

Barak Pearlmutter <bap@cs.unm.edu>