CS 241 — Summer 2024 — Assignment 2 / Out of 10 marks

Assignments for CS 241
← Assignment 1	Assignment 2	Assignment 3 →
	Friday, June 7 at 5:00 pm
P1 • P2 • P3 • P4 • P5

On this assignment, you will create DFAs for certain regular languages, write a tool that determines if a DFA recognizes a string, then extend this tool to a Simplified Maximal Munch scanner.

You will write DFAs in the cs241 DFA file format, and you will use the cs241.DFA tool to test your DFAs in problems 1 and 2, and optionally in problems 3, 4, and 5.

The cs241.DFA (web version) tool reads a DFA file and for a given test input (specified in the DFA file), the DFA tool tests the input against the DFA.

You will submit your solutions to Marmoset, which will run your solution against some automated tests. Each problem specifies a required filename that you must use for your Marmoset submission.

Note: From this problem onwards for the remainder of the course, when an assignment uses the description "release+secret", what this means is that release tests are worth no marks, but each release test has a similar (but not identical) secret test which is worth marks. The goal of release+secret testing is to ensure that you test for yourself and don't overspecialize to our test cases. When marks are described just as "secret", this means that there are secret test cases which are unrelated to release tests, and you must consider the full range of testing possibilities on your own for such marks, with no feedback.

Problem 1: Alfalfa [1 mark of 10: 1 public] (filename: `alfalfa.dfa`)

Write a DFA with the alphabet {a, l, f} recognizing the language of strings that contain the string "alfalfa".

DFAs are specified in the course specific DFA file format we have created for the course.

Extra Practice Problems

Here are some extra optional problems if you would like more practice with DFAs. You can submit the following files to A2P1 on Marmoset, and Marmoset will check that your DFA is correct.

These problems are not worth marks. There is no bonus for completing them.

catstar.dfa: Construct a DFA with alphabet {a,c,t} recognizing the language described by the regular expression (ca(a)*t)*.
cattac.dfa: Construct a DFA with alphabet {a,c,t} recognizing strings that do not start or end with a, and for every a in the string, the letter before the a is different from the letter after the a.
endalfalfa.dfa: Construct a DFA with alphabet {a, l, f} recognizing the language of strings that end with alfalfa.

Submit all DFA files together in a zip file. If you submit only one, Marmoset will still run the tests but you will only get marks for the one you submitted.

Problem 2: -64 to 512 [1 mark of 10: 0 public / 1 release] (filename: `64-to-512.dfa`)

Write a DFA with the alphabet {-, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9} that recognizes the language of integers made of numeric characters (optionally started with a negative sign), between the range -64 and 512 (inclusive), with no leading 0s, and no negative 0 (i.e. "-0")

Problem 3: DFA Recognizer [2 marks of 10: 1 public / 1 release+secret] (filename: `dfa.c(c|pp)` or `dfa.rkt`)

Write a program that reads a DFA file from standard input, and for each string in the input section of the file, determine whether the string is accepted by the DFA. Print a single line of output (to standard output) for each string:

If the string is accepted, print the string, followed by a space, followed by true
If the string is not accepted, print the string, followed by a space, followed by false

To simplify your task, starter code is provided that reads a DFA file from standard input. The starter code does not actually do anything, but comments are provided at the locations where the code has finished reading certain pieces of information. You should fill in these sections with your own code that sets up appropriate data structures and implements the DFA recognition algorithm.

c++: dfa.cpp
racket: dfa.rkt

If you use the c++ starter code, you can compile the code with:

      g++ -g -O0 -fPIC -std=gnu++20 -Wall -Wextra -pedantic-errors -Wno-unused-parameter -o dfa dfa.cpp

And run the code with:

      ./dfa < input.dfa

If you use the racket starter code, you can run the code with:

      racket dfa.rkt < input.dfa

You are not required to use the starter code, and you are allowed to change the starter code in any way you see fit. However, the starter code does not purposefully contain any bugs or issues that will cause you to lose marks on this assignment. We strongly recommend that you use the starter code.

Note that the DFA file format uses the special keyword .EMPTY to denote the empty string. If .EMPTY is given as input, then in your output, you should print .EMPTY (rather than nothing).

Your program does not need to check errors in the DFA file. Each DFA file in the tests follows the DFA file format, and the specified DFA is a valid DFA. Note that inputs in the .INPUT section are not a part of the definition of DFA.

Your program should behave the same as the cs241.DFA tool aside from error checking (cs241.DFA does error checking, but you do not need to).

For this problem, Marmoset uses diff -Zb to compare your output, so that small differences in whitespace are ignored (e.g., extra space at the end of the line).

Problem 4: Simplified Maximal Munch [3 marks of 10: 1 public / 2 release+secret] (filename: `smm.c(c|pp)` or `smm.rkt`)

Updates and Clarifications

Removed a contradictory description of the expected output. (28 May)

In this problem, you will extend your DFA recognizer to a Simplified Maximal Munch scanner.

As in the previous problem, your program will read a DFA file from standard input. However, it should treat the input section as a single string, rather than multiple whitespace-separated strings. Your program should attempt to tokenize the string using the provided DFA and the Simplified Maximal Munch algorithm.

The program should print the tokens obtained, in order, one per line, to standard output, as the algorithm identifies each token. Each line should be terminated by a line feed (ASCII 0x0A).

If an error is detected, no additional output should be sent to standard out. A message should be printed to standard error. The message must start with ERROR. The error message does not need to be anything more than ERROR, but you might find it useful to include more information.

For instance, if you are given the following input:

      .ALPHABET
      $ 0-9
      .STATES
      start $ $n!
      .TRANSITIONS
      start $ $
      $  0-9 $n
      $n 0-9 $n
      .INPUT
      $2$4$1$241

Then your output must be:

The input string to be scanned will not be empty (i.e., the empty string: "") or epsilon (i.e., .EMPTY). It will contain at least one character.

You do not need to output the state of the DFA that you reached (which often represents the "kind" of the token). You only need to output the actual string (the "lexeme" of the token).

You must implement Simplified Maximal Munch, not some other scanning algorithm. The tests are not just looking for any valid tokenization, but that you produce the specific tokenization that Simplified Maximal Munch would produce, or an error if Simplified Maximal Munch cannot find a tokenization.

For this problem, there is no sample executable to compare your output with. Try testing with simple DFAs where you can easily understand how the string should be broken down into tokens.

Your program does not need to check errors in the DFA file. Each DFA file in the tests follows the DFA file format, and the specified DFA is a valid DFA.

As with the previous problem, Marmoset will use the -Zb options with diff to allow for some leniency in terms of whitespace. However, you should not really be printing whitespace other than a single line feed after each token you output.

Problem 5: Simplified Maximal Munch [3 marks of 10: 0 public / 1 release / 2 secret] (filename: `asm.c(c|pp)` or `asm.rkt`)

Updates and Clarifications

Fixed an error in the definition of DEC which incorrectly prevented a 0, and incorrectly allowed for only 2 characters. (26 May)
Fixed an error in the description of DIRECTIVE which allow a directive to be ".". (26 May)
Fixed an error in the output description that incorrectly stated that the output was like problem 4. (28 May)

Write a program that scans MIPS assembly and outputs its tokenization.

Your program will output a list of tokens, one on each line, with the token kind in all caps, a space character, then the lexeme, and the newline character (ascii 0x10)

Your token types must be:

DIRECTIVE: a period followed by at least one alpha character
LABEL: an alpha character, followed by zero or more alphanumeric characters, follow by a colon
ID: an alpha characters, followed by zero or more alphanumeric characters
HEXINT: The characters 0x followed by 1 or more characters between 0-9, a-f, or A-F
REG: The $ followed by one or two decimal digits
DEC: a single 0 character xor ( an optional -, followed by a character between 1-9, optionally followed by zero or more characters between 0-9 )
COMMA: The , character
LPAREN: The ( character
RPAREN: The ) character

The input to the assembler will have each assembly instruction on only one line.

Your assembler should not crash in any ways, even if the input is not a valid assembly language program. Your assembler should not silently ignore errors in the input program. c++ solutions cannot leak resources.

Your scanner will be tested by both valid mips assembly, invalid mips assembly, and input that doesn't scan. Invalid mips assembly can scan successfully. For instance add $1, $2, $3, $4 will successfully scan. Your program does not need to validate that what it scanned is a valid program. It only needs to output the tokens that it scans, stopping either when the end of input is reached, or when an error is detected.

If the input to your program is

      add $1, $2, $3, $4

Then the output of your program must be:

      ID add
      REG $1
      COMMA ,
      REG $2
      COMMA ,
      REG $3
      COMMA ,
      REG $4

In the earlier Simplified Maximal Munch scanner problem, you were provided a scanning DFA, followed by a string to scan using the DFA. For this problem, the input to your program is simply mips assembly. That is, the input will not include a DFA for mips tokens. You will likely have to create such a DFA yourself and embed it in your program somehow.

If you want to reuse your DFA reading code from the earlier problems, you could embed it as a multi-line string literal (these are called raw strings in C++ and here strings in Racket) and then read it from a std::stringstream (C++) or using open-input-string (Racket).

You could also submit the DFA as a separate file (zip it together with your program) and have your program read it using file I/O. However, note that Marmoset blocks you from writing to files. This means you should open the file as read-only; otherwise opening the file could fail and cause unexpected errors in your code. In C++, you can use std::ifstream to open a file as read-only. In Racket, open-input-file should work.

You can embed a DFA file into a c/c++ by using xxd to generate a raw string:

      xxd -i mips.dfa > mips.cpp && g++ dfa.cpp mips.cpp -g -O0 -fPIC -std=gnu++20 -Wall -Wextra -pedantic-errors -Wno-unused-parameter -o dfa

In your dfa.cpp implementation, you can "read" the file with:

      #include <iostream>
      #include <sstream>

      extern unsigned char mips_dfa[];
      extern unsigned int mips_dfa_len;

      int main()
      {
          std::string foo((char *) mips_dfa, mips_dfa_len);
          std::stringstream in(foo);
      }

CS 241 — Summer 2024 — Assignment 2 / Out of 10 marks

Problem 1: Alfalfa [1 mark of 10: 1 public] (filename: alfalfa.dfa)

Extra Practice Problems

Problem 2: -64 to 512 [1 mark of 10: 0 public / 1 release] (filename: 64-to-512.dfa)

Problem 3: DFA Recognizer [2 marks of 10: 1 public / 1 release+secret] (filename: dfa.c(c|pp) or dfa.rkt)

Problem 4: Simplified Maximal Munch [3 marks of 10: 1 public / 2 release+secret] (filename: smm.c(c|pp) or smm.rkt)

Problem 5: Simplified Maximal Munch [3 marks of 10: 0 public / 1 release / 2 secret] (filename: asm.c(c|pp) or asm.rkt)

Problem 1: Alfalfa [1 mark of 10: 1 public] (filename: `alfalfa.dfa`)

Problem 2: -64 to 512 [1 mark of 10: 0 public / 1 release] (filename: `64-to-512.dfa`)

Problem 3: DFA Recognizer [2 marks of 10: 1 public / 1 release+secret] (filename: `dfa.c(c|pp)` or `dfa.rkt`)

Problem 4: Simplified Maximal Munch [3 marks of 10: 1 public / 2 release+secret] (filename: `smm.c(c|pp)` or `smm.rkt`)

Problem 5: Simplified Maximal Munch [3 marks of 10: 0 public / 1 release / 2 secret] (filename: `asm.c(c|pp)` or `asm.rkt`)