Project 1

Deadline P1A: Friday, October 6, 11:59pm
P1B: Friday, November 3, 11:59pm
Name on Marmoset P1A, P1B
To Submit P1A: mipsscan.cc OR mipsscan.rkt
P1B: wlp4scan.cc OR wlp4scan.rkt
Marking Scheme P1A: 50% (14 release test marks, 36 secret test marks)
P1B: 50% (22 release test marks, 28 secret test marks)

Simplified Maximal Munch Scanning

For this project, you will write two scanners:

Required Lecture Material

The lectures have covered enough material to complete both scanners. You need to understand DFAs, the simplified maximal munch algorithm, and the general concepts behind scanning. In terms of the course notes, you need to have read Chapter 3 in its entirety.

Both scanners are to read a sequence of ASCII characters from standard input, tokenize the sequence using simplified maximal munch, and output the tokenization, printing one token per line. Specifications for the valid tokens are given below.

Read This Before You Start!

Reference implementations, starter code, secret tests, and more!

Token Specifications

The following tables define the kinds and lexemes of the tokens your scanner must produce. Restrictions on certain token kinds are also specified.

Kinds, Lexemes, & Restrictions: MIPS Assembly

Kind Description of Associated Lexemes
ID Identifiers: Strings starting with an alphabetic character (a-z or A-Z), followed by zero or more alphanumeric characters (a-z or A-Z or 0-9).
DOTID Dot-Identifiers: An . character followed by an ID lexeme.
LABELDEF Label Definitions: An ID lexeme followed by a : character.
DECINT Decimal Integers: Strings which optionally start with a - character, followed by one or more decimal digits (0-9).
HEXINT Hexadecimal Integers: Strings starting with 0x, followed by one or more hexadecimal digits (0-9 or a-f or A-F).
REGISTER Registers: A $ character followed by one or more decimal digits (0-9). The numeric value of the string of decimal digits cannot exceed 31.
COMMA Comma: A , character.
LPAREN Left Parenthesis: A ( character.
RPAREN Right Parenthesis: A ) character.
NEWLINE Either a line feed character (ASCII 0x0a), or the two-character sequence carriage return, line feed (ASCII 0x0d, ASCII 0x0a).
When printing this token, do not print the lexeme. Print a line containing only the kind.
?WHITESPACE Whitespace: A sequence of one or more space (ASCII 0x20) or tab (ASCII 0x09) characters.
?COMMENT Comment: A ; character, followed a sequence of zero or more ASCII characters that does not contain a line feed (ASCII 0x0a) or carriage return (ASCII 0x0d).
(In other words, once a ; character is encountered, everything up to the next line feed or carriage return is a ?COMMENT.)

Restrictions:

A DFA file that recognizes the lexemes of valid MIPS assembly tokens is provided. Note that the DFA does not assign kinds to lexemes for you, nor verify restrictions. It just recognizes strings that are valid lexemes.

Kinds, Lexemes, & Restrictions: WLP4

Kind Description of Associated Lexemes
ID Identifiers: Strings starting with an alphabetic character (a-z or A-Z), followed by zero or more alphanumeric characters (a-z or A-Z or 0-9), not equal to any of the following keywords: "int", "wain", "if", "else", "while", "println", "return", "new", "delete", "NULL".
NUM Numbers: Strings consisting of one or more decimal digits (0-9).
INT Keyword int: The string "int".
WAIN Keyword wain: The string "wain".
IF Keyword if: The string "if".
ELSE Keyword else: The string "else".
WHILE Keyword while: The string "while".
PRINTLN Keyword println: The string "println".
RETURN Keyword return: The string "return".
NEW Keyword new: The string "new".
DELETE Keyword delete: The string "delete".
NULL Keyword NULL: The string "NULL".
LPAREN Left Parenthesis: The ( character.
RPAREN Right Parenthesis: The ) character.
LBRACE Left Brace: The { character.
RBRACE Right Brace: The } character.
LBRACK Left Bracket: The [ character.
RBRACK Right Bracket: The ] character.
BECOMES Equals: The = character.
PLUS Plus: The + character.
MINUS Minus: The - character.
STAR Star: The * character.
SLASH SLASH: The / character.
PCT Percent: The % character.
AMP Ampersand: The & character.
COMMA Comma: The , character.
SEMI Semicolon: The ; character.
LT Less Than: The < character.
GT Greater Than: The > character.
LE Less Than or Equal To: The string "<=".
GE Greater Than or Equal To: The string ">=".
EQ Equal To: The string "==".
NE Not Equal To: The string "!=".
?WHITESPACE Whitespace: A sequence of one or more of the following characters: space (ASCII 0x20), tab (ASCII 0x09), line feed (ASCII 0x0a) or carriage return (ASCII 0x0d).
?COMMENT Comment: The string "//", followed a sequence of zero or more ASCII characters that does not contain a line feed (ASCII 0x0a) or a carriage return (ASCII 0x0d).

Restrictions:

A DFA file for WLP4 lexemes is not provided.

Testing your Scanners

If using C++, compile your MIPS scanner as follows:

g++ -g -Wall -std=c++17 mipsscan.cc -o mipsscan
Or your WLP4 scanner as follows:
g++ -g -Wall -std=c++17 wlp4scan.cc -o wlp4scan
If your program uses other .cc files, you must include those in the compilation command as well.

In the examples below, if using Racket, substitute ./mipsscan with racket mipsscan.rkt and substitute ./wlp4scan with racket wlp4scan.rkt.

If you are using C++ and want to test with Valgrind, add valgrind before the name of your program. For example: valgrind ./mipsscan

MIPS Scanner, successful scan example

Suppose the file program.asm contains the following data:

lis $4
.word 16
lw $3, 0xfffc($4) ; what does this do?
jr $31
If you run your MIPS scanner with this file, you should get the following output:
./mipsscan < program.asm
ID lis
REGISTER $4
NEWLINE
DOTID .word
DECINT 16
NEWLINE
ID lw
REGISTER $3
COMMA ,
HEXINT 0xfffc
LPAREN (
REGISTER $4
RPAREN )
NEWLINE
ID jr
REGISTER $31
NEWLINE
MIPS Scanner, unsuccessful scan example

Suppose the file program.asm contains the following data:

jr $32
If you run your MIPS scanner with this file, it should produce a message on standard error containing the string ERROR, in ALL CAPS. You can confirm the message is being sent to standard error by redirecting standard output to /dev/null:
./mipsscan < program.asm > /dev/null
ERROR: Invalid register number: $32
The error message does not need to be informative, but it must contain ERROR in ALL CAPS or else Marmoset won't be able to tell that you are producing an error. Making your error messages informative is still recommended because it can help you debug your program.
WLP4 Scanner, successful scan example

Suppose the file program.wlp4 contains the following data:

// Simple program that adds 

int wain(int a, int b) {
  // Just add the numbers together
  return a + b; 
}
If you run your WLP4 scanner with this file, you should get the following output:
./wlp4scan < program.wlp4
INT int
WAIN wain
LPAREN (
INT int
ID a
COMMA ,
INT int
ID b
RPAREN )
LBRACE {
RETURN return
ID a
PLUS +
ID b
SEMI ;
RBRACE }
WLP4 Scanner, unsuccessful scan example

Suppose the file program.wlp4 contains the following data:

int factorial(int n) {
  return (n!);
}
If you run your WLP4 scanner with this file, it should produce a message on standard error containing the string ERROR, in ALL CAPS. You can confirm the message is being sent to standard error by redirecting standard output to /dev/null:
./wlp4scan < program.wlp4 > /dev/null
ERROR: Simplified maximal munch failed at this point:
int factorial(int n) {
  return (n!)
The error message does not need to be informative, but it must contain ERROR in ALL CAPS or else Marmoset won't be able to tell that you are producing an error. Making your error messages informative is still recommended because it can help you debug your program.

You can use the reference implementations mipsscan and wlp4scan to see what the expected output should be for other inputs. The diff command is helpful for comparing your output to the expected output.

Stepping Stones

Here are suggestions for how to approach this project. They may help if you are overwhelmed and unsure how to start.

Begin by downloading the following starter code (C++, Racket). The starter code for each language contains:

By default, the starter code is set up to print information about the DFA stored in the string constant.

Marmoset Note: When submitting to Marmoset, you must submit all files your program uses. If submitting using the web interface, submit a ZIP file containing the files you want to submit. If submitting using the command line, you can just type the names of the files to submit. For example, if you renamed dfa-print.cc to mipsscan.cc, but your code still uses dfa.h and dfa.cc, you would write:

/u/cs_build/bin/marmoset submit cs241 P1A mipsscan.cc dfa.h dfa.cc
Step 1: Representing a DFA in Code

Create a class or struct representing a DFA. Using the function for printing information about a DFA as a starting point, write a function that reads a DFA file and constructs an instance of the DFA class or struct. You can test it by replacing the string constant in dfa.h / dfa.rkt, or by passing your own files to the function directly.

Step 2: Get Next State Function

Using your DFA class or struct, write a "get next state" function which takes a state and a character as parameters. If there is an arrow leading out of the input state that is labelled with the input character, return the state pointed to by the arrow. Otherwise, return a value indicating there is no next state to go to.

Step 3: Simplified Maximal Munch Function

Use the "get next state" function to implement a "simplified maximal munch" function, which takes a string (or an input stream/port), tokenizes it with simplified maximal munch, and returns a sequence of strings (token lexemes). Do not worry about token kinds or filtering out tokens at this stage.

You can test your implementation with the DFA provided in the starter code, the MIPS token lexeme DFA, and any other DFA files you have lying around.

Step 4: MIPS Scanner

Replace the DFA provided in the starter code with the MIPS token lexeme DFA. It is time to write the MIPS scanner.

Create a class or struct for tokens. Modify your "simplified maximal munch" function to return a sequence of tokens instead of a sequence of strings.

Finally, write input and output code to read the input string from standard input, process it into tokens with the "simplified maximal munch" function, and print the tokenization.
Step 5: WLP4 Scanner

Now it is time to write the WLP4 scanner. Create a DFA for WLP4 tokens and replace the MIPS DFA with this new DFA.

Creating the WLP4 DFA may be a little tricky. If you are stuck, study the provided MIPS DFA and try to understand how it works.

Once you have the WLP4 DFA, modify the part of your "simplified maximal munch" function that checks restrictions on tokens, so that it checks the WLP4 restrictions instead of the MIPS restrictions.

Test your code thoroughly, and you're done!