CS 241 — Winter 2025 — Assignment 4 / Out of 9 marks

Assignments for CS 241
← Assignment 3	Assignment 4	Assignment 5 →
	Friday, June 13th at 5:00 pm
P1 • P2 • P3 • P3a • P3b

In this assignment, you will write a tokenizer for WLP4, a simple programming language that we will incrementally write a compiler for in this course. In addition, you will be asked to write some simple programs in WLP4, to better understand the language.

Your WLP4 compiler will eventually translate WLP4 source code into MIPS assembly language. The combination of the tools you write in assignments 4 through 8 will be a complete compiler for WLP4 to assembly language, and in concert with the tools you wrote in assignments 1 through 3 to compile assembly into machine code, you will have a complete pipeline for compiling WLP4 into executable MIPS programs.

You will get some practice with WLP4 to better understand its various quirks and limitations, and then you will specialize your Simplified Maximal Munch scanner to work with WLP4 programs, completing the first component of the compiler.

Introduction to WLP4

Starting from this assignment, you will begin writing a compiler that translates a C-like programming language called WLP4 into MIPS assembly language.

WLP4 is quite similar to a restricted subset of C, but it uses C++ style memory allocation (new/delete instead of malloc/free). The syntax is harshly restricted to make it easier to compile. For example:

All variable declarations must appear at the start of a procedure. Mid-procedure declarations are not allowed.
All variables must be initialized to a constant. They cannot be initialized to another variable or the result of an expression. They cannot be left uninitialized.
Integer literals cannot be negative. In fact, there is no unary (one-operand) minus operator. To get -1, for example, you must write 0 - 1.
If statements must be written with blocks and must have a matching else, even if it's empty. Because of the restriction to blocks, there is no "else if" syntax.
An if statement can only have one boolean condition (no && or || operators).
You can only have one return statement per procedure and it must be at the very end.
Array access must be done using pointer arithmetic. Instead of writing array[index], you must write *(array+index).

More information about WLP4 is below:

How to Compile and Run WLP4 Programs

Brief Introduction to WLP4

Formal WLP4 Specification

Problem 1: WLP4 Practice (filename: practice.wlp4) [2 marks, 1 release]

Warm-Up Exercise 1 (not submitted)

Write a WLP4 procedure that computes the absolute value of an integer. You do not need to handle the case of -2³¹ = -2147483648 (this integer has no positive counterpart in 32-bit two's complement).

Warm-Up Exercise 2 (not submitted)

Write a WLP4 procedure called tieBreak that takes two distinct integers as parameters.

If the integers have different absolute values, return the integer with the smaller absolute value.
If the integers have the same absolute value, then since they are distinct, one is positive and one is negative. Return the negative integer.

To Submit

Write a WLP4 program that takes a non-empty array of integers in the range -241 to 241 (inclusive) as input and returns the integer that occurs most frequently. If there is a tie for the most frequent integer, break the tie using the rule from the tieBreak procedure described in Warm-Up Exercise 2.

Your program should run in linear time (in the size of the input array). This likely means you will need to allocate a sufficiently large block of memory to store the number of times each integer occurs.

Examples

Suppose the input array is [1, 2, 2, 3, 3, 3]. Your program should return 3.

Suppose the input array is [4, 4, 4, 2, 2, 2, -3, -3, -3]. There's a three way tie between 4, 2 and -3. Since 2 has the least absolute value, your program should return 2.

Suppose the input array is [1, -1, 2, -2]. There's a four way tie between 1, -1, 2, and -2. Among these, 1 and -1 tie for the smallest absolute value. To break this tie, you choose the negative number, so you return -1.

Testing

Compile your program with wlp4c, then run the compiled code with mips.array. Since wlp4c produces machine code directly, you don't need to use cs241.binasm.

Alternatively, on the student Linux system:

$ wlp4c < practice.wlp4 > practice.mips
$ mips.array practice.mips

Please remember that the student Linux system is unreliable. Do your development and testing on your own system, using the web tools when possible, so that you won't be stuck when (rather than if) the student Linux servers go down.

The wain function in your WLP4 program should take an int* as the first parameter and an int as the second parameter. The first parameter will receive the value that mips.array places in $1 (a pointer to the array) and the second parameter will receive the value it places in $2 (the size of the array).

When the program terminates, the return value of the wain function will be stored in $3. Ensure the value in $3 is correct.

Problem 2: WLP4 Printing Integers Radix Redux (filename: radix.wlp4) [2 marks, 1 release]

In assignment 3, you wrote an assembly program to print integers in base 10. WLP4 includes a print statement, written println(x);, that directly prints integers in base 10. In this problem, you will write a WLP4 program to print integers in any base. To do so, you will need to use putchar(x);, which outputs the low 8 bits of its integer argument as a character to standard output.

Note that the base of an ordinal number system is also called its radix.

Write a WLP4 program that takes two 32-bit signed integer arguments: the first is the number to print, and the second is the radix to print it in. The number must be non-negative. If a negative number is given as the first argument to wain, your program must print nothing and return 1. The radix must be greater than 1 and less than 37. If a number out of that range is given as the second argument to wain, your program must print nothing and return 2. If the number and radix are both valid, your program must print the digits of the number in the given radix. After the digits, it must print a newline, and return 0.

If the number is 0 and the radix is valid, the correct output is always the single character '0' and a newline. In all other cases, you should print no leading 0s.

If the radix is greater than 10, use the capital English letters for the digits after 9: A, B, C, etc. This is why the radix is restricted to being less than 37: we have 10 standard decimal digits plus 26 letters equals 36 possible digit characters.

Examples

Suppose that the input is 12345, 10. Your program should print "12345" and a newline, and return 0.

Suppose that the input is 342391, 8. Your program should print "1234567" and a newline, and return 0. Note that 1234567 interpreted in base 8 is equal to 342391 interpreted in base 10.

Suppose that the input is 147, 2. Your program should print "10010011" and a newline, and return 0.

Suppose that the input is 683248722, 36. Your program should print "BASE36" and a newline, and return 0.

Suppose that the input is 9172063, 29. Your program should print "CS241" and a newline, and return 0.

Suppose the input is -1, 3. Your program should return 1.

Suppose the input is 15, 1. Your program should return 2.

Suppose the input is -37, 37. Your program should return 1.

Testing

Compile your program with wlp4c, then run the compiled code with mips.twoints. Since wlp4c produces machine code directly, you don't need to use cs241.binasm.

Alternatively, on the student Linux system:

$ wlp4c < radix.wlp4 > radix.mips
$ mips.twoints radix.mips

Problem 3: WLP4 Scanner

Write a scanner for WLP4.

You have two options for how to write your WLP4 scanner. One option is to use a general Simplified Maximal Munch tokenizer provided by us, and provide the DFA, and the other is to write your own scanner from scratch (or adapt your scanner code from assignment 2). If you choose to write your scanner from scratch, solve Problem 3a. If you choose to use the general tokenizer provided by us, solve Problem 3b. YOU DO NOT NEED (and are not advised) TO SOLVE BOTH.

If you were able to complete the assembly scanner for A2, we recommend solving 3a. It will be easier to adapt your existing code than to conform to the behaviour of a tool you haven't otherwise encountered. If you were not able to complete the assembly scanner for A2, you may find 3b easier.

Problem 3a: WLP4 Scanner (Scratch) (filenames: wlp4scan.cc OR wlp4scan.rkt) [5 marks, 2 secret+release, 2 secret]

IF YOU SUBMIT A SOLUTION TO THIS PROBLEM, YOU DO NOT NEED TO AND SHOULD NOT SUBMIT A SOLUTION TO PROBLEM 3B. If solutions are submitted to both problems, the solution receiving a higher score will be used for problem 3.

Write a scanner for WLP4. The scanner should read a WLP4 program from standard input and tokenize it, producing a sequence of lines, each representing one token of the program. Each line should be terminated by a line feed (ASCII 0x0A) and should contain the token kind (the all-caps name indicating the type of the token, such as ID, NUM, LPAREN, etc. as defined in the WLP4 Specification), followed by a single space, followed by the token lexeme (the actual string represented by the token).

Whitespace and comments are not considered tokens and should not be part of the output. However, tokens in the input might be separated by whitespace or comments.

For programs with a valid tokenization, your output should be identical to that of the wlp4scan tool.

Clarifications

If a NUM token's numeric value is out of range (that is, it represents a number strictly greater than 2³¹ - 1 = 2147483647) you should report an error. If using C++, you are advised to use std::stoll (that is, convert your lexeme to a double) to achieve this, instead of converting the string to an int, as the process of converting a string to an int will naturally truncate it to the size of an int, placing it into range. If using Racket, there is no such difficulty.
If you see something like 2147483647999, you should treat it as one NUM token, and report an error since it is out of range, rather than treating it as two in-range NUM tokens (2147483647 and 999). That is, the range checking should not be part of the DFA.
In the earlier Simplified Maximal Munch scanner problem, you were provided a scanning DFA, followed by a string to scan using the DFA. For this problem, the input to your program is simply WLP4 source code. That is, the input will not include a DFA for WLP4 tokens. You will likely have to create such a DFA yourself and embed it in your program, similar to the suggestion for the assembly scanner.

Example

If the input is this WLP4 program:

//
// WLP4 program with two integer parameters, a and b
//   returns the sum of a and b
//
   int wain(int a, int b) {
     return a + b;   // unhelpful comment about summing a and b
   }

Then the correct output is:

INT int
WAIN wain
LPAREN (
INT int
ID a
COMMA ,
INT int
ID b
RPAREN )
LBRACE {
RETURN return
ID a
PLUS +
ID b
SEMI ;
RBRACE }

Testing

Compare your results to wlp4scan. For this problem, Marmoset again uses diff -Zb, so that small differences in whitespace are ignored (e.g., extra space at the end of the line).

On the student Linux system:

$ ./wlp4scan < input.wlp4 > output.txt
$ wlp4scan < input.wlp4 > expect.txt
$ diff -Zb output.txt expect.txt

Be careful not to confuse wlp4scan (the course-provided reference implementation of the scanner) with your own program. Perhaps name your own executable something different, like mywlp4scan.

Problem 3b: WLP4 Scanner (Preprocessed) (filenames: wlp4.dfa AND (wlp4scan.cc OR wlp4scan.rkt)) [5 marks, 2 secret+release, 2 secret]

IF YOU SUBMIT A SOLUTION TO THIS PROBLEM, YOU DO NOT NEED TO AND SHOULD NOT SUBMIT A SOLUTION TO PROBLEM 3A. If solutions are submitted to both problems, the solution receiving a higher score will be used for problem 3.

Write a scanner for WLP4, using a generic simplified maximal munch tokenizer as a preprocessing step.

We have provided a tool, cs241.SMM, which is a general simplified maximal munch tokenizer. It takes a DFA with no .INPUT segment as an argument, and a program to be tokenized with that DFA from standard input. It outputs one token lexeme per line. It is not capable of ignoring whitespace or comments on its own; in cs241.SMM, you must tokenize whitespace like any other token. cs241.SMM supports two extensions to the DFA file format:

The alphabet is allowed to include the special symbols ".SPACE" and ".NEWLINE". .SPACE means any whitespace other than a newline, and .NEWLINE, of course, means newline. The distinction needs to be made because comments end with a newline.
Transitions may be on any alphabet character (including .SPACE and .NEWLINE), or the special symbol ".OTHER". In a transition, .OTHER is used if no other transition is valid for the given character; i.e., it is the "else" case, used for all symbols not otherwise written in a transition for the given state. This even includes characters outside of the alphabet. This is needed because comments are allowed to contain any character.

Note that ".SPACE" and ".NEWLINE" are also included in the lexemes that are output by cs241.SMM. All other characters in lexemes are output verbatim.

cs241.SMM is capable of breaking its input down into a sequence of tokens, but is not capable of categorizing them into kinds, and it would be impractical to use it for range checking. Thus, in order to use it for a scanner, you must write two components: a DFA for cs241.SMM, and a program to take the lexemes output by cs241.SMM, discard whitespace and comment tokens, categorize the remaining tokens, and check ranges.

wlp4.dfa

First component: wlp4.dfa. Write a DFA for WLP4, following the WLP4 Specification, adequate for use by cs241.SMM. Given a WLP4 program as input, it should write the lexemes of each token, one per line, to standard output. This will also include lexemes for whitespace and comments, which must be discarded in the next step.

If the input does not tokenize by the Simplified Maximal Munch algorithm with the given DFA, cs241.SMM will output ".ERROR" as a line to standard output, as well as outputting a descriptive error to standard error and quitting.

Example

If the input is this WLP4 program:

int wain(int a, int b) {
  return a+b;
}

Then the correct output is:

int
.SPACE
wain
(
int
.SPACE
a
,
.SPACE
int
.SPACE
b
)
.SPACE
{
.NEWLINE.SPACE.SPACE
return
a
.SPACE
+
.SPACE
b
;
.NEWLINE
}
.NEWLINE

But, it's OK if your tokenization of whitespace is slightly different, as the whitespace will be discarded.

If you see something like 2147483647999, you should output it as a single token, even though the value will be out of range. This is not an error at the DFA level, and will be checked later.

Testing

Run your DFA with cs241.SMM.

On the student Linux system:

$ ./wlp4scan wlp4.dfa < input.wlp4

Note that your DFA will not be tested in isolation. It will only be tested with the next component.

wlp4scan.cc OR wlp4scan.rkt

Write a scanner for WLP4 that takes one lexeme per line, from cs241.SMM, as input.

The scanner should read a sequence of token lexemes from standard input (i.e., the output of cs241.SMM) and produce a full tokenization, one token per line. Each line should be terminated by a line feed (ASCII 0x0A) and should contain the token kind (the all-caps name indicating the type of the token, such as ID, NUm, LPAREN, etc. as defined in the WLP4 Specification), followed by a single space, followed by the token lexeme (the actual string represented by the token, exactly as output by cs241.SMM).

Whitespace and comments are not considered tokens and should be discarded.

If the input cannot be scanned into a sequence of valid tokens as per the requirements in the WLP4 Specification, your program must print an error message containing the string ERROR in all caps to standard error and exit normally. Note that this can be due to an error that was caught by the DFA (and thus a reaction to ".ERROR" from standard input), or due to an error not caught by the DFA which you instead catch in your scanner.

For programs with a valid tokenization, your output should be identical to that of the wlp4scan tool.

If a NUM token's numeric value is out of range (that is, it represents a number strictly greater than 2³¹ - 1 = 2147483647) you should report an error. If using C++, you are advised to use std::stoll (that is, convert your lexeme to a double) to achieve this, instead of converting the double to an int, as the process of converting a string to an int will naturally truncate it to the size of an int, placing it into range. If using Racket, there is no such difficulty.

Example

If the input to cs241.SMM is this WLP4 program, and the input to your program is the output of cs241.SMM:

//
// WLP4 program with two integer parameters, a and b
//   returns the sum of a and b
//
   int wain(int a, int b) {
     return a + b;   // unhelpful comment about summing a and b
   }

Then the correct output is:

INT int
WAIN wain
LPAREN (
INT int
ID a
COMMA ,
INT int
ID b
RPAREN )
LBRACE {
RETURN return
ID a
PLUS +
ID b
SEMI ;
RBRACE }

Testing

Compare your results to wlp4scan. For this problem, Marmoset again uses diff -Zb, so that small differences in whitespace are ignored (e.g., extra space at the end of the line).

On the student Linux system:

$ cs241.SMM wlp4.dfa < input.wlp4 | ./wlp4scan > output.txt
$ wlp4scan < input.wlp4 > expect.txt
$ diff -Zb output.txt expect.txt

Be careful not to confuse wlp4scan (the course-provided reference implementation of the scanner) with your own program. Perhaps name your own executable something different, like mywlp4scan.