Assignments for CS241
Due Friday, February 27th, 5:00 PM
In this assignment, you will write a tokenizer for WLP4, a simple C-like programming language that we will incrementally write a compiler for in this course. In addition, you will be asked to write some simple programs in WLP4 to better understand the language.
Your WLP4 compiler will eventually translate WLP4 source code into ARM64 assembly language. The tools you write in Assignments 4 through 8 will eventually combine to form a complete compiler from WLP4 to ARM64 assembly. Combined with the tools you created in Assignments 1 through 3 to compile assembly to machine code, you will have a complete pipeline for compiling WLP4 into executable ARM64 programs.
You will get some practice with WLP4 to better understand its various quirks and limitations, and then you will specialize your Simplified Maximal Munch scanner to work with WLP4 programs, completing the first component of the compiler.
WLP4 is quite similar to a restricted subset of C, but it uses C++ style memory allocation (new/delete instead of malloc/free). The syntax is heavily restricted to make it easier to write a compiler for. For example:
-1, for example, you must
write 0 - 1.
if statements must be written with blocks and must have a matching
else, even if the else block is empty. Because of the restriction to blocks, there
is no "else if" syntax.
More information about WLP4 is below:
practice.wlp4) [2 marks of 9: 1 public, 1 release]Write a WLP4 procedure (function) that computes the absolute value of an integer. You do not need to handle the case of -263 = -9,223,372,036,854,775,808 (this integer has no positive counterpart in 64-bit two's complement).
Write a WLP4 procedure called tieBreak
that takes two distinct integers as parameters, and does the following:
Write a WLP4 program that takes a non-empty array of integers
in the range -241 to 241 (inclusive) as input and
returns the integer that occurs most frequently.
If there is a tie for the most frequent integer,
break the tie using the rule from the tieBreak
procedure described in Warm-Up Exercise 2.
Your program should run in linear time (in the size of the input array). This likely means you will need to allocate a sufficiently large block of memory to store the number of times each integer occurs.
Suppose the input array is [1, 2, 2, 3, 3, 3]. Your program should return 3.
Suppose the input array is [4, 4, 4, 2, 2, 2, -3, -3, -3].
There's a three way tie between 4, 2 and -3. Since 2 has the least absolute value, your program should return 2.
Suppose the input array is [1, -1, 2, -2].
There's a four way tie between 1, -1, 2, and -2.
Among these, 1 and -1 tie for the smallest absolute value.
To break this tie, you choose the negative number, so you return -1.
Compile your program with cs241.wlp4c, then run
the compiled code with cs241.arm64emu. Since
cs241.wlp4c produces machine code directly,you don't need to use cs241.binasm.
Alternatively, on the student Linux system:
$ cs241.wlp4c < practice.wlp4 > practice.bin
$ cs241.arm64emu -a practice.bin 4 4 4 2 2 2 -3 -3 -3
Please remember that the student Linux system is unreliable. Do your development and testing on your own system, using the web tools when possible, so that you won't be stuck when (rather than if) the student Linux servers go down.
The wain function in your WLP4 program
should take an int* as the first parameter
and an int as the second parameter.
The first parameter will receive the value
that cs241.arm64emu places in x0
(a pointer to the array) and the second parameter
will receive the value it places in x1 (the size
of the array).
When the program terminates, the return value
of the wain function will be stored
in x2. Ensure the value in $x2 is correct.
In Assignment 3, you wrote an assembly program to print integers in base 10. WLP4 includes a print statement, written println(x), that directly prints integers in base 10. In this problem, you will write a WLP4 program to print integers in any base. The base of an ordinal number system is also called its radix; hence, the problem name. To do so, you will need to use putchar(x);, which outputs the least significant 8 bits of its integer argument as a character to standard output (same fashion as outputting a character in base ARM64 assembly).
Write a WLP4 program that takes two 64-bit signed integer arguments: the
first is the number to print, and the second is the radix to print it in.
The number to print must be non-negative. If a negative number is given as the first
argument to wain, your program must print nothing and return 1 to indicate an error. The radix
must be greater than 1 and less than 37. If a number out of that range is
given as the second argument to wain, your program must print nothing and
return 2 to indicate an error. If the number and radix are both valid, your program must print
the digit(s) of the number in the given radix. After the digits, it must
print a newline, and return 0.
If the radix is greater than 10, use the capital English letters for the digits after 9: A, B, C, etc. This is why the radix is restricted to being less than 37: we have 10 standard decimal digits plus 26 letters equals 36 possible digit characters.
Suppose that the input is 12345, 10. Your program should print "12345" and a newline, and return 0.
Suppose that the input is 342391, 8. Your program should print
"1234567" and a newline, and return 0. Note that 1234567 interpreted in
base 8 is equal to 342391 interpreted in base 10.
Suppose that the input is 147, 2. Your program should print
"10010011" and a newline, and return 0.
Suppose that the input is 683248722, 36. Your program should
print "BASE36" and a newline, and return 0.
Suppose that the input is 9172063, 29. Your program should
print "CS241" and a newline, and return 0.
Suppose the input is -1, 3. Your program should return 1.
Suppose the input is 15, 1. Your program should return 2.
Suppose the input is -37, 37. Your program should return 1.
Compile your program with cs241.wlp4c, then run
the compiled code with cs241.arm64emu. Since
cs241.wlp4c produces machine code directly,
you don't need to use cs241.binasm.
Alternatively, on the student Linux system:
$ cs241.wlp4c < radix.wlp4 > radix.bin
$ cs241.arm64emu radix.bin 683248722 36
Write a scanner for WLP4.
You have two options for how to write your WLP4 scanner. One option is to use a general Simplified Maximal Munch tokenizer provided by us, and provide the DFA, and the other is to write your own scanner from scratch (or adapt your scanner code from assignment 2). If you choose to write your scanner from scratch, solve Problem 3a. If you choose to use the general tokenizer provided by us, solve Problem 3b. YOU DO NOT NEED (and are not advised) TO SOLVE BOTH.
If you were able to complete the assembly scanner for A2, we recommend solving 3a. It will be easier to adapt your existing code than to conform to the behaviour of a tool you haven't otherwise encountered. If you were not able to complete the assembly scanner for A2, you may find 3b easier.
IF YOU SUBMIT A SOLUTION TO THIS PROBLEM, YOU DO NOT NEED TO AND SHOULD NOT SUBMIT A SOLUTION TO PROBLEM 3B. If solutions are submitted to both problems (why?), the solution receiving a higher score will be used for problem 3.
Write a scanner for WLP4. The scanner should read a WLP4 program
from standard input and tokenize it, producing a sequence of lines,
each representing one token of the program.
Each line should be terminated by a line feed (ASCII 0x0A)
and should contain the token kind
(the all-caps name indicating the type of the token as defined in the WLP4 Specification), followed by a single space, followed by the token lexeme (the actual string represented by the token).
Whitespace and comments are not considered tokens and should not be part of the output. However, tokens in the input might be separated by whitespace or comments.
If the input can't be scanned into a sequence of valid tokens as per the requirements in the WLP4 Specification, your program must print an error message containing the string ERROR in all caps to standard error and exit normally.
For programs with a valid tokenization, your output should
be identical to that of the wlp4scan tool.
If the input is this WLP4 program:
//
// WLP4 program with two integer parameters, a and b
// returns the sum of a and b
//
long wain(long a, long b) {
return a + b; // unhelpful comment about summing a and b
}
Then the correct output is:
LONG long
WAIN wain
LPAREN (
LONG long
ID a
COMMA ,
LONG long
ID b
RPAREN )
LBRACE {
RETURN return
ID a
PLUS +
ID b
SEMI ;
RBRACE }
Compare your results to cs241.wlp4scan. For
this problem, Marmoset again uses diff -Zb, so that small
differences in whitespace are ignored (e.g., extra space at the end of the
line).
On the student Linux system:
$ ./wlp4scan < input.wlp4 > output.txt
$ cs241.wlp4scan < input.wlp4 > expect.txt
$ diff -Zb output.txt expect.txt
wlp4.dfa AND wlp4scan.(cc/cpp/rkt)) [5 marks of 9: 1 public, 2 secret+release, 2 secret]IF YOU SUBMIT A SOLUTION TO THIS PROBLEM, YOU DO NOT NEED TO AND SHOULD NOT SUBMIT A SOLUTION TO PROBLEM 3A. If solutions are submitted to both problems, the solution receiving a higher score will be used for problem 3.
Write a scanner for WLP4, using a generic simplified maximal munch tokenizer as a preprocessing step.
We have provided a tool, cs241.smm, which is a
general simplified maximal munch tokenizer. It takes a DFA with no .INPUT segment as an argument, and a program
to be tokenized with that DFA from standard input. It outputs one token
lexeme per line. It is not capable of ignoring whitespace or comments on its
own; in cs241.smm, you must tokenize whitespace like any other
token. cs241.smm supports two extensions to the DFA file format:
.SPACE and
.NEWLINE. .SPACE means any whitespace other than a newline, and
.NEWLINE, of course, means newline. The distinction needs to be made
because comments end with a newline..SPACE and
.NEWLINE), or the special symbol .OTHER. In a transition, .OTHER is
used if no other transition is valid for the given character; i.e., it
is the "else" case, used for all symbols not otherwise written in a
transition for the given state. This even includes characters outside
of the alphabet. This is needed because comments are allowed to contain
any character.
Note that .SPACE and .NEWLINE are also included in the lexemes that are
output by cs241.smm. All other characters in lexemes are
output verbatim.
cs241.smm is capable of breaking its input down into a
sequence of tokens, but is not capable of categorizing them into kinds, and
it would be impractical to use it for range checking. Thus, in order to use
it for a scanner, you must write two components: a DFA for
cs241.smm, and a program to take the lexemes output by
cs241.smm, discard whitespace and comment tokens, categorize
the remaining tokens, and check ranges. The description of both components follows.
wlp4.dfa
Write a DFA for WLP4, following
the WLP4 Specification, adequate for use by
cs241.smm. Given a WLP4 program and your DFA as input, cs241.smm should write the lexemes of each token, one per line, to standard output. This will also include lexemes for whitespace and comments, which must be discarded in the
next step.
If the input does not tokenize by the Simplified Maximal Munch algorithm
with the given DFA, cs241.smm will output ".ERROR" as a line
to standard output, as well as outputting a descriptive error to standard
error and quitting.
If the input is this WLP4 program:
long wain(long a, long b) {
return a + b;
}
Then the correct output is:
long
.SPACE
wain
(
long
.SPACE
a
,
.SPACE
long
.SPACE
b
)
.SPACE
{
.NEWLINE
.SPACE.SPACE
return
.SPACE
a
.SPACE
+
.SPACE
b
;
.NEWLINE
}
.NEWLINE
Note: it's OK if your tokenization of whitespace or comments differs from how it's done above; all that matters is that the final tokenization (made in conjunction with the second component) is correct.
Also, your DFA should not check for valid ranges in numerical tokens. If you run into a numerical token that's too large (such as 9223372036854775808), you should output it as a single token, even though the value will be out of range. This is not an error at the DFA level, and can be checked later.
Run your DFA with cs241.smm.
On the student Linux system:
$ ./wlp4scan wlp4.dfa < input.wlp4
Note that your DFA will not be tested in isolation. It will only be tested with the next component.
wlp4scan.(cc/rkt)
Write a scanner for WLP4 that takes one lexeme per line, from
cs241.smm, as input.
The scanner should read a sequence of token lexemes from standard input
(i.e., the output of cs241.smm) and produce a full
tokenization, one token per line. Each line should be terminated by a line
feed (ASCII 0x0A) and should contain the token kind (the all-caps
name indicating the type of the token, such as ID, NUM, LPAREN, etc. as
defined in the WLP4 Specification), followed
by a single space, followed by the token lexeme (the actual string
represented by the token, exactly as output by cs241.smm).
Whitespace and comments are not considered tokens and should be discarded.
If the input cannot be scanned into a sequence of valid tokens as per the requirements in the WLP4 Specification, your program must print an error message containing the string ERROR in all caps to standard error and exit normally. Note that this can be due to an error that was caught by the DFA (and thus a reaction to ".ERROR" from standard input), or due to an error not caught by the DFA which you instead catch in your scanner.
For programs with a valid tokenization, your output should be identical to
that of the cs241.wlp4scan tool.
If a NUM token's numeric value is out of range (that is, it represents a
number strictly greater than 264 - 1 = 9223372036854775807) you should
report an error in this component. If using C++, you are advised to use std::stol
(that is, convert your lexeme to a long) to achieve this. If using Racket, ranges can be checked manually.
If the input to cs241.smm is this WLP4 program...
//
// WLP4 program with two integer parameters, a and b
// returns the sum of a and b
//
long wain(long a, long b) {
return a + b; // unhelpful comment about summing a and b
}
...and the input to your program is the output of cs241.smm, then the correct output is
LONG long
WAIN wain
LPAREN (
LONG long
ID a
COMMA ,
LONG long
ID b
RPAREN )
LBRACE {
RETURN return
ID a
PLUS +
ID b
SEMI ;
RBRACE }
Compare your results to cs241.wlp4scan. For
this problem, Marmoset again uses diff -Zb, so that small
differences in whitespace are ignored (e.g., extra space at the end of the
line).
On the student Linux system:
$ cs241.smm wlp4.dfa < input.wlp4 | ./wlp4scan > output.txt
$ cs241.wlp4scan < input.wlp4 > expect.txt
$ diff -Zb output.txt expect.txt
Be careful not to confuse cs241.wlp4scan (the
course-provided reference implementation of the scanner) with your own program.
Perhaps name your own executable something different, like
mywlp4scan.
© University of Waterloo. | Design by TEMPLATED.