Assignments for CS 241 | ||
---|---|---|
← Assignment 3 | Assignment 4 | Assignment 5 → |
Friday, Nov 1st at 5:00 pm | ||
P1 • P2 • P3 • P3a • P3b |
In this assignment, you will write a tokenizer for WLP4, a simple programming language that we will incrementally write a compiler for in this course. In addition, you will be asked to write some simple programs in WLP4, to better understand the language.
Your WLP4 compiler will eventually translate WLP4 source code into MIPS assembly language. The combination of the tools you write in assignments 4 through 8 will be a complete compiler for WLP4 to assembly language, and in concert with the tools you wrote in assignments 1 through 3 to compile assembly into machine code, you will have a complete pipeline for compiling WLP4 into executable MIPS programs.
You will get some practice with WLP4 to better understand its various quirks and limitations, and then you will specialize your Simplified Maximal Munch scanner to work with WLP4 programs, completing the first component of the compiler.
Starting from this assignment, you will begin writing a compiler that translates a C-like programming language called WLP4 into MIPS assembly language.
WLP4 is quite similar to a restricted subset of C, but it uses C++ style memory allocation (new/delete instead of malloc/free). The syntax is harshly restricted to make it easier to compile. For example:
More information about WLP4 is below:
Write a WLP4 procedure that computes the absolute value of an integer. You do not need to handle the case of -231 = -2147483648 (this integer has no positive counterpart in 32-bit two's complement).
Write a WLP4 procedure called tieBreak
that takes two distinct integers as parameters.
Write a WLP4 program that takes a non-empty array of integers
in the range -241 to 241 (inclusive) as input and
returns the integer that occurs most frequently.
If there is a tie for the most frequent integer,
break the tie using the rule from the tieBreak
procedure described in Warm-Up Exercise 2.
Your program should run in linear time (in the size of the input array). This likely means you will need to allocate a sufficiently large block of memory to store the number of times each integer occurs.
Suppose the input array is [1, 2, 2, 3, 3, 3]
. Your program should return 3.
Suppose the input array is [4, 4, 4, 2, 2, 2, -3, -3, -3]
.
There's a three way tie between 4, 2 and -3. Since 2 has the least absolute value,
your program should return 2.
Suppose the input array is [1, -1, 2, -2]
.
There's a four way tie between 1, -1, 2, and -2.
Among these, 1 and -1 tie for the smallest absolute value.
To break this tie, you choose the negative number, so you return -1.
Compile your program with wlp4c
, then run
the compiled code with mips.array
. Since
wlp4c
produces machine code directly,
you don't need to use cs241.binasm
.
Alternatively, on the student Linux system:
$ wlp4c < practice.wlp4 > practice.mips $ mips.array practice.mips
Please remember that the student Linux system is unreliable. Do your development and testing on your own system, using the web tools when possible, so that you won't be stuck when (rather than if) the student Linux servers go down.
The wain
function in your WLP4 program
should take an int*
as the first parameter
and an int
as the second parameter.
The first parameter will receive the value
that mips.array
places in $1
(a pointer to the array) and the second parameter
will receive the value it places in $2 (the size
of the array).
When the program terminates, the return value
of the wain
function will be stored
in $3.
Ensure the value in $3 is correct.
In assignment 3, you wrote an assembly program to print integers in base 10. WLP4 includes a print statement, written println(x);, that directly prints integers in base 10. In this problem, you will write a WLP4 program to print integers in any base. To do so, you will need to use putchar(x);, which outputs the low 8 bits of its integer argument as a character to standard output.
Note that the base of an ordinal number system is also called its radix.
Write a WLP4 program that takes two 32-bit signed integer arguments: the first is the number to print, and the second is the radix to print it in. The number must be non-negative. If a negative number is given as the first argument to wain, your program must print nothing and return 1. The radix must be greater than 1 and less than 37. If a number out of that range is given as the second argument to wain, your program must print nothing and return 2. If the number and radix are both valid, your program must print the digits of the number in the given radix. After the digits, it must print a newline, and return 0.
If the number is 0 and the radix is valid, the correct output is always the single character '0' and a newline. In all other cases, you should print no leading 0s.
If the radix is greater than 10, use the capital English letters for the digits after 9: A, B, C, etc. This is why the radix is restricted to being less than 37: we have 10 standard decimal digits plus 26 letters equals 36 possible digit characters.
Suppose that the input is 12345, 10
. Your program should print "12345" and a newline, and return 0.
Suppose that the input is 342391, 8
. Your program should print
"1234567" and a newline, and return 0. Note that 1234567 interpreted in
base 8 is equal to 342391 interpreted in base 10.
Suppose that the input is 147, 2
. Your program should print
"10010011" and a newline, and return 0.
Suppose that the input is 683248722, 36
. Your program should
print "BASE36" and a newline, and return 0.
Suppose that the input is 9172063, 29
. Your program should
print "CS241" and a newline, and return 0.
Suppose the input is -1, 3
. Your program should return 1.
Suppose the input is 15, 1
. Your program should return 2.
Suppose the input is -37, 37
. Your program should return 1.
Compile your program with wlp4c
, then run
the compiled code with mips.twoints
. Since
wlp4c
produces machine code directly,
you don't need to use cs241.binasm
.
Alternatively, on the student Linux system:
$ wlp4c < radix.wlp4 > radix.mips $ mips.twoints radix.mips
Write a scanner for WLP4.
You have two options for how to write your WLP4 scanner. One option is to use a general Simplified Maximal Munch tokenizer provided by us, and provide the DFA, and the other is to write your own scanner from scratch (or adapt your scanner code from assignment 2). If you choose to write your scanner from scratch, solve Problem 3a. If you choose to use the general tokenizer provided by us, solve Problem 3b. YOU DO NOT NEED (and are not advised) TO SOLVE BOTH.
If you were able to complete the assembly scanner for A2, we recommend solving 3a. It will be easier to adapt your existing code than to conform to the behaviour of a tool you haven't otherwise encountered. If you were not able to complete the assembly scanner for A2, you may find 3b easier.
IF YOU SUBMIT A SOLUTION TO THIS PROBLEM, YOU DO NOT NEED TO AND SHOULD NOT SUBMIT A SOLUTION TO PROBLEM 3B. If solutions are submitted to both problems, the solution receiving a higher score will be used for problem 3.
Write a scanner for WLP4. The scanner should read a WLP4 program from standard input and tokenize it, producing a sequence of lines, each representing one token of the program. Each line should be terminated by a line feed (ASCII 0x0A) and should contain the token kind (the all-caps name indicating the type of the token, such as ID, NUM, LPAREN, etc. as defined in the WLP4 Specification), followed by a single space, followed by the token lexeme (the actual string represented by the token).
Whitespace and comments are not considered tokens and should not be part of the output. However, tokens in the input might be separated by whitespace or comments.
If the input cannot be scanned into a sequence of valid tokens as per the requirements in the WLP4 Specification, your program must print an error message containing the string ERROR in all caps to standard error and exit normally.
For programs with a valid tokenization, your output should
be identical to that of the wlp4scan
tool.
std::stoll
(that is, convert your lexeme to a double) to achieve this,
instead of converting the string to an int, as the process of converting
a string to an int will naturally truncate it to the size of an int,
placing it into range. If using Racket, there is no such difficulty.
If the input is this WLP4 program:
// // WLP4 program with two integer parameters, a and b // returns the sum of a and b // int wain(int a, int b) { return a + b; // unhelpful comment about summing a and b }
Then the correct output is:
INT int WAIN wain LPAREN ( INT int ID a COMMA , INT int ID b RPAREN ) LBRACE { RETURN return ID a PLUS + ID b SEMI ; RBRACE }
Compare your results to wlp4scan
. For
this problem, Marmoset again uses diff -Zb
, so that small
differences in whitespace are ignored (e.g., extra space at the end of the
line).
On the student Linux system:
$ ./wlp4scan < input.wlp4 > output.txt $ wlp4scan < input.wlp4 > expect.txt $ diff -Zb output.txt expect.txt
Be careful not to confuse wlp4scan
(the
course-provided reference implementation of the scanner) with your own program.
Perhaps name your own executable something different, like
mywlp4scan
.
IF YOU SUBMIT A SOLUTION TO THIS PROBLEM, YOU DO NOT NEED TO AND SHOULD NOT SUBMIT A SOLUTION TO PROBLEM 3A. If solutions are submitted to both problems, the solution receiving a higher score will be used for problem 3.
Write a scanner for WLP4, using a generic simplified maximal munch tokenizer as a preprocessing step.
We have provided a tool, cs241.SMM
, which is a
general simplified maximal munch tokenizer. It takes a DFA with no .INPUT segment as an argument, and a program
to be tokenized with that DFA from standard input. It outputs one token
lexeme per line. It is not capable of ignoring whitespace or comments on its
own; in cs241.SMM
, you must tokenize whitespace like any other
token. cs241.SMM supports two extensions to the DFA file format:
Note that ".SPACE" and ".NEWLINE" are also included in the lexemes that are
output by cs241.SMM
. All other characters in lexemes are
output verbatim.
cs241.SMM
is capable of breaking its input down into a
sequence of tokens, but is not capable of categorizing them into kinds, and
it would be impractical to use it for range checking. Thus, in order to use
it for a scanner, you must write two components: a DFA for
cs241.SMM
, and a program to take the lexemes output by
cs241.SMM
, discard whitespace and comment tokens, categorize
the remaining tokens, and check ranges.
First component: wlp4.dfa. Write a DFA for WLP4, following
the WLP4 Specification, adequate for use by
cs241.SMM
. Given a WLP4 program as input, it should write the
lexemes of each token, one per line, to standard output. This will also
include lexemes for whitespace and comments, which must be discarded in the
next step.
If the input does not tokenize by the Simplified Maximal Munch algorithm
with the given DFA, cs241.SMM
will output ".ERROR" as a line
to standard output, as well as outputting a descriptive error to standard
error and quitting.
If the input is this WLP4 program:
int wain(int a, int b) { return a+b; }
Then the correct output is:
int .SPACE wain ( int .SPACE a , .SPACE int .SPACE b ) .SPACE { .NEWLINE.SPACE.SPACE return a .SPACE + .SPACE b ; .NEWLINE } .NEWLINE
But, it's OK if your tokenization of whitespace is slightly different, as the whitespace will be discarded.
If you see something like 2147483647999, you should output it as a single token, even though the value will be out of range. This is not an error at the DFA level, and will be checked later.
Run your DFA with cs241.SMM
.
On the student Linux system:
$ ./wlp4scan wlp4.dfa < input.wlp4
Note that your DFA will not be tested in isolation. It will only be tested with the next component.
Write a scanner for WLP4 that takes one lexeme per line, from
cs241.SMM
, as input.
The scanner should read a sequence of token lexemes from standard input
(i.e., the output of cs241.SMM
) and produce a full
tokenization, one token per line. Each line should be terminated by a line
feed (ASCII 0x0A) and should contain the token kind (the all-caps
name indicating the type of the token, such as ID, NUm, LPAREN, etc. as
defined in the WLP4 Specification), followed
by a single space, followed by the token lexeme (the actual string
represented by the token, exactly as output by cs241.SMM
).
Whitespace and comments are not considered tokens and should be discarded.
If the input cannot be scanned into a sequence of valid tokens as per the requirements in the WLP4 Specification, your program must print an error message containing the string ERROR in all caps to standard error and exit normally. Note that this can be due to an error that was caught by the DFA (and thus a reaction to ".ERROR" from standard input), or due to an error not caught by the DFA which you instead catch in your scanner.
For programs with a valid tokenization, your output should be identical to
that of the wlp4scan
tool.
If a NUM token's numeric value is out of range (that is, it represents a
number strictly greater than 231 - 1 = 2147483647) you should
report an error. If using C++, you are advised to use std::stoll
(that is, convert your lexeme to a double) to achieve this, instead
of converting the double to an int, as the process of converting a string to
an int will naturally truncate it to the size of an int, placing it into
range. If using Racket, there is no such difficulty.
If the input to cs241.SMM
is this WLP4 program, and the input to your program is the output of cs241.SMM
:
// // WLP4 program with two integer parameters, a and b // returns the sum of a and b // int wain(int a, int b) { return a + b; // unhelpful comment about summing a and b }
Then the correct output is:
INT int WAIN wain LPAREN ( INT int ID a COMMA , INT int ID b RPAREN ) LBRACE { RETURN return ID a PLUS + ID b SEMI ; RBRACE }
Compare your results to wlp4scan
. For
this problem, Marmoset again uses diff -Zb
, so that small
differences in whitespace are ignored (e.g., extra space at the end of the
line).
On the student Linux system:
$ cs241.SMM wlp4.dfa < input.wlp4 | ./wlp4scan > output.txt $ wlp4scan < input.wlp4 > expect.txt $ diff -Zb output.txt expect.txt
Be careful not to confuse wlp4scan
(the
course-provided reference implementation of the scanner) with your own program.
Perhaps name your own executable something different, like
mywlp4scan
.