CS 106 Winter 2018
Assignment 08: Text Processing
Question 1 Dramatis Personae
The script of a play is often formatted so that at the start of a character's line, their name is written in ALL CAPS, followed by a period. For example, here's a tiny bit of Shakespeare's Hamlet formatted this way:
LAERTES. Most humbly do I take my leave, my lord. POLONIUS. The time invites you; go, your servants tend. LAERTES. Farewell, Ophelia, and remember well What I have said to you. OPHELIA. ’Tis in my memory lock’d, And you yourself shall keep the key of it. LAERTES. Farewell.
That makes it relatively easy to extract the names of the speakers from a script, and create a list of who delivers every line, in order. For example, here are the speakers of the first 100 lines in Hamlet:
BARNARDO, FRANCISCO, BARNARDO, FRANCISCO, BARNARDO, FRANCISCO, BARNARDO, FRANCISCO, BARNARDO, FRANCISCO, BARNARDO, FRANCISCO, HORATIO, MARCELLUS, FRANCISCO, MARCELLUS, FRANCISCO, MARCELLUS, BARNARDO, HORATIO, BARNARDO, MARCELLUS, BARNARDO, MARCELLUS, HORATIO, BARNARDO, HORATIO, BARNARDO, MARCELLUS, BARNARDO, MARCELLUS, BARNARDO, HORATIO, MARCELLUS, HORATIO, MARCELLUS, BARNARDO, HORATIO, MARCELLUS, BARNARDO, HORATIO, MARCELLUS, HORATIO, MARCELLUS, HORATIO, MARCELLUS, HORATIO, BARNARDO, HORATIO, MARCELLUS, HORATIO, BARNARDO, HORATIO, MARCELLUS, BARNARDO, HORATIO, MARCELLUS, HORATIO, MARCELLUS, KING, KING, LAERTES, KING, POLONIUS, KING, HAMLET, KING, HAMLET, QUEEN, HAMLET, QUEEN, HAMLET, KING, QUEEN, HAMLET, KING, HAMLET, HORATIO, HAMLET, HORATIO, HAMLET, MARCELLUS, HAMLET, HORATIO, HAMLET, HORATIO, HAMLET, HORATIO, HAMLET, HORATIO, HAMLET, HORATIO, HAMLET, HORATIO, HAMLET, HORATIO, HAMLET, HORATIO, HAMLET, HORATIO
We can use the ordered list of speakers to create a kind of visualization of a play. We'll draw a chart with a timeline for each character. Whenever that character speaks, we'll put a mark on their timeline, at a location proportional to their speech's position in the script. We won't worry about how many words they say when they speak; each whole speech will count as one unit of time. For Hamlet, we might end up with a visualization like this one:
In this question, you will write a sketch that reads the script of a play from a text file, and creates a data visualization like the one above. We will provide you with code to find the names of the characters in the script. You will have to store that information two ways: in an array, which records the order of the speakers in the play, and in an IntDict, which makes it easy to get an array of the character names with no repeats. Follow the steps below:
Create a folder called A08. In it, create a new sketch called DramatisPersonae. As usual, put your name and Student ID number at the top of the sketch.
Write a setup() function. Set the sketch window to have a width and height of at least 400, and a white background. Absent any enhancements, your sketch will draw a static (i.e., unchanging) visualization, so you can put most or all of your code in setup(); you don't need a draw() function.
Download one or more of Hamlet, Macbeth, and Romeo and Juliet (by right-clicking on those links). The scripts are from the free archive Project Gutenberg. However, I modified them slightly to make them easier to work with in Processing; in particular I removed the license, which you can read online. Add the files you downloaded to the sketch's data/ folder.
Add code to setup() to load one of the scripts (your choice) into the sketch, storing it as usual in an array of strings.
Now declare a second array of strings. This one will hold just the speaker names for each line of dialogue. Loop over the lines of the text file. For each one, if it starts with some capital letters and/or spaces, followed by a period, treat the letters up to the period as the name of a character (as long as they don't form the words "ACT" or "SCENE"). Append all such names to the array so that it contains a list like the one above (e.g., { "BARNARDO", "FRANCISCO", "BARNARDO" ... }).
Detecting a character name can be a bit tricky, so here's a function that does it for you.
String findPlayer( String line ) { if ( line.contains( "." ) ) { line = trim( line ); line = line.substring( 0, line.indexOf( '.' ) ); if ( match( line, "^[A-Z ]+$" ) != null ) { if ( !(line.contains( "ACT" ) || line.contains( "SCENE" ) ) ) { return line; } } } return null; }
This function takes a line of text as input, and uses a combination of string functions and regular expressions to see if it starts with a character name. It returns either the name of a character if the line is the start of a speech, or null if no character name was found. Copy this function into your sketch and use it on every line of text in the input file.
Double-check that your loop works, and your second array variable holds a reasonable-looking list of character names.
We also need a list of unique character names, basically something like the array above but with all the duplicates removed. In the same place where you declare your second array above, declare an IntDict variable. In the same loop as the previous step, whenever you add a character name to your array, also add it to this IntDict (by setting it to be associated with a dummy int value). The difference is that the IntDict will automatically filter out repeats as you add names.
Now, if you call the IntDict's keyArray() method, the dictionary will give you back an array of strings containing one copy of every unique name in the script. You'll probably want to declare another local String[] variable to hold this array.
We're ready to start drawing. Loop over all the unique character names you obtained in the previous step. For each one, write the name near the left side of the sketch window, and draw a timeline (just a thin line) extending from there to somewhere near the right edge of the sketch window. Ideally, your character names will be right-aligned (use the textAlign() function). The spacing should be chosen so that all names fit the height of the sketch window.
Inside the same loop, after drawing a character's timeline, put an inner loop that walks over the array of speaker names you created in Step 5. If a given speaker in that array is the character whose timeline you're drawing, put a mark on that timeline. The x position of the mark is determined from the position of this speech in the array of speaker names, scaled to fit the timeline (try using map() for this). The y position of the mark is chosen so that it's centred on the correct timeline. The mark can be a thin rectangle or ellipse, and can be any colour you want.
For example, let's say that Hamlet has 1123 lines of dialogue. You're currently drawing the timeline for POLONIUS. Walking over the array of speakers, you discover that POLONIUS delivered the line at position 247 in the array. You would put a mark 247/1123 of the way along POLONIUS's timeline.
The entire sketch can be written in under 60 lines of code, not counting comments. These 60 lines do count the 13-line findPlayer() function, suggesting that you should be aiming for 40–50 lines, not counting comments.
Enhancements
You are free to experiment with enhancements. Creative or ambitious enhancements will receive bonus marks. Here are some fairly natural enhancements that are straightforward to implement:
Experiment with different colour schemes for the sketch and the marks on the timelines.
Experiment with different orderings for the characters' timelines. The default will have characters in order of appearance. Two natural alternatives would be alphabetical, or in order by number of lines delivered.
The following are much more complicated and would require a lot of additional work:
Instead of having each line of dialogue take up the same space on the timeline, vary their widths according to the number of words spoken by the character.
Allow the user to hover over a mark, and show a popup with the corresponding line of dialogue.
Experiment with different data sources, such as movie or television screenplays. These are easy to find online, but it's not necessarily as easy to figure out where the character names are given to begin lines of dialogue.
Store your solution in a sketch titled DramatisPersonae in the A08 folder.
Submission
When you are ready to submit, please follow these steps.
Please ensure that any sketches you submit compile and run. It's better to submit a sketch that runs smoothly but implements fewer required features than one that has broken code for all features. If you get partway into a feature but can't make it work, comment it out so that the sketch works correctly without it.
If necessary, review the Code Style Guide and use Processing's built-in auto format tool. You do not need to use the precise coding style outlined in the guide, but whatever style you use, your code must be clear, concise, consistent, and commented.
If necessary, review the How To Submit document for a reminder on how to submit to LEARN.
Make sure to include a comment at the top of all source files containing your name and student ID number.
Create a zip file called A08.zip containing the entire A08 folder with its subfolder DramatisPersonae.
Upload A08.zip to LEARN. Remember that you can (and should!) submit as many times as you like. That way, if there's a catastrophe, you and the course staff will still have access to a recent version of your code.
If LEARN isn't working, and only if LEARN isn't working, please email your ZIP file to the course account (see the course home page for the address). In this case, you must mail your ZIP file before the deadline. Please use this only for emergencies, not "just in case". Submissions received after the deadline may receive feedback, but their marks will not count.