CS 106 Winter 2018

Lab 09: Tables


Question 1 Marvel Comic Characters

The statistical analysis website FiveThirtyEight has a great habit of publishing the datasets they build in the process of their reporting, resulting in a veritable treasure trove of fun and unusual CSV files. We'll use one of those datasets to get you comfortable working with tables in Processing. This sketch won't draw anything, it'll just crunch some data and report summaries to the console.

  1. In a folder called L09/, create a new, empty sketch called Marvel.

  2. In 2014, FiveThirtyEight ran a story titled Comic Books Are Still Made By Men, For Men And About Men. To create that story, they assembled a large dataset of Marvel and DC comic book characters. Download a copy of the Marvel CSV file and add it to your sketch's data/ folder. If you want, you can download our local copy by right-clicking on the link in this sentence.

  3. Study the dataset to get a general sense of what data is in it and how it's laid out. You can load the CSV file into Excel if you want, or just look at the description of the table on the web page where it lives.

  4. In the sketch, create a setup() function. In setup(), load the CSV file into the sketch and store it in a local variable of type Table. Note that this table has a header row. Run the sketch to verify that the table was initialized without any errors.

  5. Add code to setup() to count the number of bisexual characters in Marvel comics. Gather this information by walking over the table and checking whether each character is bisexual, adding one to a local integer variable each time they are. At the end of setup(), use println() to print a line of text to the console of the form

    Bisexual characters: 123
    where 123 is replaced by the number of bisexual characters you found in the table (123 isn't the answer). Don't draw this or any other text in the sketch window; print it to the console.

  6. Add code to setup() to measure the average number of appearances by all characters. The top characters have thousands of appearances, but there's a "long tail" of bit players with one or two appearances each. Where does the average fall? Print a line of the form

    Average number of appearances: 12.3456
    where 12.3456 is replaced by the average you calculate.

  7. Most of the recurring characters are still alive—they drive the storylines. Which deceased character has the most appearances? Find that character and print a line of the form

    Most prominent deceased chracter: Ultron (Earth-616) [181 appearances]
    Ultron is not the correct answer; replace that text with the name of the most prominent deceased character, and their number of appearances in square brackets, as above. Don't worry about text like "(Earth-616)" in the character's name.

  8. What percentages of characters are male or female? Presumably there must also be characters who are neither. Gather information about the proportions of characters who are listed as male, female, or anything else, and print a line of the form

    Sex: 40.1% male, 39.9% female, 20.0% other
    Those are definitely not the right answers! Note that to calculate a percentage, use an expression like float(count)/float(total)*100.0, where count is the number of characters you found with the trait and total is the total number of characters in the dataset.

  9. A number of different hair types appear in the dataset: Brown Hair, Red Hair, Pink Hair, Bald, etc. How many different categories are there? Print a line of the form

    Hair types: 20
    This is the trickiest piece of information to gather. Create a local variable of type IntDict. In a loop, add all the hair types found in the table as keys in the dictionary, associating them with a dummy value (say, 1). At the end, ask the dictionary how many keys it has (consult the documentation for IntDict to figure out how).

When you're done, the sketch's console window should look like this:

Bisexual characters: 123
Average number of appearances: 12.3456
Most prominent deceased chracter: Ultron (Earth-616) [181 appearances]
Sex: 40.1% male, 39.9% female, 20.0% other
Hair types: 20

Nothing should be drawn in the sketch window. You don't need any code in your sketch apart from a setup() function. And remember: it's incorrect to figure out the answers to these questions in, say, Excel, and then to write a sketch that just prints out the answers. A "solution" of that form will receive no marks. Your sketch should continue to work if we substituted in, say, the CSV file containing DC characters.

Save your solution in a sketch titled Marvel in your L09/ folder.

If you're generally interested in cool visualizations of data taken from comic books, may I recommend Tim Leong's book Super Graphic, basically a large anthology of exactly that.

Question 2 Baseball Salaries

The highest single-year salary ever paid to a Major League Baseball player was $33,000,000 (thirty-three million dollars), which Alex Rodriguez received in 2009 and 2010, and Clayton Kershaw received in 2016. How have player salaries been evolving over time? Is everybody getting paid more, or only top players? You will write a sketch to visualize player salaries from 1985 to the present day.

  1. Visit Sean Lahman's Baseball Archive and download the CSV version of the 2016 data. This is the same archive that we used for a few of the in-class table examples.

  2. In your L09/ folder, create a new, empty sketch titled SalaryViz. Write a setup() function that sets the window size to at least 500×500. From the baseball archive, add the file Salaries.csv to the sketch. For this sketch, that's the only table you'll need.

  3. Add code to setup() to load the table from Salaries.csv. Note that this table includes a header row.

    In this exercise, we care about two fields in the table: the first field, named yearID, and the last field, named salary.

  4. Loop over the rows of the table. For every row, draw a small semi-transparent circle whose x position is determined from the year and whose y position is determined from the given player's salary in that year. Use map() (or something like it) to ensure that the range of years 1985–2016 fills the width of the sketch (minus a margin on the left and right) and the range of salaries 0-33000000 fills the height of the sketch (minus a margin on the top and bottom). You'll end up with an image like this:

    For this question, you're allowed to "know" going in that the highest salary in the table is 33000000. That is, you can put that into your sketch as a constant, you don't need to walk through the table to find the maximum salary (this will save a bit of coding).

That's the only required code in this assignment. Of course, there are many opportunities for enhancements, to improve the quality of the visualization. For example:

We'll award bonus marks to especially creative or innovative enhancements.

You can complete this sketch in under 20 lines of code, not counting comments. When you're done, store your solution in a sketch titled SalaryViz in the L09/ folder.

Submission

When you are ready to submit, please follow these steps.

  1. If necessary, review the Code Style Guide and use Processing's built-in auto format tool. You do not need to use the precise coding style outlined in the guide, but whatever style you use, your code must be clear, concise, consistent, and commented.

  2. If necessary, review the How To Submit document for a reminder on how to submit to LEARN.

  3. Make sure to include a comment at the top of all source files containing your name and student ID number.

  4. Create a zip file called L09.zip containing the entire L09 folder and its subfolders Marvel and SalaryViz.

  5. Upload L09.zip to LEARN. Remember that you can (and should!) submit as many times as you like. That way, if there's a catastrophe, you and the course staff will still have access to a recent version of your code.

  6. If LEARN isn't working, and only if LEARN isn't working, please email your ZIP file to the course account (see the course home page for the address). In this case, you must mail your ZIP file before the deadline. Please use this only for emergencies, not "just in case". Submissions received after the deadline may receive feedback, but their marks will not count.