Dirkjan Ochtman: writing

Compilers on Coursera

Published on 2012-07-21 by Dirkjan Ochtman in learning

Two weeks ago, I completed the Compilers course on Coursera; it was a very worthwhile experience. From the amount of discussion about e-learning, it seems a pretty hot topic. So far, I haven't seen any posts about what it's like to actually participate in a Massive Open Online Course (MOOC), so I figured I'd write up some thoughts on my experience. It's pretty detailed, so:

TL;DR

Taking this course was an great way to learn more about compilers and fill a hole in my CS curriculum. Professor Alex Aiken is a great instructor and covers a good amount of material. I learned a lot about compiler construction despite having toyed with my own compiler before starting the course. The programming assignments were particularly tough, giving me useful experience in building compilers and a great sense of achievement. Coursera seems a nicely designed platform, and I'd like to try some other courses next year.

Joining up

I heard about the course via a Google plus post at the end of April. I'd been playing with writing my own compiler for a language I'm experimenting with, and I figured this would be a good way to learn a few things about what works in compiler design. For my own project, I had gotten started in Python, with a custom regex-based lexer, a Pratt parser and doing fairly basic code generation by writing out LLVM IR. At some point, the code generation code grew unwieldy and I split it up into a "flow" stage, doing what I now know to be called semantic analysis and turning the AST into a control flow graph, and an actual code generation stage to translate the CFG to IR.

There was no compilers class for my CS program in university, and I had substituted another programming class for the assembler programming class they offered, but one of Steve Yegge's rants had stuck with me. So I figured that, with my masters in CS and some experience writing a basic compiler "from scratch" I would be able to handle the course next to my full-time day job.

This compilers course is originally from Stanford, and supervised by professor Alex Aiken and some staff via Coursera. There is a similar class on Udacity, which I didn't find until after I'd already started at Coursera. The Coursera version is comprised of lectures, quizzes, proof assignments, mid-term and final tests and programming assignments, with the programming assignments being graded separately so that there are two levels of participation. This installment (I think it will run again in the future) ran from April 19, when the first lectures were available, to July 6, when the final test was due.

Lectures

Lectures were posted each week. There was at least about 90 minutes of video per week, though some weeks it ran up to 160 minutes. It's divided up into pieces of about 5 to 25 minutes, which made the viewing significantly more manageable. I used HTML5 video inside the browser (which worked great even on 3G internet), but you can also download each video separately. For the in-browser viewer, there's an option to view at speeds from 0.5x to 2x, but it's hidden in the user preferences, so I didn't find it until after I was done with the course.

Each video starts with a short introduction where you can see the professor talking; after that, you see the slides, which the professor scribbles notes on as he goes along explaining the material. The slides can be downloaded separately as PDFs for later review, in two versions: the pristine version and the one with the notes scribbled on them by the professor.

I didn't like lectures much in university, but found that I actually liked watching these. The pacing is pretty good, although of course it's sometimes a little slow and sometimes a little fast, but the professor was engaging and the scribbling on the sheet makes it feel a little more interactive than your standard slides plus narration. It also helped me that the videos were generally pretty short, so you can watch one, do something else for a bit, then watch another one.

Quizzes & tests

There were 6 quizzes, spaced out in time throughout the course, each covering the material in the lectures posted since the last quiz. There were two deadlines on each quiz: the early deadline, something like a week after the quiz was posted, and the end of the course, for half credit. Each quiz could be taken as many times as you wanted, the highest score would count, and you got to see the correct answers as soon as you submitted the quiz.

The questions were pretty challenging. In the first few quizzes, I just clicked through and didn't end up getting very good scores. If I felt the score was too low, I'd take it again and see if I could do better after studying the correct answers for a bit, but it didn't always get much better. I was treating it like a small exam on what I'd learned from the last set of lectures.

At some point halfway through, I changed my strategy and started seeing the quizzes more as additional material. I took extra time, started noting things down on paper and really working through the problems, as many times as necessary to get a perfect score. I feel this was a much better way to do it, because I actually learned things by checking my answers.

I found some of the questions annoying because I felt they required very detailed reasoning through an example DFA or generating MIPS assembly code by hand. I don't mind building up assembly code for the programming assignment (that much -- see below) but I was really hoping more for questions that tested my understanding of the underlying theory rather than my reproduction skills of detailed algorithms laid out in the lectures. Of course, taking more time for the quizzes helped with that, too, but I'm still inclined to dislike those kinds of questions.

The tests were more or less like the quizzes, though slightly harder. I didn't do that well on the mid-term, where I got about 50%; I got 75% on the final, which was much more satisfying. There was some grumbling in the forums about some of the questions being ambiguous or even just wrong, but this wasn't a big issue for me.

DeduceIt

There were 6 proof assignments, run via a small web app called DeduceIt. The web app is a little rough around the edges, and so these assignments could just get you extra credit. In all of these assignments, some part of the week's material was represented in a proof assignment, where we were suplied with a number of given statements, a goal statement and a few production rules. We had to apply the rules to the given statements in a particular number of steps to derive the goal statement.

This would all be fine if it wasn't for the representation used in the assignments. To require students to carefully apply the rules to given statements themselves, both rules and statements are given as LaTeX-like text expressions. These are hard to read and very tedious to reproduce, making the assignments more about understanding the representation than actually going through the proof.

On the other hand, the proof assignments are a good way to go through the algorithms that were covered in the lectures without having to dive into the details of actual code implementing the algorithm. I liked doing them for this reason, but felt the representation got in the way more and more as the assignments got more complicated towards the end of the course.

Programming assignments

The real meat of the course, for me, was in the programming assignments, implementing a compiler for the Classroom Object-Oriented Language (COOL). Each assignment corresponded to one of the essential compiler stages: a lexer, a parser, a semantic analyzer and a code generator. All assignments were available in C++ or Java flavors, with the first two being built around on the flex/JLex and bison/CUP tools respectively. It was also allowed to just use another language, with the caveat that it required reimplementing some of the support code that was made available.

I decided to go with the Java variants of the first two assignments, on the premise that it would be educational to learn the usage of tools like JLex and CUP rather than building my own lexer and parser like I'd done for my own compiler. Getting started with JLex was fairly frustrating, so I almost dumped it in favor of doing it in Python, but once I got the hang of it it turned out okay.

(I've since been wondering whether students could learn more if they implemented a lexer or parser from scratch versus using something like flex or bison. At least from my own experience, writing a lexer/parser for a realistic language is not that hard. Of course flex results in a highly optimized lexer, but writing up a lexing algorithm from scratch would seem more instructive. On the other hand, perhaps using the generator tool allows you to focus more on the characteristics of the grammar you're developing, instead of it potentially getting lost in procedural code.)

The reference compiler is implemented in the same four stages, with a simple text format used to communicate the relevant data through UNIX pipes: the lexer consumes program code and spits out a simple line-based list of nodes, which the parser consumes and turns into an indentation-based AST representation, the semantic analyzer augments the AST with more typing information and finally the code generator outputs MIPS assembler code.

Grading was done via Perl scripts that fuzzily compare the output of your code to that of the reference component, running over a list of sample programs (both valid and invalid) to score your program. This worked quite nicely and made it straightforward to find the remaining issues. In fact, the main way I wrote my code was by starting from a small program, trying to generate some reasonable output, then use diff -u to compare it to the output from the reference program and see where my code failed. I often find that it's easier to stay motivated when generating small successes with some regularity, and this way of working made it harder to get stuck. On the other hand, it might have been better to do a little more design up front, forcing me to think through issues rather than hacking away at problems that come up.

I ended up doing the last two assignments in Python. I wrote up a small parser for the textual representation of the AST used and some infrastructure to easily write passes over the AST, then got started with the actual assignment. This worked great; Python is the language I use most, and not having to think so much about typing or memory management made it easier to get the assignments done. In my opinion, Python is a much better language for most teaching purposes than either Java or C++, because it enables more focus on the actual algorithm (versus the "details" of programming). Apparently the Udacity compilers course uses Python, so I might have gone there if I'd known about it before starting with Coursera.

The programming assignments took a lot of time. With my full-time job, I found it pretty hard to find 10-25 hours per assignment of focused time reading documentation, writing code and testing against sample programs. However, these assignments were also the most instructive and rewarding for me, so I wouldn't want to have missed them, and having some deadlines also helped with not procrastinating so much.

As a result, I ended up giving up on the last 20% of the code generation assignment. Assembly is pretty verbose and hard to read, and working with MIPS assembly certainly made me appreciate LLVM IR more (even if x86 assembly is uglier still). Debugging assembly is also pretty painful, so I got a little frustrated as the deadline was getting closer. The simulator used to run MIPS, SPIM, was also a little limited and buggy in places, which certainly didn't make things easier. In short, I'll be happy to return to LLVM and its suite of tools.

Community

Part of why this course worked for me was definitely the community. The deadlines can be harsh if you're not a full-time student, but being able to get some help from staff, confer about particularly hairy parts of the semantic analyzer or get some extra explanation about a quizz question makes it much more fun to do. It should also make it less likely you hit a wall and have to give up on the course and give some extra incentive to finish the whole course. This made the forums an integral part of the experience for me.

Overall experience

If you similarly lack any education in compilers and find that Steve Yegge's aforementioned rant has you inspired, Coursera's course is a great way to learn a lot about compilers. I'm sure the Udacity course is nice, too, but the Coursera environment seems more attractive. I also found Coursera's relation to renowned universities appealing, though that might just be good marketing. In any case, I look forward to taking another course through their site.