Syntax

Jul 21, 2015 in programming, thoughts

I want to talk about programming language syntax tradeoffs. This stuff has probably been said before, but I’ve been thinking about it a bit recently and want to write a blog post about it, because what’s this blog even for anyway.

I was thinking about people’s complaints about various programming languages, and it got me thinking about tradeoffs of features vs. syntax simplicity. What do I mean? Let’s look at some examples.

Python

I quite like Python’s syntax. Probably its most distinctive feature (i.e., you know about it even if you’ve never written a line of Python) is using indentation to denote block structure. There are no braces. Which is nice, because it forces you to use correct indentation, etc., blah blah. But it does lead to one problem, which is that you can’t really use anonymous functions. There are lambdas, but they can only consist of one expression (e.g. lambda x: x+1.) This actually makes sense if you think about it, though. Let’s look at an anonymous function being passed as an argument in Javascript:

do_stuff(function(x) {
    var y = x + 6;
    return y;
}, 4);
// or
do_stuff(function(x) { var y = x + 6; return y; }, 4);

Now how can we do something similar in Python? There’s no semicolons to separate statements, so each statement has to be on a separate line. That means we would have to go with something similar to the first structure above. But there are also no braces to delimit blocks, so we have to use indentation! This means that the above function call would have to look something like this:

do_stuff(lambda x:
    y = x + 6
    return y
, 4)

Not only is that kind of ugly, it’s hard to read and probably hard to parse. So it makes sense that Python would leave this out, because it tends to be concerned with making its syntax as readable as possible.

Python ends up solving the problem of anonymous functions with the use of decorators, which are like regular functions, except that they take a function as an argument. I haven’t had much experience with this part of Python, to be honest, but an interesting use of decorators that I’ve seen is in lightweight web frameworks. Look at this code for Dancer2, a light Perl web framework:

get '/' => sub {
    return "Hello world!";
};

Accounting for all of Perl’s weird syntax rules, this is syntactic sugar for:

get('/', sub {
    return "Hello world!";
});

(sub is Perl’s function keyword.)

The equivalent code in Flask, a similar light Python web framework, reads like this:

@app.route('/')
def hello():
    return "Hello world!"

This basically does the same thing as the Perl example — it passes the string '/' and the function hello as an argument to the function app.route. However, it does it without invoking an anonymous function anywhere. It is a little awkward because you have to give all your page functions names, but that might not necessarily be bad depending on your viewpoint.

I think decorators are a pretty smart solution to this problem in Python, but it still shows how languages can run into conflicts between syntax elegance and functionality. Python, on the whole, tends to err on the side of syntax elegance in these conflicts.

Ruby

In Ruby, parentheses around method/function calls are optional. This is actually quite handy for a number of reasons. For example, to write a getter for an attribute of a class, you can just name it after the attribute, and whenever it’s written, the function will be called.

What I just wrote there didn’t make very much sense, so let’s look at an example instead.

class A
    def initialize
        @x = 0
    end
    def x
        return @x
    end
end

a = A.new
puts a.x # equivalent to: puts a.x()

This prints 0. (And notice how there were no parentheses anywhere in the code.) This means that Ruby code tends to look a lot cleaner than, say, Java, where the “proper” thing to do would be to declare x to be private and then implement a getX() function that returns x, and use that everywhere just in case you want to change how “getting x” works in the future1. In Ruby, you don’t have to do that — you can essentially just refer to the property by name. There’s even a built-in Ruby function (attr_accessor) that will write these methods for you! So that’s pretty awesome.

Unfortunately, there’s a dark flipside to this parentheses-free design. And that is that Ruby doesn’t actually have first-class functions. For example, this Python program can’t be directly translated into Ruby:

def myfunc(x):
    return "You wrote %s." % x

def g(z):
    print(z(3))

g(myfunc)   # output: You wrote 3.

If we try to translate this into Ruby, we get an error saying that there aren’t enough arguments for myfunc.

def myfunc x
    return "You wrote #{x}."
end

def g z
    puts z 3
end

g myfunc    # equivalent to g(myfunc()) -- whoops, myfunc needs an argument!

There are several ways to get around this limitation in Ruby; there are objects which represent callable code and which you can pass around, and there are anonymous functions (lambdas). Like Python, Ruby has come up with its own way to mitigate the limitations of its syntax. Is it worth it? You decide! I haven’t really done anything with Ruby ever, so I don’t know how important this is, but here’s an article about why lacking parentheses is good, for what it’s worth.

Now, let’s look at the other extreme — languages that have bent their syntax to their features, for better or for worse.

C++

C++ is kind of infamous for implementing EVERYTHING POSSIBLE, often at the expense of reasonable syntax. This is often beneficial to people writing code in C++, because it has every feature under the sun, but it can often make programs difficult to read and even more difficult to parse. For a long time, there was an ambiguity with nested template syntax2:

a<b<c>>d;
// could be:
a <b <c> > d; // declaring d of type a<b<c>>
a < b < c >> d; // is a less than the expression (b < (c>>d)), or something like that

Apparently this has been resolved in the C++11 standard to parse the first way instead of the second way, which is what most people want it to do. (See here for some discussion of why it is a problem and how it’s been fixed.) But there are still some other C++ syntax bugaboos floating around out there. The most vexing parse, for example:

// Code from the Wikipedia article
class Timer {
    public:
        Timer();
};

class TimeKeeper {
    public:
        TimeKeeper(const Timer& t);

        int get_time();
};

int main() {
    TimeKeeper time_keeper(Timer());    // <--- HERE
    return time_keeper.get_time();
}

// the above line could be either:
TimeKeeper time_keeper(Timer (*)()); // (a function declaration)
// or:
TimeKeeper time_keeper(Timer());     // (a variable declaration)

Most people that write something similar mean it to be parsed the second way above, but it ends up being parsed the first way. As noted in the Wikipedia article, there’s a new syntax in C++11 which removes the ambiguity, but it looks like you can still get bitten by this if you use this syntax.

Also, C++11 introduces its own share of ridiculous syntax constructs. C++ got lambdas, and they look something like this:

[=] (int param) -> int { return param + 1; }

The syntax for return types is inconsistent with regular functions (it’s written after the ->), and the square brackets look bizarre, but hey — you can select how you want your variables captured in the closure! Giving people this choice at the expense of beautiful syntax was probably the right tradeoff for C++ to make, seeing as C++ tends to be used in applications where speed is at a premium, and you don’t want to waste time copying all your variables into the closure when you aren’t using them. In fact, you’ll notice that the two languages I discussed above are both scripting languages, so maybe they can afford to forgo speed for beautiful syntax.

Although this code is probably unforgivable:

int main(){(([](){})());}

Perl

Ah, Perl. Famous for regular expressions, the 1990s internet… and being unreadable to anyone other than the original author. Strictly speaking, this isn’t necessarily true if you avoid some of the language’s features, but it does possess plenty of facets that make it difficult to understand. A lot of these stem from Perl’s origin as a replacement for sed/awk/bash scripts. So, Perl inherited a lot of weird two-character variables ($!, $?, $$) from these languages. It also has the weird implicit variable thing, which makes it easier to come from these languages. This lets you write Perl like this:

while(<>) {
    s/a/b/g;
    print;
}

and all of these actions are being taken on an implicit variable (another two-character variable, $_). The above program, when called with a file as its argument, will print out the file with all as replaced with b.

Allowing this shorthand may have seemed like an acceptable tradeoff in the beginning, because it matched what people were already familiar with, and it allowed fast code-writing, which fit into the niche Perl was going for (writing small shell scripts to accomplish something quickly.) But as Perl has grown, it has become clear that writing this way is a very bad idea. For example, someone else’s function can break $_, which is very hard to track down as an error. Also, it makes your code very hard to read. But Perl has to keep supporting these features for backwards compatibility. And I can vouch that when you are writing a quick script, they’re pretty handy.

So it looks like some of Perl’s tradeoffs have not aged well. I could keep talking about Perl syntax for a while, because I actually think that parts of it are quite elegant, but I’m going to cut myself off for now. Maybe in another post.

So what

If you look carefully, you can find this pattern in any language. In some, the tradeoff is very dramatic (first-class functions vs. parentheses in Ruby), whereas in some languages it’s more subtle. For example, why are parentheses required around the if condition in C/C-like languages? It can lead to ambiguous parsing if you put a single line after the condition. For example:

if a - b - c;
/* is this... */
if (a) { -b-c; }
/* ...or... */
if (a-b) { -c }
/* ...or even... */
if (a-b-c) {}

(And don’t even get me started on PHP!)

But anyway, I decided to write this post because I have been thinking about this subject for a while, and it seemed like an interesting subject to write about. After trying (and failing) to write my own toy language, I learned a lot about parsing and syntax, and the tradeoffs that you have to make when designing a language. I started noticing these things in other languages as well.

(Also, I’m writing a new toy language, learning from my mistakes. Perhaps later we’ll be seeing that around on this site as well!)


Footnotes

  1. It looks like Java has discussed the possibility of getting C#-style property getters/setters without setX() et al. But it doesn’t look like they’ve implemented that anyway yet, and I like how it’s all wrapped up in a method call in Ruby. (Or message passing, whatever.)back
  2. All the angle brackets in this code sample are murdering Vim’s Markdown/HTML syntax highlighting. It’s all messed up for the rest of the file.back