Irreal: March 2010

Monday, March 29, 2010

PLT Scheme Gets a New Name

The PLT group announced that it is changing the name of PLT Scheme to PLT Racket. DrScheme will become DrRacket, mzscheme will become racket, and so on. Normally I dismiss this type of thing as either marketing driven silliness or a cover for other, more serious problems that the vendor is having. I almost always find this sort of gratuitous name change annoying. In this case, however, I agree that the change is warranted.

DrScheme was my very first Scheme and I was impressed by the quality of its implementation and the tools that it provided. Over the years I became less enchanted. First it was little things: rather than use format as the format-and-print function like every other LISP, PLT used printf. Even worse, they did have a function named format that did something else making it harder to use a real format. Rather than use a standard sockets interface for their networking they used their own API that made using PLT Scheme harder for someone used to the normal socket interface. To be sure, PLT provided alternate libraries that ameliorated these annoyances, but it was still something you had to specify in every program file. Not fatal, just irritating. And they did have all those neat tools.

They went too far, however, when they removed set-car! and set-cdr!. This was justified by declaring these two functions "too dangerous" for programmers to use safely. Statements like that should set alarm bells ringing for any Lisper. The point of Lisp, at least to me, is to remove straight-jackets and give programmers the power they need to solve the problem they're working on. Instead, here's PLT throwing up roadblocks that programmers need to waste mental resources working around.

That was it for me. I switched to Guile and haven't looked back. I have been vaguely aware, though, of further changes to PLT Scheme in the name of protecting programmers from themselves. It's been clear for some time that the PLT group has been moving their Scheme towards a "Pascal with Parentheses" language. There's nothing wrong with such a language, of course. The PLT group is, after all, a group of academics interested in finding optimal ways of teaching programming. But if you have such a language you shouldn't call it Scheme. The PLT folks have recognized this and changed the name of their language to Racket. I say good for them.

Monday, March 22, 2010

Writing Emacs Commands That Work on Regions or the Entire Buffer

Some Emacs commands come in pairs: one for regions and one for the entire buffer (print-region / print-buffer, eval-region / eval-buffer, ispell-region / ispell-buffer, etc.), others have a command only for regions and require a C-x h to act on the entire buffer (downcase/upcase-region, many of the compile commands, and so on). But does it really have to be like that?

Consider this code snippet from a nice post on the excellent EMACS-FU:

1:  (defun djcb-count-words (&optional begin end)
2:    "count words between BEGIN and END (region); if no region defined, count words in buffer"
3:    (interactive "r")
4:    (let ((b (if mark-active begin (point-min)))
5:        (e (if mark-active end (point-max))))
6:      (message "Word count: %s" (how-many "\\w+" b e))))

This little gem is full of geeky goodness. The actual word counting happens on line 6—a nice implementation due to Rudolf Olah—but the important part for us happens on lines 3–5. Using (interactive "r") causes the point and mark to be passed to the function, smallest first. Thus if there is an active region, the parameters begin and end will specify its boundaries. If no region is defined—indicated by mark-active being nil—the function uses the beginning and end of the buffer as the boundaries for the how-many function. Otherwise it uses the region boundaries as specified in the begin and end parameters.

I really like this trick and I used it in several of my functions for a long time. Then one day I got an error: Emacs informed me that the mark was not set and therefore the (interactive "r") failed. This doesn't happen very often; you usually see it when you've just opened the file and haven't done anything to cause the mark to be set yet. Still, it was annoying so I wrote a macro to do the same thing without having to worry about whether the mark was set or not.

 1:  ;; Macro to take care of (interactive "r") with no mark set problem
 2:  (defmacro with-region-or-buffer (args &rest body)
 3:    "Execute BODY with BEG and END bound to the beginning and end of the
 4:  current region if one exists or the current buffer if not.
 5:  This macro replaces similar code using (interactive \"r\"), which
 6:  can fail when there is no mark set.
 7:  
 8:  \(fn (BEG END) BODY...)"
 9:    `(let ((,(car args) (if mark-active (region-beginning) (point-min)))
10:           (,(cadr args) (if mark-active (region-end) (point-max))))
11:       ,@body))
12:  
13:  ;; Indent it properly
14:  (put 'with-region-or-buffer 'lisp-indent-function 1)

The \(fn (BEG END) BODY...) on line 8 provides a correctly formatted prototype of the macro for the documentation. Line 14 tells Emacs how to indent elisp code that uses the with-region-or-buffer macro.

Now if we write:

(with-region-or buffer (b e)
  (message "Word count: %s" (how-many "\\w+" b e)))

it will expand to

(let ((b (if mark-active (region-beginning) (point-min)))
      (e (if mark=active (region-end) (point-max))))
  (message "Word count: %s" (how-many "\\w+" b e)))

and count the number of words in the region, if there is one, or the entire buffer otherwise.

Friday, March 19, 2010

Hall of Shame

One of my pet peeves is Web sites that ask for a credit card number and then refuse to accept spaces despite the facts that the spaces occur on the card and that it's significantly more difficult to enter them correctly without the spaces. Similar remarks apply to phone numbers and doubtless others. There's no reason for this other than programmer laziness—or perhaps just cluelessness—and, really, a lack of respect for their users. Why any company would tolerate this is a mystery, but that companies such as Microsoft, Adobe, TurboTax Software, and Apple tolerate it is just inexcusable.

Comes now Steve Friedl in a fit of righteous anger with his No Dashes or Spaces Hall of Shame. As Steve demonstrates, it's often easier to do the right thing than it is to add an instruction line specifying no spaces or dashes. The best part of his post is a long list (with examples) of offenses by major companies that really should know better. Steve is still accepting nominations, so by all means send him any examples that are inflicted upon you. Maybe public shaming will put an end to this disgraceful practice.

Wednesday, March 17, 2010

Russ Cox on Regular Expressions

I've long been a fan of Russ Cox's 2007 paper Regular Expressions Can Be Simple and Fast. In it he explained why many of the common regular expression libraries in languages such as Perl, Python, Java, PHP, and others can exhibit exponential behavior, while the libraries associated with the Unix tools, such as egrep and awk, have provably linear performance on all inputs. Oddly enough, the O(1) Unix regular expression tools are based on an algorithm invented by Ken Thompson in 1968 for the QED editor, while the (potentially) O(2ⁿ) implementations used by PERL et al. were based on Henry Spencer's much later Regular Expression Library.

I've just discovered that Russ wrote two other papers in the series: Regular Expression Matching: the Virtual Machine Approach (2009) and Regular Expression Matching in the Wild (2010). The first of these discusses various virtual machine implementations for regular expressions. The first paper discussed an implementation based on Thompson's ideas but it worked by converting the regular expression into an explicit nondeterministic finite automaton (NFA) and executing that. The implementations in the second paper are closer to Thompson's actual implementation, which generated IBM 7094 machine code and executed it directly. Cox shows that nevertheless the two implementations are essentially the same and that each is revealing in its own way. This is a wonderful paper—in some ways better than the first—and is really a must read.

The final paper discusses the RE2 regular expression system used at Google and now released as an open source project. RE2 is based on the ideas discussed in the first two papers but is a real implementation meant for industrial strength applications.

If you've ever wondered how regular expressions work or even have the slightest interest in them, NFAs, or just enjoy excellent software engineering, you should definitely give these papers a read. You won't be sorry.

Saturday, March 6, 2010

A New Org-mode Talk

On February 8, 2010, Carsten Dominik gave a talk about org-mode at the Max Plank Institute for Neurological Research. This talk is a bit different from the one he gave at Google in that the emphasis is more on how he uses org-mode than on the actual mechanics of its use.

It's a great talk and make a nice companion to Bernt Hansen's excellent writeup on how he uses org-mode to run his company and life. The talk is well worth 53 minutes of your time.