Case-folding function names

Posted: April 30th, 2013 | Author: Mars | Filed under: Design | Comments Off

Every programming language which manipulates strings offers a pair of functions which convert text to upper or lower case. These functions are often used to perform case-insensitive comparisons – you just convert both strings to either upper or lower case first. This generally works, but it fails for some scripts, and so the Unicode standard defines a case-folding transform which produces a normalized string suitable for caseless matching.

Radian’s string library implements to_upper and to_lower, and it seems reasonable that it should offer a case-folding function too. But what to call it? Nobody else seems to be offering such a function: I can’t find one in .NET, in Python, in PHP, in Ruby – Go has a mysterious function called SimpleFold, but whatever it’s doing, it isn’t what I’m trying to do.


Change from ‘const’ to ‘def’

Posted: October 25th, 2012 | Author: Mars | Filed under: Design, Progress, Syntax | 2 Comments »

Radian offers two simple symbol types: var lets you define a symbol to which you can later assign a new value, while const is a definition which cannot later be changed. I had expected to make heavy use of const in Radian code since it echoes a pattern I use frequently in C or C++, but in practice I’ve found myself shying away from it. The reason is entirely superficial: it doesn’t feel right, because the values I would be assigning just aren’t constants. Instead, most of the consts I would define are intermediate values – things that will change on every invocation of the function or every pass through the loop, but which can remain unchanged once I’ve defined them. As such it just feels weird to call them constants, and so I tend to define them as var even if I have no intention of ever redefining them.

I still think that const has a good place; in fact I think that using it heavily is good style. I’ve decided therefore to rename it. Stealing a keyword from Python, “constants” are now “definitions”, using the keyword def. I’d avoided def since Python uses it for function definitions, specifically, while Radian functions use function, but sometimes one’s nice clean abstract ideas don’t pan out in practice.

It’s about time to freeze the syntax for a while. Aside from the half-finished regex literals, which are actually present in 0.6, I don’t see any further syntax changes on the horizon. All the upcoming work is in libraries and the toolchain.


File reading API

Posted: September 27th, 2012 | Author: Mars | Filed under: Design, Progress | Comments Off

The regex system is turning out to be a larger project than I had anticipated. It’s still important, but as the length of time it appears likely to consume continues to grow, its immediate priority is dropping. I’m still working on it, but I’m not going to let it delay the long list of smaller pieces of functionality impeding other use-cases.

I am continuing to move away from the original monadic IO system. The latest change is the file-input mechanism: the function that used to be io.read_file is now file.read_bytes. I want it to be clear that the result of this function is a byte buffer, not a string. The buffer object implements the sequence interface, so if I just called it file.read an unobservant ASCII-using programmer might be able to get disturbingly far along without noticing that what they’d read was not actually text, and had not been decoded from its byte form, but merely a string of bytes. By naming the function read_bytes I hope to plant a seed of puzzlement which will lead the programmer to its eventual sibling, read_string, which will require you to specify the encoding of the text file you are reading.

Another change is the elimination of the filespec object. I’d intended to use an abstract mechanism for describing a file, but it’s ultimately nothing but a thin wrapper around a path string. Since every platform I care about uses path strings to identify files, I’ve decided to drop the wrapper. Perhaps there will eventually be a module in the library which implements platform-localized transformations on path strings.


Regex literals

Posted: September 6th, 2012 | Author: Mars | Filed under: Design, Syntax | Comments Off

While I’m trying to err on the side of tradition with Radian syntax, and would thus like to use the /regex/ style, Radian already uses the forward-slash character for division. This doesn’t stop Perl, Ruby, or Javascript, but giving the slash character a contextual meaning makes those languages much more difficult to parse. With a non-context-free grammar, you can’t simply parse in one pass and analyze later: you need to analyze the code you’ve seen in order to understand how to parse the characters ahead. (It’s even worse with Perl, where you can’t correctly interpret a slash character without executing all the preceding code!)

This undesirable trait obviously hasn’t been a fatal obstacle to success for those languages, but since I’m starting with a clean sheet I might as well maintain the LL(1) constraint. This will keep the parser simple and make life much easier for any future intelligent editors and other static analysis tools.

I think I’ll combine Ruby’s alternate delimiter syntax and Scala’s hash-quote syntax: %/regex/, or alternately %"regex". The percent sign is otherwise unused, and it will introduce the regex literal; the next character will be a delimiter, which can be either a forward-slash, single-quote, or double-quote.

A regex is not a string literal, so backslashes will be interpreted by the rules of the regular expression sublanguage rather than the rules of the string-literal sublanguage. The delimiter character is an escape, breaking out of the regex sublanguage; therefore you should pick a delimiter which you don’t intend to use in your regex.

Of course someone will eventually need to construct a pathological regex including all three possible delimiters, so I’ll borrow the quote-doubling mechanism from BASIC. Within a regex literal, doubling up the delimiter character will not end the regex, but will insert a single instance of the delimiter. For example, these two will be equivalent:

%/reg//ex/
%"reg/ex"


Regex literals in various languages

Posted: August 22nd, 2012 | Author: Mars | Filed under: Design, Syntax | 3 Comments »

Languages in which regexes are first-class syntax elements:

Awk: /I (love|hate) regexe(s|n)/
Perl: /I (love|hate) regexe(s|n)/ or |I (love\|hate) regexe(s\|n)|
Ruby: /I (love|hate) regexe(s|n)/ or %r!I (love|hate) regexe(s|n)!, where the bang mark can be any delimiter
Javascript: /I (love|hate) regexe(s|n)/
Clojure: #"I (love|hate) regexe(s|n)"

Languages which offer “raw” strings with no internal escapes:

Scala: """I (love|hate) regexe(s|n)"""
Python: """I (love|hate) regexe(s|n)"""

Languages which offer minimally escaped strings:

PHP: 'I (love|hate) regex(s|n)' – backslash escapes backslash and single-quote, but no other characters
Python: r"I (love|hate) regex(s|n)" – can use either single or double quote

The oldest example of a first-class regex literal I can find appears to be in Awk. Ruby and Javascript copied it from there by way of Perl.


String literals and regular expressions

Posted: August 10th, 2012 | Author: Mars | Filed under: Design | Comments Off

Working on the design for the regular expression system, I’ve run into a problem anyone who’s written regexes in C knows well: encoding a piece of code written in the regular-expression language into a string literal for the C language yields an unreadable mass of backslashes. Since Radian’s string literal syntax is derived from C’s – the only difference being Radian’s omission of octal literals – the same problems lurk just over the horizon here.

Different languages have solved this problem in different ways. Python uses both single- and double-quote characters to delimit backslash-escaped string literals, while an “r” prefix introduces a “raw” string mode in which backslashes are included in the string. Ruby uses a different approach: strings delimited with single-quotes use a minimal escape language, where the only escapable characters are the single-quote and the backslash itself, while strings contained in double-quotes use the usual C-derived escape language. Javascript is wussy as usual; single- and double-quotes are interchangeable, and all strings use a reduced list of the original backslash-escapes from C.

Back when I was working on REALbasic, I once wrote a piece of the IDE that needed to generate string literals containing arbitrary bytes. Strings in REALbasic were delimited with double-quotes, and the only escape mechanism they supported was that you could double up a quote mark to have it treated as a single character. There was a way to introduce arbitrary data into a string, but it involved a function call per byte, and that wouldn’t have worked for these data chunks. I solved the problem by hacking an undocumented feature into the compiler: a new type of string literal which supported the usual array of C-derived backslash escapes. Not the most elegant solution, perhaps, but going through the full design process would have taken more time than the feature I was working on could afford, and since I was the compiler guy it wasn’t like I was foisting extra maintenance work off on anyone else. (So far as I know the code is still in there, never to see the light of day…)

What’s the right solution for Radian? At present, there is but one type of string literal, which uses backslash escapes, but we are fortunately still in a flexible early state where radical changes are possible. Is the escape model the best default? Most string literals have no escapes, and most strings which do are intended for interpolation. Perhaps the default string literal should be a non-escaped “raw” string, like Ruby’s single-quoted strings or REALbasic’s string literals. Python uses prefixes to control the specific string literal subtype: I rather like the idea but it seems that strings should be raw by default, and that special processing should be explicitly turned on rather than off. For now, that might mean a prefixed backslash to enable backslash-escapes; in the future, a prefixed dollar-sign might enable interpolation syntax.


Module identifiers

Posted: August 10th, 2012 | Author: Mars | Filed under: Design, Progress | Comments Off

Modules are a lot like objects, and the implementation of module files in Radian’s compiler shares a great deal of code with the implementation of object blocks. One common element they’ve had is the use of self to refer to the current instance, the object on which the function or method was called.

This works fine until you define an object inside a module, something I’ve had occasion to do once or twice, and which I imagine other Radian programmers may also find to be a useful practice: the object’s definition of “self” shadows the module’s “self”, making it awkward to reach the other members of the module. There are workarounds, of course, but they suck.

I’ve just committed some code which changes modules so that the implicit parameter referring to the current module is now simply the name of the module file, minus its “.radian” suffix: that is, it’s the same name you would use to import the module from another file. This has the pleasant implication that references to module members look the same inside the module as they would from outside – though of course code inside the module can refer to private members, while code outside the module cannot.

It does feel just a little strange to have the identifiers available inside a source file depend on a piece of metadata like the file’s name, but the import system is already committed to the idea that filenames matter. It’s conceptually weird, but in practice it’s just requiring you to do something you were probably going to do anyway.


Resolved: constructor parameters

Posted: July 18th, 2012 | Author: Mars | Filed under: Design | Comments Off

With Aaron Ballman’s help, I have resolved yesterday’s question with these changes:

  • Object constructor parameters will no longer become members of the result object. I may introduce explicit memberization syntax some day, which would allow this feature when convenient. Making the behavior optional and off-by-default allows me to upgrade later if necessary without breaking any current code.
  • The compiler will detect references to member vars which do not go through “self” and report a helpful error. I may add direct member access in the future, in place of this error, but it’s not clear that this would improve code clarity.

Constructor parameter memberization was clever and sometimes helpful, but there was no way to turn it off if you didn’t want it, and that’s something I try to avoid. I have updated all the code in the standard library, and while there are a handful of new var declarations now which simply copy a parameter value, the code is clearer overall.


Constructor parameters and object members

Posted: July 17th, 2012 | Author: Mars | Filed under: Design | Comments Off

Object constructors are functions which create and return an object. Everything defined inside the body of the constructor becomes a member of the result object. In the current implementation, “everything” includes the constructor’s parameters. You can always name the parameters with an underscore to make them private, but they will be included as object members all the same.

I thought this might be convenient since I frequently find myself writing do-nothing C++ constructors which merely initialize a list of member variables with a corresponding list of parameters. Easy, I thought: I’ll just bake that pattern into the language. After trying it out for a year or two, however, I find that I avoid the parameter-as-member feature more often than I take advantage of it. Of the 22 parameterized constructors in the radian library right now, only three use the “memberization” feature on purpose, with another one appearing to use it by accident. When I do use the feature, the code doesn’t look right; at a quick glance, the object appears to be missing some of its declarations. I have to remind myself to look at the parameter list for some extra names.

The big problem with memberization is that parameters are always vars, and yet I almost never want to give an object public vars. If anything is going to change one of the object’s vars, it should be one of the object’s own methods.

If constructor parameters are not object members, then, what should they be instead? Since all functions and methods are placed inside the object block, the constructor parameters are in scope throughout the object. What should it mean when a method refers to them? Ordinary function parameters are mutable, so it seems that constructor parameters should be vars as well – but that is only possible if they are object members. On the other hand, constructor bodies only contain definitions, and cannot assign new values to vars – only methods can do that. Perhaps it’s OK to prevent methods from changing parameter values, since the parameters belong to the constructor, and not to the object. It does feel inconsistent, though, because a method can assign new values to any other var defined in the constructor!

I’m not sure what to do with this. Simplicity and internal consistency drive me toward a solution which looks good on the surface but turns out to be awkward and a bit hard to read in practice. Pushing the design the other direction would create more rules and more clutter. Perhaps the language can afford it, since Radian’s objects are already mechanically much simpler than those found in most other languages.

This is the kind of problem that makes language design hard in an interesting way.


Asynchronous tasks

Posted: July 16th, 2012 | Author: Mars | Filed under: Design, Progress | Comments Off

I’ve gone back and forth and back again on the nomenclature: the current implementation adds a sync operator. A function which contains a sync becomes a task generator in exactly the same way that a function which contains a yield becomes a sequence generator.

A task generator is a function which returns a task; a task represents a series of related actions. Each action holds a response from the previous action; if the task is_running, you may send a new value. This updates the action pointer, creating a new response.

This scheme allows a program to describe a complex chain of asynchronous actions and continuations using normal imperative syntax. You don’t need to break your code up into a lot of nested callbacks, or laboriously transform a simple loop into some object with state; instead you can use the sync operator and let the compiler do that work for you.

This is very similar to the async function system in C# or Visual Basic, with Radian’s sync operator taking the place of C#’s await. There’s no need to explicitly declare that the function is async, though; the compiler will figure that out. It is also very similar to Python’s enhanced generators, though Python fuses yield and sync into a single operator, reusing iterators as asynchronous tasks. I considered this approach for Radian, but extending iterators in that way turned out to significantly impede the compiler’s ability to extract map/reduce operations out of loops. The constraints are an important part of the design, so I kept the two mechanisms separate.

The point of all this engineering, of course, is that I can now redesign the I/O API around the asynchronous task system. At present, writing a Radian program which performs any kind of I/O interaction or touches global system state in any way is a masochistic exercise in long chains of callbacks. You can’t really use the language the way it’s meant to be used, since you have to turn your code inside out just to talk to the filesystem. With the new I/O model, your entire program will effectively be one big asynchronous task, and only the presence of the sync keyword will distinguish a normal function call from one which performs some IO action.

Inside a sequence generator, one can either yield a single value into the sequence output, or yield from another sequence to splice all of its values in as though the current generator had yielded them itself. Inside an asynchronous task, however, the sync operator expects that everything you return will be another asynchronous tasks. It’s as though sync is always doing yield from: you are always syncing from another asynchronous task. If you want to create a new atomic action which just returns some value, there will be a utility function in the task module which creates such a task which you can then sync from.