Arithmetic types as lexically scoped declarations

Posted: January 15th, 2015 | Author: Mars | Filed under: Design | 2 Comments »

There are many algorithms for evaluating arithmetic operations, each useful in different situations. Sometimes you want simple, machine-word integers, and sometimes you want floating-point doubles; but sometimes it would be nice to have infinite-precision fixed-point arithmetic instead.

The usual solution for this problem is a system of related numeric types, with various rules about implicit or explicit conversions between those types, such that the output type for any given operation may be inferred from its inputs. Radian does exactly this, though its type implementations are nameless internal details and not explicitly declared or referenced.

This all works well enough, but every design has its tradeoffs, and I wonder if another strategy might be more convenient. Since Radian types are implicit, there’s no straightforward way for a programmer to specify the type of computation they want, and thus the math package has to maintain as much precision as it can – whether that information is actually useful or not. For the great majority of arithmetic operations, simple machine-word integers are plenty, but the radian library has to overflow into bignums just in case the programmer later decides to care.

Another weakness of the traditional arithmetic model is that there is no interface for specifying behavior when there are multiple valid possibilities. What should the math package do when a program divides by zero? Should it raise an exception, return some special NaN code, or merely approximate infinity and get on with life? Each could be the right answer for some situation, but language designers are generally obligated to pick one and hope it will work for everyone.

What if we separated the ideas of numeric type and arithmetic type? What if, instead of delegating arithmetic operations to number objects, we delegated them to some “calculator” object? One might have a “machine-word integer” calculator, an “IEEE double” calculator, or a “4x IEEE single vector” calculator, each one implementing the various arithmetic operators using some consistent mechanism. Perhaps “calculator” is an interface, with methods named “add”, “subtract”, “multiply”, and so on. The standard library might provide some common calculators, as listed above, but programmers with specialized needs could implement their own calculators in any fashion they saw fit, with any specialized rounding, approximation, or exception-handling behavior they might happen to need. Instead of specifying the types of variables, then, one would specify the types of computations.

Let’s explore how this might work in Radian. At present, arithmetic operators are syntactic sugar for a method call on the left operand, where each operator has a specific name, so these statements are equivalent:

def foo = bar + baz
def foo = bar.add(baz)

Let’s do something else instead: we’ll still call an “add” method, but we’ll imagine that there is some object named “arithmetic” which implements it. These two methods would then become equivalent:

def foo = bar + baz
def foo = arithmetic.add(bar, baz)

Perhaps the language would provide an implicit global definition for “arithmetic” which links against the existing standard library code. Outer-scope symbols can always be overridden by local symbols, so any function or object or control structure would be free to apply its own calculator object by merely defining its own “arithmetic” symbol:

function float_add(x, y):
  import fancy_arithmetic
  def arithmetic = fancy_arithmetic.configure(my_handler, 42, true)
  result = x + y
end function

The addition function still compiles to the same arithmetic.add(x,y) as ever, but the function has provided its own definition of arithmetic, so it can control the algorithm used. In this example I am imagining that it might include some parameters detailing the desired exception behavior and precision.

Since this is a lexical structure, not a dynamic part of the call stack, it is still possible to determine a value’s type at compile time – in fact it becomes much easier to determine what kind of result a given arithmetic operation will have, since the compiler can always tell which calculator object is currently in play. If the language library defined some standard calculator type which could be implemented efficiently using hardware primitives, the compiler might be able to detect that and generate accordingly more efficient code, instead of leaving all the decisions to runtime as it currently must.

It seems unlikely that this is a new idea, but I haven’t been able to find any references to previous experiments along these lines. If you’re familiar with such a project I would love to hear about it.

Target constraints and module names

Posted: September 10th, 2013 | Author: Mars | Filed under: Design | Comments Off

Following from yesterday’s design discussion, here is the plan I am considering for platform-specific code.

At present, a statement of the form import foo generates an extern reference to an object named foo, then pushes “foo.radian” onto the compile queue. When the compiler reaches “foo.radian”, it generates a static singleton object named foo whose members are the declarations found the file.

The new system will add awareness of the target OS and architecture. Instead of searching for “foo.radian”, the compiler will search through a sequence of possible file names. Where $OS is the OS component of the LLVM target and $ARCH is the architecture component, the compiler will search for these files, in order, compiling the first one it finds and ignoring the rest:

  1. foo-$OS-$ARCH.radian
  2. foo-$OS.radian
  3. foo-$ARCH.radian
  4. foo.radian

We already have a distinction between the module name, foo, and the file name, “foo.radian”; this change will simply add an additional, optional suffix to the file name, leaving the module name unchanged. Since the compiler will always search for each possible implementation file in order, it will be impossible to import more than one different implementation of the same module for a single target build.

I chose the hyphen instead of Go’s underscore because it is already forbidden from module names. This means it is impossible to accidentally import against a specific platform implementation; even if you only supply an implementation of the file for a single platform, you must always import the module name alone.

Currently supported values of $OS would be “linux” and “macosx”, while $ARCH can be “i386″ or “x86_64″.

Conventions for platform-specific source files

Posted: September 9th, 2013 | Author: Mars | Filed under: Design | Comments Off

The Go language uses OS- and architecture-specific file name suffixes to identify platform-specific implementations of a module. There is a more complex system of build constraints of which the name suffix is only one example, but it’s a pretty powerful convention.

I wonder if a similar scheme would work for Radian. Import statements are already abstracted from filenames, after all – one imports a library named “foo.radian” with the statement import foo and not a more C-style import "foo.radian". Perhaps the build tool could first check for a file named “foo_macos.radian”, if one happens to be targeting Mac OS, and only fall back to “foo.radian” if no platform-specific file exists.

In C++ code, one typical pattern is to define an abstract base class whose methods implement common behavior, calling pure virtual functions for platform-specific operations. Each platform then defines a subclass which implements those methods. Another similar strategy uses the bridge pattern: instead of subclassing, the general class delegates its operations to a platform-specific implementation object.

How might we use this strategy in Radian, if its import statement could transparently pick an appropriate implementation file for the target platform?

In order to use the abstract-base-class approach, we’d have to introduce the notion of module inheritance. Your import foo would actually target foo_linux.radian, or whatever was appropriate, and each implementation would then somehow declare itself to be an extension of foo_common.radian. Perhaps this would all happen magically if the build tool observed the presence of both a foo_platform.radian file and a foo.radian file, but it seems like a lot of magic – and if we’re going to postulate the introduction of a module-inheritance scheme, it seems likely that people might want to use it for purposes other than simply isolating platform-dependent code; it should be called out explicitly.

This makes me think the bridge pattern is a better bet. If you were to design a network library, for example, you might define a network.radian which provides the public API. This file might then be packaged with network_internal_macos.radian, network_internal_windows.radian, and network_internal_linux.radian. The main network.radian module would then import network_internal, automatically picking up the appropriate version for the current target. External users would import network.radian and use some reasonable library API, while the library itself would delegate its grungy details to the methods in the platform-specific files.

Given that Radian requires source file names to consist of a legal identifier followed by the extension “.radian”, it might make more sense to use a hyphen or a period than an underscore. Imagine network_internal-linux.radian, for example, which you’d import network_internal and then refer to via x = and the like.

Case-folding function names

Posted: April 30th, 2013 | Author: Mars | Filed under: Design | Comments Off

Every programming language which manipulates strings offers a pair of functions which convert text to upper or lower case. These functions are often used to perform case-insensitive comparisons – you just convert both strings to either upper or lower case first. This generally works, but it fails for some scripts, and so the Unicode standard defines a case-folding transform which produces a normalized string suitable for caseless matching.

Radian’s string library implements to_upper and to_lower, and it seems reasonable that it should offer a case-folding function too. But what to call it? Nobody else seems to be offering such a function: I can’t find one in .NET, in Python, in PHP, in Ruby – Go has a mysterious function called SimpleFold, but whatever it’s doing, it isn’t what I’m trying to do.

Change from ‘const’ to ‘def’

Posted: October 25th, 2012 | Author: Mars | Filed under: Design, Progress, Syntax | 2 Comments »

Radian offers two simple symbol types: var lets you define a symbol to which you can later assign a new value, while const is a definition which cannot later be changed. I had expected to make heavy use of const in Radian code since it echoes a pattern I use frequently in C or C++, but in practice I’ve found myself shying away from it. The reason is entirely superficial: it doesn’t feel right, because the values I would be assigning just aren’t constants. Instead, most of the consts I would define are intermediate values – things that will change on every invocation of the function or every pass through the loop, but which can remain unchanged once I’ve defined them. As such it just feels weird to call them constants, and so I tend to define them as var even if I have no intention of ever redefining them.

I still think that const has a good place; in fact I think that using it heavily is good style. I’ve decided therefore to rename it. Stealing a keyword from Python, “constants” are now “definitions”, using the keyword def. I’d avoided def since Python uses it for function definitions, specifically, while Radian functions use function, but sometimes one’s nice clean abstract ideas don’t pan out in practice.

It’s about time to freeze the syntax for a while. Aside from the half-finished regex literals, which are actually present in 0.6, I don’t see any further syntax changes on the horizon. All the upcoming work is in libraries and the toolchain.

File reading API

Posted: September 27th, 2012 | Author: Mars | Filed under: Design, Progress | Comments Off

The regex system is turning out to be a larger project than I had anticipated. It’s still important, but as the length of time it appears likely to consume continues to grow, its immediate priority is dropping. I’m still working on it, but I’m not going to let it delay the long list of smaller pieces of functionality impeding other use-cases.

I am continuing to move away from the original monadic IO system. The latest change is the file-input mechanism: the function that used to be io.read_file is now file.read_bytes. I want it to be clear that the result of this function is a byte buffer, not a string. The buffer object implements the sequence interface, so if I just called it an unobservant ASCII-using programmer might be able to get disturbingly far along without noticing that what they’d read was not actually text, and had not been decoded from its byte form, but merely a string of bytes. By naming the function read_bytes I hope to plant a seed of puzzlement which will lead the programmer to its eventual sibling, read_string, which will require you to specify the encoding of the text file you are reading.

Another change is the elimination of the filespec object. I’d intended to use an abstract mechanism for describing a file, but it’s ultimately nothing but a thin wrapper around a path string. Since every platform I care about uses path strings to identify files, I’ve decided to drop the wrapper. Perhaps there will eventually be a module in the library which implements platform-localized transformations on path strings.

Regex literals

Posted: September 6th, 2012 | Author: Mars | Filed under: Design, Syntax | Comments Off

While I’m trying to err on the side of tradition with Radian syntax, and would thus like to use the /regex/ style, Radian already uses the forward-slash character for division. This doesn’t stop Perl, Ruby, or Javascript, but giving the slash character a contextual meaning makes those languages much more difficult to parse. With a non-context-free grammar, you can’t simply parse in one pass and analyze later: you need to analyze the code you’ve seen in order to understand how to parse the characters ahead. (It’s even worse with Perl, where you can’t correctly interpret a slash character without executing all the preceding code!)

This undesirable trait obviously hasn’t been a fatal obstacle to success for those languages, but since I’m starting with a clean sheet I might as well maintain the LL(1) constraint. This will keep the parser simple and make life much easier for any future intelligent editors and other static analysis tools.

I think I’ll combine Ruby’s alternate delimiter syntax and Scala’s hash-quote syntax: %/regex/, or alternately %"regex". The percent sign is otherwise unused, and it will introduce the regex literal; the next character will be a delimiter, which can be either a forward-slash, single-quote, or double-quote.

A regex is not a string literal, so backslashes will be interpreted by the rules of the regular expression sublanguage rather than the rules of the string-literal sublanguage. The delimiter character is an escape, breaking out of the regex sublanguage; therefore you should pick a delimiter which you don’t intend to use in your regex.

Of course someone will eventually need to construct a pathological regex including all three possible delimiters, so I’ll borrow the quote-doubling mechanism from BASIC. Within a regex literal, doubling up the delimiter character will not end the regex, but will insert a single instance of the delimiter. For example, these two will be equivalent:


Regex literals in various languages

Posted: August 22nd, 2012 | Author: Mars | Filed under: Design, Syntax | 3 Comments »

Languages in which regexes are first-class syntax elements:

Awk: /I (love|hate) regexe(s|n)/
Perl: /I (love|hate) regexe(s|n)/ or |I (love\|hate) regexe(s\|n)|
Ruby: /I (love|hate) regexe(s|n)/ or %r!I (love|hate) regexe(s|n)!, where the bang mark can be any delimiter
Javascript: /I (love|hate) regexe(s|n)/
Clojure: #"I (love|hate) regexe(s|n)"

Languages which offer “raw” strings with no internal escapes:

Scala: """I (love|hate) regexe(s|n)"""
Python: """I (love|hate) regexe(s|n)"""

Languages which offer minimally escaped strings:

PHP: 'I (love|hate) regex(s|n)' – backslash escapes backslash and single-quote, but no other characters
Python: r"I (love|hate) regex(s|n)" – can use either single or double quote

The oldest example of a first-class regex literal I can find appears to be in Awk. Ruby and Javascript copied it from there by way of Perl.

String literals and regular expressions

Posted: August 10th, 2012 | Author: Mars | Filed under: Design | Comments Off

Working on the design for the regular expression system, I’ve run into a problem anyone who’s written regexes in C knows well: encoding a piece of code written in the regular-expression language into a string literal for the C language yields an unreadable mass of backslashes. Since Radian’s string literal syntax is derived from C’s – the only difference being Radian’s omission of octal literals – the same problems lurk just over the horizon here.

Different languages have solved this problem in different ways. Python uses both single- and double-quote characters to delimit backslash-escaped string literals, while an “r” prefix introduces a “raw” string mode in which backslashes are included in the string. Ruby uses a different approach: strings delimited with single-quotes use a minimal escape language, where the only escapable characters are the single-quote and the backslash itself, while strings contained in double-quotes use the usual C-derived escape language. Javascript is wussy as usual; single- and double-quotes are interchangeable, and all strings use a reduced list of the original backslash-escapes from C.

Back when I was working on REALbasic, I once wrote a piece of the IDE that needed to generate string literals containing arbitrary bytes. Strings in REALbasic were delimited with double-quotes, and the only escape mechanism they supported was that you could double up a quote mark to have it treated as a single character. There was a way to introduce arbitrary data into a string, but it involved a function call per byte, and that wouldn’t have worked for these data chunks. I solved the problem by hacking an undocumented feature into the compiler: a new type of string literal which supported the usual array of C-derived backslash escapes. Not the most elegant solution, perhaps, but going through the full design process would have taken more time than the feature I was working on could afford, and since I was the compiler guy it wasn’t like I was foisting extra maintenance work off on anyone else. (So far as I know the code is still in there, never to see the light of day…)

What’s the right solution for Radian? At present, there is but one type of string literal, which uses backslash escapes, but we are fortunately still in a flexible early state where radical changes are possible. Is the escape model the best default? Most string literals have no escapes, and most strings which do are intended for interpolation. Perhaps the default string literal should be a non-escaped “raw” string, like Ruby’s single-quoted strings or REALbasic’s string literals. Python uses prefixes to control the specific string literal subtype: I rather like the idea but it seems that strings should be raw by default, and that special processing should be explicitly turned on rather than off. For now, that might mean a prefixed backslash to enable backslash-escapes; in the future, a prefixed dollar-sign might enable interpolation syntax.

Module identifiers

Posted: August 10th, 2012 | Author: Mars | Filed under: Design, Progress | Comments Off

Modules are a lot like objects, and the implementation of module files in Radian’s compiler shares a great deal of code with the implementation of object blocks. One common element they’ve had is the use of self to refer to the current instance, the object on which the function or method was called.

This works fine until you define an object inside a module, something I’ve had occasion to do once or twice, and which I imagine other Radian programmers may also find to be a useful practice: the object’s definition of “self” shadows the module’s “self”, making it awkward to reach the other members of the module. There are workarounds, of course, but they suck.

I’ve just committed some code which changes modules so that the implicit parameter referring to the current module is now simply the name of the module file, minus its “.radian” suffix: that is, it’s the same name you would use to import the module from another file. This has the pleasant implication that references to module members look the same inside the module as they would from outside – though of course code inside the module can refer to private members, while code outside the module cannot.

It does feel just a little strange to have the identifiers available inside a source file depend on a piece of metadata like the file’s name, but the import system is already committed to the idea that filenames matter. It’s conceptually weird, but in practice it’s just requiring you to do something you were probably going to do anyway.