Up to Main Index                             Up to Journal for April, 2023

                     JOURNAL FOR SUNDAY 23RD APRIL, 2023
______________________________________________________________________________

SUBJECT: What makes a minimal programming language?
   DATE: Sun 23 Apr 18:41:50 BST 2023

I’ve had many questions from curious readers about Mere: “How did you design
Mere?”, “How can I design a programming language?”, “What are the minimum
requirements for a programming language?”, “What features should a programming
language have?”. Writing this article has taken me quite a while, it’s quite a
long one.

I’m not an expert on creating programming languages and these are just my
thoughts. What follows is what I consider to be the minimal requirements for a
usable programming language. If you have ever thought of writing your own
programming language, or wondered what it involves, I hope you find it useful…

What actually makes up a programming language? A programming language should
provide at least the following:


    • Data types:
        - Integer
        - Floating point
        - Characters
        - Strings
        - Boolean
    • Operators:
        - Assignment
        - Arithmetic
        - Comparison
        - Logical
    • Functions
    • Expressions and statements
    • Control flow constructs:
        - Sequential
        - Conditional
        - Iterative
    • Comments
    • Syntax
    • Built-in functions


DATA TYPES
‾‾‾‾‾‾‾‾‾‾
Data types are used to define the data stored and manipulated by a program.
The data type of a variable is used to determine how the variable may be used
and which operators apply to it.

What are the minimal data types that a programming language should support?


    Integer numbers: whole numbers. For example: 3, 5, -42. They can be used
      for indexing, counting, arithmetic and logic.

    Floating point numbers: fractional numbers or real numbers. For example:
      0.025, -3.14, 99.99. They can be used for measurement and calculations.

    Characters: a single character or symbol. For example: ‘A’, ‘7’, ‘Σ’, ‘→’.
      Used for text processing and a component of larger strings.

    Strings: a sequence of characters. For example: “A7Σ→”. Strings are used
      for storing and manipulating textual information such as names and
      addresses.

    Boolean: a simple true or false flag. Used for conditional flows, logic
      and boolean algebra.


From these basic data types more complex composite types may be derived. For
example: arrays, maps and structures.

For the data types supported, each should have a formal way of expressing
literal values. For example, an integer may be expressed as decimal (42),
hexadecimal (0x2A), octal (0o52) or binary (0b101010).


OPERATORS
‾‾‾‾‾‾‾‾‾
Operators in a programming language specify a mathematical, relational or
logical operation that should be applied to one or more operands. For example:
“1 + 2”, here the operator for addition ‘+’ is applied to the operands ‘1’ and
‘2’, the result being the addition of ‘1’ to ‘2’.

What are the minimal set of operators a programming language should support?


    Assignment operators: You need to be able to assign values to variables.
      For example: “x = 5”. In some languages you can also assign expressions
      to variables. For example: “p = print("hello world")”

    Arithmetic operators: Operators are needed for basic arithmetic at a
      minimum; addition, subtraction, multiplication and division. For
      example: “x = 1 + 2”

    Comparison operators: The ability to test the equality or inequality of
      values in order to be able to make decisions and control logic flow. For
      example: “x > 0”, is the value of the variable ‘x’ grater than zero?

    Logical operators: logical operators allow combining and negating boolean
      conditions. They allow conditions to be evaluated for logic flow. For
      example: “x && y” returns a boolean value.

    Functions: while not an operator, functions allow defining reusable blocks
      of code that can take input parameters and return results. Without
      functions code cannot be modularised or abstracted.


EXPRESSIONS & STATEMENTS
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
Another important design decision is that of what is an expression and what is
a statement. An expression is a piece of code that can be evaluated to a
value. For example: “x + 2” evaluates to the sum of adding ‘2’ to the variable
‘x’. A statement is a piece of code that performs instructions but does not
return a value. An example statement might be “x = y”, where the variable ‘x’
is assigned the value of variable ‘y’.

Expressions can be used in place of other operands. For example: “x + y” is an
expression. If we have another expression “a + b” we can use “x + y” in place
of the operand ‘b’ to produce “a + x + y” which is a valid expression. This is
not possible with statements. For example, if we have the statements “x = y”
and “a = b” we cannot use “x = y” in place of the operand ‘b’ as “a = x = y”
is invalid due to the statement “x = y” not returning a value that can be
assigned to ‘a’.

Another difference of expressions and statements is that statements cause side
effects while expressions evaluate to values. For example the statement “print
x” has the effect of printing the value of ‘x’ but returns nothing. While the
expression “x * y” returns the value of multiplying ‘x’ and ‘y’ but effects
nothing.

Expressions have fixed execution while statements can have variable execution.
For example, the expression “x + 2” always executes adding ‘2’ to the value of
‘x’. The statement “if x > 0 {…}” executes code differently depending on the
value of ‘x’ and the code inside the braces.

Depending on the language, expressions can be used as statements by ignoring
the return value. What normally would be regarded as a statement, such as
“x = y”, may be implemented as an expression by returning a value — in this
case assignment may return the value assigned. This would allow our previous
expression “a = x = y” to be valid.

In Mere everything that can be an expression is an expression. Mere has very
few statements.


CONTROL FLOW CONSTRUCTS
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
The execution flow of a program is the sequence in which expressions and
statements are executed. The flow can be altered using control flow
constructs. What are the minimal execution flow constructs a programming
language should provide?


    Sequential flow: simply executing code in linear order one line at a time.

    Conditional flow: executing code based on a selection criteria. For
      example: “if x<0 { y-- } else { y++ }”, dependant on the value of
      ‘x’ the ‘if’ statement will either execute “y--” or “y++”.

    Iterative flow: repeated execution of code based on a conditional. For
      example: “for x = 1; x<10; x++ { }” the ‘for’ statement will execute the
      code within the braces while the condition ‘x’ is less than 10 is true.


In addition a language will usually provide additional statements that can
effect the control flow constructs. For example: break, continue, goto and
return.


COMMENTS
‾‾‾‾‾‾‾‾
A programming language should provide a way of annotating the source code for
humans. Comments are also useful for marking lines or blocks of code that you
do not want to execute, for example during development. At a minimum the
language should provide:


    • Line comments
        These take the form of a starting marker, such as ‘//’ or ‘#’ and end
        at the end of the current line.
    • Block comments
        These take the form of a starting marker, such as ‘/*’ and end at a
        matching end marker, such as ‘*/’.


SYNTAX
‾‾‾‾‾‾
In order to write code in a programming language you need to have a syntax. A
syntax defines the rules for how expressions and statements, the code, are
written. Syntax can vary wildly and can be very strict or quite lax.

Mere takes a very lazy approach to syntax:


    • A line ends with a line-feed character.
    • White-space characters are only required to separate variable names,
      function names, labels and literals in lists.
    • Continuation lines end with an opening parenthesis or comma.
    • Line comments start ‘// ’ or ‘# ’ and finish at the end of the line.
    • Block comments are enclosed in ‘/*’ and ‘*/’ and may be nested.
    • Parentheses and commas in function calls are optional.
    • Variable and label names can be any Unicode string that is not an
      operator, built-in function name or data type literal.
    • Integer literals can be represented in decimal, hexadecimal, octal or
      binary.
    • Unsigned integers are represented as per integers literals with a ‘u’
      suffix.
    • Float literals can be represented in decimal or scientific notation.
    • String literals can be interpolated inside double quotes "…" or raw if
      enclosed in backticks `…`.
    • Boolean literals are ‘false’ or ‘true’.


BUILT-IN FUNCTIONS
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
When programming there are certain operations that need to be performed that
are not provided by operators alone. For example: printing the value of a
variable or querying the length of a string. The minimal built-in functions
might be:


    delete: delete an element from an array or map
      exit: terminate execution of the program with a status code
      goto: go to a label for flow control
        if: flow control for conditional execution
     input: receive data from standard input
    length: get the length of a string, array or map
     print: display data to standard output
      type: query the data type of a variable


Depending on how the programming language is implement some of these may be
provided as reserved keywords and some may be implemented in a standard
library — or a mixture of both.

                                      ∞

The above are what I regard as the minimum requirements for any programming
language. The bare minimum. Looking over the various lists Mere still has a
few features lacking: single characters, user defined functions and iterative
control flow constructs. These features are actively being worked on. After
that, I need to think about adding a useful standard library of functions.


THE HARD WORK BEGINS
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
You now have a rough design for your programming language. However, designing
a programming language and implementing it are two very different things. Once
you are able to write programs in the language you have designed, you have to
be able to execute them. In order to execute a program the source code must
either be interpreted or compiled.

For Mere I implemented a very simple compiler with three main stages. Once
compiled the output is handed over to the Execution engine to run the program:


    • Lexical Analysis
        Mere has a simple tokenizer that breaks the source code into
        individual tokens.
    • Code generation
        Mere uses a modified shunting yard algorithm to parse the tokens and
        turn them into reverse polish notation.
    • Symbolize
        Mere uses this stage to resolve tokens to their types: variables,
        literals, operators and functions.

    • Execution Engine
        Mere has a simple stack based execution engine that processes the
        compiled reverse polish.


Currently Mere requires the source code to run programs. However, it is a
simple exercise to save the program, the output of the Symbolize phase, and
load it into the Execution engine separately. I just haven’t got around to it
yet and while developing Mere it’s easier to re-compile and run programs.

After writing Mere, twice, I would give the following advice when implementing
your own programming language:


    • Write a simple parser that can break source code into tokens.
    • Write a simple interpreter that can execute the tokens.
    • Implement the integer data type.
    • Implement the basic assignment operator: =
    • Implement a print statement.
    • You should be able to parse “x = 5” and store 5 in the variable ‘x’.
    • Implement the basic arithmetic operators: ‘+’, ‘-’, ‘*’ and ‘/’
    • You should be able to do simple math: “x = 1 + 2”, “x = x + 7”
    • Implement the integer data types that will be indexed: arrays and maps
    • Implement indexed assignments, with nesting: “x[1][1][1] = 5”
    • Implement more data types, operators and built-in functions.


Why do I recommend this approach? Because I found that the step “implementing
indexed assignments” is the killer. This was where everything fell apart when
creating Mere and I had to start over. The steps above get you to that point
as quick as possible without additional work. If you can get something similar
to the following up and running you are well on your way to having your very
own programming language:


    >cat indexing.mr
    x = [int](        // x is an int map
      1 [int](        // with a nested int map
        2 [int](      // with another nested int map
          3 []int(2)  // with a nested int array
        )
      )
    )

    // Assign to nested array element
    println "Before: " x[1][2][3]
    x[1][2][3][0] = 3
    println " After: " x[1][2][3]

    // For extra credit, append array elements
    x[1][2][3] = []int x[1][2][3] 42
    println " Extra: " x[1][2][3]

    // Bonus: short-hand indexed assignment
    x[1][2][3][0] += 7
    println " Bonus: " x[1][2][3]

    >mere indexing.mr
    Before: [2]
     After: [3]
     Extra: [3 42]
     Bonus: [10 42]
    >


Once your programming language can do this, you just need to add more of the
same. More data types, more operators and more built-in functions…

Writing your own programming language can be a huge challenge but it can also
be fun and very rewarding. There are not many people that have their very own
programming language. Nothing beats the feeling you get when writing code in
your own programming language and then see it actually running :)

I will be releasing Mere as open source. The Git repository will be available
so that the history can be examined. The source code will be available and
maybe you could use it to jump start your own little programming language…

--
Diddymus


  Up to Main Index                             Up to Journal for April, 2023