Persistent Data Structures

October 22, 2016


In this post we’ll discuss some basic data structures in functional programming and their implementation in OCaml. Our main reference are the first two chapters of Purely Functional Data Structures by Chris Okasaki.

The first chapter explains persistent data structures and the second chapter provides examples of three common data structures in imperative programming and how to implement them in ML (a functional language).

Persistent Data Structures

The core idea behind designing data structures in functional languages is that they’re immutable, that is, if you want to modify it, a new copy has to be made.

If we want to modify a data structure in the naive way, we would need to clone the entire structure with the modifications. Instead, this is done in a smart way so different versions of the data structure share them. Two examples are given to illustrate this idea: lists and binary trees.

Immutable Linked Lists

Lists are a central native data structure in functional languages. In Ocaml, they’re implemented as linked list. From our data structure class, we know that inserting an element at the beginning of a list is an O(1) operation, but this requires modifying pointers in the data structure. As we said data structures are immutable, so we need to make a copy to perform this operation.

Note that we only need to make the new node point to the beginning of the original list, which means we don’t have to clone this list, we just reuse it.

Now, if we’re inserting in the middle of the list will require copying the existing nodes until we reach our newly inserted element, which we can then point to the remaining of the original list. Consider the example below, where we insert the element 4 in the second position of the list.

Inserting a new element (4) in a linked list

Inserting a new element (4) in a linked list

Note that inserting an element at the beginning of a linked list is much more efficient than at the end, both in terms of complexity and space. If we were to insert it at the end, we would require full copies and insertion would be O(length of the list). As an experiment, we can write a function that generates a list representing a range of integers in two ways.

In example 1, we create the list by inserting an element at the beginning of the array,

let rec list_range_fast start_range end_range =
  if start_range == end_range then []
  else start_range :: (list_range_fast (start_range + 1) end_range)

In example 2 we do it at the end:

let rec list_range_slow start_range end_range =
  if start_range == end_range then []
  else (list_range_slow start_range (end_range - 1)) @ [end_range]

If I run the slow version with start_range = 0 and end_range = 50000, it takes over a minute to run on my computer, while the fast version runs in a few milliseconds.

Immutable Binary Trees

Binary trees are a generalized version of a linked list and it works the same way. The key point to notice is that when inserting an element in a (binary) tree only the nodes in the path from the root to the inserted node need to have their pointers modified, so we need to clone a number of nodes that are at most the height of the tree. See the example below:

Inserting an element in a binary tree

Inserting an element in a binary tree. Green nodes are the new ones

With this basic concept understood for simple constructs like linked lists and binary trees, we can construct basic data structures on top of them: heaps and balanced binary trees.

Leftist Heaps

A (minimum) leftist heap is a binary tree with the following properties:

1. The value of a node is smaller or equal the value of its children.

Let the spine of a node be the path defined by following the right children until a leaf is found (i.e. the rightmost path), and the rank of a node the length of its spine.

2. The rank of the left child is always greater or equal than the right child.

It’s possible to show that the rank of a heap of n nodes is less or equal than O(floor(log n + 1)).

Insertion. Before talking about inserting an element in a heap, let’s define the more general merge() function, which merges two leftist heaps, A and B, into one. We compare the root of each heap, say A.val and B.val. If A.val <= B.val, we make A.val the root of the new heap, A.left the left of the new heap and the right of the new heap will be the merge of A.right and B. If A.val >= B.val we do an analogous operation.

Note that the result of the merge might violate property 2, since we’re adding a new node to the right subtree. This is easy to fix: we just to swap the left and right subtree when coming back from the recursion. This is what makeTree() does.

Note that we always perform the merges of the right, which means that the number of such merges will be proportional to the largest rank the original heaps, which implies we can do it in O(log(A.size + B.size)).

In the context of immutable structures, this implementation of heap is efficient because the only nodes whose pointers we need to mutate (even when a swap is needed) are on the spine of the trees, so it only needs to create O(log n) new nodes.

The insertion process consists in creating a heap with one element and merging into the target heap.

Returning the minimum element is trivial: we just need to return the element at the root. Removing the minimum is easy as well: just merge the left and right subtree using merge().

My implementation for the leftist heap is on github.

Binomial Heaps

Binomial heaps are an alternative heap implementation. Before defining a binomial heap, let’s introduce the binomial tree. A binomial tree is a can be defined recursively based on a property called rank. A single node is a binomial tree of rank 0. A tree of rank r > 0, is formed by combining two trees of rank r-1 making one tree the leftmost child of the other. We denote this operation as linking.

Binomial Heap

Examples of binomial trees of different ranks

A binomial heap is a list of binomial trees ordered by increasing rank, with no trees with the same rank.

Insertion. Before talking about insertion of an element into a heap, let’s see how to insert a binomial tree of rank r into the heap. To keep the order of the rank of the tree list, we need to traverse through it to find the right place to insert it. Since we can’t have two trees with the same rank, if we run into this case we need to merge those two trees into one with rank r + 1. This can cascade so we might need to repeat this process with the new tree.

Linking a tree is a direct application of its definition. We just need to decide which of the trees will become child of the other. Since we want a minimum heap, we need to guarantee that the minimum element is always on top, so we always make the tree with smallest root to be on top.

Linking is a constant time operation, since we only need to update the pointer of the root of the top tree, which also means we only need to clone the root node to generate a new immutable binomial tree.

For the complexity of traversing the heap list, we can first node that a heap of rank r has 2^r nodes. Which means that a heap of n elements has at most log(n) trees. In fact, a neat way to represent a binomial heap of size n is by its binary representation. For example, 10(dec) = 1010(bin), so for a heap of size 10, we should have a list of trees with ranks 1 and 3. This shows that we can traverse a list in O(log n) and in the worst case we might need to clone this many number of nodes.

Returning the minimum element requires us to scan the list of trees, because even though we know the minimum element is the root of a tree, we don’t know which tree it is, so this operation is O(log n). For removing this element, we can define an auxiliary function, removeMinTree(), which removes the tree with the minimum element from the tree list. Since we only want to remove the root of this tree, we need to re-insert the subtrees back to the heap.

One key observation is that, in a binomial tree of rank r, the children of the root are also binomial trees of ranks from 0 to r-1, which form a binomial heap. We can then define a merge() function that merges two heaps using an idea similar to a merge sort. If we refer back to the analogy of the binary representation of the heap size, the merge operation is analogous to adding two binary numbers!

My implementation for the binomial heap is on github.

Red Black Trees

A red-black tree is a binary seach tree in which every node is either Red or Black. It respects the following invariants:

1. No red node has a red child
2. Every path from the root to an empty node has the same number of black nodes

Property. the height of a red-black tree of n nodes is at most 2*floor(log(n + 1))

Proof sketch. If the tree has a path to an empty node of length greater than 2*floor(log(n + 1)), this must have more than floor(log(n + 1)) black nodes in the path because we cannot have two consecutive red nodes (invariant 1). Now, by removing all the red nodes, we must have a complete tree of height >= floor(log(n) + 1) + 1, which means 2^(floor(log(n + 1)) + 1) - 1 which is greater than n.

Membership. Since a Red-Black tree is a binary search tree, search can be done in O(height of the tree) which is O(log n) by the property above.

Insertion. inserting an element in a Red Black tree is similar to inserting it into a binary search tree. The challenge is that it may violate one of the invariants. To avoid that we must perform re-balancing along the path we follow to insert the node in the tree.

We always color the inserted node Red. This doesn’t violate the Black nodes constraint, but it might violate the Red nodes one, in case its parent is also Red. If that’s the grandparent of the inserted is necessarily Black. We now have 4 possible scenarios, depicted at the top, right, bottom and left trees:

Unbalanced Red-Black trees and the result of the balancing operation

Unbalanced Red-Black trees and the result of the balancing operation

We assume that the subtrees a, b, c and d are balanced. For all these 4 cases we can perform a “rotation” to achieve the tree at the center.

The Black nodes constraint is not violated, because the path from the root still has the same number of Black nodes as before, and we fixed the consecutive Reds. Notice that y might violate the Red node constraints with its parent, so we need to do it recursively.

In terms of complexity, the insertion can be done in O(log n) and rebalancing takes a constant amount of time. Regarding the immutable structure, notice we only need to change nodes from the insertion path, so only O(log n) nodes have to be cloned.

My implementation for the Red Black tree in Ocaml is on github.


The first two chapters from this book are really interesting. I have seen the binomial heap and Red-Black trees before in data structure classes and also implemented some data structures such as AVL trees in functional programming in the past (Lisp) during college.

I wasn’t aware of the immutability of data in functional programming until much later, when I was learning Haskell. Okasaki first introduces this concept in Chapter 2, so it allow us keeping that in mind when studying the implementation of functional data structures.

He doesn’t make it explicit on Chapter 3 that the data structures presented are efficient in terms of extra memory necessary to clone them, but they are easy to see.

The ML syntax is very similar to OCaml, but it was a good exercise implementing the code on my own. I tried making it more readable with comments and longer variable names. This also led me to learn a few OCaml constructs and libraries, including:

* How to perform assertions (See the binomial heap implementation)
* Unit testing (using OUnit2 library)
* The “or matching” pattern (see the balance() function in

Review: The Design of Everyday Things

September 5, 2016


Don Norman is the director of The Design Lab at University of California, San Diego.

In The Design of Everyday Things, he discusses human psychology, introduces many concepts around design and provides suggestions to improve the usability of products. He takes into account practical real world challenges, such as time and budgets constraints during development.

The book is divided into seven chapters which we’ll summarize in this short post.

1. The psychopathology of everyday things

This first chapter focus on attributes of products that influence its usability. It introduces concepts such as affordances, mapping and feedback that improve usability. Affordances help people figure out what actions are possible without the need for labels or instructions. These are relationships (between human and object), not (object) properties.

Affordance make this obvious this side is to be pushed. The asymmetric bar suggest which side of the door to press.

Affordance make this obvious this side is to be pushed. The asymmetric bar suggest which side of the door to press.

Sometimes it’s not possible to make actions obvious, in which case we need signifiers to help with it. Signifiers include messages, symbols and legends.

Signifiers, labels in this case, aid users in deciding whether to pull or push.

Signifiers, labels in this case, aid users in deciding whether to pull or push.

Mapping is useful when the controls the human interacts with are not in the same place as the object controlled. A common example is a switch and light. When there are many lamps to control using physical correspondence between them make it easier to find out which switch controls each light.

Physical distribution of switches location maps to actual lights location.

Physical distribution of switches location maps to actual lights location.

My first reaction on the switch above is that it looks ugly and cluttered. One message I got from the book is that good design is not necessarily beautiful and minimal – sometimes they’re conflicting even, because they might hide affordances and signifiers.

Feedback is communicating the result of an action immediately. This includes turning the light on the elevator button when it has been pressed or in web design depressing a button and disabling it temporarily (if the result cannot be returned immediately).

Conceptual model is the ability for the user to keep a simplified version of the system in their mind, often relating to an existing product. One example is the use of terms like Desktop, Folders and Files in the GUI of an operating system, relying on the existing model of organization from an office.

One example of bad conceptual model is the heater/oven regulated by a thermostat. If you want to pre-heat the oven quicker, one natural idea is to put the temperature to the maximum and then lower it down when it’s ready. The problem is that this is not how thermostat ovens work. They have a heater providing a constant flow of heat, and they control the temperature by turning it on and off. The longer you leave it on, the higher the temperature gets, but it make it reach that temperature faster.

2. The psychology of everyday actions

This chapter focus on the user side, more specifically, what goes in their head when interacting with a product. He proposes breaking down an action into stages.

Stages of an action

Stages of an action

He discusses levels of processing: visceral (instinct), behavioral (habit) and reflective (conscious). In the picture above, the stages are aligned by these levels. Intention and evaluation are both at the conscious level, plan and interpretation at the behavioral, and finally execution and perception are visceral.

Users blame themselves. Humans are usually eager in blaming other people day to day, but when interaction with machines, they often blame themselves, but the confusion is caused by a bad design.

3. Knowledge in the head and in the world

This chapter focuses on how we use knowledge to interact with a product. He categorizes knowledge into two: knowledge in the head (memory) and knowledge in the world (conventions, standards).

Delving into the workings of memory he talks about short term vs long term memory and how short memory can only keep a few item “on cache” (using a computer analogy). The author mentions how constraints help remembering things, such as why it’s easier to remember poems vs. prose, because has a more rigid structure. He brings back ideas from Chapter 1, like conceptual models and mapping, which reduces the amount of things to remember.

Regarding knowledge in the world, a lot of conventions vary according to culture or country (e.g. which side of the road to drive on), which must be taken into account especially when developing systems available internationally.

Systems should rely more on knowledge in the world than in the head. Some systems rely on knowledge in the head on purpose, often for security reasons, for example reliance on passwords.

4. Knowing what to do: constraints, discoverability and feedback

This chapter focuses on how the product can help users to interact with it by limiting the universe of possible actions (constraints), making it easy to discover the right way to use it (discoverability) and providing feedback information along the way to tell users whether they’re using it correctly.

He categorizes constraints into four types: physical, cultural, semantic (derived from the purpose of the action) and logical (for example: there’s only one logical way to perform an action).

For discoverability the author analyzes the design of faucets, which have to make it easy for users to control water flow and temperature.

For feedback, he discusses the pros (does not require focused attention) and cons (annoyance, surrounding noise) of using sound as feedback.

5. Human error? No, bad design

In this chapter, the author focuses on user errors. He categorizes them into slips (execution error) and mistakes (planning error). Slips are easier to detect because they are a deviation of the expected plan, while mistakes might be executing correctly but the wrong plan.

He suggests designing for errors. This includes preventing errors in the first place (constraints), sensibility checks (e.g. input validation), the option to undo actions, make error obvious and easy to correct.

6. Design thinking

This chapter provides a framework for the process of designing. It includes the double diamond: the first diamond tries to find/define the problem, while the second is to find the solution.

The analogy with the diamond shape is that in both phases it starts by expanding the range of ideas and then narrowing down to specific ones. More technically, he defines four phases in each of the diamonds:

1. Observation
2. Idea generation
3. Prototyping
4. Testing

Observation requires a deep understanding of a small set of customers (as opposite to other forms of observations such as large-scale general A/B testing).

Idea generation is basically brainstorming. This, with the prototyping and testing should be an interactive process.

In the rest of the chapter the author discusses related topics of designing, how external factors influence the design process (budget and time constraints), the fact that the buyer might not be the end user (e.g. appliances for a rental place) and how making something harder to use might be desirable (such as to improve security and provide access control).

7. Design in the world of business

In this final chapter, the author focus on real world design. Besides the budget and time constraints, one source of bloated design is the featuritis that arises from competition. If the competitor of a product adds a new feature, it has to follow suit and add it too.

Another challenge with design, arises from the fact that people don’t like changes. Improving the design or introducing a new technology sometimes doesn’t take off until much later when people start getting used to it and adopting it. Around this theme, we discusses the tradeoffs of incremental and radical innovations, and argues that both are important for the development of products.



My impressions: I did like that the book uses consistent terminology to explain concepts and that the author provides a lot of examples. I also like the fact that he come up with conceptual models, defining relationships between different concepts, such as the stages of an action.

I didn’t think the book was very organized. He does mention the book doesn’t have to be consumed linearly, but I did feel that the book was a collection of topics around a theme instead of a cohesive text. I’m used to technical books where you look at the table of contents and how the small parts (chapters) usually have well defined boundaries and how they assemble together to form the big picture.

Most of my work consists in developing Web interfaces for people to do their jobs better. Usability is a very important concept in this field, so I’m eager to learn more about this subject.

Thoughts: Usability of code

In light of a recent read, Code Complete 2, I’ve been constantly aware of the usability (readability) of source code. If we think about it, it shares similar challenges with end products and maybe it’s possible to leverage ideas from this book and apply them to coding.

Some analogies: good function names are affordances on how to use a function, sticking to code conventions are a good way to move knowledge from the head to the world (Chapter 3), comments can act as signifiers, invariants and unit-tests can act as constraints that convey the expected behavior of a function. Conceptual models are achieve by using good abstraction that maps intuitively to the business rules the code is aimed to implement.

As emphasized in the book, we write code for people, not for machines, so there’s no reason to not strive to make them as useful as products we interface with every day.

Web Workers

August 4, 2016

In this post we study Web Workers, a technology that allows JavaScript code to run in separate threads. We’ll start by exploring the API with some toy examples and at the end discuss some applications.


By default JavaScript runs in a single (the main thread), which can be a problem for user experience if expensive operations need to be performed in code which would affect responsiveness of the UI.

The thread that runs the web worker code has some environment limitations, which includes no access to the DOM or global objects such as window. [3] contains a list of functions and methods available to a Web Worker.

Besides that, memory is not shared between the threads, having to be explicitly serialized (so it can be cloned) and passed via a method (postMessage()). This can lead to performance issues if the amount of data to be copied is large. In Transferable Objects we’ll discuss an alternative.

Workers might spawn their own workers.

For some reason, the term "web workers" reminds me of these creatures from Spirited Away

For some reason, the term “web workers” reminds me of these creatures from Spirited Away :)

Let’s work a simple example. Imagine we have two files, main.js and worker.js (in the same directory):


// Initializes the worker with the
// JavaScript code in worker.js
var myWorker = new Worker("worker.js");

// Post a message to the worker
myWorker.postMessage("Hello worker!");

// Callback that gets called when a 
// message is received from the worker.
myWorker.onmessage = function(/*MessageEvent*/ message) {
    'Message received from worker script: ',


onmessage = function(/*MessageEvent*/ message) {
    'Message received from main script: ',
  postMessage("Hello main!");

Transferable Objects

By default data is copied when sending information back and forth between the main thread and the worker, using a process called structural cloning.

The serialization/deserialization can be expensive, for which case there is an alternative: transferrable objects. More specifically, we can work with ArrayBuffers, which can be “transferred” instead of cloned, which is a more performant operation, more precisely, O(1).

We’ll first cover ArrayBuffers and then see how to apply it in a context of Web Workers.


According to [5]:

The ArrayBuffer is a data type that is used to represent a generic, fixed-length binary data buffer. You can’t directly manipulate the contents of an ArrayBuffer; instead, you create a typed array view or a DataView which represents the buffer in a specific format, and use that to read and write the contents of the buffer.

ArrayBuffer basically represents an unstructured array of bits, which, to have any meaning/interpretation, needs a view, which can be an array of 32-bit unsigned ints or 16-bit unsigned ints. In the example below we create an array buffer of 100 bytes.

// Number of bytes or number of elements
var buffer = new ArrayBuffer(100);

// A 32-bit unsigned int array of length 10 (i.e. 40 bytes), starting 
// from byte 0 at the array buffer
var int32View = new Uint32Array(buffer, 0, 10);
// A 16-bit unsigned int array of length 20 (i.e. 40 bytes), starting 
// from byte 0 at the array buffer
var int16View = new Uint16Array(buffer, 0, 20);

// Fill in the 16-bit array with 0, 1, 2...
for (var i = 0; i < int16View.length; i++) {
  int16View[i] = i;

// The memory is shared because we're reading from the same chunk of 
// the byte array.
for (var i = 0; i < int32View.length; i++) {
  console.log("Entry " + i + ": " + int32View[i]);

This is a very interesting model. In ArrayBuffers one explicitly work with the serialized form of the data and create views on top of them. I’m used to work with the views-first, that is, create a class representing some data and eventually add serialization/deserialization methods. One advantage of working with serialized data is that we don’t need to write the serialization methods, only the deserialization. The major disadvantage is that you need to know upfront how much memory you’ll have to use.

Transferrable objects

We can extend the example above to be used between a worker and the main thread.


var buffer = new ArrayBuffer(100);
var int16View = new Uint16Array(buffer, 0, 20);

for (var i = 0; i < int16View.length; i++) {
  int16View[i] = i;

console.log('array buffer size', buffer.byteLength); // 100
postMessage(buffer, [buffer]);
console.log('array buffer size?', buffer.byteLength); // 0

and in the main.js:

myWorker.onmessage = function(e) {
  buffer =;
  // Number of bytes or number of elements
  var int32View = new Uint32Array(buffer, 0, 10);

  for (var i = 0; i < int32View.length; i++) {
    console.log("Entry " + i + ": " + int32View[i]);

By logging the output to the console, we can see the main thread received the values written to the array buffer by the worker and after the worker transferred the buffer data, it was emptied out.

Note in the postMessage() API, we provide buffer as a the first parameter and then it also appears in the list indicating it should be transferred, not copied. Having to pass it twice is a bit confusing in my opinion, but this is to allow the example below, in which the objects transferred are nested inside another structure (in this case an object) and we want to transfer both buffer1 and buffer2 but not the top-level object. I’m not sure which use case the API designers had in mind, though.

  {'key1': buffer1, 'key2': buffer2}, 
  [buffer1, buffer2]

Error Handling

If any errors are uncaught by the worker, it can be caught from the main thread through the onerror callback:

myWorker.onerror = function(e) {
  console.log('an error occurred:', e);

Where e is an instance of ErrorEvent. We can simulate an error on the worker.js code:

throw new Error("Some error occurred");


The main thread can terminate the worker


or the worker can terminate itself:



A lot of the examples using Web Workers involve doing some fake expensive calculation in the worker thread, but I haven’t found any real-world application.

StackOverflow offers some ideas, including one that is dimissed as bad uses of Web Workers (polling) or from projects that are long defunct (Mozilla Skywriter). The main issue is that most of time heavy processing is done on the server.

One idea that came to mind is to use web-workers in React. React defers a lot of DOM work to the end by working with the concept of a virtual DOM. Web-workers don’t have access to the DOM but they do have it for virtual DOMs. Turns out this idea has been explored already [7, 8] but there were some technical difficulties in implementing events.


In this post we studied Web Workers and some examples utilizing it. I learned a few other related topics like ArrayBuffers, and Compositor Workers. I was a bit disappointed with the lack of compelling applications using Web Workers. I’ll try it out in some of my projects and see if I can any benefits from it.


Some of the code presented in this post is available on Github.

[1] MDN – Using Web Workers
[2] HTML5 Rocks – The Basics of Web Workers
[3] MDN – Functions and classes available to Web Workers
[4] Google Developers – Transferable Objects: Lightning Fast!
[5] MDN – JavaScript typed arrays
[6] StackOverflow – What are the use-cases for Web Workers?
[7] React Custom Renderer using Web Workers
[8] GibHub React Issues: Webworkers #3092
[9] Compositor Workers Proposal

OCaml Modules

July 17, 2016


One prevalent syntax in some OCaml code I’ve encountered is about modules, so I decided to study them a little bit to be able to better understand the code I’ve been reading.

This post is based on Chapters 4 and 9 from Real World OCaml [1, 2], which we started in the last post.

Defining modules

By default, a file defines a module. Module names are derived from filenames and is always capitalized (even if the filename is not). Let’s define a toy example:

let a = 1;;

Printf.printf "%d\n" MyModule.a;;

We can now compile using the built-in ocamlc (make sure to follow the setup here):


Note that the order is important. Since depends on, it has to be included first. In case the files do not live in the same directory, we can use the -I option. For example, if myModule was in another_dir, we could compile using:

ocamlc -I another_dir/ another_dir/

A module has two parts: the definition and the signature. A module defined by a file can be constrained by a signature in a file called filename.mli. The definition (mli) should be included in the compiling step and appear before the definition. For example, we could have


val a : int;;

and compile using:

ocamlc myModule.mli


It’s possible to define multiple modules inside a file through submodules. We can add this to

module MySubmodule : sig
  val inc : int -> int
end = struct
  let inc x = x + 1

and in

Printf.printf "%d\n" ( 1);;

Note that we still need to provide the module name defined by the filename. The first part of the definition is the type signature of the module and the second the definition. The general syntax is:

module <module name> : sig
  type <abstract type name>
  val <name> : <type signature>
end = struct
  type <abstract type name> = <concrete type>
  let <name> <definition>

Alternatively, we can separate the type definition from the implementation. In that case, we create a separate file with extension .mli containing the interface


module MySubmodule : sig
  val inc : int -> int

and in

module MySubmodule = struct
  let inc x = x + 1

In general, the syntax for the signature is:

module type <module type name> : sig
  type <abstract type name>
  val <name> : <type signature>

and for the module definition is:

module <module name> : <module signature> = struct
  type <abstract type name> = <concrete type>
  let <name> <definition>

Including modules

Modules are made available when linking during compile time, but if we want to use a function from a module, we still need to qualify it. We can alternatively include them explicitly to avoid qualifying module names (similar to use namespace in C++).

open Module

Instead of:

We can also invoke open inline:

# let average x y =
    let open Int64 in
    x + y / of_int 2;;
val average : int64 -> int64 -> int64 = <fun>

or alias the module to a shorter name with the let module construct:

let print_median m =
  let module C = Counter in
  match m with
  | C.Median string -> printf "True median:\n   %s\n" string
  | C.Before_and_after (before, after) ->
    printf "Before and after median:\n   %s\n   %s\n" before after


Functors are functions that transform modules, so it’s a function from a module to a module. A basic example is provided in [2]. First we define a toy signature signature:

module type X_int = sig 
  val x : int 

Then, we define a functor, which we call Increment:

module Increment (M : X_int) = struct
    let y = M.x + 1

What tells us this is a function is the extra parameter the module takes (M: X_int). In here, M is the name we give to the module and X_int is its interface. Since the interface tells us M has the x value, we can access it within our function. Note that Increment acts like a function, taking M (module) as parameter and returning another module, defined by the struct block. In this case, the type signature is different because the returned module has y instead of x. We can force the returned type of a function by adding a constraint:

module Increment (M : X_int) : X_int = struct
    let x = M.x + 1

Now, if we try to use y, we can a compilation error. To fix it, we just change y to x.

Functors cannot be used by themselves. They’re useful for creating modules out of existing modules. For an example, imagine we have a module implementing X_int:

module Three = struct
  let x = 3

We can create a new module Four, by transforming our Three module:

module Four = Increment(Three);;

// Testing the modules
Printf.printf "%d\n" Three.x;; // 3
Printf.printf "%d\n" Four.x;;  // 4

“Abstract” functors

In [2], the authors provide an example of an MakeInterval module, in which there’s a dependent type. First it creates a Comparable signature:

module type Comparable = sig
  type t
  val compare : t -> t -> int

to make it shorter (and less “real world”), I’ve created a simpler version here, MakeElement:

module MakeElement (InputModule : Comparable) = struct
  type t = Element of InputModule.t
  let create x = Element x

we can then create a module:

module IntElement = MakeElement(Int);;

The above works because Int satisfies the constraint defined by the Comparable module signature. The authors make a point here that sticking to standard conventions can improve reuse such as cases like this. We can finally use

# let e = IntElement.create 10;;
val e : IntElement.t = IntElement.Element 10

The authors argue that the IntElement exposes implementation details because Element is “exposed”:

# let e2 = IntElement.Element 10;;
val e2 : IntElement.t = IntElement.Element 10

One solution is to constrain the return type of the functor and not expose Element in the signature. The signature would look like:

module type ElementInterface = sig
  type t
  type element
  val create : element -> t

and the functor would look like:

module MakeElement (InputModule : Comparable): ElementInterface = struct
  type t = Element of InputModule.t
  let create x = Element x

The problem is the type element is not bound to anything, so we have to explicitly do it when defining the functor. The construct is the following

module MakeElement (InputModule : Comparable): 
  (ElementInterface with type element = InputModule.t) = struct
  type t = Element of InputModule.t
  type element = InputModule.t
  let create x = Element x

now MakeElement return a module with interface ElementInterface which doesn’t expose Element.

Destructive Substitution

One problem with the approach above is the redundant binding of type element. One slightly different syntax removes that requirement:

module MakeElement (InputModule : Comparable): 
  (ElementInterface with type element := InputModule.t) = 
  type t = Element of InputModule.t
  let create x = Element x

We basically changed from = to :=, which is called destructive substitution.

Multiple Interfaces

It’s possible to “extend” more than one module signature when creating a new one. For example:

module type MyFunctor = sig
   include MyModuleInterface
   include MyModuleInterface2 with type t := t


Modules seems a very powerful concept in Ocaml. Besides organizing files, modules can act as functions and can model concepts from Object Oriented Programming like classes and interfaces.

I’ve been following a different study strategy while learning Ocaml. I try to read some real world code and when I get stuck understand a syntax or a pattern, I can search for them in a book. This makes it more interesting than reading a book cover to cover.


[1] Real World OCaml – Chapter 4. Files, Modules, and Programs
[2] Real World OCaml – Chapter 9. Functors
[3] Compiling OCaml projects

Exploring OCaml

June 26, 2016


I’ve been learning the basics of OCaml in order to be able to follow the examples from Purely Functional Data Structures, from Chris Okasaki.

My background is that I have zero knowledge about OCaml, but having studied a bit of Haskell, I’m familiar with the main concepts in functional programming, so I’ll try to focus on syntax + difference between the two languages. I’ll probably skim through concepts from functional programming I’ve already covered previously on past Haskell posts.


I used to skip the setup from posts, but it might be useful to someone, especially with a similar environment, so I decided to include it here. This setup assumes MacOS and emacs.


Easily available on Brew:

$ brew install ocaml
$ brew install opam
$ ocaml

For other OS’es:

Emacs syntax highlighting

Looks like tuareg is a popular mode for developing OCaml in emacs:

opam init
opam install tuareg

At the configuration file (~/.emacs.d/init.el) we can add:

(load "~/.opam/system/share/emacs/site-lisp/tuareg-site-file")


Typing ocaml in the terminal will bring up the CLI, but it’s not very interactive, not having a good way to go back previous command or edit strings in the middle. I’ve learned about this command line called rlwrap that implements this functionality. We can easily install it on mac:

brew install rlwrap

And invoke ocaml like this:

rlwrap ocaml

We can also add a simple alias to have these features as default:

alias ocaml='rlwrap ocaml'

Language Basics


Statements boundaries are defined by ;;. Example:

# 1 + 1;;
- : int = 2

This let us define multi-line expressions,

# 1 +
# 1
# ;;
- : int = 2


(* This is a single-line comment. *)

(* This is a
 * multi-line
 * comment.

Load code from file

It can become tedious to edit functions in the CLI. It’s possible to execute the contents of a file:

> ocaml
# #use "";;

Basic Types

* int – 31 or 63-bits, depending on the platform one of the bits is used for internal memory management

When writing literal number values, the underscore character is ignored (as long as it’s not the leading character). For example:

# 10_1___02__  + 1;;
- : int = 10103

This can be useful to define large numbers in a more user friendly way:

# let x = 12_434_934
val x : int = 12434934 

* float – IEEE double-precision floating point

OCaml doesn’t do explicit casts, especially between ints and floats. We have to cast using functions like int_of_float or float_of_int. Examples:

# 1 + 1;;
- : int = 2
# 1.3 +. 1.2;;
- : float 2.5
# 1 + 1.0;;
Error: This expression has type float but an expression was expected of type int
# 1 +. 1.
Error: This expression has type int but an expression was expected of type float
# 1 + (int_of_float 1.0)
- : int = 2
# (float_of_int 1) +. 1.
- : float 2.

Note that the sum operator for floats has an extra . character (+.)

* bool – true/false
* char – 8-bit ascii character
* string – more than a list of char, efficient internal representation


We can define variables by the use of let

# let a = 3 in
  let b = 4 in
  a + b;;
- : int = 3

This looks like imperative code at the first glance, but it’s slightly different. let <expr1> in <expr2>. The expr1 is only made available inside the scope of expr2. For example:

# let a = 3 in
  let b = 4 in
  a + b;;
- : int = 3
# a;;
Error: Unbound value a

Here the variable a was defined only for the expression:

  let b = 4 in
  a + b;;

When we terminated it with ;; , a was out of scope. We can also bind multiple variables in the same expression, for example:

# let a = 3 and let b = 4 in
  a + b;;
- : int = 3


Defining Functions

Example: A simple sum function

# let sum a b = a + b;;
val sum : int -> int -> int = <fun>
# sum 1 2
- : int = 3

Note how the type signature syntax is very similar to Haskell.

Explicit function type signature

Sometimes to avoid ambiguity, we might want to provide the types for the inputs/outputs of the function. For example, we might want to define a sum function only intended to be used by ints.

let intSum (a: int) (b: int) : int = a + b;;

Lambda functions

Denoted by the use of the fun construct, it’s useful for passing simple functions as parameter (for example to a map over lists).

# let doubleValues ls = (fun x -> 2*x) ls
val doubleValues : int list -> int list = <fun>
# doubleValues [1; 2; 3];;
- : int list = [2; 4; 6]

Recursive functions

A function must be explicitly marked as recursive by add a rec, according to [2], due to technical reasons – mainly related to type inference.

Example: Computing the factorial of a natural number n:

# let rec factorial n =
  if n == 0 then 1
  else n * (factorial (n -1))
val factorial : int -> int = <fun>
# factorial 10;;
- : int = 3628800

Matching patterns

Similar to Haskell, Ocaml has pattern match which we can use to decide which body of function to apply. For example, to invert a list we can do:

# let rec reverseList xs =
  match xs with
  | [] -> []
  | x :: xs -> (invertList xs) @ [x]
# reverseList [1; 2; 3];;
- : int list = [3; 2; 1]

The _ operator indicates all the non-matched cases. For example, we could rewrite the reverseList function as

# let rec reverseList xs =
  match xs with
  | x :: xs -> (invertList xs) @ [x]
  | _ -> []
# reverseList [1; 2; 3];;
- : int list = [3; 2; 1]

Labeled arguments

We can prefix a function parameter with ~ to indicate it’s a labeled (named) argument. Example:

# let div ~a ~b = float a /. float b;;
val div : a:int -> b:int -> float = <fun>
# div 10 2
- : float = 5.
# div ~b:10 ~a:2
- : float : 0.2

Note how the variable name shows up in the function’s type signature. It’s important because we may pass a function with labeled arguments to another function and it may make use of this fact (it’s also useful for documentation).

If the variable name matches the named parameter, we don’t need to repeat ourselves:

# let a = 10
val a : int = 10
# leb b = 2
val b : int = 2
# div ~b ~a

We can also apply currying using named arguments. For example, if we want generate a new function with the value b “bound”, we can do

# let b = 2
val b : int = 2
# let half = div ~b;;
val half : a:int -> float = <fun>
# half ~a
- : float = 0.5

When currying, positional parameters (i.e. non-labeled arguments) are always bound before the labeled ones. For example:

# let makeList3 ~a ~b c = [a; b; c];;
val makeList3 : a:'a -> b:'a -> 'a -> 'a list = <fun>
# let makeList2 = makeList3 1
val makeList3 : a:'a -> b:'a -> 'a list = <fun>
# makeList2 2 3
- : int list = [2; 3; 1]

Optional parameters

Optional parameters are prefixed with ? like sep in the example below:

# let concat ?sep x y =
  let sep = match sep with None -> "" | Some x -> x in
    x ^ sep ^ y
val concat : ?sep:string -> string -> string -> string = <fun>

# concat "ab" "cd";;
- : string = "abbc"
# concat "ab" "bc" ~sep:",";;
- : string = "ab,cd"

The value coming from an optional parameter is either None or Some x. An optional parameter is also a labeled parameter.

In the example above, we use pattern matching to provide a default value to sep. There’s a shorter syntax for this:

let concat ?(sep=" ") x y =
  x ^ sep ^ y
val concat : ?sep:string -> string -> string -> string = <fun>

By providing a default value, the value in the sep variable won’t be either None/Some, but the actual type expected, in this case a string.

It can be tricky to apply curry in the presence of optional arguments. [2] discusses in detail the heuristics applied by the compiler in different scenarios.


In this post we covered the very minimum to get our hands dirty with some OCaml code and learn the basic syntax. I don’t plan to study [2] for now. I’m mostly interested in learning enough to follow along Purely Functional Data Structures.

Some first impressions: the ocaml CLI is pretty limited, but thanks to rlwrap it becomes manageable. Haskell is more elegant, Ocaml is more practical.

For the next post in the series I plan to study the most basic and simple data structure in functional programming: lists.


[1] – The Basics
[2] Real World OCaml – Chapter 2. Variables and Functions

Review: Code Complete 2

May 31, 2016


In this post I’ll share my notes about the book: Code Complete 2, by Steve McConnell. The book has great information about several aspects of Software Development, but it’s quite long: 862 pages.

This is not a summary of the book by any means, but rather points that I found interesting, novel or useful. I hope the reader find it useful or at least that it inspires you to go after the book for more details on a specific topic.

I’ve written down bullet points about each chapter, some of which I added my own thoughts/comments.

Chapter 1 – Introduction

* Motivation: What is Software construction and why it’s important
* Explains the overall format of the rest of the book

Chapter 2 – Metaphors

* Metaphors help to understand problems better and use solutions developed for the analogous problems and apply them to the original problem.
* Common analogy to software development is civil construction.
* Bad metaphors, from forced analogies, which can be misleading.

Chapter 3 – Prerequisites (gathering requirements)

* Errors caught in the specification phase are the cheapest to fix (10x if the error is caught in construction).
* Different types of software require different degrees of investment in prerequisites. Three categories of softwares:
— business (e.g. website),
— mission-critical (e.g. packaged software) and
— embedded/life critical (e.g. avionics or medical devices).
* Requirements changes over time, so your design has to be flexible enough to allow changes.

Chapter 4 – Construction Planning

* Important decisions that have to be taken before the construction phase: programming language, code conventions, development practices (e.g. pair-programming), etc.
* Technology wave and the tradeoffs of choosing technology in different stages of this wave. For example, late-wave technology is more mature, has better documentation and user-friendly error messages. On the other hand early-wave environments are usually created to address problems with existing solutions.

Comments: Adopting early technology also helps with recruiting. Programmers like new technology.

Chapter 5 – Design

* The main goal of the design should be to keep software complexity low.
* Different levels of design: software-level, packages, classes, routines and internal routines — different types of software require different amounts of design details.
* Two main flavors to carry over the design: bottom-up or top-down approaches.
* Information hiding is key in a good design: It helps lowering the complexity by not requiring the person reading the code abstract details and reduce decoupling.

Comments: Hiding information properly is an art. It doesn’t help to stick as much code as possible into private methods if the public methods are not intuitive and require diving into implementation details.

Chapter 6 – Classes

* Consistent abstractions – different methods in the class should model the problem at the same level. Example: a class representing an employees record which inherits from a List with two methods:


The second one is closer to implementation detail. In general, the closest to the business level the abstraction is, the better.

* Inheritance vs. composition: Inheritance if often abused and long chains of inheritance is often hard to read. Arthur Riel suggests no more than 6 levels, author says it should be limited to 3 levels.
* Be careful with excess of attribution to a single class. Heuristic that a class should have at most 7 members.

Chapter 7 – Routines

* Naming: should be verb + noun and should describe the value it returns (if any).
* The purpose of a routine is to reduce complexity.
* The heuristic to the maximum number of parameters is 7.
* Routines should follow the linux philosophy: it should do one thing and do it well.

Comments: in the past I used to think of routines as ways to share code. This sometimes conflicts with readability and the linux principle. This is especially true when you group several routines calls into one because it’s used in two places, but they’re not cohesive enough to name it the routine clearly, so we end up using vague terms such as init, prepare, preprocess, cleanup, etc. Nowadays I prefer being verbose (i.e. repeating the lines in both places) in favor of readable code.

Chapter 8 – Defensive Programming

* When to use assertions: error handling for things you expect to occur and assertion for the ones you don’t expect.
* When to use exceptions: should be defined a convention. The name of the exceptions should match the level of abstraction of the current code (e.g. no RuntimeException where business logic is defined) this also means catch/re-throwing if the exception crosses the boundary of two different abstraction layers.
* Barricades: a specific layer that deals with bad input so that the core code doesn’t have to deal with them.

Comments: the barricade is pretty nice, it helps reducing the complexity in the main code by not having to handle too many corner cases and also centralizes where bad data is handled, so you don’t risk double or triple checking the same conditions in several places.


Chapter 9 – Pseudocode Programming Process (PPP)

* Describe the routine first in pseudo-code and then add the actual implementation but leaving the pseudo-code as comment.

Chapter 10 – Variables

* The span of a variable is defined as the number of lines between where a variable is declared to where it’s used. The average span of a variable is a indicator of complexity. High span variables means that the variable is spread out along the code. To reduce this number one can try to re-order statements in such a way that variables are declared close to where it’s used and all its uses are grouped in closer places.

Chapter 11 – Variable naming

* Optimal length is 10 to 16 characters.
* Computed qualities such as total, max, should be suffix, not prefix.
* Use conventions to indicate special types of variables such as loop indexes, temporary, boolean, enums, etc.
* Define a document for consistent variable naming conventions, including abbreviations.

Chapter 12 – Fundamental Data Types

* Consider using special purposed containers instead of plain arrays. Most of the cases we need sequential access, so queue, stack or sets can be more appropriate.
* Use intermediated boolean variables for the sole purpose of making complex predicates (in if clauses) simpler.

Chapter 13 – Unusual Data Types

* Organize related set of variables into a structure so they become more readable/easier to copy.
* Global variables are evil. If you need them, at least out them behind access routines (e.g. static member variables in a class).

Chapter 14 – Organizing code within a routine

* Make dependencies between 2 pieces of code obvious: via parameters, comments or flags + invariants.
* To break dependencies chunks of code, initializing variables in the beginning of the routine might help.

Comments: I also like memoized functions to break dependencies. If B depends on A being run, I create A as a memoized function that B can call no matter if it had been called already.

Chapter 15 – Conditionals

* When doing if/else conditionals, test the “normal” case first and the exception in the “else”.

Comments: I tend to do the opposite in favor of the early returns: this helps reducing nesting and clear up corner cases first – this works well when the handling of the exception case is simple.

Chapter 16 – Loops

* Keep as much code outside of the loop as possible, and treat it as a black box, if possible.
* Perform only one function in the loop, prefer using multiple loops with simple functions than one loop with many functions – unless you can prove that using multiple loops is the bottleneck of the code.
* The loop body should be short (<20 lines) and should be visible entirely in the screen.
* Loop nesting should be limited to 3 levels.

Chapter 17 – Unusual Control Structures

* Multiple returns within the same routine: use only if this improves readability.
* Avoid recursion if possible. Restrict it to a single routine (no chains like A -> B -> A -> B...).

Chapter 18 – Table-driven methods

* Simplify complicated conditional logic by pre-computing the values and hard-coding them in a lookup table (works if inputs are discrete).

Chapter 19 – General control issues

* When comparing numbers, use the number-line order, in other words, always use the “. For example, instead of

a < 0 || a > 10 do a < 0 || 10 < a

* Do not put side effects on conditionals.
* Switch/Case statements indicates poorly factored code in OO programming.
* Measuring control complexity in a routine: count the number of if, while, for, and, or and case. It should be less than 10.

Chapter 20 – Quality assurance

* Testing and code reviews are not enough by themselves. A combination of different techniques yields the lowest number of bugs in general.
* In studies, code review by 2 people found twice more errors as code reviews by 1 person – this is surprising because one would think a lot of the errors found by each reviewer would overlap. The book doesn’t mention the optimal number of reviewers.
* There are many qualities in a software including: correctness, usability, efficiency, reliability, flexibility and robustness, and some of them are conflicting (e.g. improving robustness decreases efficiency). Decide upfront which characteristic to optimize and keep the tradeoffs in mind.

Chapter 21 – Collaborative Development

* Formal inspections of design: the author of the design creates a document and other people in the team have to review it independently. This not only forces the designer to think it thoroughly, but also makes sure other people in the team will be familiar with the architecture.

Chapter 22 – Testing

* Exercise control flows. Instead of testing all conditionals combinations (which would be exponential, and prohibitive), add 2 tests for each conditional (one for the true and another for the false cases).
* Test data flow. The suggestion is to test all pairs of (definition, usage) of all variables. For example, if we have

  int a = 10; // 1
  if (x < a) { // 2
  int b = a + 1; // 3

In this case we would add two tests: one that exercises lines (1) and (2) and another that exercises (1) and (3).

* Bug distribution is not uniform across the code. It’s more likely that 20% of the code contains 80% of the bugs. It’s important to identify these areas and avoid over-testing, especially if using TDD.
* Keep records of bugs and fixes: where the bugs were found, severity of the code, etc. This will help to identify the critical paths.

Chapter 23 – Debugging

* A bug in your code means you don’t fully understand your code.
* Reproduce the error in different ways, to make sure you understand what is really causing the problem.
* Rewriting code might be a better alternative if debugging is taking too long. The idea is set a maximum time dedicated to debugging, after which one is allowed more drastic solutions such as rewrites.

Chapter 24 – Refactoring

* Make refactorings safe. In order to accomplish that, they should be small, self-contained, easy to review, documented, and tested.
* Different refactorings have different risks degrees and the planned accordingly.
* Book recommendation: Refactoring: Improving the Design of Existing Code by Martin Fowler.

Chapter 25 – Code Tuning

* Code tuning is overrated.
* Performance is often not the best feature to optimize: throughput and correctness are more important.
* 4% of the code accounts for 50% of the performance – and programmers are bad at guessing code bottlenecks. Finish the product first, and profiling to find the bottlenecks.
* Common sources of performance bottlenecks are system calls, I/O and pagination.

Chapter 26 – Code Tuning Techniques

* Specific techniques to improve runtime of code.
* Experimental results for each technique shows that in some environments the optimizations provide great performance gains, but in other cases, no significant improvements are obtained (sometimes degrading performance).
* The key takeaway is: profile! Compilers are very smart nowadays, so it’s hard to predict what roles an optimization will be converted to final machine code.
* Downside of optimizations is loss of readability:

“The impact of unmeasured code tuning on performance is speculative at best, whereas the impact on readability is as certain as it is detrimental.”

Chapter 27 – Program Size Affect Construction

* As the code size grows,
— More people are necessary, increasing communication overhead
— Productivity is lower
— Error density increases
— More time has to be proportionally spent on non-construction phases (planning and design)

Chapter 28 – Managing Construction

This chapter seems to be more targeted to managers, but also useful to developers to understand the “other side”.

* On encouraging good coding:

“If someone on a project is going to define standards, have a respected architect define the standards rather than the manager. Software projects operate as much on an expertise hierarchy as on an authority hierarchy.”

* Configuration management: practices to deal with changes, either in requirements, design or source code.
— Discuss planned changes in group, no matter how easy they are to implement, so people keep track of .
* Estimating construction time:
— Use estimating software,
— Treat estimation seriously
— Adjust estimates periodically and
— If initial estimations were off, learn why.

Chapter 29 – Integration

* Different strategies for doing code integration, mainly top-down (start with skeleton and fill in the blanks) and bottom-up (starts with individual pieces and glue them together).
* Strategies to make the integration smoother, such as automated builds and smoke tests.

Chapter 30 – Programming Tools

* Covers: IDE’s, tools for compiling, refactoring, debugging and testing
* Large projects require special purpose tools
* Programmers overlook some powerful tools for years before discovering them


Chapter 31 – Layout and Style

* Covers indentation, line breaks
* Do not align assignment statements on the ‘=’ sign. The idea is noble and it improves readability, but it’s a standard hard to maintain. Not everyone will have the same care and also automated refactors will likely miss it.
* Add at least one blank line before each comment

Chapter 32 – Self-documenting code

* Don’t comment too much or too little :)
* The author admits there’s not a lot of hard-data regarding usefulness of comments and what the “right” amount is
* Comment while or, better yet, before coding
* Especially in bug fixes, the comment should explain why the code works now, not why it didn’t work in the past.
* Comments styles have to be easy to maintain (avoid end-of-line comments, because if the variable gets renamed, it will misalign the comment)

Chapter 33 – Personal Character

Traits that the author considers important in a good programmer:

* Humility – admits their limitation, open to learn new things, change their minds
* Curiosity

Analyze and plan before you act: dichotomy between analysis and action. Programmers tend to err of the side of acting.

* Intellectual Honesty – admit mistakes, be honest/realistic about estimates and delays,
* Communication
* Creativity
* Discipline
* Laziness

The most important work in effective programming is thinking, and people tend not to look busy when they’re thinking. If I worked with a programmer who looked busy all the time, I’d assume he was not a good programmer because he wasn’t using his most valuable tool, his brain.

Traits that the author thinks are overrated:

* Persistence
* Experience
* Gonzo programming – programming like crazy, non-stop

Chapter 34 – Themes in Software Craftsmanship

Conclusion and review of thee book

* The primary goal of software design and construction is conquering complexity
* Process matters.

My message to the serious programmer is: spend a part of your working day examining and refining your own methods. Even though programmers are always struggling to meet some future or past deadline, methodological abstraction is a wise long-term investment – Robert W. Floyd.

* Write programs for people first, computers second
* Watch for warning signs. Examples: a high number of bugs in a particular class may indicate the class is poorly design. A lot of bugs in the project overall might indicate a flawed development process. Difficulty to reuse in another place, indicates it’s too coupled, etc.

Chapter 35 – Where to find more information

Books recommendation:

* Pragmatic Programmer – Hunt and Thomas.
* Programming Pearls – Bentley, J.
* Extreme Programming Explained: Embrace Change – Beck, K.
* The Psychology of Computer Programming – Weinberg.
* The Mythical Man-Month – Brooks
* Software Creativity – Glass, R.
* Software Engineering: A Practitioner’s Approach – Pressman, R.
* Facts and Fallacies of Software Engineering – Glass, R.
* UML Distilled – Fowler, M.
* Refactoring: Improving the Design of Existing Code – Fowler, M.
* Design Patterns – Gamma et al.
* Writing Solid Code – Maguire, S.


This book contains a lot of valuable information and I’ve incorporated several of his ideas in my day-to-day work, especially regarding making code easier to read.

The suggestions in the book are often backed by hard data, making them more credible. Sometimes the advice is subjective, even contradicting, but he often provides several points of view or alternatives, so that the reader can make their best judgement of when to use them.

Tree Ring Matching using the KMP Algorithm

March 12, 2016

Disclaimer: trees and rings are concepts found in Mathematics and Computer Science, but in this post they refer to the biology concepts ;)

Biosphere 2

Earlier this year I went to Arizona and visited the Biosphere 2. Its initial goal was to simulate a close environment where a group of 6 humans should survive for 2 years without any exchange with the exterior (except sunlight, of course). They had different biomes, including a rain forest, an ocean to a desert.

Biosphere 2

Biosphere 2

The plan didn’t go well because they had missed interaction of the plants in one of the seasons, which causes the level of oxygen to fall under the accepted level, so they had to get extern supply. The mission did last 2 years, but was not a perfect closed environment.

Later, another mission was attempted, but had to be abandoned after a few months. The project lost its funding and no more attempts were performed. Nowadays, the place is used by biologists to study environments under controlled conditions and also open to the public for tours (the Nautilus magazine recently wrote more about this).

The tour includes a small museum explaining some of the research, one of them is analyzing trees to understand the ecosystem from the past. According to one of the exhibits, tree rings can be used to determine the age of the tree and also some climate from that period, a process called dendrochronology. The more rings a tree has, the old it is and the width of the tree varies with the climate during the period of its growth.

A lot of the trees alive today are not too old. On the other hand they have fossil of older trees. Since they possibly co-existed for a period of time, it’s possible to combine the data from the rings of both trees by matching the outer rings of the fossil tree with the inner rings of the recent tree.


Tool to explain the matching system

The Longest Prefix-Suffix String Overlap problem

I thought about the tree ring matching for a while and realized this can be reduced to the problem of finding the longest overlap between the suffix and prefix of two strings. More formally, given strings A and B, find the longest suffix of A that is also a prefix of B.

This problem can be easily solved using a simple quadratic algorithm, which is is probably fast enough for real world instances, possibly in the order of thousands of rings. In Python, we can use the array access to do:

def longest_suffix_prefix_naive(suffix, prefix):
    minlen = min(len(prefix), len(suffix))
    max_match = 0
    for match_len in range(1, minlen + 1):
        if prefix[:match_len] == suffix[-match_len:]:
            max_match = match_len
    return max_match

In the code above, A is named suffix and B prefix.

But can we do better? This problem resembles a string search problem, where we are given a text T and a query string Q and want to find whether Q appears in T as substring. One famous algorithm to solve this problem is the KMP, named after their authors Knuth Morris and Pratt (Morris came up with the algorithm independently from Knuth and Pratt) which is linear on the size of T and Q.

We’ll see how to use the ideas behind the KMP to solve our problem. So let’s start by reviewing the KMP algorithm.

The KMP Algorithm: A refresher

The most straightforward way to search for a query Q in a text is for every character in T, check if Q occurs in that position:

def substr(Q, T):

    tlen = len(T)
    qlen = len(Q)

    for i, _ in enumerate(T):
        match = 0
        # Searches for Q starting from position i in T
        while match < qlen and \
              i + match < tlen and \
              Q[match] == T[i + match]:
            match += 1

        if match == qlen:
            print 'found substring in position', i

Avoiding redundant searches

Imagine that our query string had no repeated characters. How could we improve the search?

Q = abcde
T = abcdfabcde
        ^ mismatch in position 4

If we were using our naive idea, after failing to find Q in position 4, we’d need to start the search over from T(1) = 'b'. But since all characters are different in Q, we know that there’s no 'a' between positions 1 and 3 because they matched the rest of strings of Q, so we can safely ignore positions 1 to 3 and continue from position 4.

The only case we would need to look back would be if 'a' appeared between 1 and 3, which would also mean 'a' appeared more than once in Q. For example, we could have

Q = ababac
T = ababadac...
         ^ mismatch in position 5

When there is a mismatch in position 5, we need to go back to position 2, because 'abababac' could potentially be there. Let MM be the current mismatched string, 'ababad' above and M the string we had right before the mismatch, i.e., MM without the last letter, in our case, 'ababa'. Note that M is necessarily a prefix of Q (because MM is the first mismatch).

The string M = 'ababa', has a potential of containing Q because it has a longest proper suffix that is also a prefix of Q, 'aba'. Why is it important to be a proper prefix? We saw that M is a prefix of Q, hence the longest suffix of M that is a prefix of Q would be M itself. We are already searched for M and know it will be a mismatch, so we’re interested in something besides it.

Let’s define LSP(A, B) as being the length of the longest proper suffix of A that is a prefix of B. For example, LSP(M, Q) = len('aba') = 3. If we find a mismatch MM, we know we need to go back LSP(M, Q) positions in M to restart the search, but we also know that the first LSP(M, Q) positions will be a match in Q, so we can skip the first LSP(M, Q) letters from Q. In our previous example we have:

Q =   ababac
T = ababadac...
         ^ continue search from here 
           (not from position 2)

We’ll have another mismatch but now MM = 'abad' and LSP(M, Q) = len('a') = 1. Using the same logic the next step will be:

Q =     ababac
T = ababadac...
         ^ continue search from here 
           (not from position 2)

Now MM = 'ad' and LSP(M, Q) = 0. Now we shouldn’t go back any positions, but also won’t skip any characters from Q:

Q =      ababac
T = ababadac...
         ^ continue search from here

If we know LSP(M, Q) it’s easy to know how many characters to skip. We can simplify LSP(M, Q) by noting that M is a prefix of Q and can be unambiguously represented by its length, that is M = prefix(Q, len(M)). Let lsp(i) = LSP(prefix(Q, i), Q). Then LSP(M, Q) = lsp(len(M)). Since lsp() only depends on Q, so we can pre-compute lsp(i) for all i = 0 to len(Q) - 1.

Suppose we have lsp ready. The KMP algorithm would be:

def kmp(Q, T):

    lsp = precompute_lsp(Q)

    i = 0
    j = 0

    qlen = len(Q)
    tlen = len(T)

    while i < tlen and j < qlen:

        if j < qlen and Q[j] == T[i]:
            j += 1
            i += 1
        elif j > 0:
            j = lsp[j - 1]
            i += 1

        # string was found
        if j == qlen:
            print i - j

We have to handle 3 cases inside the loop:

1: A match, in which case we keep searching both in T, Q.

2: A mismatch, but there’s a chance Q could appear inside the mismatched part, so we move Q the necessary amount, but leave T.

3: A mismatch, but there’s no chance Q could appear in the mismatched part, in which case we can advance T.

This algorithm is simple but it’s not easy to see it’s linear. The argument behind it is very clever. First, observe that lsp[j] < j, because we only account for proper suffixes. It’s clear that i is bounded by len(T), but it might not increase in all iterations of the while loop. The key observation is that (i - j) is also bounded by len(T). Now in the first condition, i increases while (i - j) remains constant. In the second, i might remain constant but j decreases and hence (i - j) increases. So every iteration either i or (i - j) increases, so at most 2T iterations can occur.

The pre-computed lsp

How can we pre-compute the lsp array? One simple idea is to, for every suffix of Q, find the maximum prefix it forms with Q. In the following code, i denotes the start of the suffix and j the start of the prefix.

def precompute_lsp_naive(Q):
    qlen = len(Q)
    lsp = [0]*qlen

    i = 1
    while i < qlen :

        j = 0
        while (i + j < qlen and Q[j] == Q[i + j]):
            lsp[i + j] = max(lsp[i + j], j + 1)
            j += 1

        i += 1
    return lsp

The problem is that this code is O(len(Q)^2). In practice len(Q) should be much less than len(T), but we can do in O(len(Q)), using a similar idea from the main idea of the KMP, but this time trying to find Q in itself.


The main difference now is that we’ll compute lsp on the fly. While searching for Q in Q, when we have a matching prefix, we update the lsp. When we have a mismatch, instead of re-setting the search to the beginning of Q we skip some characters given by the lsp. This works because j < i and we had lsp(k) for k <= i defined. In the following code, the only changes we performed were adding the lsp attribution and having T = Q.

def precompute_lsp(Q):

    qlen = len(Q)

    lsp = [0]*qlen

    i = 1
    j = 0
    while i < qlen:

        if Q[j] == Q[i]:
            j += 1
            i += 1
            # We only added this line
            lsp[i - 1] = j 
        elif j > 0:
            j = lsp[j - 1] 
            i += 1

    return lsp

We can use exactly the same arguments from the main code of KMP to prove this routine is O(len(Q)). The total complexity of KMP is O(len(Q) + len(T)).

Solving our problem

At this point we can observe that the problem we were initially trying to solve, that is, find the lsp(A, B), is the core of the KMP algorithm. We can search B, the prefix, in A, the suffix. If we can keep track of the indices i and j, we just need to find when i equals to the length of T, which means there was a (partial) match of B in the suffix of A.

We can generalize our kmp function to accept a callback that yields the indices of the strings at any step:

def kmp_generic(Q, T, yield_indices):

    lsp = precompute_lsp(Q)

    i = 0
    j = 0

    qlen = len(Q)
    tlen = len(T)

    while i < tlen:

        if j < qlen and Q[j] == T[i]:
            j += 1
            i += 1
        elif j > 0:
            j = lsp[j - 1]
            i += 1

        yield_indices(i, j)

Now, if we want an array with the indices of all occurrences of Q in T, we can do:

def kmp_all_matches(Q, T):
    matches = []
    qlen = len(Q)
    def match_accumulator(i, j):
        if j == qlen:
            matches.append(i - j)
    kmp_generic(Q, T, match_accumulator)
    return matches

Here we use a nested function as our callback. We can also use kmp_generic to solve our original problem:

def longest_suffix_prefix(suffix, prefix):
    slen = len(suffix)
    max_match = [None]
    def max_matcher(i, j):
        if max_match[0] is None and i == slen:
            max_match[0] = j

    # Search prefix in suffix
    kmp(prefix, suffix, max_matcher)

    return 0 if max_match[0] is None else max_match[0]

Note that the nested function has read-only access to the variables in the scope of the outer function, like max_match and slen. To be able to share data with the nested function, we have to work with references, so we define max_match as a single-element array.


I used to participate in programming competitions. One of the most interesting parts of the problems are their statements. It’s usually an interesting random story from which you have to extract the actual computer science problem. A lot of the times though, the problem writer thinks about a problem and then how to create a story around that problem. It’s nice when we can find real-world problems that can be modelled as classic computer science problems.

I’ve used the KMP several times in the past but never took the time to study it carefully until now, which only happened because I wrote it down as a post.


[1] Wikipedia – Knuth–Morris–Pratt algorithm
[2] Annual Growth Rings