John Hearn

Matrices of Boolean Nets

Tue, 17 Dec 2024 16:03:00 GMT

When studying the boolean networks on the last post I wondered what the state update rule would look like as a matrix. These things can always be represented by matrices. Working out how to do it led to some nice links to quantum computing and category theory.

A few things led me to the way to construct the boolean net matrices. I knew that representation theory says it should be so but that doesn’t make it easier to construct them. Trial and error didn’t really get me anywhere. Then I found this article describing a simple boolean matrix construction which reminded me that this is deeply related to quantum computing qubit circuit construction that I already know something about, but without all the annoying quantum restrictions.

Later found this paper describing a different encoding: The logic of Boolean matrices - C. R. Edwards. It might be worth reviewing this to see if this method maintains the same categorical structure.

Boolean algebra

The thing to notice is that a bit ( or ) can be represented as a vector: and

An identity (NoOp) gate is then and a NOT gate is .

To operate on multiple bits at the same time we tensor them together by applying the Krondecker product. So the boolean number is equal to , for example. We end up with a vector of length with an entry for each possible value of the binary number, in order¹.

¹ Note the most significant bit is on the left in these expressions. The ordering tripped me up a couple of times.

Higher order operators are easier here than in the quantum case. We can just write out the result that we want as columns in a matrix, where each column represents , , , , respectively. So for example the gate is simply: .

When we multiply a pair of binary digits (tensored together) by this matrix then we get the answer we want. Example: $ AND * (T F) F$.

The beautiful thing is that we can then use wiring diagrams to build more elaborate calculations, like we do in quantum computing circuits. It works because this boolean algebra is an example of a symmetric monoidal category having a braiding that is essentially a SWAP operation between bits. This category allows us to compose operations by shifting bits into place with an appropriate sequence of swaps. This is very similar to the quantum computing equivalents but without the restriction to unitary (reversible) operations. Dropping this restriction means that we can give the category copy² and discard³ operations, turning it into a copy-discard category⁴.

² Also sometimes called clone, as in the no-cloning theorem.

³ Or drop or delete, as in the no-deleting theorem.

⁴ Also called a garbage share category but whoever thought up that name should take a deep look at themselves.

Examples like this make applied category theory such a powerful thinking tool.

Kauffman’s example

A simple circuit for Kauffman’s 3 node example is fairly easy to construct. The three operators correspond to the three truth tables, one for each node. The node value is cloned/copied. This is not allowed in quantum computing circuits but no problem here.

Kauffman’s example circuit

This circuit does a trick to avoid an extra copy. The copied bit is dropped at the end to make a routine with matching inputs and outputs. This allows multiple similar operations to be chained together to evolve the system.

Some more tricks are required⁵ to implement the swapping and shifting of registers in code but luckily I had already done something similar for the quantum computing simulator I wrote some years ago. The and gates were relatively easy to think through and the tensor product just worked perfectly for all these unfamiliar operations (there are no-cloning and no-deleting restrictions in quantum circuits which make things harder in that context).

⁵ I went back again to this paper which I found helpful to understand this process.

Of course with this simple example (and the benefit of hindsight) the matrix corresponding to this entire circuit could have been been constructed from scratch from the extended truth table. However this way the actual relations themselves are baked into the circuit. The final matrix, let’s call it , will be the same anyway. Here’s what it looks like in code (note that matrix multiplication is from the right, the opposite of the circuit diagram):

OPX = op(kauffman_truth_table[1,:])
OPY = op(kauffman_truth_table[2,:])
OPZ = op(kauffman_truth_table[3,:])

M = lift(3,3,DROP)*lift(3,3,OPZ)*lift(3,2,OPY)
        *row_swap(4,2)*row_swap(4,1)*lift(3,2,OPX)
        *row_swap(4,2)*lift(3,3,CLONE)*row_swap(3,2)

> 8×8 SparseMatrixCSC{Int64, Int64} with 8 stored entries:
    ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅
    1  ⋅  1  1  ⋅  ⋅  1  ⋅
    ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
    ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1
    ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
    ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅
    ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅
    ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅

Quite a lot of work went into getting that matrix mostly full of zeros!

Testing out the circuit we do get back the results from the original paper:

for state in 0:7
    s = Bool.(digits(state, base=2, pad=3))
    S = transform(s)
    println(s => measure(M*S))
end

[0, 0, 0] => [0, 0, 1]
[0, 0, 1] => [1, 0, 1]
[0, 1, 0] => [0, 0, 1]
[0, 1, 1] => [0, 0, 1]
[1, 0, 0] => [0, 0, 0]
[1, 0, 1] => [1, 1, 0]
[1, 1, 0] => [0, 0, 1]
[1, 1, 1] => [0, 1, 1]

So now the whole point of doing this wasn’t just to build the circuit in a new way. It was to use the power of linear algebra to tell us something about the network and its properties. Being a matrix we can calculate its eigenvalues and eigenvectors. If we can find eigenvalues with a value of exactly 1 then we will have found stationary points in the network, i.e. cycles.

The eigenvalues $= { _1, _2, …, _8 } $ of turn out to be:

λ = eigvals(M)

8-element Vector{ComplexF64}:
   -0.5 - 0.8660254037844388im
   -0.5 + 0.8660254037844388im
    0.0 + 0.0im
    0.0 + 0.0im
    0.0 + 0.0im
    0.0 + 0.0im
    0.0 + 0.0im
    1.0 + 0.0im

Now it just so happens that those non-zero values are the cube roots of unity. Using the fact that the cubes of the eigenvalues of a matrix are the eigenvalues of the matrix cubed then we should get all ones.

λ = eigvals(M^3)

8-element Vector{Float64}:
   0.0
   0.0
   0.0
   0.0
   0.0
   1.0
   1.0
   1.0

The last 3 eigenvalues are now exactly one! So let’s look at the last 3 eigenvectors and convert them back to the corresponding states:

The kimatograph depicted in the original paper. In more mordern language we might call this a state (transition) diagram, a pseudoforest or a functional graph. The states , and are in the cycle. is missing 🤷.

[measure(ν) for ν in eachcol(eigvecs(M^3)[:,6:8])]

3-element Vector{Vector{Bool}}:
 [1, 1, 0]
 [0, 0, 1]
 [1, 0, 1]

These are exactly the states which form the only cycle in the example kimatograph.

It is not coincidence that the other eigenvalues are zero. The fact that is also no coincidence. There is much more to learn about the spectral properties of these matrices.

Wolfram’s cellular automata

Once that had worked, I was interested in matrix operations for bigger boolean networks. As before we can apply the same process to Wolfram’s cellular automata which are just specialisations of the general boolean network. This network can be as big as you like. How can we go about building it systematically? This was an interesting problem to mull over and I eventually worked it out. First we can come up with a trinary operator representing the rule.

Wolfram rule operation

This operator has 3 inputs, two of which just pass through to the outputs. The other output is the result of the Wolfram coded rule. The reason for passing through the two outputs is to be able to chain them together easily. Of course we could just copy the values but this way seems neater. To create the complete circuit it also has to wrap around the ends to form a periodic boundary. That’s done by just copying the bottom bit to the top and the top bit to the bottom. The extra two bits are dropped at the end to make a composable operation. So we’re left with this circuit:

Wolfram circuit

Note that there are input states and output states so the operation can again be chained by matrix multiplication on the left.

What does this look like in code? Something like this although I expect there are many ways to do it:

function Rule(code::Integer)
    outputs = binary(code, 8)
    R = zeros(Bool, 8, 8)
    for k in 0:7
        X,Y,Z = reverse(binary(k,3))
        R[:,k+1] = transform([outputs[k+1], Y, Z])
    end
    sb(R)
end

function update(N, rule)
    # Clone the ends and shift to top or bottom respectively
    WRAP = row_shift(N+2,1,N+1)*lift(N+1,N,CLONE)*row_shift(N+1,N+1,2)*lift(N,1,CLONE)

    # Build the rule gate
    R = Rule(rule)

    # Chain the rule N times from the left
    Rᴺ = foldl(*, lift(N+2-2,i,R) for i in N:-1:1)

    # Drop the last two bits
    PRUNE = lift(N,N,DROP)*lift(N+1,N+1,DROP)

    # Build the full circuit
    PRUNE*Rᴺ*WRAP
end

Took a while to get the bit order of all this right but the idea was right. I did some nice testing and got it working as it should. As a big test I used the circuit to build and run different systems and compare to existing catalogues. This was the result and I’m pretty happy with it.

Excerpt from summary grid on the Wolfram website. The slight differences are due only to slight modifications to the starting states.

My results

Looking at the spectra of the update matrices is pretty interesting. As a simple example we can discover eigensystem of the different rules.

In fact we can look at all the eigenvalues of the same group of rules as above. These are plots of the complex plane from (-1,1) in both axes.

A few interesting things immediately jump out. First, every rule has zero eigenvalues and the rest are on the unit circle. This is consistent with what we saw for the simple boolean network above.

The rules with eigenvalues in $ ,0,1 $ correspond to the simplest rules: 123, 127, 128, 132, 136. In fact, rules with widely and evenly spaced eigenvalues seems to lead to simple periodic behaviour. Rules 130 and 134 are examples.

Rules 122, 126 and 129 share the same eigenvalue distributions have triangular run-in evolutions although all different in detail. Rule 131 also has uneven eigenvalues and a triangular update rule.

Rules known to be complex like 124 and 137 have a large number of eigenvalues. There can be a maximum of eigenvalues so the natural question is if any reach this maximum? Rule 120 has many eigenvalues but the pattern is simple. I suspect that in this case the behaviour is highly dependent on the initial state.

There is a strong resemblance here with the permutation matrices and that’s not surprising. The update rules are, in a sense, permutations with degeneracies. Those degeneracies are the “spider’s legs” in the state diagram. The results suggest that those degeneracies correspond to the zero eigenvalues of the matrix. The degeneracies are zero rows in the transition matrix. Powers of the transition matrix essentially zero out rows associated with the spider’s legs of the permutations, eventually converging on the cycles of a true permutation.

The spectrum of the update rule

We can remove zero rows and the corresponding columns from the transition matrix without changing the cycle dynamics. This essentially means removing the “leaves” of the state diagram. The outer leaves are sometimes called “Eden” states because they can only be initial states and cannot be reached by any other transition. Continuing this process will result in a permutation matrix with specific cycle characteristics.

We can formalise this: let be the matrix that, when multiplied with another matrix, removes its th row. The matrix which does this has rows and columns. To construct it, take the identity matrix, remove row and set the th column to zeros. The result looks something like this for the case where and :

If we have a matrix, , then multiplying it from the left by will remove one row. To remove the corresponding column it turns out that we have to multiply from the right by the transpose. So the final matrix is given as $ M^{} = Q_k M Q_k^T $.

To remove multiple rows and columns, indexed as , just repeat the process: . Of course, the sequence can be combined so that and by basic matrix rules⁶ we have $ M^{} = Q M Q^T $.

⁶ (wiki)

This matrix operation is different from the operations we’ve seen before. It doesn’t discard bits but rather discards individual states or, more specifically, unreachable states. I’m not sure what the categorical construction is for this but it’s hard to see how it would fit on the kinds of a wiring diagram we’ve been using.

This procedure might result in more rows with all zeros. These are the new unreachable leaf states that replace the removed one. This procedure is essentially pruning the leaf states one by one until we’re left with a pure permutation matrix and its cyclic structure. This construction has the effect of removing the zero eigenvalues from the matrix .

I think that this is related to Krohn–Rhodes theory somehow. That theory relates this kind of cellular automata with actions of semigroups. I’m trying to learn about that stuff but it’s hard. There is a categorical treatment of the Krohn–Rhodes theory⁷ that I would like to look at. This stuff spirals off into all kinds of directions that I find fascinating but time is finite. Nevertheless it’s something to put on the todo list.

⁷ Wells, Charles. ‘A Krohn-Rhodes Theorem for Categories’. Journal of Algebra 64, no. 1 (May 1980): 37–45. doi.org/10.1016/0021-8693(80)90130-1.

I bought the book by John Rhodes which tries to explain this and the scientific and philosophical ramifications. It’s called Applications of Automata Theory and Algebra via the Mathematical Theory of Complexity to Biology, Physics, Psychology, Philosophy, and Games - John Rhodes; Chrystopher L. Nehaniv (Ed.) (2009)

So what? The useful thing about this is only seen when considering processes more generally. There is an argument that says that processes can be designed to naturally be attracted towards desired behaviour, wherever they may start. We might say that an particular iterative approach, akin to the update rule, might be likely to result in a stable end state (which may be cyclic but identifiably stable). This has been investigated and used in the field. For example, Barry O’Reilly work on Residuality Theory bases some his philosophy this kind of stability.

Nothing is black and white

Another interesting thing that comes out naturally from the categorial approach (and the semigroup approach actually) is that the states don’t have to be restricted to s and s. It turns out that not only are our update rules examples of cd-categories, they are also examples of Markov categories. The additional restriction here is that the weights of the states sum to . This means that rather than apply boolean logic to our circuits we can apply probabilistic (fuzzy) logic. Take the operator. In category theory this is often called the conjunction or join operator and it’s fundamental to these (cartesian) monoidal categories. In the world of probabilities this same construction represents the probability of two events occurring together. Likewise the operation corresponds to the probability of either event happening or both. The complement rule is the NOT operator⁸.

⁸ What we don’t have with this construction is conditionals. I think it’s possible just haven’t needed it.

The beautiful thing about these categories is that they automatically capture dependence between probabilities. Take a look at this:

The AND gate has two inputs and they are both ½. So then why does it output ½ in one case and ¼ oin the other? The reason the outputs are different is because the COPY operation has created a dependency between them. In the first case there are two independent variables and therefore 4 possibilities and hence the probability of both being 1 is 1 in 4 or ¼. In the other case, since the inputs are effectively the same variable, they necessarily have the same value so the only possibilities are or and therefore the probability of both being 1 is ½. I think that’s pretty nice.

Anyway, the upshot is that we can create variables as real numbers in the form where . and (as defined above) correspond to the special cases where and respectively. The construction automatically ensures that the state elements sum to 1. The tensoring ensures that this is honoured for any number of variables.

Now that we’ve done it we can just plug real numbers into our cellular automata and see what happens:

The update rule is exactly the same. It just works. In this case the initial state in each case is set to a single value of in the center cell, rather than as in the diagram above. The cells are grey-scaled, 0 is white and 1 is black, so 0.6 is a medium grey. Notice that the grey cells emanate away from the starting cell and seem to be superimposed onto a background pattern. I think this nicely shows how the update rule permeates through the cells.

There are all kinds of experiments to be done with this. What if multiple cells are initialised to 0.6 in the initial state?

You can see how the effects of the initial state radiate and seem to interact with each other and create interference patterns.

Talking of interference patterns, this category will work on complex numbers too, a sort of analog to a quantum state. Of course, true quantum states don’t have copy and delete operators so let’s just call them complex states? Our bits will be defined this time as where is complex and . In this case their square sum to 1 and the tensor product will honour the L2 norm here too. Let’s generate the same diagrams with this new input state. Remember the update rules are still boolean matrices and haven’t changed at all.

This first one shows the magnitude of the complex states with a simple initial state of . Again a 60% magnitude but with an irrational phase:

And this one shows the phases:

There is a very similar kind of pattern emerging from in both the magnitude and phase of the state vector. What if we have two cells set in the initial state?

And this one shows the phases:

Some nice patterns emerging as the two initialised elements interfere with each other. You could play with this all day long. I haven’t looked at all on what patterns come out of this but I suspect that the phases sort of x-ray into the patterns, seeing through static or repeating background behaviour.

So what?

I spent quite a few hours on this, actually whole days, over several weeks. It’s been a fascinating journey and I’ve come away with a much deeper understanding of boolean networks and Wolfram’s cellular automata. I’ve got some additional intuition and respect about it relates to category theory and a renewed interest in the Applied Category Theory book I bought some time ago but have been slow to read. Nothing i this article is particularly useful in itself. The categorial construction of the update rules are far too inefficient to be practical but the string diagram reasoning and formalisms have become more familiar and I can see the real power. I didn’t really belive David Spivak when he said somewhere that category theory can be used even for philosophical reasoning, but I think I’m starting to see what he means. It’s a very powerful language but also formal and mathematical. That’s the sort of thing I should like, I guess. ANyway I’m going to leave it there, this write up is never ending. Glad it’s done now.

Kauffmann’s Basic Gene Nets

Thu, 05 Dec 2024 06:53:00 GMT

Stuart Kauffman’s 1969 paper Metabolic Stability and Epigenesis in Randomly Constructed Genetic Nets introduced a simple model of gene interaction which seemed to capture certain characteristics of real biological systems. It was a precursor for later work on NK-models and cellular automata. They have come up recently in different forms such as Stephen Wolfram’s work on biological computability¹² and Barry O’Reilly’s Residuality Theory.

¹ Why Does Biological Evolution Work? A Minimal Model for Biological Evolution and Other Adaptive Processes (Wolfram - May 2024).

² Foundations of Biological Evolution: More Results & More Surprises (Wolfram - December 2024).

I’m going to take the approach of just reproducing Kauffman’s initial results, as I like to do, including some additional work to link it to Wolfram’s classes. I have also done some work on the separability, divisibility and reversibility of these nets but that will be in another post.

The setup

The paper begins like this:

Proto-organisms probably were randomly aggregated nets of chemical reactions. The hypothesis that contemporary organisms are also randomly constructed molecular automata is examined by modeling the gene as a binary (on-off) device and studying the behavior of large, randomly constructed nets of these binary “genes”.

First some terminology. Kauffman refers to his net as “a binary (on-off) device”. We will call this a boolean network because each node has a boolean value attached to it, on or off. When he uses the word “gene” (in quotes) he’s referring to the individual nodes of the network.

There are nodes and each node is attached to other nodes. By way of example Kauffman uses a simple network with 3 nodes and 2 connections.

Comparison with figure from the original paper which includes the truth table for each node.

The example that Kauffman uses in the paper is a (N=3,K=2) network. By necessity it is a complete graph and is small enough to be worked through by hand.

The network is allowed to evolve in a particular way:

… the inputs to each binary “gene” may be chosen at random; the effect of those inputs on the recipient element’s output behavior may be randomly decided by assigning at random to each element one of the possible Boolean functions of its inputs.

For example, if a node has two inputs A and B (K=2) then the rule might be or AND or XOR , assigned at random, and the node’s value would be updated accordingly.

Alternatively, these rules can be represented in full generality as a truth table. Since for this study we don’t care what the actual rules are in terms of ANDs and ORs, we will construct the update rules simply by taking a random truth table.

To reproduce Kauffman’s simple example in the paper we assign the following truth table to the network:

As an aside, there are entries in the truth table. This suggests to me that a tensor representation might be more efficient for simulation, especially for larger values of .


0	1	0
0	0	1
1	1	1

Enumerating the small number of possible states in this example we find the following update rule:

Comparison with truth table depicted in the original paper.

 [0, 0, 0] => [0, 0, 1]
 [0, 0, 1] => [1, 0, 1]
 [0, 1, 0] => [0, 0, 1]
 [0, 1, 1] => [0, 0, 1]
 [1, 0, 0] => [0, 0, 0]
 [1, 0, 1] => [1, 1, 0]
 [1, 1, 0] => [0, 0, 1]
 [1, 1, 1] => [0, 1, 1]

When the update rule is applied multiple times, the values of the nodes enter a dynamic wholly defined by the truth table. Since this is a finite system, at some point the values will enter a cycle. The maximum possible cycle length is (or in our small example) but often the cycles will be shorter. By tracing the dynamics for each of the possible state values we can draw a graph of their trajectories³.

³ Kauffman calls these diagrams kimatograph in the paper but it doesn’t seem to be a term that caught on, although I quite like it.

The kimatograph depicted in the original paper. The state has been omitted for some reason.

Example kimatograph from paper

By looking at the diagram and following the arrows, it’s clear that all the states eventually fall into the same cyclic behaviour. There is a transient (or run-in) period between the initial state and the first state encountered on a cycle. Kauffman defines a confluent as the set of states leading into, or on, a cycle. In this case there is only one.

The results

In the paper, Kauffman studied the dynamics of large networks, keeping . Just by way of example a graph representation of a network with might look like the following diagram.

This network has 14 nodes labelled from 1 to 14 (). Each node has two incoming connections ().

The situation clearly becomes more complicated very quickly and you’d expect the dynamics to be equally complicated. However, Kauffman goes on to say:

The results suggest that, if each “gene” is directly affected by two or three other “genes”, then such random nets: behave with great order and stability.

This is the crux of the study and mostly borne out but we will see that there are important exceptions. Kauffman realised this and the nuance became of central importance in the development of complex systems theory although it is understated in this early paper.

Take, for example the histogram of cycle length. Kauffman discovered that the average cycle length for () nets with random truth tables was smaller than might be expected. We can reproduce the result fairly well.

A histogram of cycles detected in 400 node random boolean nets.

The histogram of cycle length depicted in the original paper. Note the cut-off at 55 and the preponderance of cycles of length 2.

Histogram nets N=400

There are immediately some things to note. First the similarity is remarkable, especially considering that this is a run of 200 random samples out of a possible states. There are peaks at 1, 2, 4, 6, 8, 10, 12, 16, 20, 24, etc. and the trend is decreasing with larger cycle lengths.

There are two important caveats, though.

The preponderance of cycles of length two in Kauffman’s results is not present in mine. Initially, I thought this was a bug in my code but I checked and double checked the calculations by hand and couldn’t find a problem. Moreover, closer study reveals a second, related caveat worth mentioning.

I have artificially chopped the histogram at the value of 55 to match Kauffman’s. However, my results include cycles of lengths much greater then 55, some having exceeded the limit of 10,000 which I had to add to stop the simulations running forever. I have individual instances of cycle lengths of 60, 62, 72, 78, 80, 84, 96, 102, 111, 116, 168, 186, 248, 282, 292, 320, 381, 438, 458, 482, 508, 1017, 1260, 1281, 1552, 3066, 3500, 3628, 6527 and eight more reaching the upper limit.

Suspiciously, my results typically show a similar number of cycle lengths above 55 as the difference in Kauffman’s and my results for 2-cycles. For example, the 2-cycle count from my experiment shown in the figure, was 9. The number of cycles of length greater than 55 was 39. Kauffman’s 2-cycle count was 40.

I don’t have Kauffman’s original code and I can’t formally prove my code is correct. Nonetheless I have found an interesting way of testing it which is instructive in itself. The idea is to narrow in on Stephen Wolfram’s well-known 1D cellular automata as a special case of Kauffman’s boolean nets.

Wolfram’s classes

It turns out that Wolfram’s famous study of elementary 1D cellular automata (which he discusses in great depth in his book “A New Kind of Science (2002)”) can easily be modelled as a boolean net. The good thing is that the results are visually quite distinctive and can serve as a reference.

To build it, start with the nodes arranged in a line. Connect each node to itself and its two immediate neighbours, making . The system is closed by wrapping around the ends of the line so that the nodes at both extremes are connected together.

These are sometimes called periodic boundaries.

Make a truth table based on one of the possible rules. Wolfram codified them as numbers from 0 to 255. The truth table is identical for each of the nodes.

Example of the history of evolution of a boolean net with 100 nodes arranged in a line. Each time step is a row on the image. The pattern is cyclic and, in fact, is classified as “class 2” in Wolfram’s scheme.

Allow the system to evolve, keeping records on the history of states as the rows of an image. This is what you get, for example for rule 26:

Wolfram and others have studies these nets in depth and famously found that they exhibit behaviour that fits into 4 classes:

Wolfram’s four classes

The first class leads to a single terminal, stationary state. All 0s or all 1s. In terms of Kauffman’s nets, the cycle length in this class is 1 and we have seen that many rules lead to this outcome. Kauffman notes that removing rules which always resolve to 0 or 1⁴ significantly reduces the occurrence of this class.

⁴ He calls them contradictions and tautologies respectively.

The second class reduces to fixed patterns with simple periodic behaviour. Rule 26 shown above falls into this class. While it is a fairly complicated (using the term advisedly) pattern it is periodic and predictable.

The third class is entirely chaotic. The pattern does not converge and is only periodic to the extent of the finite state space (which may be extremely large). The appearance is of white noise and randomness and, in fact, is available as a random number generator in Mathematica.

The fourth class demonstrates complexity. The patterns are non-periodic but also not chaotic. These configurations are on the edge of chaos, neither periodic nor chaotic. One interesting property that at least two of the rules in this category have is universality⁵, or in other words the ability to perform any calculation.

⁵ Universality in Elementary Cellular Automata - Matthew Cook

What’s the relevance of all this? Well, by converting the well-documented Wolfram rules to Boolean nets I’m much more comfortable saying that my simulation is working as expected. I’ve tried many different rules and the patterns are identical to the published ones.

On top of that, since the simple cellular automata studied by Wolfram is an example of very simple Kauffman network then the Kauffman networks in general must necessarily admit such classes of behaviour, including class 3 and 4, the chaotic and universal ones. In this light, Kauffman’s discovery that the average cycle length remains smaller than might be expected is just a part of a much richer picture.

At this point I got side tracked onto a parallel way of modelling the networks which has some promise. The write up of that will have to wait until the next post.

Christopher Alexander and Network Theory

Sat, 16 Nov 2024 07:30:00 GMT

The aim of this article is not to give a complete picture of 60 years of his work. I suggest you read his books for that. Rather it is to explain just enough to see the arc and place it in context.

In his early work, “Notes on the Synthesis of Form (1964)” (abbreviated in the following as NotSoF), Christopher Alexander explored a mathematical, network-based approach to design. He conceptualised design as an interrelated network of “misfit variables” to be optimised through computational analysis, algorithmically solving conflicts between interconnected elements.

His thesis was that to solve a design problem it should be reduced to multiple smaller problems and that, crucially, we have been limited to using existing concepts that have come down to us as “arbitrary historical accidents” rather than being an optimal description of the situation at hand. At this stage, Alexander’s approach was reductionist and analytical, believing that a design could be synthesised and studied through rigorous structural optimization. He suggested algorithms (precursors to community structure algorithms) to cluster elements together in such a way as to reduce the overall design problem to a hierarchical tree of simpler ones which in turn could be solved and optimised independently.

A figure from NotSoF which shows design variables (small black circles, so-called nodes in the network) and their interactions (lines). The nodes are grouped into clusters (represented as the larger circles).

In this work, the clusters of variables Alexander calls “diagrams” (which he would later rename to the “patterns” which he became famous for) and this is where his thinking began to change. As he states in the preface to later editions:

As you can see, it is the independence of the diagrams which gives them these powers. At the time I wrote this book, I was very much concerned with the formal definition of “independence,” and the idea of using a mathematical method to discover systems of forces and diagrams which are independent. But once the book was written, I discovered that it is quite unnecessary to use such a complicated and formal way of getting at the independent diagrams.

What does modern network science tell us about this? To answer this we can use the example described in depth at the end of NotSoF.

The example from the book NotSoF which in turn is taken from the study “The Determination of Components for an Indian Village” - Conference on Design Method (Oxford : Pergamon, 1963). The layout is force directed, meaning that related nodes are attracted toward each other and unconnected nodes are pushed apart. Sometimes clusters appear naturally but in this case none are easily identified. The graph is interactive, feel free to try and pull the clusters apart yourself.

This is the full and extensive design problem that Alexander studied. Although he was able to group this into 12 hierarchical clusters (represented by the colours off the nodes), the nature of the independence of those clusters is far from clear. Force directed layouts such as this one sometimes identify underlying structure but not in this case. Even so, visually it might be deceiving so we can apply modern summary statistics to this network and see the difficulty in clustering this network at all.

Metric	Value	Interpretation
Average degree	~20	Each node, on average, interacts with 20 other nodes.
Edge density	0.14	About 1 in 7 of all possible interactions are present. This is a dense graph.
Average Path Length	2.0	On average there are just 2 hops between each node.
Global clustering coefficient	0.06	Higher values mean better clusters. This is a very low value.
Algebraic connectivity	>6	Very high. By comparison tree-like graphs, scale-free and small-world networks have this typically less then 1.0.

These metrics all point at this being a highly connected network without any clear clusters or community structure. Although the diagrams suggested by Alexander in the book do, to some measure, minimise the links between clusters, the result is still very far from the tree-like structure that is espoused.

Alexander realised this, of course, and the beginnings of a perspective shift are clear in his essay the very next year “A City is Not a Tree (1965)”, where he claimed that cities and urban structures cannot be fully explained by hierarchical models or purely tree-like networks. He argued that successful cities are not trees but are “semi-lattices” — networks with overlapping connections that resemble the organic, intertwined complexity of real-life. Here, Alexander is reflecting on and challenging his own reductionist assumptions, positing that cities function best with fluid, overlapping, and non-hierarchical connections, allowing diverse elements to interrelate dynamically.

Alexander’s reflexivity is to be admired. Remember he was writing this 60 years ago, when hard science was a raging success explaining the unexplainable and we were still optimistic for “Grand Theories of Everything” even though computers filled an entire room.

By the time Alexander finished “A Pattern Language (1977)”, his thinking had moved even further away from strict optimization schemes toward a more life-centered philosophy. He recognized that human environments thrive on the richness of overlapping networks rather than static or strictly hierarchical arrangements. Rather than optimizing individual design problems, he focused on identifying “patterns” — recurring, archetypal solutions that reflect timeless principles of habitability. Each pattern addressed a design aspect that could contribute to a “whole” design when combined with others in rich and intricate ways. Patterns were interconnected in what Alexander saw as a more organic network, linked by relationships that encouraged cohesion and harmony within the design. Rather than algorithmic clustering, these patterns represented design wisdom drawn from observation and experience, creating environments that intuitively supported human needs and preferences.

Taking this idea further, in “The Nature of Order series (2002–2005)”, Alexander considers “centers,” a concept intended to capture essential order that appears naturally within vibrant environments. Alexander believed these centers were interrelated and overlapping, embedded in a “living” structure that achieved harmony through wholeness rather than quantifiable connections.

By this point Alexander’s perspective has shifted entirely to a holistic one. He is no longer using tree structures and prefers using words like "wholeness" to describe this shift requiring experience and expertise to understand the full picture.

I will admit that the mathematical approach of Alexander’s early work still very much appeals to the logical side of my brain and 30+ years of western education: if I follow this procedure, I can solve the problem in the abstract using maths and computers. This is one extreme of a philosophical framework which emphasises narrowing the vision and focusing in on breaking down and solving component problems and then reintegrating them. Some people use the left-hemisphere as a metaphor for this kind of thinking.

This is also the thinking that leads us to search for optimal efficiency through gradient descent to cost reduction and maximal productivity.

Ironically the left-right distinction is itself a left-hemisphere concept. Using such terms encumbers us with a recursive limitation.

However, shifting too dramatically from a left-hemisphere to right-hemisphere approach (System A to System B, as he called them) causes unresolvable tension, poles so far apart that there simply is no sweet spot. Christopher Alexander himself documented these problems in the book “The Battle for the Life and Beauty of the Earth: A Struggle between Two World-Systems (2012)”. System A drives for cold, modern efficiency and System B for life and harmony. I live in a System A building. How much better it could have been with a little System B. Too much System B and it wouldn’t even exist. As Daniel Schmachtenberger says “we’re creating a future that nobody wants”.

Can we apply modern efficiency to find life and harmony and get the best of both worlds? It doesn’t seem like an impossible goal to me.

In modern network theory, patterns, diagrams and even centres might be better considered as motifs rather than clusters determined by the density of their interactions. Motifs are repeating, stable structures with the potential to identify key functional properties embedded within a network of a particular type - certain configurations that always seem to work well together. Importantly, they do not need to be disconnected from the rest of the network but can mix and overlap with other nodes and motifs while maintaining their own essential structure. Although there are modern algorithms for detecting motifs, just like clusters, at this point Alexander was arguing that algorithms could not capture the fundamental quality of “life” that great designs embody; only an intuitive, holistic understanding could bring the sense of harmony aspired to.

There is a certain circularity in Alexander’s story. With the aim of trying to get away from subjective, accidental concepts to a more objective approach he ended up saying that only through expertise are we able to detect and evaluate the concepts in the first place.

I am more optimistic. I wonder if we will be able to complete the circle and detect natural but explainable motifs. For example, new language models seem to be able to glean underlying patterns that maybe even the experts are not aware of. I can envision a world, not too far away, where we will need both the algorithmic tools and expert evaluation to keep things in check.

A nice example of a similar circularity has happened in chess. Many believed that computers could not play chess due to the unfathomable number of possible moves while others believed that even if they did, the game would become boring and predictable. Nonetheless modern engines have far surpassed humans and are now helping professional players to develop new intuitions. Rather than mechanising the game, modern encounters are quick to leave theory behind and explore more novel openings than they did before the engines existed.

The evolution of Christopher Alexander’s design philosophy also has an interesting analogy with the modern tension between systems thinking and complexity theory. When viewed through the lens of network theory we can place his journey in a firm conceptual framework and show his battles in a modern light. He moved from clustering design variables to a more holistic, fluid approach. This is very similar to the comparison by some of traditional and structural systems thinking approaches (such as lines and circles and systems dynamics) to more fluid approaches from Complexity Theory. Bearing in mind Alexander’s philosophical journey, I am eager to see the evolution and synthesis of both of these subjects.

Predictability and batch size

Sat, 28 Sep 2024 16:30:00 GMT

Some results from a little Monte Carlo simulation of delivery times posted on LinkedIn showed that predictability decreases with larger batch sizes. A little maths shows clearly why this is the case.

The model

The post compared a team delivering one work item per day with a team delivering 5 work items together every 5 days. The teams have a batch size of 1 and 5 respectively. In this situation we’d expect the mean delivery rate to be the same, namely 1 item per day on average. We’ll call the batch size .

In our model, a team will deliver work items (doesn’t matter the size of each item, we’re only using the finishing rate in this model) in batches with items in each batch. Let’s call the number of batches .

In this idealised model, Team 1’s probability of delivering each day is close to 1, so for this case let’s say . Team 2’s probability of delivering 5 work items on any particular day is . So in general on a given day a team delivers work items with a probability of .

We want to know the probability of delivering items of work in days. This requirement is captured by a Negative Binomial distribution and, luckily for us, all the maths has been previously worked out. We’ll break the total delivery time down into two parts. First, the the number of times that a team fails to deliver on a specific day, that is the gaps between deliveries. We’ll call that number . Secondly, we need the number of days that it does deliver, this is . We can then say that the total number of days is the sum of the delivery days, , plus the non-delivery days, . In other words, . In this case then follows:

And from this we can work out the distribution of .

What does the model tell us?

Given the distribution defined above, and remembering that is a constant, then the expected value of is:

So, in this model, to deliver work items the average total delivery time is independent of the batch size, as expected.

What about its variance? The variance of the mean (a measure of predictability) is:

So the variance increases linearly with batch size. Since the greater the variance the less predictable the result we can say that the predictability decreases with batch size.

The variance also increases linearly with the amount of work. This captures the fact that the predictability decreases the further into the future you look, even if the delivery rate remains constant. This is just one of the factors comprising the cone of uncertainty that results solely from batch size.

A model is a model

In any real team the batch size won’t be constant but, all other things being equal, this model is good enough to tell us that regular delivery of smaller batches is preferable to larger ones.¹

¹ There are other reasons too, like the accumulation of changes increasing the probability of bugs but that is not covered in this model.

Also in real teams the work items are generally not completely independent. This can be due to internal team dynamics, one piece of work being a prerequisite of another, etc. The practical effect of this is to increase the variance further. One nice thing about the negative binomial as a modelling tool is that it can easily be tuned to the teams actual data by increasing its variance slightly while keeping the mean constant. In the past I have found this to be a very good approximation of data gathered from real teams.

Another nice thing about using a known distribution is that we can go beyond normal Monte Carlo and use Bayesian inference for forecasting. Having a Bayesian model has the advantage of being deterministic and smoother, regulating and reducing noise or small sample effects that are sometimes evident in Monte Carlo simulations.

As always these results need to be used with caution and a full understanding of what they mean. Nonetheless having models to help us understand the mechanisms and principles behind our intuitions can be very handy at times.

Validation

Just to check that the model we’ve described actually works, the shaded area in the plot below shows 100,000 results from Monte Carlo simulations of delivery times for 40 pieces of work when 5 items are delivered together. The red line represents the negative binomial model described here. Hopefully, they match as perfectly now as they did when I wrote this.

Dunbar’s number deconstructed (again)

Tue, 17 Sep 2024 06:30:00 GMT

Last week I was looking at how Dunbar arrived at his famous number and learnt a lot about the effects of different types of linear regression and the limits of their predictive power. I concluded that, while the science was good, the wide error margin in the linear regression over log transformed data means that we need to apply a good deal of caution about arriving at any specific number.

The data I used was from one of his more recent papers¹ so my results were slightly different from his. It was tabulated in the paper and I was just too lazy to copy it out. Nevertheless I was left with some curiosity about how the results might have changed using the original data.

¹ Dunbar, Robin I. M., and Susanne Shultz. ‘Social Complexity and the Fractal Structure of Group Size in Primate Social Evolution’. Biological Reviews 96, no. 5 (October 2021): 1889–1906. doi.org/10.1111/brv.12730.

Reproducing Dunbar’s exact results

So yesterday, with a combination of OCR and careful copying, I recovered Dunbar’s original data from his paper. Here’s a reproduction of a plot of the data - satisfyingly similar to the original.

Comparison with Dunbar’s data taken from his original paper.

Not sure why those axis limits were chosen, maybe just rounding to orders of ten. In any case group sizes less than 1 make no sense and neocortex ratios beyond that of humans are not useful. From now on the plots will focus on the meaningful ranges. They’ll also have the vertical axis on the right so that predicted values are easier to read off.

To this graph Dunbar’s original fit is overlaid. One again his famous number, 150, appears as the predicted value for human group size.

Figure 2: Same plot restricting the axes to a meaningful range and extending Dunbar’s fit to the measured neocortex ratio for humans (4.102).

Using this data and using exactly the same RMA (aka Geometric) linear regression we can recover his results almost² exactly. The next plot shows the agreement along with the 95% confidence interval as calculated following Rayner³.

² I believe there are some minor transcription errors in the original paper (or my reading of it) which produce a very slightly different line but the difference is minimal and changes nothing.

³ Rayner, J. M. V. ‘Linear Relations in Biomechanics: The Statistics of Scaling Functions’. Journal of Zoology 206, no. 3 (July 1985): 415–39. 10.1111/j.1469-7998.1985.tb05668.x.

Figure 3: plot extending the 95% confidence interval of the geometric mean regression out to the measured neocortex ratio for humans (4.102), confirming agreement with Dunbar’s original results.

This agreement was exactly what I was hoping for by using the original data and reproduces Dunbar’s result satisfactorily.

Residuals

One again we can apply simple residual checks on the data to ensure we are satisfying the linear regression assumptions. The qqplot shows the residuals close to a normal distribution.

Figure 4: Q-Q plot comparing the standardised residuals with a standard normal distribution.

The residuals plot itself shows some variance but no obvious heteroscedasticity.

Figure 5: Plot of the residuals.

Confidence in the mean

While the fit itself is statistically sound, if we look again at Figure 3, we can see that the logarithmic scale underplays the numerical range. Looking at the same graph with the vertical axis scaled linearly then the problem is clear.

Figure 6: plot extending the 95% confidence interval of the geometric mean regression out to the measured neocortex ratio for humans (4.102).

The confidence interval (which is only for the mean itself) is from under 100 to nearly 250. This is worse than the previous data because the slope is greater. I think you will agree that this is a wide error margin which is partially obscured by the log-log view. If we don’t believe the confidence interval calculation (and we shouldn’t, some doubt has been placed on it⁴, see below) then let’s take a different approach and compare.

⁴ Changyong Feng, Hongyue Wang, Naiji Lu, Tian Chen, Hua He, Ying Lü, and Xin Tu. ‘Log-Transformation and Its Implications for Data Analysis’. Shanghai Archives of Psychiatry 26, no. 2 (1 April 2014): 105–9. 10.3969/j.issn.1002-0829.2014.02.009.

Assuming the parameters of the fit are normal, as we must because it was an assumption of the regression itself and has been somewhat confirmed empirically, then we can generate random samples of regression lines drawn from those distributions. This is a so-called Monte-Carlo simulation of the regression distributions and gives us an alternative way to histogram the predicted values. The following plot shows 3000 such samples together with a histogram of the predicted values for human group size.

Figure 7: plot of Monte Carlo random samples of the regression line together with a histogram of the predicted value for human group size.

Again we find an average in the right range (see below) but once more with very wide error margins for the prediction. Actually it’s worse. The distribution is clearly heavy tailed towards higher group sizes. This is an expected effect of converting a Normal distribution on a log scale to a linear scale where it becomes Log Normal and therefore long tailed. Taking 1 million such predictions it’s evident that it conforms very closely.

Figure 8: histogram of 1 million predictions fitting very closely with a LogNormal distribution. 95% quantiles are shown as vertical lines at 71 and 368 respectively. Mean is 176.

Taking these results we can confirm that the 95% confidence interval has widened further, from 71 as a lower bound to 368 as the upper. Remember that this is still the range for the mean itself and does not take into account the variability in the samples.

There is another insight here too. The mean of the LogNormal distribution is not 150 but rather 176. The reason for this is exactly as stated in Feng et al. and can be confirmed by direct calculation. The mean of the distribution when transformed back into natural units is not but rather . The variance shifts the mean.

Confidence in the prediction

We’ve stated several times that until now we have been looking at inferences related only to the prediction of the mean of the regression line, not including the variance in the data itself. Let’s take that into account now. We know the variance in the residuals and under the assumption they are normally distributed in log space then, in principle, we can model the residuals in the predicted value.

The normal distribution of the mean regression and the normal distribution of the residuals is combined using the relation: . We can then carefully extend this to linear space by expressing as a Lognormal distribution with parameters and .

The factors are necessary because the original data was transformed base 10 and the relation between the normal and the Lognormal is via the natural logarithm.

Using this new distribution the confidence interval has now widened still further. The 95% interval ranging between 31 to over 740. The mean now is above 200 due to the upward effect of the increased variance.

Conclusion

With the original data it’s been possible to reproduce Dunbar’s famous results almost perfectly.

However it has become even clearer that the extrapolation of the regression line to human scales is highly questionable. The prediction is well outside the existing data, leading to wide error margins, and the log transformation exaggerates the error still further. If we also take into account the residual variance in the data, the confidence interval of any prediction widens beyond useful limits (according to my analysis, the 95% interval is from 71 up to 740).

This is not an issue related to the type of regression nor any internal structure in the samples but rather the overwhelming error margins that make sensible prediction based on the available data unreasonable.

This is all consistent with the Stockholm University group’s “deconstruction”⁵ of not only the number itself, but the notion that such a number is even sensible to talk about.

⁵ Patrik Lindenfors, Patrik Lindenfors, Andreas Wartel, Andreas Wartel, Johan Lind, and Johan Lind. ‘“Dunbar’s Number” Deconstructed’. Biology Letters 17, no. 5 (1 May 2021): 20210158–20210158. 10.1098/rsbl.2021.0158.

Investigating Dunbar’s number

Tue, 10 Sep 2024 07:30:00 GMT

In a seminal paper¹ from 1992, Robin Dunbar extrapolated from the brain measurements of different animal species and their typical social group sizes, to make a now famous prediction about human social group size. His analysis led him to what has become known as Dunbar’s number, 150, which has had multiple applications in different organisational contexts.

¹ Dunbar, R. I. M. ‘Neocortex Size as a Constraint on Group Size in Primates’. Journal of Human Evolution 22, no. 6 (1 June 1992): 469–93. doi: 10.1016/0047-2484(92)90081-j.

² Patrik Lindenfors, Patrik Lindenfors, Andreas Wartel. ‘“Dunbar’s Number” Deconstructed’. Biology Letters 17, no. 5 (1 May 2021): 20210158–20210158. 10.1098/rsbl.2021.0158.

Prompted by a definition in this blog post, and out of pure curiosity, I searched for a bit more information and stumbled on an relatively recent article by him in which he gives an overview of his research. The article sounded a little snarky to me and snarkiness continued in the comments. It seems he was responding to some research² published by a group at Stockholm University who ran a set of modern statistical tools over the original dataset and were unable to draw the same conclusions that Mr Dunbar had. This group then responded to Dunbar’s article by writing another article explaining their position in more detail.

Since working with real data and reading the work of different research groups is a fantastic way to learn, I decided to do a bit of amateur data analysis. I learned a lot and hopefully by writing down what I saw, I’ll learn even more.

Fitting the data

On a double log-log plot, my grandmother fits on a straight line - Fritz Houtermans

Much of the mathematics of this section can be found in Rayner, J. M. V. ‘Linear Relations in Biomechanics: The Statistics of Scaling Functions’. Journal of Zoology 206, no. 3 (July 1985): 415–39. 10.1111/j.1469-7998.1985.tb05668.x.

Here’s an example of a similar data set to the one used in the original studies. The data is tabulated in the paper but it’s a lot to copy out so I took a similar data set published by the same author in a later paper. Figure 1 is a plot of the data. For multiple species of animal, it shows the ratio of the size of the neocortext to the size of the rest of the brain () on the horizontal axis and the average social group size () for that species on the vertical axis.

Figure 1: a plot of neocortext ratio against average social group size for 39 animal species.

Neither Dunbar nor his critics used the data in this form however, they used its logarithm. This is a common technique used under the assumption that it will make skewed data (as this is) appear more symmetrical (normal). There is no explanation of the reasons for the transform in this particular case. Nonetheless it has an important effect on the results. We’ll come back to this point later. Here’s the transformed data.

Comparison with Dunbar’s data taken from his original paper.

Figure 2: a plot of against for the same 39 animal species.

I’ve presented it in exactly the same way as Dunbar’s paper. Although the data is slightly different the similarity is clear. Also, it does look a bit more like a straight line now and the variance is a little less pronounced. I’m not sure how valid this kind of subjective judgement is.

Figure 3 overlays Dunbar’s original result onto this data set and there is no surprise. The line (the equation for which is given explicitly in the paper) is projected out to show the predicted human mean group size which can be seen to be close to 150. This is the source of Dunbar’s number.

Figure 3: plot extending the 95% confidence interval of the geometric mean regression out to the measured neocortex ratio for humans (4.102).

I’m tried to reproduce this result but for now we note that for the analysis to work we have to make the strong assumption that an errors in the measurements are normally distributed. The choice of using the log-log transformation another strong assumption that we’re taking as fact for the moment but we will return to it.

The parameters of the line are determined using linear regression which attempts to find the best straight line through the points. The question arises: what is the best straight line? It turns out that there are multiple ways of defining best fit and depending on which you chose you can get different results.

The most common type of linear regression is most commonly called Ordinary Least Squares³ (OLS) where the the best line is chosen to be the one that minimises the squares of the vertical difference between each point and itself. This is the most basic regression technique taught in statistics classes.

³ Dunbar calls it LSR (Least Squares Regression) in his paper.

For more discussion about regression with known error weights in both axes for each sample see this paper. In this case we don’t know the error margins in the data.

The problem with OLS, as Dunbar points out in his original paper and again in his recent article, is that by minimising only the vertical distances there is an implicit assumption that there are no errors in the measurements on the x-axis, which is not true in this case.

To take these (unknown) error margins into account in both axes we can choose to minimise the geometric distance between the points and the line. This is called Geometric Mean Regression⁴ in much of the modern literature. The geometric distance is the area between the point and the line, the triangle spanning both the distance along the x-axis and the distance along the y-axis.

⁴ Dunbar calls it RMA (Reduced Major Axis) but just as heads up it turns out the the web page that Dunbar references in his article to explain RMA is wrong. For a neat overview of the main types of regression I’d recommend the paper by Xu, S. (2014). A Property of Geometric Mean Regression. The American Statistician, 68(4), 277–281. doi:10.1080/00031305.2014.962763.

One cited problem with geometric regression is the extra difficulty in the interpretation of confidence intervals for the parameters. I haven’t seen that as being any more of a problem for this type of regression than for any other but I might be wrong. Please read the paper if you are interested.

I did both geometric and OLS regression to compare the difference. Figure 4 shows the results of both types of linear regression transformed back into the original, untransformed scales.

This confirms Dunbar’s assertion that OLS considerably underestimates the slope of the regression line, as compared to geometric one.

There are two other important things to note here beyond the difference between the two lines. First notice the upward curve of the regression lines over the original data. This is a result of the log transformations, essentially producing an approximately exponential fit due to the differing scales of the two axes.

The second thing to notice is the wide spread of the confidence interval for this line. This is also exaggerated by the log transformation. In fact, the validity of the confidence intervals after transformation is highly questionable. See Feng et al. for more details.

Notwithstanding the above caveats, what happens if we extrapolate the curve in an attempt to predict a typical group size for humans?

A direct extrapolation of this data using the geometric fit gives a group size prediction for humans as very approximately ~100. This well below Dunbar’s result but some difference should be expected due to the different data. It does however highlight the disproportionate effect of the log-log transformation on the prediction. It also puts into question the conceptual validity of extrapolating so far outside the available data. Remember that this is the spread of the confidence interval of just the mean itself and does not extrapolate from the variance seen in the rest of the data. This is a key point and an interesting realisation for me personally.

Until now we have collected a series of assumptions about the data that we’ll look at in turn in light of the analysis.

Log-log transformation

It is evident that by taking the log-log transformation of the data we have made the confidence intervals of the predictions extremely wide. Was this choice justified? Let’s take a look at the distributions of the data before and after the transformation.

They show that the transformation has gone some way to make the distributions more symmetrical, especially the group size. It might be interesting to see the results without taking the log of the neocortex ratio but it’s confirmed that this transformation does make sense in this case, even though it does introduce other problems elsewhere. It’s a trade-off I guess.

Distribution of errors

It was assumed that the errors in the measurements in both axes were normally distributed. Figure 7 takes a closer look at the distributions of the residuals of the log transformed data against their regression line.

This residuals are reasonably close to a normal distribution which is one possible justification for the log-log transformation on the data.

We would like to check for the presence of heteroscedasticity which can cause ordinary least squares estimates of the variance (and, thus, standard errors) of the coefficients to be biased, possibly above or below the true of population variance. A quick plot of the residuals doesn’t give us too much cause for concern.

We might like to check that more formally with a Breusch-Pagan test. Again this can be taken as further justification for the log-log transformation.

Confirmation bias

More recently additional evidence has been gathered to support the original Dunbar number of 150. As we’ve seen, this number is at best a very rough guess of the mean and doesn’t include the potential variance over the extrapolated range. Nonetheless there is a very real possibility of confirmation bias in later results. The phenomena starts when people latch on to these specific numbers and then send their correlates to Dunbar who then catalogues them as supporting evidence.

What have military units got to do with Christmas card lists? Christmas card lists are definitely a social phenomenon but it if far from clear which band they should be in. If the results had been 50, say, then it would have been taken as evidence for that smaller group size. It’s a post-hoc correlation.

Conclusion

So, after all this analysis, I learnt a great deal about regression considerations. Least squares never seemed so controversial until I started this. My results came in below Dunbar’s original results. This means nothing, of course, but it does lend some credence to the critics suspicions which would be compatible with my little investigation here.

I also learnt a lot about what can and can’t be said about Dunbar’s number as a concept.

First, none of this says anything about the nested structure of the group sizes and other predictions made subsequently, which have also been used widely⁵.

⁵ The Team Topologies book uses Dunbar’s work as a basis for team size discussions. For them, “Dunbar’s number” (as defined in the glossary) actually refers to results from Dunbar’s later work on nested group sizes and is related but different to Dunbar’s number as usually understood.

Secondly, it is most definitely not a maximum or upper limit, as often stated. We need to be careful not to confuse the confidence interval of the sample mean with a measure of confidence of the prediction.

It is clear to me now that stating Dunbar’s number as 150 is misleading and misses a great deal of context. Even assuming the validity of the model and the regression, it is a predicted mean value extrapolated far outside the underlying data with extremely wide error bars and needs to be treated with a great deal of caution. The prediction has even wider margins to the point of losing some of its meaning as a useful concept.

On not having a map

Thu, 22 Aug 2024 19:30:00 GMT

Dora and Boots are on one side of an island and they want to get to the other side.

Dora the Exploradora and her monkey friend Boots are on a mission. They are on one side of a magical island and they need to get to the other.

She gets out her trusty map from her purple rucksack and sees that there are no roads or paths so she carefully plans the very best route she can with the information she has, avoiding the scary dark forest valley and skirting the rocky mountains. On the way they are met with unexpected obstacles: gullies, fallen trees and riddling trolls. Undeterred and with a little of our help, they finally get to their destination. We did it! We did it!

Some time later, Diego (and Baby Jaguar, presumably) are faced with the same challenge. Unfortunately it’s now really foggy. (“Niebla”, you say it!) Poor chap doesn’t have a map and only has a compass to guide him roughly in the right direction. Diego looks around and bravely heads along the easiest route he can see, roughly east, feeling his way across the island the best he can, avoiding only the obstacles he can see around him. He makes it to the other side.

Dora did great, she feels lucky she had her map! However, Diego made it too. They wonder how much having the map helped. Now that they’re across the island they look back at the route they took. With the benefit of hindsight they can see that perhaps the route she took based on the map wasn’t the best one and, in fact, it might have been better to go a different way.

It turns out that in this case Diego actually did better than Dora, even without the map. In fact he’s not far from the best route in hindsight.

By not sticking to a predefined plan he was able to navigate around those fallen trees and even some of those pesky trolls.

Is this true in general or just a fluke. Let’s try and find out.

So how much did the map help?

To answer this question I had an idea for a toy model to compare Dora’s and Diego’s adventures and did a bit of programming to try it out. The images above are actual screenshots. Here’s the setup.

The island is a grid. Getting from one point on the grid to an adjacent point has a difficulty representing the lie of the land (the difficulty of movement). This is calculated numerically as a cost function where the cost depends on the change in height¹. This corresponds to the intuitive notion that steeper sections are more difficult to traverse than flat sections and any bumps on the ground make the going harder.

¹ Mathematically, the cost function at between any two adjacent points is calculated as where is the change on height.

² The use of the word “ruggedness” here is deliberate. We can draw on some sophisticated science to crystallise this idea and give it real-world meaning.

Then we consider two grids. One grid is set up like a map, an idealised representation of the contours of valleys and mountains and a few of the largest features. Another grid represents the actual terrain, with additional detail of smaller obstacles which are not marked on the map. The ruggedness² of the terrain is applied randomly but its magnitude is parametrisable.

For different levels of ruggedness, I ran a shortest path finding algorithm ³ across the map (Dora’s planned route) and across the terrain using step sizes of varying lengths. Think about the size of these steps as corresponding to different amounts of fog for Diego: visibility is the second parameter. In every case calculating the total difficulty over the whole journey.

³ Dijkstra’s algorithm.

Since this is a numerical model we can get a feeling for what’s happening by studying how the model behaves as the parameters (ruggedness and visibility) change. I ran the model thousands of times to get a distribution of results for different parameter settings.

Diego vs Dora

First let’s see how Diego’s and Dora’s experience compares in general as the ruggedness of the terrain changes.

This graph shows the difficulty of the route followed as the ruggedness of the terrain increases. The width of the ribbon represents the variability of the results over thousands of runs of the model. The model is set up in such a way that ruggedness and difficulty are directly related, in fact linearly related. This is visible as the slope of the lines. Similarly, the intercept of the line is the minimal difficultly of the map. There are no units on the axes. That’s because it’s not clear to me how those units would map to the real world anyway. By removing the units I’m saying that I don’t know: as Carveth Read astutely pointed out, “It is better to be vaguely right than exactly wrong.”.

As you might guess, Dora did well when the map was highly accurate (the far left of the graph), but, unfortunately for her, any additional ruggedness of the terrain which was not marked on the map made her experience much more difficult than expected when following the route she planned in advance.

Diego on the other hand didn’t stick to a predefined route so sometimes was able to adapt to obstacles not on the map. As the ruggedness of the terrain increases past a certain point he consistently does better than Dora.

This is the first observation: there is a trade-off between following a plan and adapting as you go and it depends (at least) on the ruggedness of your terrain. There is a book in that statement⁴.

⁴ Just as a teaser, Simon Wardley talks at great length about how novelty generates uncertainty, leading to a rugged terrain while mature components can be planned more confidently. Bent Flyvbjerg has researched uniqueness bias and a tendency for us to believe that our situation is different from any other, overestimating the uncertainty. Or Chris Rodgers’ Wiggly

A second observation here, often overlooked, is predictability. This can be interpreted on the graph as the width of the ribbon. A narrow ribbon means low variability which corresponds to greater predictability. Again Dora’s progress is very predictable when the map matches reality but that predictability decreases rapidly as reality bites. Interestingly, Diego’s progress is much more predictable than Dora’s as the terrain becomes more and more unpredictable; the variability in his routes is, in fact, fairly constant.

More predictability means being able to have greater confidence in the final outcome. In many projects involving multiple external stakeholders this predictability is actually more important than the details of any plan itself. Read that again.

Third observation: Past a certain point of ruggedness, not only does Diego consistently do better than Dora, he also tends to pick a path close to the best route in hindsight. Let’s see this on a graph.

Although rarely finding the absolute best route, Diego’s route deviates in a relatively small and consistent way.

Intuition holds in the sense that the best route also gets more and more difficult (and variable) as the ruggedness of the actual terrain increases. Diego hardly ever finds perfection but, interestingly, his experience tracks surprisingly closely the best possible one. Again we can see that Diego’s strategy is more predictable than Dora’s as he can “think on his feet” and avoid any trouble encountered on the way.

So how much does Dora’s experience deviate from her best made plans?

The relationship between the plan and reality has become a cliché. As Field Marshal Helmuth von Moltke is said to have said:

Often shorted to “No plan survives first contact with the enemy.”. Also compare with Mike Tyson’s rather more pithy version. I’m not a big fan of quotes related to violence but they do seem to abound in this context.

“No plan of operations extends with any certainty beyond the first encounter with the main enemy forces.”

Can we see this in our little model? The answer is yes and very strikingly.

Dora made the very best plan with the information available to her. Nonetheless she typically ends up deviating a lot from the plan.

When the map perfectly matches the terrain then her experience is as expected but as the map fails to show the detail, the difficulty and variability of her experience increases enormously. Compare this to Diego’s experience compared to the very best route. Field Marshal Helmuth von Moltke was definitely right.

There is another maxim which is usually attributed to Eisenhower:

“Plans are worthless, but planning is everything.”

The strategies used by Dora and Diego are two extremes but they can be combined. We can plan and we can adapt too. What happens if we lift the fog on Diego to allow him to “look ahead” with greater visibility to the future.

Diego’s experience improves as the fog lifts and he has greater visibility of the future.

If we give Diego predictive superpowers he gets closer and closer to the best possible route. Perfect prediction is equivalent to 20/20 hindsight.

If we believe Winston Churchill who said:

“…the best generals are those who arrive at the results of planning without being tied to plans.”

Then Diego would have made a pretty good general!

A model is just a model

Erica Thomson, in her book Escape from Model Land: How Mathematical Models Can Lead Us Astray and What We Can Do About It (2022) wrote:

“Models are not simple tools that we can take up, use and put down again. The process of generating a model changes the way that we think about a situation, encourages rationalisation and storytelling, strengthens some concepts and weakens others.”

This is a model with many implicit assumptions, for example, there are no existing paths or roads on the map. Assuming these were clear of obstacles then these would potentially give Dora a significant advantage. What roads would do in effect is to reduce uncertainty in the map, making her strategy more effective. Likewise dead ends like impassable cliffs or rivers are unlikely to be present.

Also there is some flexibility in the destination, it’s not a specific location. If it were then Diego would have a more difficult time finding it because he wouldn’t be able to “go roughly east”. There is some scope here for further study. Are there strategies that Diego could follow to get to a specific spot in a reliable way?

The model is set up in such a way that ruggedness and difficulty are directly related, in fact linearly related. This is visible as the slope of the line on the graph. Similarly the intercept of the line is the minimal difficultly of the map. On the other hand, there are no units on the axes. That’s because it’s not clear to me how those units would map to the real world anyway. By removing the units I’m saying that I don’t know⁵. I could attempt to build them in but then it would be a different model.

⁵ “It is better to be vaguely right than exactly wrong.” ― Carveth Read, Logic: Deductive and Inductive (1920)

In fact, where did the the parameters of ruggedness and visibility even come from? Well, they came from my intuition and I had to play around with the model to see if my intuition was valid or not. It turned out that it was but it doesn’t preclude other parametrisations and observations.

As it stand, it does do two things, however. Firstly, it reifies my intuition. It’s not just some fuzzy idea in my head, or me being persuaded by appeal to authority. It’s a numerical model that has built-in assumptions and a story to tell. You might have another.

Secondly, it uses data to demonstrate the results, not hand wavy suppositions. Assumptions in, observations of data out. Both ends require a certain leap of faith but at least it’s an honest, transparent and verifiable one. That’s one way to escape Model Land.

It took me a couple of full days to create this model and another couple to write up the results. My intuition is stronger now and I have a model which generates interpretable data. When someone says Alfred Korzybski’s phrase “the map is not the terrain” again, I will think back to this model. The next time someone asks me why we need a plan, I’ll have an honest answer.

Next draft:

Remove repetition of the “no axis” stuff.

Need a “So what?” section which explicitly interprets the model in terms of project management.

GPS?

Flow and Cognitive Load

Tue, 06 Feb 2024 12:30:00 GMT

“Maximise flow by minimising cognitive load”. Wait a minute…

I’ve had a bad back recently and have been scrolling LinkedIn more than I usually do. One thing I noticed was the words “flow” and “cognitive load” coming up a lot, quite often together. Now I’ve heard three people in real life saying almost the exact phrase above and I think it warrants a bit of extra thought.

I guess the first whiff of something fishy is that it smacks a bit of our old friend and adversary Mr Frederick Winslow Taylor. The suggestion of optimisation and words like “cognitive”, sound quite sciency, don’t they? Scientific management has been around a long time. It tends to get sidelined during times of boom to be wheeled out again during the bust. We are no longer in a boom cycle so fair enough but the deserved concern around its dehumanising aspect still remains and we need to be careful. (Also, to me, cognitive load and flow seem to have something vaguely reminiscent of time and motion. Probably my imagination.)

There’s also something fishy about the use of the phrase “cognitive load” itself. Are we talking about an individual’s capacity for learning the skills required to do the work or the whole team’s ability to assimilate that knowledge as a group? It seems quite confused. In either case, there is no better formula to demotivate a skilled person than to remove the need for their skill, especially if you say it’s for their own good. I hope that what is really being meant by “cognitive load” is to minimise the bureaucracy, the burdensome tools, the handoff coordination, the form filling, the approval seeking, all the stuff that really gets in the way of flow.

And talking of flow… secondly, and obviously, flow means so many things. It could just mean not being blocked. It could refer to a simple experience like an easy payment process. It could imply a certain grace and ease such as the flow of a passage of text or a piece of music while at the same time, and less elegantly, it could mean a slicker pipe through which material can be discharged quickly.

Most interestingly to me, “flow” could mean a mental disposition such as the one conceived by Mihály Csíkszentmihályi where work is energising, engaging and enjoyable. Everyone from programmers to PowerPoint wranglers know and enjoy that feeling. However Csíkszentmihályi proposed that the most productive and satisfying work wasn’t done by minimising the difficulty of the task, but by balancing the difficulty with the skill of the person or people involved. If that were the case we don’t want to minimise cognitive load but to balance it.

Whatever the case, whatever maximising fast flow means to you and whatever minimising cognitive load means to you, we’ve been saying for years that it’s velocity, not speed that’s important, that is direction and purpose over how fast you go. We’ve also said that we need skilled and motivated employees fully engaged in the work they are doing. Maximising flow and minimising cognitive load sounds like it could be a recipe for the best possible feature factory.

A Little Every Day

Wed, 22 Feb 2023 09:30:00 GMT

People have given me a lot of advise over the years. “Do what you love.” Ha ha. “Measure twice, cut once.” Well, that would depend, wouldn’t it.

In the end there are only a couple of phrases which stick with me. One was “Always run to something, not from something.” Run to a new job, not from an old one. Run to safety, not from a fire. Maybe I’ll write about why that helps sometime.

Another piece of advice which I still apply is “Do a bit every day.” Let me explain.

This was given to me by a friend of my parents when I had a young family and I was renovating our home. The size of the task ahead was daunting and time was in short supply because nappies needed changing, bread needed winning and the basic necessities of life needed attending to. It was overwhelming.

I tried planning the work and sticking to the plan but I always slipped behind which was stressful in itself, just adding to the problem. I suffered and the family suffered.

Then I tried to “do one job every day” and it changed everything. The “one job”, Joe told me, needn’t be big, in fact it could be as simple as placing a single tile or laying a single floorboard. Sometimes the job was just 5 minutes when everyone else was in bed. No pressure to finish, no schedule to keep to.

The amazing thing was that as the days passed, stuff got done. As if by magic walls were tiled and floors were laid. And the stress was gone.

I put the success of this “technique” down to two things. First, sometimes just getting started is the hardest part. After that the job becomes easier than you thought and it’s done before you know it. The other factor is that as the days pass you maintain some momentum. A couple of times I stopped doing a job everyday and it became harder to get that momentum back. Each and every day something gets done and you move closer to your goal.

When the house was finally complete it turned out to be an immensely rewarding experience.

Inverse Markov

Sat, 29 Oct 2022 06:46:00 GMT

Markov chains are a fantastic tool for modelling how stuff moves about over a network of possible states. Think about them as probabilistic state machines. Its uses are widespread, from Google’s PageRank algorithm which models the way users move about the internet to analysing the ways players move around a Monopoly board. We can also use it as a simple model of workload in a distributed system or team.

In my case I had some economic data and I wanted to see if I could make a fortune finding a trend in the market so I found myself wanting to determine a Markov chain model from real-world data.

Needless to say the result was negative and my wealth has increased only in terms of knowledge.

It turns out that this topic was studied extensively in the last century¹’² and, although simple and well understood, it took me a while to get my head around it. I’m writing it up here to help me remember.

¹ Lee, T. C., Judge, George G., Takayama, T., 1965. On Estimating the Transition Probabilities of a Markov Process. American Journal of Agricultural Economics 47, 742–762. https://doi.org/10.2307/1236285

² Lee, T. C., Judge, George G., Zellner, A., 1968. Maximum Likelihood and Bayesian Estimation of Transition Probabilities. Journal of the American Statistical Association 63, 1162–1179. https://doi.org/10.1080/01621459.1968.10480918

For the background of the maths behind Markov chains in general there are numerous resources online. Basically we repeatedly apply a transformation to the current state to turn it into a new state. We then apply the same transformation to the new state to create the next state and so on. Remarkably, no matter which state we start in, while conditions stay constant the system always converges to the same steady state (plus a bit of noise). We can use this fact to work backwards from observations of the sequence of states and infer the model that would have generated them.

Say we have a state represented by a vector and a transformation represented by a matrix . Based on a sequence of noisy observations of the state, , what is our best estimate of ?

We’ll need plenty of observations so let’s put them together into the same matrix:

Each application of takes us to the subsequent state so we have

We end up with an equation involving .

Since is a square matrix we will need to be at least equal to the number of states. If we have more then even better. Using the pseudo-inverse of the states we can get the least squares best estimate for P.

Where the symbol represents the generalised inverse which in fact is easy enough to calculate, .

In Julia, assuming we already have the state data in the variable , this becomes

V = ν[1:n-1,:]
U = ν[2:n,:]

V⁺ = inv(V'V)V'

P̂ = V⁺*U

This works fairly well but it has problem. The pseudo-inverse knows nothing of the constraint on the probabilities which must all be positive. Various workarounds exist but in the end we end up having to fiddle with the numbers to get them all positive in the optimal way.

Luckily for us, optimisation algorithms for situations like these have also been studied extensively. Julia’s JuMP eco-system, for instance, provides tools which eat these kinds problems for breakfast.

We are optimising for the minimum square difference between and where the entries in , always positive, are the variables we are trying to determine,

Specifically, we are finding values for with the objective of minimising

We also have the condition that the entries must be positive and the constraint that the rows must sum to 1, .

Assuming is the number of states, in JuMP we can define the problem like this:

model = Model(Ipopt.Optimizer)

@variable(model, p[1:r^2] >= 0.0)
@objective(model, Min, (U - V*P)'*(U - V*P))
for j in 1:r
    @constraint(model, sum(p[i] for i in j:r:r^2) == 1.0)
end

optimize!(model)

Pest = reshape(value.(p), r,r)

With this code I’ve been able to recover a good approximation to the transition matrix in testing, the approximation becoming better with increasing sample size.

We could give the algorithm an initial hint based on the pseudo-inverse described above but in my experiments the optimisation algorithm is fast enough to not require it.

The method could be improved by quantifying the uncertainty in the results. On way would be to batch observations and look at the distribution of results. Another might be to apply a bayesian approach using a Dirichlet multinomial as the prior and updating with the observations. A problem for another time.

Accelerate and Farmers’ Gates

Fri, 08 Jul 2022 18:01:00 GMT

Accelerate: Building and Scaling High-Performing Technology Organizations by Nicole Forsgren, Jez Humble, Gene Kim

The kind of gate I’m talking about.

In the rural community where I grew up there was a common aphorism: “A good farmer has good gates” and generations of experience showed that this is a pretty good rule of thumb.

If you took a group of researchers and surveyed the local farmers’ general habits, including how well maintained their gates are, do you think they could infer a relation? Could they even predict which were the better farmers based on their gates? Yes they probably could, because a good farmer has good gates.

In the jargon, this is predictive inference, and it is the kind of inference that Accelerate describes¹.

¹ In chapter 12 they go into some detail.

Does having good gates make you a good farmer? Of course not. The relation between gates and the type of farmer is not causal.

Can a farmer become a better farmer through better maintenance of his gates? Hmm. Accelerate might say that “good gates drive good farming”. IMO the word “drive” is bearing quite a lot of weight to make that statement.

The unfortunate situation is that predictive inference explains nothing. It may validate a hypothesis based on a theory but the scientific literature is littered with such validations that later turned out to be false².

² Try this website to demonstrate your political point.

For explanation you need, at least, causal inference and that is much much harder. Causal inference requires a causal statistical model which is notoriously hard to get right.

My feeling is that some of the key indicators in Accelerate really are causal but others are just the equivalent of good gates.

Why Lognormal?

Thu, 19 Aug 2021 06:21:00 GMT

The log-normal distribution is sometimes used as a simple model for the distribution of latencies from real-world systems. By no means is it a perfect model and sometimes alternatives are better, especially when the load is far below maximum capacity but at high load I’ve empirically found it to be a reasonable fit.

It’s important to understand that models are just models. Under different situations the latency distributions may be completely different.

Real world data from proprietary production system under high load.

One reason that this is important is that the log-normal is long-tailed. This means that while the average response time under these conditions might be excellent, the top percentiles might be totally unacceptable so we end up having to oversize more than we might like.

I’ve often wondered why this might be the case and answers to this question are often very hand-wavey. Here I attempt a totally unscientific explanation.

Let’s take a standard result from queueing theory and say that the latency is a function of the load factor,, defined as the ratio of arrival rate to service rate.

Latencies fly off to infinity when the load factor approaches 1. Makes sense because if the requests arrive faster than they can be serviced then the queue will grow until something else gives.

Now we might assume there is some dispersal in the load factor due to the natural burstiness of incoming traffic. We don’t want to model this noise with a normal distribution because our load factor is strictly between 0 and 1. A better choice is the beta distribution which has exactly that property and just so happens to nicely approximate the normal distribution if we choose the right parameters.

Now what happens to the distribution of latencies? Well, luck would have it that the transformation of the beta distribution by the formula above gives us the so called beta-prime distribution. And that approximates a LogNormal distribution surprisingly well, especially around the tail.

Beta-prime distribution compared to the Log Normal distribution by method of moments.

Absolute error tapers away.

What does this mean? Absolutely nothing and it’s wrong for all kinds of reasons. Nonetheless it might give a clue to the usefulness of the LogNormal distribution and its tail and why the Erlang/Gamma distributions sometimes are more indicative.

The intuition would be that the natural variance in the load produces latencies heavily skewed by the elbow curve of the latency formula, especially when under high load, producing the long tail. Under lower loads the skewing effect is less and the distribution loses it’s long tail, in fact becoming more like an Erlang distribution after all.

Source code for the graphs is here.

Load Balancing Strategies and their Distributions

Sat, 03 Apr 2021 16:27:00 GMT

Results of a simulation to compare four of the most well known load balancer strategies:

Round robin - requests are routed to each of the available servers in turn
Least occupied - requests are routed to the server with least current requests
Random - requests are routed to a randomly selected server
Random 2 - requests are routed to one of two randomly selected servers, where the chosen server has least current requests (see here and here)

The experiment

Arrival rate: requests arrive with a Poisson process with mean .
Service rate: request completions are distributed with a Log-Normal distribution (although any realistic distribution shows the same characteristics).
Load factor is the ratio of arrival rate to completion rate, ranging strictly from 0.0 to 1.0. If greater than 1 then the requests would accumulate without end.

We keep a record of the number of requests in the system for a single server. The Julia code is here.

The results

The difference in the distributions of the concurrent requests in with the different strategies is clear.

Distribution of concurrent requests under different load balancing strategies. The requests are distributed over 20 servers with average arrival rate of 10 per step at 75% load factor.

Notice how remnants of the input and output distributions are still visible in the round robin strategy. The least occupied and random 2 strategies tend to concentrate the distribution around its average. On the other hand the random strategy seems to spread the distribution still further.

How does the average vary with load factor?

Change in average concurrent requests as load factory varies, keeping number of servers at 20.

All strategies perform progressively worse as load factor increases, as would be expected. However the least occupied and random 2 strategies are noticeably better that the others, even at higher load factors.

And how does it change with number of servers?

Change in average concurrent requests as number of servers varies, keeping load factor constant at 80%.

Clearly the best strategies require a minimum number of servers over which to spread the requests. That number is relatively small (in this example around 10) and very little improvement is observed with additional servers. The others are unaffected by number of servers.

Variance

The spread in concurrent requests will translate to a spread in latencies too. For a distributed systems we are usually interested in maintaining predictable latencies and minimising long-tails, so we want to minimise this spread. It’s visibly clear that least occupied is the best in this case as it has the narrowest distribution. Let’s have a look at the variance for each one, taking round robin as the base.

Comparison of variance of the distribution of concurrent requests under different load balancing strategies, keeping the number of servers at 20.

What is the relationship between number of servers and the variance?

Change in variance under different load balancing strategies as number of servers increases.

We can see that when there is only a single server, all strategies are the same (obviously). Both the round robin and random strategies have little or no effect on the variance, no matter how many servers there are in the cluster. Although bot least occupied and random 2 fare better, the least occupied has a clear advantage here and seems to be able to capitalise on additional servers more effectively.

Shedding

We employ a shedding mechanism to help keep request distributions under control independently of the load balancing strategy. How does this affect the results? Here we apply a simple shedding of requests by disallowing more than 20 concurrent requests per server.

Distribution of concurrent requests under different load balancing strategies, shedding requests above 20 per server.

The mean of the worse strategies has actually gone down, however much more shedding is being done. That means that server throughput (serviced requests) should be lower. On the other hand the average latencies will also be higher due to the higher load on the server.

Conclusion

As an aside, we can also calculate the entropy of the distribution and see (as shown in the graph above) that it has also has decreased for the least occupied and random 2 strategies while it has actually increased for the random one. One interpretation of these results is that the random strategy introduces a little bit of new uncertainty into the system. On the other hand least occupied and random 2 actually remove uncertainty. This is the load balancer equivalent of Maxwell’s demon, applying work to each request in order to reduce its uncertainty.

Of all the strategies round robin and random are disastrous and either do nothing to improve the distribution of requests or actually make it worse. However, the least occupied and random 2 strategies are able to take advantage of multiple servers to not only reduce the mean but also reduce the variance across the cluster.

While the least occupied is slightly better in terms of the spread of requests, the random 2 has some other advantages. Firstly, it’s slightly simpler and therefore faster in practice because only 2 servers are checked for each request rather than all of them. More importantly, it avoids servers which are (re)starting receiving all the load immediately. This is useful when the server needs some time to warmup caches, etc.

Polynomial Chaos

Fri, 01 May 2020 17:00:09 GMT

Emily Gorcenski - Polynomial Chaos: A technique for modeling uncertainty - Polynomial chaos is a somewhat obscure technique that leverages a natural connection between probability distributions and orthogonal polynomial families. This talk demonstrates the technique and its applications.

(The Julia code for this blog is available in my GitHub notebooks repo and online here.)

This talk appeared recently in my YouTube recommendations and with a title like “Polynomial Chaos” I had to take a look. This is a summary of what I learnt mainly to help my own understanding.

Polynomial Chaos Expansion (aka PCE, also known as Wiener Chaos Expansion.) was a technique introduced just before the second world war by Norbert Wiener. The use of the word chaos is different from the way we understand it today and seems to come from its application to the statistical study of white noise.

In a very similar way to how the Fourier and Laplace transforms are related to the exponential functions, there are strong relationships between certain probability distributions and corresponding families¹ of orthogonal polynomials.

¹ Catalogued in the Wiener-Askey scheme.

I’m imagining it in a similar way to how a continuous oscillation can be parameterised by only its frequency and amplitude in the Fourier case, though I’m not sure how far that analogy goes. This plot of the polynomials themselves does seem a bit sine-wavey.

Also like the Fourier and Laplace versions, the transformed version of the distribution has many useful properties which can be used for similar purposes, like approximation and solving differential equations.

In these notes, and following closely the talk mentioned above, I’ll try and describe how you might approximate a general probability distributions using this technique.

Polynomial chaos extends from the fact² that any stochastic variable (within reason) can be transformed into a system of orthogonal polynomials: .

² The Kosambi–Karhunen–Loève theorem states that a stochastic process can be represented as an infinite linear combination of orthogonal functions, analogous to a Fourier series representation of a function on a bounded interval.

If the polynomials are chosen correctly then they can represent certain probability distributions very compactly. For example, for a normally distributed random variable the polynomials are Hermite polynomials, , and the transformed random variable can be written , where and are the mean and standard deviation respectively, which makes sense.

The relationship between the distribution and the polynomials can be seen most clearly in the definition of the inner product of the polynomials themselves. In this case the Hermite polynomial inner product is defined like this:

The elements in being both the weighting function for the product and the distribution itself.

We want to approximate a general probability distribution, by expanding in terms of a chosen set of polynomials belonging to another distribution, say . To do this the trick is to transform both and into the same, uniform distribution using an inverse transformation of both:

Then use the Galerkin projection to compute the individual coefficients:

All this is verbatim from the talk. I also planned to transcribe her code (to Julia of course) but I found a better plan.

Dealing with the polynomials from scratch is pretty tedious so I looked for a package that would do it for me. As it turns out there is a Julia package called PolyChaos which does most of this. Looking through the documentation I didn’t see this actual use case so I did it myself.

Using the PolyChaos package we can easily define our Hermite polynomials. In PolyChaos they are called GaussOrthPoly³:

³ The name Hermite is used for the variant of the Hermite polynomials used by physicists. I remember they appear as part of the study of the quantum linear harmonic oscillator which I studied in university.

using PolyChaos

op_gauss = GaussOrthoPoly(20)
H(i,x) = evaluate(i, x, op_gauss)

We also compute the inner (scalar) products, for our polynomials, PolyChaos conveniently does this for us:

sp = computeSP2(op_gauss)

Then we define our inverse functions for testing:

using Distributions

inv_cdf(dist) = u -> quantile(dist, u)

h = inv_cdf(Exponential())
l = inv_cdf(Normal())

integrand(i) = u -> h(u)*H(i, l(u))

In this case the distribution we want to approximate if the Exponential distribution, h, and is defined as a partial function. The Gassian we will approximate it with is defined as the partial function l. Finally de define our integrand in terms of h(u), H and l(u), for a particular index, i.

Now we perform the integration, for which PolyChaos also has us covered.

int_op = Uniform01OrthoPoly(1000, addQuadrature=true)

We’ll truncate the approximation to p polynomials:

p = 21
ki = [integrate(integrand(i-1), int_op) / sp[i] for i in 1:p]

Then we can reconstitute the approximated distribution using 5000 Gaussian random variables, :

ζ = randn(5000)
Σ = zeros(5000)
for i in 1:p
    Σ += ki[i] * H(i-1,ζ)
end
histogram(Σ, normed=true)

Exponential distribution constituted as a sum of transformed gaussian random variables.

With this same code we can now approximate any distribution using random variables drawn from a more manageable distribution of our choice. This would allow us to perform other transformations or analysis which may have been difficult in the original form. It may also be a faster alternative to sampling techniques, like Monte Carlo variants.

Stop Using @Autowire

Fri, 03 Apr 2020 12:25:00 GMT

The Spring Framework is one of the most widely used Java frameworks around. There is a lot of great stuff in the Spring eco-system so it’s a shame to see its flagship feature, a dependency injection container, being widely misused.

First some history.

Cheesy quote of the day: „You have to know the past to understand the present.“ — Carl Sagan

Spring has been around since the early-2000s and was conceived as an antidote to the messy J2EE situation at the time. Its principal (but certainly not its only) attraction was a strong focus on Inversion of Control (IoC) and its dependency injection framework emerged amongst a plethora of competing DI containers.

In the beginning we had bean descriptors written in XML (best forgotten) and then¹, when annotations became fashionable, and driven partly by competing frameworks like PicoContainer and Guice, (much of) the XML was replaced with @Autowire.

¹ In 2007, apparently.

² @Autowire and the standardised @Inject annotations are essentially the same and Spring supports both. I’ll consider them as synonyms in this post.

The strength of @Autowire (and, equally, @Inject²) is its simplicity: add the annotation and let the framework do the rest.

Even at that time there was a big debate about whether to use constructor or field injection, the former better by design and the latter simpler to apply, but in either case the annotation was required to be present somewhere in the class.

The rather unfortunate consequence of having to add Spring specific code to otherwise clean domain objects was considered a worthwhile trade-off. And since we’re using the annotation anyway why not just add it everywhere?

The annotation is no longer required on constructors and the trade-off is no longer worthwhile. Nonetheless millions of developers are dragging it into the 2020s needlessly along with its disadvantages.

As an example consider this code:

class MyClass {
    @Autowire private Foo foo;
    @Autowire private Bar bar;

    ...
}

This can be converted to the following code without having to do any other changes. Spring will handle it just fine.

class MyClass {
    private final Foo foo;
    private final Bar bar;

    public MyClass(Foo foo, Bar bar) {
        this.foo = foo;
        this.bar = bar;
    }

    ...
}

Granted that there are a couple of extra lines of boilerplate code³ but there are strong arguments why the second option should be preferred and @Autowire should now be considered a code smell, especially on fields or accessors. This post outlines some of them.

³ If this is important to you then consider using Lombok to generate the appropriate constructor automatically, but note that that has its own trade-offs.

⁴ From maintainability issues to the IDE warnings about unassigned fields.

There are also stylistic reasons⁴ but I’ll skip them here and concentrate on pure, cold engineering. Much of the same reasoning can be applied to the use of the @Value annotation as well. The objections can be grouped into two main categories.

Firstly, it makes it impossible to use the final modifier. Using the final modifier on fields is an important feature for multiple reasons:

Compilation fails if you have not provided all the necessary dependencies, whereas @Autowire fails at runtime. Compile time guarantees are stronger and safer than runtime testing, they just are. It’s easier and quicker to fix compilation errors than it is to find runtime bugs. So much magic makes configuration problems really hard to debug at runtime. You will lose time over this.

The simple fact that a value cannot change after construction gives additional assurances about the behaviour of the code just by looking at it. It explicitly declares intent and can help avoid some very hard to find bugs.

Final fields are guaranteed to be synchronised between threads. If you don’t declare fields final, then you must cover thread-safety by some other means, or accept that you don’t have it.

The JVM adds extra care-taking to non-final fields to ensure correct memory ordering which is not needed in final fields. Additionally, there are plans to allow the JIT compiler to aggressively optimise code in the knowledge that a field value will not change, as currently happens with static final fields.

That should be enough by itself but there is another major group of objections, namely that it hides dependencies instead of making them explicit.

It’s so easy to add an @Autowire annotation to a field that the structure of the object graph almost inevitably becomes messy over time and, if not careful, can even lead to hard to maintain circular dependencies.

A long list of dependencies in a constructor is a signal that a class has too many responsibilities (violating the SRP) but it’s easy for the same number of annotations to go unnoticed inside a class with many fields.

@Autowired classes are needlessly harder to test. Either we have to bootstrap the entire framework (see below), make our fields public or use something like Mockito’s @InjectMocks, a good example of engineering a solution to a problem that can been avoided altogether.

It’s a NullPointerException waiting to happen. If you construct an instance outside of Spring then the fields must be public or otherwise initialised using setters, breaking encapsulation. It also means it’s possible to create an object in an invalid state breaking the “make invalid state unrepresentable” advice.

Finally, it unnecessarily ties your code to the Spring framework binaries making migrations between frameworks⁵ and restructuring of the domain much harder. This also applies to @Service, @Component, etc. which can also be removed but that’s another story.

⁵ Unfortunately I’ve had to do several over the years: EBJ -> Spring, Spring -> EJB, Spring Boot 1.x -> Spring Boot 2.x 😠

Having said all that, as always there are some notable exceptions to the general rule.

Autowiring with @Bean annotations can (and probably should) be used in Spring configuration classes, which you are likely already using if you’re using Spring. In this case @Autowire is still mostly unnecessary and again you can keep Spring stuff out of your domain. For example:

@Configuration
static class SomeConfiguration {

    @Bean
    public MyClass myClass(Foo foo, Bar bar) {
        return new MyClass(foo, bar);
    }
}

This may be useful if you want to perform some additional configuration and avoids the @PostConstruct nonsense. Again, the idea is to keep all the Spring based annotations inside the configurations and outside the domain classes. This way your domain classes will be clean and typically easier to test without the heavy machinery of…

@SpringBootTest automagically uses @Autowire to inject fully configured objects into your tests. A suitable constructor would be better but JUnit⁶ has its limitations. Take for example:

⁶ Another ubiquitous annotation magic wielding library

@RunWith(SpringJUnit4ClassRunner.class)
@ActiveProfiles("test")
@SpringBootTest(classes = ApplicationConfig.class)
public class BigOldServiceTestIT {

  @Autowired private BigOldService service;

  ...
}

There are probably more exceptions, there always are. I’ll add them as I think of them. On the other had there are definitely more reasons not to, we haven even talked about anaemic domains and good OO design here. That’ll be for another post.

Technical Practices for Continuous Delivery

Mon, 03 Feb 2020 07:25:00 GMT

DORA recommends strengthening a core set of technical practices¹ to “drive”² Continuous Delivery, which in turn “drives” business performance.

¹ Go to the website and click on the “Technical Practices” node. Alternatively take a look at their book Accelerate which lays all of this out in detail.

² Causal inference is a stated assumption. It’s debatable whether this is the case but that’s for another time.

³ See, for example, Extreme Programming Annealed - Glenn Vanderburg

They clearly have internal relationships and, like XP³, there is a dependency graph of interlocking practices. For example, it’s difficult to image trunk-based development without some kind of version control. I was curious what it looked like so I gave it a first stab.

CD Dependency Web

Arrows represent a “supports” relationship. For example: “trunk-based development supports continuous integration”. Some comments:

Version control is at the root of the practices. This is obvious to any practitioner and hardly worth saying. Is anyone not using a VCS in 2020?

A loosely coupled architecture supports deployment automation. Difficulties arise with deployment automation when the codebase is overly monolithic or if distributed components are coupled. Monolithic codebases, even when properly modularised, can result in conflicts. Also overly coupled components result in so called “distributed monoliths” and require complicated deployment sequences. Here we need to distinguish deployment coupling from runtime coupling and talk more about contracts and DDD-style strategic integration patterns.

Contracts between distributed components and multiple teams can be nicely understood through #PromiseTheory which also gives us a model for scaling. A subject for another time.

Database change management supports deployment automation. If you’ve worked in projects without DB change management you’ll know that it can lead to many problems. Before tools such as Flyway and Liquibase were available maintaining database schema in line with the code base was a serious headache. Database changes had to be synchronised with code changes, often resulting in outages, cache problems and delays.

I’ve said that shifting left on security supports continuous testing. Testing is not just about features. How many of us have been stung by security concerns appearing late in the development cycle which could have been solved so much more easily if detected earlier? For example penetration testing is nearly always done as late as possible, for whatever reasons. Exposed services or plain text parameters in a development environment are not an issue but penetration tests flag them immediately.

Though not listed amongst the main practices, performance concerns appear in two separate guises. First, comprehensive monitoring and observability enables performance issues to be made visible and picked up quickly. Second, performance testing is part of any continuous testing strategy.

In common with security, performance concerns raised early in development can actually be an antidote to premature optimisation and lead to better design choices through real feedback. For example, if a query is correct but too slow under production load then that can be dealt with early rather than unnecessary DB scaling in production⁴. A box for shifting left on performance would fit beautifully between comprehensive monitoring and observability and continuous testing.

⁴ DB level scaling in the presence of slow queries often goes under the euphemism of “tuning”.

It’s no use building releasable binaries after every commit, multiple times a day, if you are going to deploy to production once a month. This breaks the feedback mechanism and will result in a call for hot-fixes. Hot fixes require separate branches, break TBD and require separate deployment pipelines. Rollback becomes more difficult because ALL the commits in the release will be rolled-back even if they are giving value.

In general I prefer to have a single binary for any version of the software. The corollary is that configuration should be done externally to the binary. There are two main ways to do that. (1) by externalising everything, for example in a properties file in a well know location or (2) packaging configuration inside the binary for ALL environments and configuring a variable with the name of the configuration to load. The first is usually the preferred, if nothing else it means that worries about the security of production keys etc. can be separated from the management of the build itself.

Teams, Systems and Catastrophe

Sun, 26 Jan 2020 15:17:00 GMT

Many surprising discoveries were made in the last century about how groups of connected things behave and interact. I would like to present a couple of the simplest results which, I think, can help us understand some of the phenomena that we see in our daily lives.

You’ve probably seen the murmurations of swallows and the mesmerising bees which are examples of this.

To avoid over-abstraction I’ll talk specifically about people connected through teams and distributed software systems connected by dependence, but the same ideas and mechanisms are widely applicable to many other situations.

Dependency Hell

The first one I want to cover was discovered by mathematicians Paul Erdős and Alfréd Rényi in the late 1950s¹ and concerns the way networks tend to join together as new connections are added.

¹ Presented in a pair of seminal papers: On Random Graphs. I [1959] and On the evolution of random graphs [1960].

They considered what would happen if you start with a fixed set of independent things and then progressively add new connections between them at random.

One might think that the connectivity on the whole would increase in proportion with the number of connections, but this is not the case. The connectivity is not only non-linear² but undergoes a phase transition³ at which nearly everything joins together very quickly. Let’s see it happening.

² Non-linear: The output changing in a way that is not proportional to the change in the input. Like the temperature of the shower when it goes from freezing cold to scolding hot after the tiniest of adjustments. We tend to expect linearity, either by nature or nurture, but non-linearity is the norm rather than the exception. To paraphrase Stanislaw Ulam: studying non-linearity is like studying non-elephants.

³ Phase-transition: When a collective undergoes a sudden structural reordering as some parameter of the system is gradually changed. Most often applied to thermo-dynamic systems but complex systems share many of their characteristics.

We start with 20 completely independent objects (people or system components) then start adding connections randomly. What you find is that initially nothing much happens: you have some pairs of things and a few threesomes.

Few connections

Adding a few more connections and we can see groups starting to coalesce.

Transition

However add just a couple more connections and the different connected groups⁴ of the network quickly connect together to form much a larger (so-called giant) group.

⁴ This are called components in network jargon but that might be bit confusing for software people for whom everything is a component!

Giant component

This happens every time. If we run the above scenario many many times with a thousand points and plot the size of the largest group, the tendency is remarkable.

Source code for generating these graphics with Julia can be found here.

The effect is surprisingly non-linear. In other words, as the number of random connections between things reaches a certain threshold then nearly all the things will be connected together. In the simplest model, the number of connections just needs to be greater than half the number of things.

This morning I had to remove some oil from the engine of my car. I very very gently adjusted the sump bold to allow the dark oil to escape gradually. Instead, it went jumped from a dribble to a great stream of blackness all down my arm. That’s non-linearity.

For people networks this is the first part⁵ of the Kevin Bacon game, you are certainly connected to Kevin Bacon and indeed to everyone else on the planet.

⁵ For the other part see The dynamics of ‘small-world’ networks by Watts and Strogatz, which significantly reduces the number of ‘hops’ between you and Mr Bacon.

In the case of software dependencies the situation is less fun. The principle implies that the number of dependencies in our software is unintuitively highly transitive. That is, the chains of dependencies in our code and components tends to make everything depend on everything else. We see why a single fault can bring down an entire company’s infrastructure and is the reason we must be draconian when choosing our dependencies to stand any chance against this effect.

Dependencies are not only transitive but many are hidden and bi-directional. I describe this in more detail in my first ever talk Predictably Unpredictable.

Another lesson to be learnt from this scenario originated some years ago in a paper⁶ building on the originals mentioned above. It adds to the model a destructive process where connected groups are “burnt down” periodically. The key is that the building and burning down of the groups balances to a precarious and dynamic equilibrium⁷, so-called self-organised criticality.

⁶ See Erdos-Renyi random graphs + forest fires = self-organized criticality by Balazs Rath and Balint Toth

⁷ A nice interactive model Critically Inflammatory is available on the Complexity Explorables website for you to play with.

The take-away for systems designers is that we sometimes see outages in our systems and then “repair” them by scaling or using tools like circuit-breakers and time-outs. At the same time other connections are being added by new features and services. We find ourselves more often than not on the critical line between stable and unstable.

Tipping Points

Let’s look at a second model. This time, rather than taking completely independent objects and gradually connecting them together, we take a set of object which are all connected together, but with varying weights.

This might represent the number of times that one person interrupts another person per day, or the number of requests from one service to another in a distributed system.

We then gradually increase all the weights using some common scaling factor.

This scenario was first proposed by Lord Robert May in a famous paper⁸ from 1972. He found⁹ that the interactions between the individual parts become unstable (that is move away from a balanced and steady equilibrium) at a certain sudden, non-linear transition.

⁸ See Will a large complex system be stable? by RM May, Nature 238 (5364), 413-414

⁹ See also this course on random matrices for an in-depth description of the reasoning.

We start with a network where all objects, this time just 10 of them, are connected to each other by differing strength interactions.

Complete weighted network

Using May’s model we can calculate the threshold where the loops and feedback in the network make the whole system unstable. As before, rather that try and study specific configurations, lets run the example many times with different random networks of this type to see the tendency.

Again the source code for generating these graphics with Julia can be found here.

May’s criteria for stability has drawn some controversy¹⁰ but the finding is still striking. There is a definite tipping point where network effects produce a phase shift in the dynamics of the system.

¹⁰ When will a large complex system be stable? by Joel E.Cohen and Charles M.Newman

Imagine the network being a team of 10 people each communicating freely with everyone else. The team members which work most closely together would have the strongest interactions. At the tipping point, a slight change in the situation or team dynamic might be amplified by the network effects of interactions with multiple people and could cause a cascade that affects the team as a whole. This is of course just a model and no doubt unrealistic in detail but it wouldn’t be the first time that a seemingly trivial change would destabilise a team.

Alternatively, imagine the network being a distributed software system. If a peak in load is experienced and a component is pushed beyond its response threshold once again it can affect the entire system disproportionately.

As with the previous example, teams will naturally detect and counteract these tipping points - people with too many interactions will turn off Slack if they can’t keep up. Profilers will be brought to bear to optimise some code just enough to stop overloading the system. This has the result of leaving us constantly on the brink of catastrophe.

Wrap up

In this article there was no intention to make precise predictions about the dynamics of any particular team or system but rather demonstrate universal tendencies using basic ensemble techniques.

We’ve seen a couple of simple examples of how small increases in interactions, either in number or in strength, can have significant, non-linear effects for a system as a whole.

Furthermore, our naturally reactions tend to balance the network effects and leave us at a critical point¹¹ where dangerous tipping points are continually at the doorstep. To a certain extent this explains the universal saying “if it ain’t broke, don’t fix it”.

¹¹ For an deeper look at self-organised criticality the book How Nature Works by Per Bak is the seminal work. For a nice summary of the current situation see Watkins, N.W., Pruessner, G., Chapman, S.C. et al. 25 Years of Self-organized Criticality: Concepts and Controversies. Space Sci Rev 198, 3–44 (2016).

The combination of these dynamics is why a single new team member can bring a working team to a standstill or a small change in load can wreak havoc on a seemingly well oiled system. Note that this doesn’t mean that we should avoid change, just that we need to bear in mind how the system might respond.

Strict minimisation of direct and indirect dependencies is not just about clean architectures. Removal of dependencies and working towards additional quality measures beyond the minimal “it works” moves systems further from the critical point and hence makes them considerably more stable. On the other hand, increasing code entropy and poor design choices will push a system towards the critical point, making them less stable, even if they continue working.

I’m aware that the software industry is not used to talking about systems in these terms, some of these ideas could be considered technical and abstract. Nevertheless these examples are real and can help us increase our literacy with complex systems, our “lived, practical complexity”. They are just couple of the many results of this type that could be applied more generally.

DIKUW for Programmers

Sun, 07 Jul 2019 18:30:00 GMT

One of the most interesting models that I’ve seen in recent months is from a talk on Systems Thinking by Dr. Russell Ackoff. It’s variously called the DIKW Pyramid (a mneumonic for Data, Information, Knowledge, Wisdom) or the Information Chain, amongst many other names. Dr. Ackoff also includes a category called “Understanding” which is avoided under some philosophical treatments but I’ll include it because I think it provides valuable insight. Hence and henceforth I’ll call it the DIKUW model.

It’s one of those models which is hard to describe but easy to grasp. In a nutshell it’s this:

Data⇒Information⇒Knowledge⇒Understanding⇒Wisdom

How each of these layers is interpreted varies. I’m going to try and approach it from three angles: as a programmer with programming constructs, with de Bono’s Mechanics of Mind model and, finally, from an AI perspective.

This first post treats it as a programmer.

Data

Imagine a random area of a memory chip full of transistors, each either on or off. One might suppose that those states mean something to someone but without any context it remains data and, actually, does not provide information at all.

That may seem counter intuitive because we are used to seeing data as information but, thinking about it, it can’t be. First, we don’t even know if it is a complete representation of anything. It’s just a state with no context. It might as well, in fact, be random¹.

¹ We may be able to apply techniques from the higher levels to extract more data from the data: average, variance, repeating patterns, etc. but until we apply the higher levels the data remains data.

Often, as programmers we confuse data with information and pay the price with bugs and errors, as we’ll see in a moment.

Information

Let’s say we have observed the state of our transistors somehow and we know which states represent either 0s or 1s, and we know it’s an 8-bit representation. Then our random piece of RAM might look like this:

01101100 01101001 01100110 01100101 11100010 10000110 10010000 01111011 
11100010 10000110 10010001 00110001 00100000 11100010 10001101 10110101 
11100010 10001000 10101000 00101110 11100010 10001000 10100111 00110011 
00100000 00110100 00111101 00101011 00101111 00101100 11000010 10101111 
00110001 00100000 00110000 00100000 00110001 11100010 10001000 10011000 
00101110 11100010 10001010 10010110 11000010 10101111 00110001 00100000 
00110000 00100000 00110001 11100010 10001000 10011000 00101110 11100010 
10001100 10111101 11100010 10001010 10000010 11100010 10001101 10110101 
01111101 00001101 00001010

So we have some information. Hurray! Or do we? Actually no. We’re still missing context. Is it part of the Chemical Brothers’ new album or an amusing cat video?

Let’s say someone tells me it’s text. OK, great, no problem. Looks online for binary text converter. This means we really do have some information, right?

lifeâ{â1 âµâ¨.â§3 4=+/,Â¯1 0 1â.âÂ¯1 0 1â.â½ââµ}

Ups, something is still wrong. Nobody told us the encoding. This is a common programming error and due to a confusion over data and information. Let’s try with UTF-8 instead of ASCII.

life←{↑1 ⍵∨.∧3 4=+/,¯1 0 1∘.⊖¯1 0 1∘.⌽⊂⍵}

Well, the encoding is now correct and we now finally have some information (yeah!). However, it’s still gobbledygook, at least to me. Even so, just to get this far we needed to know: 1. how to observe some state (transistors on or off, in this case) 2. how to read that state (0s or 1s) 3. what it represents (image, text, audio, etc.) 4. how it’s represented, the format (UTF-8, ASCII, etc.)

The good news is that information, like energy, can be manipulated and changed in form. We could convert it to base-7 or EBCIDIC, if we wanted, and it wouldn’t lose it’s content, as long as we remember the context, the meta-data if you like, our information is safe. But what do we do with it?

Knowledge

We still have a problem. We have no idea what the information means. It looks like we have some symbols and someone probably knows what those symbols mean but I’m going to guess that most people don’t. This is beyond encoding and into the area of semiotics and language. The symbols are signifiers.

Now, let me tell you, that is actually a piece APL code. If you know something about APL maybe you can follow what it’s doing. If not, then I can point you towards the APL wiki or the Wikipedia entry and maybe after (considerable) study you might be able to follow what it’s doing. You could even watch the video. If you can then hat’s off to you sir! I personally have no idea.

As it is given, this is a recipe which solves a problem. But then we might ask ourselves: what problem?

Understanding

Even when we know what the program does we might ask ourselves: why did someone write this code in the first place? What’s its purpose? There are no comments, the variables are greek (literally). The only clue is the name of the function….

In actual fact it is a famous one-liner for Conway’s Game of Life. If you don’t know what the Game of Life is, it’s a famous programming toy from computer science which turns out to be in a class of systems which are Turing complete, that is universal computing machines. The person who wrote the code probably knew that.

They also knew intimately how to solve the problem with the given language. That require a deeper level of reasoning than merely learning the language. It means many hours of experimentation and practice. That is why we do katas and continuous practice to hone these skills.

Understanding why and how is very different from following the rules.

Wisdom

OK, so we understand universal computation and the game of life, the APL language, the text encodings, the bytes and the transistors. We might understand the concepts but how do we know when and if to apply them? When would I consider APL over some other language? When should I use a programming language at all or and when should I be playing with my children?

You might have noticed that the ideas become more abstract and general as we move along the chain. This last level is the most difficult of all and the subject of most debate. If understanding requires reflection then this require still more, even a personal philosophy about ones own values.

This has been just one trivial example of the many levels of knowledge or understanding behind anything we do. We are knowledge workers, using information everyday. I wonder sometimes if we really stop to really understand some of the things we do (he says, copy/pasting from Wikipedia) or whether we are content with just knowledge.

(That’s the first part. The next part will deal with applying mental models to the different stages to understand them better in general.)

The Hodgepodge Machine

Sat, 08 Jun 2019 15:16:00 GMT

“The pleasure of complexity… the inconceivable nature of nature” - Richard Feynmann

Back in the 80s, my Dad used to work as a security guard at a geothermal energy research site and sometimes I would accompany him on his rounds. The facility was well funded and very advanced for its time and there was all kinds of computer equipment which was at that time both fascinating and alien to me. There was also a collection of science and technology magazines and sometimes I would read them for hours.

The magazines I liked most were Computer Weekly and Scientific American. I liked them mostly for their programming sections. Back in those pre-internet days it was common for the articles to include listings, mostly in Fortran, BASIC and maybe Pascal and even Assembler¹. That’s how I learned to code.

¹ Yep, we used to type machine code by hand!

² Dewdney, A. (1988). COMPUTER RECREATIONS. Scientific American, 259(2), 104-107. Retrieved from http://www.jstor.org/stable/24989205

One article I remember particularly fondly was in Scientific American’s Computer Recreations section with the caption “The hodgepodge machine makes waves”². The date was August 1988. It described a set of rules which mimicked a type of chemical reaction, the so called Belousov-Zhabotinsky reaction, and resulted in fantastic shapes and behaviours emerging almost like magic. I was able to reproduce that at the time on my Amstrad CPC even though it took ages to update each frame. It was one of my first programming successes.

A few weeks ago one of my colleagues at Codurance wrote an article called Nature in Code where she simulated biological interactions in code and how interesting phenomena emerge naturally. It reminded me of the Hodgepodge Machine and inspired me to try it again with more modern techniques (thank you Solange). So I sat down this morning, and with a little help from this blog post, wrote it in Julia. This is the result.

Compare that to this real reaction:

This ties in with recent work I’ve been doing on dynamical systems and complexity. The hodgepodge machine is a dynamical system: deterministic but, under certain conditions, totally unpredictable.

As Feynman said, “And it’s all really there…but you’ve got to stop and think about it to really get the pleasure about the complexity; the inconceivable [he chuckles] nature of nature.”.

Code is on GitHub here.

Chaotic Waterwheel with Planck

Sun, 02 Jun 2019 16:22:00 GMT

(Update: Many thanks to shakiba for fixing this and adding it to the Plank.js homepage as an example.)

I’m just finishing Steven Strogatz’s Nonlinear Dynamics and Chaos course and one of the systems he discusses is the chaotic Malkus water wheel. This is a real set up devised to mimic the famous Lorenz equations.

“In the 1960s… a real system was needed to demonstrate that chaos and the butterfly effect were realities and not mere mathematical artefacts… W.V.R. Malkus, a mathematician at MIT, realized that the Lorenz-Equations can be transformed into the equations of motion of a waterwheel. This waterwheel was built at MIT in the 1970s and helped to convince the sceptical physicists of the reality of chaos” - taken from here

It consists of a stream of water feeding into multiple, leaky cups mounted on a rotating wheel. The weights of the cups containing water produce a chaotic behaviour causing the wheel to rotate in different directions unpredictably. So today I thought I’d play with this idea and try to reproduce it in a 2D physics engine. I chose Planck, a JavaScript physics engine based on the Box2D implementation in C++.

It took a while to tune the size of the balls representing the water flow and the gaps in the cups which regulate the outflow. It’s far from perfect but it demonstrates the idea.

Anyway here’s the result.

John Hearn

Matrices of Boolean Nets

Boolean algebra

Kauffman’s example

Wolfram’s cellular automata

The spectrum of the update rule

Nothing is black and white

More questions

So what?

Kauffmann’s Basic Gene Nets

The setup

The results

Wolfram’s classes

Christopher Alexander and Network Theory

Predictability and batch size

The model

What does the model tell us?

A model is a model

Validation

Dunbar’s number deconstructed (again)

Reproducing Dunbar’s exact results

Residuals

Confidence in the mean

Confidence in the prediction

Conclusion

Investigating Dunbar’s number

Fitting the data

Log-log transformation

Distribution of errors

Confirmation bias

Conclusion

On not having a map

So how much did the map help?

Diego vs Dora

So how much does Dora’s experience deviate from her best made plans?

A model is just a model

Flow and Cognitive Load

A Little Every Day

Inverse Markov

Accelerate and Farmers’ Gates

Why Lognormal?

Load Balancing Strategies and their Distributions

The experiment

The results

Variance

Shedding

Conclusion

Polynomial Chaos

Stop Using @Autowire

Technical Practices for Continuous Delivery

Teams, Systems and Catastrophe

Dependency Hell

Tipping Points

Wrap up

DIKUW for Programmers

Data

Information

Knowledge

Understanding

Wisdom

The Hodgepodge Machine

Chaotic Waterwheel with Planck