What are non-persistent data structures

Time travel with persistent data structures

The year is 2025. Social platforms on the Internet have long since completely freed us from the need for a good memory. A large part of the communication is stored on servers for eternity and can be accessed at any time.

With a touch of nostalgia we remember the good old days around 2013 and ask ourselves why our favorite platform doesn't allow us to ask who our “friends” were back then.

Certainly not because that is not possible. Virtual “time travel” like this one is entirely possible and in many cases can even be implemented efficiently. Using the example of a friendship graph as it can be found in social platforms, with this article we will try to approach a certain class of data structures and to develop a comparatively simple solution to the problem.

In the following we will use the concept of persistent data structure use. It is not to be confused with the term persistence from data management. This is understood to be a data structure with read and update operations and the property that all update operations use the old version of the data structure. This allows us to look back into the past even after an update and to see all the old states, i.e. the complete history of the data structure.

In addition to this review functionality, persistent data structures allow, above all, to work without destructive update operations. In a previous article we explained why this type of programming is easier to understand and easier to lead to correct and more maintainable code. The minimization of side effects also offers enormous advantages for parallelism.

We now want to turn to our application example: We use C ++ 11 as the programming language here, but the ideas are completely language-independent and can be used universally. In particular, it is intended to show once again that concepts from functional programming can also be useful in imperative languages.

Our task is the following:

  • Manage a friendship graph with users and friend relationships.
  • Provide a function that returns a user's friends list.
  • Provide a function that initiates a (one-sided) friendship relationship.
  • Allow access to old versions of the graph as efficiently as possible.

First of all, of course, the question arises as to how we want to represent our graph in memory independently of any time travel. We will number our user (s) from (0) to (n - 1) and use an adjacency list display. A very simple implementation that still no Versioning supported, could look like this, for example:

The output is as expected:

This representation and the associated functions already provide the basic functionality that we need for our application. Our next goal is to make this data structure persistent.

A first attempt

We can actually implement a simple type of persistence very easily: We copy the complete data structure with each update and only change the copy. The changed code could then look like this, for example:

Of course, this variant has an obvious decisive disadvantage: We can now look into the past, but we pay a high price: Our memory consumption increases rapidly with the number of updates, since each time we copy an array that is linear in size in the Number of users is.

It is also clear that this has to be much better, because after all, only one value in this array is changed with each update. The rest remains exactly the same between two versions.

In the following, we will neglect the optimization of the friend lists themselves, since they use little memory compared to the large list of all users. Similar tricks can, however, also be used there.

When looking for a more memory-efficient design, a trick is almost always useful:

Indirection

At the moment our data structure for referencing the friend lists has exactly one level, so we can access a user's friends with a single pointer operation. We will now give up this demand. Instead, we decide to divide our array into (k) parts of equal size (level 2), which in turn are referenced from a higher-order array of size (k) (level 1):

It is clear that we can still do friend queries in constant time, this time we just need two pointer resolution instead of just one as before.

The more interesting question is what we now have to do with an update. First we have to copy the affected block in the level 2 array. We also need to copy the level 1 array as the pointers to the changed blocks change. The block size is approximately (\ frac {n} {k}), so overall we have to copy memory of the order of magnitude (\ mathcal {O} (k + \ frac {n} {k})).

Ideally, we therefore choose (k = \ sqrt {n}) so that only (\ mathcal {O} (\ sqrt {n})) memory has to be copied with each update. Since there are (10 ^ 6) users in our example, we decide on the block size (1000):

Through the indirection, we have now improved both the update time and the memory consumption of our data structure by the factor (\ sqrt {n}) without affecting the constant access time to the friends list. A real success! But can we do better?

More indirection

The question almost begs itself: Why exactly one level of indirection? Why not two or three or (m)? In fact, we can choose any (fixed) number of levels (m) so that updates with runtime and memory overhead (\ mathcal {O} (m \ cdot n ^ {\ frac {1} {m}})) are possible and anyway still guarantee constant access times.

At this point we realize that the simple data structure we have designed actually enables us to store the entire history of our graph in an efficient manner. We can do even more: We can also carry out update operations on old versions of the data structure and thus investigate what would have happened if ... In our time travel analogy, this would then be comparable to creating a parallel universe.

More indirection

If one thinks our previous approach even further, one might come up with the idea of ​​not using a constant number of levels, but a number depending on (n). Interestingly, for (m = \ log_2 n) you get a normal binary tree with update and access in (\ mathcal {O} (\ log n)), which of course is automatically persistent. During the update, we copy all nodes on the way from the root to the leaf exactly once.

"Real" implementations of persistent arrays, as briefly presented here, use various compromises and tricks to keep the runtime of the important operations as short as possible in reality. As an example, reference is made to the Clojure implementation of, which works with (m = \ log_ {32} n) and thus optimally uses the parallel processing of the CPU at word level and also the fast CPU cache.

outlook

Of course, not only graph data structures can be implemented persistently. Practically all important data structures such as lists, sorted sequences, search trees and hash-based associative data structures have persistent analogues. They are mainly used in functional languages.

For more information, see, for example, a blog article by Debasish Ghoshs and the excellent lecture by MIT professor Erik Demaine on the subject of persistent data structures.

The entire code of the examples shown above can of course, as always, be downloaded to try out!


About the author:

In addition to his studies, Niklas Baumstark develops software at Medilyse GmbH in the functional paradigm. His areas of expertise include system-level programming and algorithms.