Why it is impossible to detect cyclic arrays in pure PHP

Part 1 of series Detecting cyclic arrays in PHP.

This post is about a problem that might only be relevant to a small minority of PHP programmers who maybe want to write a custom serializer, or something else that requires recursively iterating an array. But, in trying to solve it, I found some interesting PHP weirdness that I think is worth highlighting.

As an aside:
If you are just looking for a reliable solution for detecting cyclic arrays – and you want to miss out on learning about some cool (but also complicated) PHP quirks – then skip to Part 2 of this series.

First of all, let’s define our terms. For the purpose of this post, we can espouse a very pragmatic definition: An array is cyclic if iterating all of its items and subitems does not terminate. Cyclic arrays are also sometimes called circular or recursive. Let’s look at some examples:

// Contains a reference to itself
$v = [1, 2, 3];
$v[1] = &$v;

// Contains a nested array that contains a reference to $x
$x = [1, [2, 3]];
$x[1][1] = &$x;

// Contains a nested array that contains a reference to an ancestor
$y = [1, [2, [3, 4]]];
$y[1][1][1] = &$y[1];

// Contains a nested array that contains a reference to itself
$z = [1, [2, 3]];
$z[1][1] = &$z[1];

I think you get the idea. All of these arrays contain a cycle at some level.

Attempting a pure PHP solution

Here is a simple algorithm for detecting cyclic arrays: For a given array, iterate all items as well as subitems and mark every array along the way. If at any point you encounter an already marked array, then you found a cycle and the input array is cyclic. This is how one could write it in PHP:

function is_cyclic(array &$array)
{
    $lastKey = array_key_last($array);
    if ($lastKey === null) {
        // Array is empty
        return false;
    }
    static $marker;
    if ($marker === null) {
        $marker = new stdClass();
    }
    if ($array[$lastKey] === $marker) {
        return true;
    }
    $array[] = $marker;
    foreach ($array as &$item) {
        if (is_array($item) && is_cyclic($item)) {
            array_pop($array); // Remove marker
            return true;
        }
    }
    array_pop($array); // Remove marker
    return false;
}

It is worth pointing out that the array argument must be passed by reference; or else we would be appending the marker to a copy and the algorithm would never terminate when given a recursive array as input. Keep this in mind because it will later come back to bite us. But at least for now, this function appears to work just fine (run code).

Breaking the pure PHP solution

Let me preface this section by saying that in PHP you cannot compare arrays for instance identity as you can do with objects. The only way to determine if two variables refer to the same array is to mark the array using one variable and then to check for the mark using the other variable. Thus, if we can find a critical flaw in this procedure, then it is impossible to reliably detect cyclic arrays in pure PHP.

In order to break our function from before, we have to feed it a specially crafted array. There are several ways to do it. The one I am showing you looks very unsuspecting and might actually appear in real code.

function craft_bomb() {
    $array = [1, [2, 3]];
    $array[1][1] = &$array;
    return $array;
}

$bomb = craft_bomb();
var_export(is_cyclic($bomb));

Will this print true or false? The answer is: neither. Instead, this code will never complete until you run out of memory or hit a timeout (run code). How can this be? The array is clearly recursive as var_dump will tell you:

$bomb = craft_bomb();
var_dump($bomb);
array(2) {
  [0]=>
  int(1)
  [1]=>
  array(2) {
    [0]=>
    int(2)
    [1]=>
    *RECURSION*
  }
}

Nothing out of the ordinary here. Just to make sure, let’s dump an array with the same structure, the only difference being that it is constructed in the local scope.

$local = [1, [2, 3]];
$local[1][1] = &$local;
var_dump($local);
array(2) {
  [0]=>
  int(1)
  [1]=>
  array(2) {
    [0]=>
    int(2)
    [1]=>
    *RECURSION*
  }
}

As expected, the dumps are identical (run code), and yet, one array will crash our program and the other one will not. There clearly is more to it than meets the eye.

What is going on inside the array

Matryoshka dolls
Photo by cottonbro from Pexels

Let’s do some printf debugging to get an idea about what is actually happening in our program. When we dump the function argument on every call, we get the following output (run code):

Call 1:
array(2) {
  [0]=>
  int(1)
  [1]=>
  array(2) {
    [0]=>
    int(2)
    [1]=>
    RECURSION
  }
}

Call 2:
array(2) {
  [0]=>
  int(2)
  [1]=>
  array(2) {
    [0]=>
    int(1)
    [1]=>
    RECURSION
  }
}

Call 3:
Same as call 1

Call 4:
Same as call 2
...

These two array dumps keep on alternating forever. Interestingly enough, we would get the exact same output if we were to pass the function argument by value instead of by reference, as I alluded to earlier. It suggests that there is some copying going on, which is strange because one cannot tell by just looking at the function code.

What is perhaps more insightful than the function argument, is how the original input array changes. When we dump it for each function call, we see something quite peculiar (run code):

Call 1:
array(2) {
  [0]=>
  int(1)
  [1]=>
  array(2) {
    [0]=>
    int(2)
    [1]=>
    RECURSION
  }
}

Call 2:
array(3) {
  [0]=>
  int(1)
  [1]=>
  &array(2) {
    [0]=>
    int(2)
    [1]=>
    array(2) {
      [0]=>
      int(1)
      [1]=>
      RECURSION
    }
  }
  [2]=>
  object(stdClass)#1 (0) {
  }
}

Call 3:
array(3) {
  [0]=>
  int(1)
  [1]=>
  &array(3) {
    [0]=>
    int(2)
    [1]=>
    &array(2) {
      [0]=>
      int(1)
      [1]=>
      array(2) {
        [0]=>
        int(2)
        [1]=>
        RECURSION
      }
    }
    [2]=>
    object(stdClass)#1 (0) {
    }
  }
  [2]=>
  object(stdClass)#1 (0) {
  }
}

...

That is quite a mouthful, but, we see that the dump of the array grows one level deeper on every function call. In addition to that, the memory usage of the script steadily increases over time (run code). This leads me to believe that the runtime is lazily materializing the cyclic array in memory as we go down the call chain. It would also explain the mechanism that makes our program run forever.

What we still do not know is why the array is behaving this way and why it is different from the locally constructed array.

Arrays cannot be trusted

Scorpion
Photo by Sharath G. from Pexels

Well, in general, arrays work as one would expect. However, when you perform a value assignment with an array that contains a reference, that predictability goes out the window. “Why?” you might ask. Take a look at this piece of code and think about what output you would expect (run code):

$a = 1;
$b = [&$a];
$c = $b;
$b[0] = 2;
echo "$b[0] $c[0]"; // 2 2

If you said 2 2 then you would be correct. But it might as well have been 2 1 depending on the rest of your program, as this example shows (run code):

$a = 1;
$b = [&$a];
$c = $b;
unset($a); // <--
$b[0] = 2;
echo "$b[0] $c[0]"; // 2 1

There is a section about this exact behavior in chapter 4 of the PHP Language Specification that deals with the subject of value assignments of arrays. To summarize, there are two key takeaways that are relevant for our problem:

1. Any items in the source array which are aliases (references) get assigned to the destination array either by value or by reference, as defined by a runtime-specific algorithm.

$a = 1;
$b = [&$a]; // $b[0] is an alias for $a
$c = $b;
// The last line is equivalent to either:
// $c = [&$a] or $c = [$a];

2. The actual copying of the array might get deferred until the array is mutated (copy on write).

$a = 1;
$b = [&$a];
$c = $b;
// The last line might be sort of equivalent to $c = &$b
// until the array is mutated through either $b or $c.

With these two rules we can rationalize why the function is_cyclic behaves differently for an array constructed in the local scope and an array returned by a function.

In the first case, no value assignments are ever performed with the array, which is why the function behaves as one would expect.

In the second case, the array is returned from the function by value and then assigned to a local variable; copying being deferred (rule 2). When we want to mark the array, the runtime will first make a copy which will receive the mark (rule 2). Performing the copy also involves copying the nested array, which is again deferred, and the cycle repeats. Every time we mark an array, a new copy gets made and the initial arrays grows one level deeper.

I know, this may be hard to wrap your head around, so I created a diagram that illustrates the abstract memory model for this scenario. It loosely follows the memory model notation from the PHP Language Specification.

The morale of this story

Try to avoid copying arrays that contain aliases. The behavior is runtime dependent and the behavior might not be what you expect. Quoting from the PHP Language Specification:

For portability, it is generally recommended that programs written in PHP should avoid performing value assignment with a right-hand side that is an array with one or more elements or sub-elements that have an alias relationship.

PHP Language Specification, https://phplang.org/spec/04-basic-concepts.html

Wrapping up

As I said at the beginning, if you are interested in a solution for reliably detecting cyclic arrays – which very simple, I promise – then please check out Part 2 of this series.

You can find all code samples from this post in this GitHub repository.

Related work:
Series: Detecting cyclic arrays in PHP

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.