Adrian Russell - Sequences 2: Collections

A Collection is Sequence with guarantees that its elements can be iterated through repeatedly, without consuming them, and also allows elements to be accessed by a specified index without needing to iterate to find them.

Lets have a look at Array again. We previously saw that we can iterate through the elements of an array using a for-in loop and each of the elements is supplied by the Iterator for the array. We can also access a single element in the array by using a zero-offset index for its position;

let array = ["a", "b", "c", "d", "e"]

let firstElement = array[0]     // contains "a"
let middleElement = array[2]    // contains "c"

This index access is defined in the Collection protocol.

protocol Collection<Element>: Sequence {
    subscript(position: Self.Index) -> Self.Element { get }

    var startIndex: Self.Index { get }

    var endIndex: Self.Index { get }  // The 'past the end' index.

    func index(after i: Self.Index) -> Self.Index

    func formIndex(after i: inout Self.Index)
}

Index

The index of a collection is the value which can be used to efficiently address a single element within the collection.

There are two properties on a collection to define the boundaries of the indices: startIndex is the index of the first element and endIndex which is the ‘past the end’ index (i.e. it does not correspond to an element in the collection but is the index after that of the final element.

Why, you may wonder, is the endIndex not just the index for the final element? It is general convention and can be seen in c++ and it allows for certain conveniences; if a collection is empty its endIndex will equal its startIndex which allows for the efficient constant-time implementation of the var isEmpty: Bool { get } property. This efficient implementation is why it is always preferable to use the property rather than querying the count to determine if a collection is empty.

The index does not necessarily denote a logical ordered position for the element within a collection. It also does not even denote the way which a developer would access the element. It exists purely for the iterator & subsequence functionality of Collection to allow that element to be efficiently accessed.

Let’s look at Set and Dictionary in the standard library to understand this better.

Set is an unordered collection which can only store unique elements (it cannot contain multiples of the same element). From a developer perspective, an element in a Set does not have a position; it is either in the set or its not. The generic methods to interact with the elements are insert(element), remove(element) and contains(element).

So what does Set use for its index? Well, the actual type is Set.Index and it look as follows:

extension Set {
    public struct Index {
        internal enum _Variant {
            case native(_HashTable.Index)
            case cocoa(__CocoaSet.Index)
        }
        internal var _variant: _Variant
    }
}

It has that _variant property which is an enum holding the index for the actual backing store of the set. It can either be backed by a cocoa NSet/CFSet or it can be a native set which is backed by a hash table.

So what is the index of the hash table? Well in a simplified manner it looks as follows:

extension _HashTable {
    internal struct Index {
        let bucket: Bucket
        let age: Int32

  
        internal init(bucket: Bucket, age: Int32) {
            self.bucket = bucket
            self.age = age
        }
    }
}

It is pointing to the bucket of the hash table which stores the element allowing it to be found without having to perform the hash on the element to find that bucket.

Dictionary is a unordered key-value store in which allows values/elements to be accessed by use of a hashable key. From the developer perspective that element is accessed though that key. It would seemingly make sense that the Element type for a dictionary is its element and the Index value is the type of the key.

That is not the case however, and there is a hint of that when we look at it being consumed by a for-in loop:

let dictionary = ["a": 1, "b": 2, "c": 3, "e": 5]
for (key, element) in dictionary {
    // consume key and element
}

The Element is in-fact a tuple which contains both the key and the element. So what is the Index? Well its going to look pretty familiar:

extension Dictionary {
    public struct Index {
        internal enum _Variant {
            case native(_HashTable.Index)
            case cocoa(__CocoaDictionary.Index)
        }
        internal var _variant: _Variant
    }
}

Yep, it’s pretty much exactly identical to Set including being backed by the same hash table.

A developer will likely never directly interact with the indices for elements in a dictionary. They exist for allowing the quick access to an element when it is being iterated. It may be developer facing such as the Integer index of an Array or iteration specific like for Dictionary. It is something you can also consider whenever making custom Collections; how can I make the Index to access an element as quickly as possible?

Subsequence

A subsequence is a contiguous subrange of a collection which will use the original collection as its backing store to avoid having to copy the contents into a new collection.

An important thing to note is that the Indices for the elements within a Subsequence exactly match those of the original sequence. It makes sense really, those indices are designed to allow for quickly accessing the elements and the subsequence is backed by the original collection so each of the elements would be accessed by those original indices.

This can cause some confusion for developers though, especially for types which use their indices for element access, such as Array.

The first element of an Array always has the index 0 and developers will often use this to access that first element. Array does offer the var first: Self.Element? { get } property to access the first element but it does return an optional which will need to be unwrapped to use. definitely safer but sometimes seen as an extra step, especially if the developers knows the array will not be empty.

This does present an issue when using the Subsequence of array; ArraySlice. The first element of an ArraySlice will likely not have its first index as 0 (unless the subsequence is a prefix of an array) and trying to access that element will cause an ‘Index out of bounds’ fatal error.

TITLE

BidirectionalCollection

A BidirectionalCollection extends the Collection protocol to allow for efficient reversed iteration.

The protocol adds methods for getting the index before a specified index.

protocol BidirectionalCollection<Element>: Collection {
    func index(before i: Self.Index) -> Self.Index
    func formIndex(before i: inout Self.Index)
}

A Bidirectional collection gets a further optimisation via an overloaded reversed() method. The implementation for Sequence and Collection both return a new array which contains all the elements in reverse order. BidirectionalCollection instead returns a ReversedCollection object which wraps the original collection without having to allocate a new array of perform the reverse. This can then be efficiently iterated through.

RandomAccessCollection

A RandomAccessCollection extends from BidirectionalCollection but does not add any new methods or requirements for the collection (beyond its Indicies & SubSequence being required to also conform to RandomAccessCollection).

That it does expect, however, is that calculating the distances between and traversing the indices of the collection is an efficient, constant-time (O(1)) operation.

This can be either be done by having the Index type conform to the Stridable protocol which allows the distance to be calculated or by implementing the index(_:offsetBy:) and distance(from:to:) methods to allow for efficient, constant-time calculation.

MutableCollection

The MutableCollection protocol specifies that a collection is mutable by allowing the element at an index to be replaced. It does not mean that a collection is resizable, nor that an element be removed; effectively a mutable collection allows the elements at indices to be replaced but not necessarily that the indices of the collection can change.

RangeReplaceableCollection

A RangeReplaceableCollection protocol can be considered the opposite/compliment of MutableCollection. It allows a range of indices (and their elements) to be added, removed, or replaced but does not necessarily guarantee that those indices will still exist when replaced, so you cannot directly get replace the value at a specific index, you can only replace the elements at an index or range of indices with elements and the collection will determine what indices they will use.

So a question arises. Why does RangeReplaceableCollection not inherit from MutableCollection? Well, the answer is String.

String is a BidirectionalCollection which conforms to RangeReplaceableCollection but not MutableCollection.

The reason for this is that not all displayable characters are the same length.

Character	UTF16	Unicode Scalers
e	[101]	[“\u{0065}”]
é	[233]	[“\u{00E9}”]
👨	[55357, 56424]	[“\u{0001F468}”]
👩	[55357, 56425]	[“\u{0001F469}”]
👦	[55357, 56422]	[“\u{0001F466}”]
👧	[55357, 56423]	[“\u{0001F467}”]
👨‍👩‍👦	[55357, 56424, 8205, 55357, 56425, 8205, 55357, 56422]	[“\u{0001F468}”, “\u{200D}”, “\u{0001F469}”, “\u{200D}”, “\u{0001F466}”]
👨‍👩‍👧	[55357, 56424, 8205, 55357, 56425, 8205, 55357, 56423]	[“\u{0001F468}”, “\u{200D}”, “\u{0001F469}”, “\u{200D}”, “\u{0001F467}”]
👨‍👩‍👧‍👦	[55357, 56424, 8205, 55357, 56425, 8205, 55357, 56423, 8205, 55357, 56422]	[“\u{0001F468}”, “\u{200D}”, “\u{0001F469}”, “\u{200D}”, “\u{0001F467}”, “\u{200D}”, “\u{0001F466}”]

In the later three emoji, you can see that they are actually composed of the individual face emojis above but joined by a “zero-width joiner” (\u{200D}) to make them into a single character.

This is also part of the reason why String does not use just an integer for its index. The position and size of a character can be widely different and the index is there for efficient access to that character.