Scala collections: how to pick the right one
Programming is understanding. Before writing a single line of code, you need to clearly know what you are dealing with — what a collection is, what is behind it, and how you can use it. Selecting the right Scala collection starts with exactly that — data structures, abstract data types, and asymptotic notation. These are the conceptual toolbox for navigating the Scala Collections Library and making informed choices.
The building blocks: what sits beneath every collection
A data structure has three main components: the data it holds, the relationships between elements, and the operations it supports—such as search, modification, and update. Operations are basically the most useful part. Scala collections are backed by various data structures — among them arrays, linked lists, hash tables, red-black trees, binary heaps, and radix-balanced tries. Understanding these structures and their performance characteristics is essential for navigating the Scala collections hierarchy.
Grouping by behavior, not by code: abstract data types
Abstract data types (ADTs) are a mechanism for grouping and classifying data structures. Many data structures have something similar — operations or data properties. ADT is how we group them. A List ADT, for example, a linked list and an array, can both be treated as a list-like data structure. They are similar in the sense that they are an ordered sequence of elements; you can take the head, the tail, or any other part. When we talk about ADTs, we primarily talk about semantics and interfaces. A clear understanding of ADTs allows you to approach any collection library with confidence.
One library, many layers: Scala collections through a CS lens
Scala collections are organized into two packages: immutable and mutable. The same ADT can have different implementations. For example, ArraySeq, ArrayBuffer, and StringBuilder are different Scala types — but they use the same data structure behind them and fulfill the same abstract data type. The Map ADT follows a similar pattern, with HashMap, LinkedHashMap, and VectorMap offering different implementations for specific requirements. Understanding the underlying data structure enables you to select the most appropriate collection.
Before you pick a collection, answer these questions
To pick a concrete collection, you first need to understand your application requirements. Six factors shape the decision:
- Domain data properties: understand the characteristics of the data — possible values, relationships, ordering, uniqueness.
- Desired operations: consider the types of operations that will be performed — insertion, deletion, retrieval, traversal.
- Performance characteristics: evaluate the expected time complexity and memory usage.
- Parallelism: consider whether the collection needs to support parallel operations.
- Thread safety: assess whether the data structure needs to be thread-safe to avoid data corruption or race conditions.
- Other requirements: consider mutability, immutability, scalability, or compatibility with existing code.
Even with the right questions in mind, two mistakes come up often. The first: using a specific implementation type in interface code — in function signatures and in traits. Using List everywhere is not ideal because List is not the fastest data structure in many cases, and the code is inflexible. Just pick the most suitable abstract thing. If you think your code should work with a sequence, pick Seq.
The second: relying on default implementations. The default implementation of Seq is just a List. But if you have a linear search scenario on data that doesn't change, ArraySeq is much more suitable. You put the data once, search many times — much faster than the default. And don't forget the Principle of Parsimony: the system should be as complex as it needs to solve the task. Choose the simplest solution that works.
Big O is not the whole story
Asymptotic notation is essential for evaluating the performance of data structure operations, including both runtime and memory usage. The idea behind this notation is to get rid of some details and focus on more abstract concepts when analyzing and comparing algorithms. Big O represents the worst-case scenario — our function is always bounded above by some other function. Big Omega is the best case — bounded below. Big Theta combines both. Scala's documentation provides a performance table using this notation. It's worth reading carefully before selecting an implementation.
Big O notation is quite abstract. But in practice, the constant factor C has a concrete value. It depends on many factors: the JVM version, the compiler, and the hardware. All of this influences the result. Array and linked-list linear search are both O(n). But on 2 million entries, the array version is several times faster. That's because modern hardware is designed for array-based data structures. The list is slower because of different constant factors, and each has a different running time.
QuickSort shows the same principle. It averages O(n log n) — a fast algorithm. But when the array is already sorted, and the partition element is chosen naively, we don't make progress in each iteration, and the running time becomes quadratic. That's the worst case. Insertion sort has quadratic time too, but on small datasets — because of constant factors — it outperforms QuickSort up to around 200 elements. That fact is used in many libraries: insertion sort for small input, QuickSort for the rest.
Scala's Vector is a subtler case. The documentation says Vector has effectively constant time, but a radix-balanced tree underlies it, so the time is logarithmic. The tree height is very low, so in practice the difference is negligible. But it is not truly O(1).
The bigger picture
Scala collections are more than a library feature. They are a structured model — a framework for reasoning about trade-offs in system design. With these concepts, you can navigate the library and make a sound decision. It's not always obvious — but that's the toolbox. And these concepts are not specific to Scala. They are not specific to any language. They are specific to computer science.
Author: Sergii Zubovych, Senior Scala Engineer at Intellias