An Introduction on How to Choose the Right Collection
Since all computational processes take time to execute, the most important thing to understand when we have more than one option we can pursue is the trade offs between them. Then we can select the most suitable implementation that ensures both functionality and performance quality.
Perhaps the most important consideration for collections of any kind is how the number of items stored in them affects their operations. We can describe this generally by using a big-O complexity notation. So when we say that an operation has a complexity of O(n), we mean that the underlying algorithm takes a linear amount of time proportional to the number of items in the collection. On the other hand, for a complexity of O(1), we have a constant time for its execution, regardless of the collection’s size.
With this concept in place, let's examine what Java collections can do for us and how they work.
What Can Collections Do?
In almost all programming languages, a collection is a way to group together values or their memory references in a way that makes it easy for us to add to, fetch, remove, or search them. In Java, the API specification defines the interfaces for all kinds of collections, including lists, sets and queues, all of which extend from a collection global interface with their particular operation contracts.
Here's a list of the interfaces and the most important methods they expose:
- Collection: Global interface for each collection: [add, remove, contains, size, clear, iterator] (available in all interfaces below)
- List: Indexed list of objects: [get, set]
- Set: Unique elements list of objects: 
- SortedSet: Sorted list of unique objects: [comparator, first, last, headSet, tailSet, subSet, comparator]
- Queue: FILO: [offer, poll, peek]
As we can see, certain collections work better in certain scenarios. Let's take a closer look at how each collection's implementations structure items to handle different situations.
An Introduction to Some Simple Collections
The simplest collection implementation available in the Java API is the ArrayList. This is the collection you'll want to use if the amount of data is small, if it can repeat, and if the order the data is stored in is irrelevant.
Given its indexed array structure, adding and getting elements have a constant complexity (O(1)) with the coast of a linear complexity (O(n)) to remove or search elements.
Another simple collection we can use is the Vector. A Vector has pretty much the same implementation as the ArrayList except that it includes the advent to be synchronized, which means that it is thread safe.
The complexity of adding elements in an ArrayList instance can be higher when reallocation is required. Since ArrayLists use arrays as their data repository, when they get full, another array must be created to handle new items, consuming more resources during this step.
Linked Structure Collections
Collections with linked data structures have the ability to keep the order in which elements were added when iterating over them. This is important when, for instance, you're handling some already sorted data from a database. When retrieving this kind of data, we don't want to lose the order in which elements were fetched from the repository, otherwise it wouldn't make sense to delegate this extra work to the database.
The simplest linked structure is the LinkedList, which can add and remove items with a constant complexity (O(1)) but has a linear complexity (O(n)) when getting and searching for them.
LinkedList collections implement the Queue interface as well, giving this concretion another important role in the Java API. Also, all Queue-exposed method implementations have a constant complexity (O(1)), giving this class all we need to work with FILO solutions.
So far, we can see that List interface concretions are never a good option if we need a collection used mostly to search items, since we will always need to pay the price of a linear complexity (O(n)).
Hash Structured Collections
Hash structured collections are the most interesting kind of implementations because they use maps as the underlying repository for their items. As you may know, a map is a data structure that maps objects to keys, and, in the case of hash structured collections, these keys are the items' hash codes.
You could be asking yourself why this is required and if it's worth the cost of the implementation and the bigger footprint. The answer is a huge and sonorous YES!! This is because it makes a collection's search complexity constant (O(1)). But, to do so, we must make sure all of the items are unique, given their equality contracts. That's why this kind of structure can only be used in Set interface concretions.
It's important to remember, however, that giving equality contracts is not a straightforward procedure. The way we do this will mostly depend on the business rules following the Java API specification.
The Java Object class, to which all other objects extend by default, has two important methods for supporting the definition of equality between any two instances. They are the equals and hashCode methods, which have the following mutual contract:
- Whenever the equals method results in equality for any two instances, then their hash codes must also be the same.
The hashCode method returns the in-memory address of any object by default. This way, two objects will only be considered equal if they are literally the same instance, which is not a very useful constraint. For most of the cases, we need to compare objects given their states, that is, their attributes' values.
You may think that just implementing the equals method would be enough for this, and in most cases, it is. But not for hash structured collections. As mentioned previously, elements are mapped by their hash codes when stored in this kind of collection. The important thing here is that this key is the main criterion to determine equality between items. If we don't follow the contract between those two methods, the business rule of the application could fail, even though your equals method has been appropriately overridden.
When discussing hash codes, it's important to note that if two objects have the same value, it doesn't make them equal. When we have this scenario, which is called hash collision, the values are stored using a linked structure in the same key. When this happens, we can see keys as a bucket map, in which objects with the same hash are stored together and checked for equality using the equals method. The consequence of this is that we can override the hashCode method to return a constant, and it'll work as expected. There will, however be a loss in performance, because linked structures will result in a higher complexity execution time for searches, compromising all the benefits that this structure can offer.
So, to make sure we are making use of all the features of a hash structured set, here is the rule of thumb:
- When overriding an equals method, do it also for hashCode and make sure their contract is being followed.
In the Java API, we have two implementations that use this kind of structure. The simpler one is the HashSet, which doesn't keep the order elements were added. If you need to keep items ordered, a LinkedHashSet should be used instead. Both have an add and search constant complexity (O(1)), but the former has iteration complexity given by O(h/n), which means that the more elements it has, the faster it will be (where h stands for capacity).
Tree Structured Collections
Tree structured collections are another important kind of Set interface implementation. Instead of using a hashCode contract, they use another mechanism to determine equality between elements, which also results in their most important feature: elements are sorted during the collection iteration process.
Tree structured collections need to implement the SortedSet interface and require that their elements implement the Comparator interface. This way, elements can define a rule to be compared with each other, allowing them to be sorted and considered equal in certain circumstances.
As a price for this, we have a logarithmic complexity (O(log n)) for add, remove, search and iterate operations. When sorted sets are required and we can’t delegate sorting to an external system, such as a database, this highest complexity can be ignored for a small number of elements.
The only implementation of this kind is the TreeSet.
The goal of this article is to present all these important considerations in just one place to make it easier to reference collection features at any time. Of course, there's always more to say about such a complex topic, but I believe that here I could do little more than scratch the surface when discussing the core implementation details of most of the commonly used collection types.