Logical operations over matrices of sorted numbers
Our e-commerce view engine (E.V.E.) works with matrices of sorted number on the lowest level of its application logic. I’d like to describe some of the algorithms we use to combine such matrices in logic operations.
Why do we keep data in matrices in sorted order? There is a price to be paid when adding new number to the matrix when sorted order is required - appending number on last position would be much faster. The reason is simple - when matrices contain only sorted data we may use much more efficient (faster) algorithms to combine them. We pay a price on updating matrices in order to gain better speed when combining them.
Let’s look how sample matrix looks like.
[[1, 10, 18, 24, 45, 89, 112, 254, 266, 312],
[1, 2, 4, 6, 18, 24, 29, 33],
[2, 3, 8, 10, 16, 18, 24, 26, 32]]
As you can see it’s two dimensional array of numbers of different array lengths on second dimension of the array.
Result of our combination is going to be java.lang.Iterator
. Iterator would allow us to perform “lazy” execution,
so that the computation price is paid as you go and not upfront. There are situations when complete results are not
necessary (for example in situations when only first page of records is required and total record count information is not).
Iterators produce series of monotonic numbers that can be source of another logical operations. This fact allows us
to create nested of iterators, that consumes output of inner iterators and create complex computation trees.
Aside: performance results in this article were measured on JDK 8, Intel® Core™ i7-5600U CPU @ 2.60GHz, Ubuntu 18.04.
Logical conjunction (AND)
Input of the logical conjunction is list of integer arrays (can be also easily modified to array of arrays, but for our usage pattern list of arrays is more suitable). Integers in arrays are monotonic (unique and increasing with position). Algorithm has statistically better performance if input list of arrays contains arrays is sorted by their size ascending.
We first init array of counters that keep pointers (indexes) to the individual arrays in the input list. Array of counters has the size equal to the size of the list. Counters start with zero index in the start.
Algorithm tries to find the index in all of the input arrays that produces the same number. Look at following example, where the shared numbers are aligned:
[[ 2, 4, 6, 8, 10, 12],
[ 3, 6, 9, 12],
[ 1, 4, 6, 7, 12]]
We see that the first match occurs on index [3, 2, 3] and the second on [6, 4, 5]. Produced result of the conjunction will be [6, 12] - only those numbers are present in all three input arrays.
One big advantage of this algorithm is the escape route that allows to finish early if we found a situation when there is no more numbers in examined array and all other arrays have bigger numbers on current “counter” index positions. Let’s show this situation on example:
[[1, 2, 3, 4, 5, 6 ],
[ 1 ],
[ 1, 5, 6, 7, 9, 10]]
We can see that only shared number is on index position 0. After that we can immediately say that there are no other shared numbers even if there is a lot of additional numbers in first and third array. Escape route allows to quickly reach end of the computation.
The algorithm is also described for example on Geeks for geeks site It’s worst case complexity is O(∑[N1,N2 … N-last]).
Our implementation can bee seen here:
Measured performance:
- 1m unique random numbers
- 10 arrays with minimum of 540k and maximum 900k numbers
- average AND performance is 47ms per computation with result array size around 46k
Performance goes up when one of the randomly selected array size is close 540k and goes down if all arrays are closer to the 900k threshold. Affect of the escape route heavily affect the final performance.
For lower count of conjugated arrays where there is bigger difference in array lengths, there may be better algorithm called Shotgun.
Logical disjunction (OR)
Disjunction is much more expensive algorithm than conjunction - there is no fast escape route and we need to go through all numbers of all arrays and get rid of duplicates.
Example input:
[[ 2, 4, 6, 8, 10, 12],
[ 3, 6, 9, 12],
[ 1, 4, 6, 7, 12]]
Produces output:
[1, 2, 3, 4, 6, 7, 8, 9, 10, 12]
Implementation uses so called priority queue that holds current elements from all arrays in the input list (the size of the priority queue is equal to size of the list and may get only smaller in time). Every item pushed to the priority queue will be placed on correct position according to the specified ordering.
Algorithm pops top element from the priority queue and appends it to the result array. After processing the popped number, counter of the array popped number originates from is incremented and next value from the array is pushed to the priority queue again. In order to skip duplicate values we just remember the last returned number and if the same number is picked up from the queue, we just increment the counter and push the element back to the queue again.
The algorithm is described in more detail as Min heap algorithm on Geegs for geeks site and it’s worst case time complexity is O(nk Log k).
Our implementation can bee seen here:
Measured performance:
- 100k unique random numbers
- 100 arrays with minimum of 6k and maximum 10k numbers
- average OR performance is 62ms per computation with result array size close to 100k
We can see that performance of this operation is 10x worse than for AND operation.
Logical negation (NOT)
Not operation accepts two array of numbers and produces array of numbers, that are present in second array and not present in the first array.
Example input:
[[ 3, 6, 9, 12],
[ 2, 4, 6, 8, 10, 12]]
Produces output:
[2, 4, 8, 10]
Algorithm picks number from subtracted (negation) array and iterates through main array until numbers are lesser than the picked one. If it runs at the equal number, it skips that number and picks next number from the subtracted array. If it runs at the equal number it starts picking next numbers from the subtracted array until it reaches equal or bigger number and then repeats again.
It’s complexity is O(M+N).
Our implementation can bee seen here:
Measured performance:
- 1m unique random numbers
- 2 arrays with minimum of 600k and maximum 1m numbers
- average NOT performance is 6ms per computation with result array size close to 220k
Conclusion
This article lays foundation for following article where we look how E.V.E. works internally. Above mentioned algorithms are used in the core of what E.V.E. does and that’s why I wanted to describe them in a detailed form. If you have any remarks or tips for more effective solution, drop me a message on Twitter (@novoj) or e-mail novotnaci[at]gmail[dot]com.