Multi-dimensional aggregation and grouping are common forms of analysis traditionally done with SQL databases, OLAP tools (Online Analytical Processing) and/or Excel pivot tables. These traditional tools use a flat table representation, whereby tuples are mapped to measures without any nesting. Increasingly, users are using general purpose programming languages for further offline analysis: languages like R, Python, Scala and Haskell. These often give us the opportunity to work with nested (inductive) structures, which functional languages like Haskell are particularly good at dealing with. Hopefully this post will prove that point.
Flat versus nested
Let’s start with an example flat dataset:
Before we model the above in Haskell, let’s import some modules and define some types:
It is idiomatic in Haskell to use a nominal record type for a row, we’ll call it ‘Trade’. We can then model the flat table structure above as a list of Trade: [Trade]. Even better would be to also model the uniqueness constraint using a primary key: Map TradeId Trade.
The flat structure does make a good canonical representation; and we could always add additional structures for efficiency (e.g. indexes). However, a nested representation offers quite a few advantages:
Nesting makes grouping explicit and assigns priorities (i.e. semantics) to the different dimensions (which I concede you may or may not want). For example, a common hierarchy of dimensions are the time units: year, month and day.
For large cubes, nesting permits efficient lookup and filtering without needing additional indexes.
Nesting allows us to separate grouping and aggregation phases. Becuase SQL databases do not permit nested structures, they must perform grouping and aggregation in a single logical phase. This is why ‘GROUP BY’ needs its own ‘HAVING’ clause.
A better monoid instance for Data.Map
Before we create a nested dataset from the above rows, we should fix the monoid instance for Data.Map. The newtype MMap below has a strictly more general monoid instance. The monoid behaviour for Data.Map can be recovered by using e.g. the First Monoid from Data.Monoid. MMap deserves its own Module and qualified import, but here I’ll just use the Data.Map API with explict wrapping/unwrapping to and from MMap.
Note that the complexity of unionWith is important. Unfortunately, Data.HashMap from the unordered-containers package states a complexity of O(n+m) for unionWith, which would always give us a quadratic implementation of mconcat.
To create a nested dataset, we’ll need a grouping operator. GHC calls this groupWith for lists, which it uses for the generalized list comprehensions extension (TransformListComp). It’s defined in GHC.Exts and the type signature is:
Let’s define an equivalent version for MMap.
-- | Grouping for potential aggregation. -- Note: this implementation is sequential, but could be parallelized. groupWith :: (Key k, Key k') => (v -> k') -> MMap k v -> MMap k' (MMap k v) groupWith f = MMap . fmap MMap . Map.foldrWithKey group Map.empty . unMMap where group k v = Map.insertWith Map.union (f v) (Map.singleton k v)
This does require an initial MMap. A convenient choice here is: MMap TradeId Trade, which models the primary key. It is not yet a grouping, as the element can only be a single Trade record. Perhaps a more satisfiy choice, albeit less convenient in Haskell, would be to use an HList or extensible record for the keys. This would allow us to use compound-keys with various associated operations on those keys.
Let’s go with groupWith and perform a grouping operation to create a level of nesting. For example, to group according to ‘Location’, we would use:
λ> :t groupWith tradeLocation :: Key k => MMap k Trade -> MMap Location (MMap k Trade)
This would group our test data into the structure shown below:
A more satisifying ‘groupWith’ is perhaps the similar ‘nest’ operator below, which splits-up the key. However, going down this path really does require HLists or extensible records, as nested tuples quickly get ugly.
Another advantage of the ‘nest’ grouping operator, is that it becomes easy to define the categorical dual: ‘unnest’. If MMap was an ‘indexed monad’ it would be the monadic join:
With deep nestings, it becomes convenient to define an infix operator for our MMap type:
Let’s now compose groupWith with fmap, to obtain grouping at various levels:
Using the above grouping operators, we can now obtain nested groupings, such as ‘timeDimensions’ below. However, such an API can difficult to use. We need to keep track of the type (level of nesting) within the expression. This could get awkward, once we start including aggregations.
A nice alternative, is to define a grouping function that also applies a function parameter to each resultant group:
This function parameter can be another grouping operation, to perform at the next level of nesting. We can then therefore compose ‘groupBy’ to get a very readable (positional) definition of a grouping:
Such definitions are easily composed further, without worrying about the level of nesting (let GHC figure out the type!).
Of course, our grouping/cube definitions above still do not actually do any aggregation. They remain parameterized with a function to apply to the inner-most level of nesting.
To aggregate, we need to fold or reduce the elements (possibly inner groupings) of a map into a single value (not necessarily an atomic value). The function we need is ‘foldMap’, but I’m going to create a new fancy name for it:
Now we have the necessary operator to rollup the innermost grouping of our cube:
Applying the above aggregation ‘myCubeSum’ to our test data (of type ‘MMap TradeId Trade’), would yield the following structure:
Note that aggregations can have types that look similar to unnesting/ungrouping, but of course they do not form isomorphisms with our grouping operators. In general, we cannot undo an aggregation.
Since MMap has the general monoid instance we defined earlier, our cube aggregations will also all have monoid instances (by induction). This means we can compute partial cubes for input chunks in parallel and then mconcat/reduce them to obtain the final cube.
I’m using the word “subcube” here to describe cubes built from aggregating larger/deeper grouping definitions.
Instead of creating a series of operators to aggregate over each successive level of nesting, we will use a positional style, as we did with the grouping definitions.
First, let’s create a new name for fmap:
We can now just specify in the term, what we want to happen to each dimension as it appears in the type. For example to rollup just the Year and Month, we can write the following:
I think the above is rather nice!
Note: the literal haskell for this entire post can be found here.