Confusing function GroupBy
The sdk offers the following function in sdk/datastream.go:
// GroupBy a given field (identified through the field func), call reducer on
// each group, and returns a data stream in which each element is an aggregation
// result of the group. The optional param maxUniqueGroupValuesOptional can be
// supplied to optimize performance. It assumes the worst case (all values in the
// data stream are unique) if no maxUniqueGroupValuesOptional is configured.
func GroupBy[T, R CircuitVariable](
ds *DataStream[T],
reducer ReduceFunc[T, R],
reducerInit R,
field GetValueFunc[T],
maxUniqueGroupValuesOptional ...int,
) (*DataStream[R], error) {
// ...
}
It is unclear based only on the documentation comment what exactly this function is intended to do, in particular as it is not explained what is meant by a group.
From reading the implementation, it appears that the groups mentioned in the documentation refer to the elements of the data stream with the same value.
Let us go through the implementation.
if len(maxUniqueGroupValuesOptional) > 1 {
panic("invalid amount of optional params")
}
g := ds.api.g
values := make([]frontend.Variable, len(ds.underlying))
for i, v := range ds.underlying {
values[i] = field(v).Val
}
groupValues, err := computeGroupValuesHint(g, values, ds.toggles)
if err != nil {
return nil, err
}
So far, values
stores the underlying data from the data stream after applying field
, and the components of groupValues
are witnesses that are so far not constrained and can be set to anything by the prover. The hint assigning these values is implemented so as to return a list of unique components of values
, padded with zero. Note, however, that a malicious prover can assign something else as well.
The data stream that will eventually be returned is made up of data aggResults
and toggles aggResultToggles
, of length maxGroupValues
, which is the full length of the values if the user did not supply a length with the optional argument maxUniqueGroupValuesOptional
:
maxGroupValues := len(values)
if len(maxUniqueGroupValuesOptional) == 1 {
maxGroupValues = maxUniqueGroupValuesOptional[0]
}
aggResults := make([]R, maxGroupValues)
aggResultToggles := make([]frontend.Variable, maxGroupValues)
The following comment then explains why it is accepted that a malicious prover can assign different groupValues
than intended:
// Filter on each groupValue, then reduce using the user supplied function. Note
// that the groupValues are computed using hints. This is ok because the
// groupValues are used as predicates in the filter step. Assuming the
// outside-of-circuit-computed groupValues are all malicious (does not exist in
// the input data stream), then the filter step would result in empty data
// streams, and the reduce results would be all toggled off. This decision is
// made based on the premise of which it is accepted that we can't prove what's
// NOT provided to the circuit as inputs.
We then arrive at the actual logic to construct the results' data stream.
for i := 0; i < maxGroupValues; i++ {
vs := groupValues[i]
group := Filter(ds, func(current T) Uint248 {
v := field(current)
return newU248(g.IsZero(g.Sub(v.Val, vs)))
})
The original data stream is filtered to only have those entries enabled that match the current groupValues
component after applying field
.
Then the reduction is applied to this filtered data stream:
aggResults[i] = Reduce(group, reducerInit, reducer)
Finally, the toggle for the result is computed as described in the comment:
// only turn on toggles for agg result if the filtered group has at least 1 item
aggResultToggles[i] = g.Sub(1, g.IsZero(Count(group).Val))
}
return newDataStream(ds.api, aggResults, aggResultToggles), nil
Thus, one could describe the function as follows.
The prover can supply a list of up to maxUniqueGroupValuesOptional[0]
(or len(ds.underlying)
if maxUniqueGroupValuesOptional
is not passed) arbitrary values. The resulting data stream will then at component i
have the result of the reduction computed over as many copies of the i
th value supplied by the prover as there are in ds
. If there are no copies in ds
, then the component is toggled off.
Note that in particular, the prover can always choose values that do not occur, and thereby produce an empty (as in everything toggled off) result. Also, the prover can use values multiple times, and the aggregations will then also occur multiple times in the result.
Outside of a test, there is no usage example for GroupBy
in the codebase. It is unclear how GroupBy
should be used given that the caller does not get any information on which values ds
was filtered for, only the aggregations. It seems like it might be more useful to, for example, make aggResults[i]
a tuple containing both Reduce(group, reducerInit, reducer)
and vs
, so that the caller knows which value was filtered and reduced over. Perhaps it might also make sense to ensure that the values in groupValues
chosen by the prover are all unique. This might run into issues with the value zero used for padding, however, as it would be unclear how to distinguish between zero used as padding and zero used as one of the values.
We recommend to clarify intended functionality and usage of GroupBy
with additional documentation and possibly an example.