scikit-learn: A walk through of GroupKFold.split()

January 25, 2018 2 minute read

Suppose $X[“groups”] = \begin{bmatrix} a \newline b \newline b \newline c \newline c \newline c \end{bmatrix}$ and n_splits=3.

Then GroupKFold.split(X, y, X["groups"]) will run into the _iter_test_indices method which simply yields the indices of the test folds.

# Parameter groups == X["groups"]
unique_groups, groups = np.unique(groups, return_inverse=True)

\[unique\_groups = \begin{bmatrix} a \\ b \\ c \end{bmatrix} \\ groups = \begin{bmatrix} 0 \\ 1 \\ 1 \\ 2 \\ 2 \\ 2 \end{bmatrix}\]

So this groups is an interesting index: if X["groups"] has $n$ unique values, groups could assign $n$ markers to the original X["groups"]. E.g.

markers = np.array(['△', '○', '□'])
markers[[0, 1, 1, 2, 2, 2]] == array(['△', '○', '○', '□', '□', '□'], dtype='<U1')

\[markers[groups] = \begin{bmatrix} △ \rightarrow a \\ ○ \rightarrow b \\ ○ \rightarrow b \\ □ \rightarrow c \\ □ \rightarrow c \\ □ \rightarrow c \end{bmatrix} \\\]

And especially, unique_groups[groups] == X["groups"].

n_groups = len(unique_groups)  # 3
 
# Weight groups by their number of occurrences
n_samples_per_group = np.bincount(groups)

\[n\_samples\_per\_group = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}\]

# Distribute the most frequent groups first
indices = np.argsort(n_samples_per_group)[::-1]
n_samples_per_group = n_samples_per_group[indices]

\[indices = \begin{bmatrix} 2 \\ 1 \\ 0 \end{bmatrix} \\ n\_samples\_per\_group = \begin{bmatrix} 3 \\ 2 \\ 1 \end{bmatrix}\]

# Total weight of each fold
n_samples_per_fold = np.zeros(self.n_splits)  # [0, 0, 0]

# Mapping from group index to fold index
group_to_fold = np.zeros(len(unique_groups))  # [0, 0, 0]

# Distribute samples by adding the largest weight to the lightest fold
for group_index, weight in enumerate(n_samples_per_group):
    lightest_fold = np.argmin(n_samples_per_fold)
    n_samples_per_fold[lightest_fold] += weight
    group_to_fold[indices[group_index]] = lightest_fold

group_index = 0； weight = 3
- lightest_fold = 0
- n_samples_per_fold[0] = 3
- group_to_fold[2] = 0
group_index = 1; weight = 2
- lightest_fold = 1
- n_samples_per_fold[1] = 2
- group_to_fold[1] = 1
group_index = 2; weight = 1
- lightest_fold = 2
- n_samples_per_fold[2] = 1
- group_to_fold[0] = 2

\[group\_to\_fold = \begin{bmatrix} 2 \\ 1 \\ 0 \end{bmatrix}\]

indices = group_to_fold[groups]

Key step! group_to_fold is actually a marker triple here.

\[indices = group\_to\_fold[groups] = \begin{bmatrix} 2 \rightarrow a \\ 1 \rightarrow b \\ 1 \rightarrow b \\ 0 \rightarrow c \\ 0 \rightarrow c \\ 0 \rightarrow c \end{bmatrix} \\\]

for f in range(self.n_splits):
    yield np.where(indices == f)[0]  # note that `np.where` here return a one-elemented tuple

The 1st split: f = 0, yield np.array([3, 4, 5])
The 2nd split: f = 1, yield np.array([1, 2])
The 3rd split: f = 2, yield np.array([0])

# This is an abstract class， `_iter_test_indices` being the abstract method
class BaseCrossValidator(with_metaclass(ABCMeta)):
    def split(self, X, y=None, groups=None):
        X, y, groups = indexable(X, y, groups)
        indices = np.arange(_num_samples(X))  # array([0, 1, 2, 3, 4, 5]) here
        for test_index in self._iter_test_masks(X, y, groups):
            train_index = indices[np.logical_not(test_index)]
            test_index = indices[test_index]
            yield train_index, test_index

    def _iter_test_masks(self, X=None, y=None, groups=None):
        """Generates boolean masks corresponding to test sets.
        By default, delegates to _iter_test_indices(X, y, groups)
        """
        for test_index in self._iter_test_indices(X, y, groups):
            test_mask = np.zeros(_num_samples(X), dtype=np.bool)
            test_mask[test_index] = True
            yield test_mask

    def _iter_test_indices(self, X=None, y=None, groups=None):
        """Generates integer indices corresponding to test sets."""
        raise NotImplementedError

The 1st split:
- test_mask == np.array([False, False, False, True, True, True])
- train_index == np.array([0, 1, 2])
- test_index == np.array([3, 4, 5])
The 2nd split:
- test_mask == np.array([False, True, True, False, False, False])
- train_index == np.array([0, 3, 4, 5])
- test_index == np.array([1, 2])
The 3rd split:
- test_mask == np.array([True, False, False, False, False, False])
- train_index == np.array([1, 2, 3, 4, 5])
- test_index == np.array([0])

P.S. Note that, given its input, GroupKFold’s output is fixed. No random seed is needed.

Twitter Facebook LinkedIn

scikit-learn: A walk through of GroupKFold.split()

Comments

You May Also Enjoy

Elementary Algebraic Structures

Using chruby on Mac

Relation (Math) and Asymptotic Notations

MongoDB Aggregation Stages: addFiels (set) / project (unset) / replaceRoot (replaceWith)