class kenchi.outlier_detection.clustering_based.MiniBatchKMeans(batch_size=100, contamination=0.1, init='k-means++', init_size=None, max_iter=100, max_no_improvement=10, n_clusters=8, n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0)[source]

Bases: kenchi.outlier_detection.base.BaseOutlierDetector

Outlier detector using K-means clustering.

Parameters:
  • batch_size (int, optional, default 100) – Size of the mini batches.
  • contamination (float, default 0.1) – Proportion of outliers in the data set. Used to define the threshold.
  • init (str or array-like, default 'k-means++') – Method for initialization. Valid options are [‘k-means++’|’random’].
  • init_size (int, default: 3 * batch_size) – Number of samples to randomly sample for speeding up the initialization.
  • max_iter (int, default 100) – Maximum number of iterations.
  • max_no_improvement (int, default 10) – Control early stopping based on the consecutive number of mini batches that does not yield an improvement on the smoothed inertia. To disable convergence detection based on inertia, set max_no_improvement to None.
  • n_clusters (int, default 8) – Number of clusters.
  • n_init (int, default 3) – Number of initializations to perform.
  • random_state (int or RandomState instance, default None) – Seed of the pseudo random number generator.
  • reassignment_ratio (float, default 0.01) – Control the fraction of the maximum number of counts for a center to be reassigned.
  • tol (float, default 0.0) – Tolerance to declare convergence.
anomaly_score_

array-like of shape (n_samples,) – Anomaly score for each training data.

contamination_

float – Actual proportion of outliers in the data set.

threshold_

float – Threshold.

Examples

>>> import numpy as np
>>> from kenchi.outlier_detection import MiniBatchKMeans
>>> X = np.array([
...     [0., 0.], [1., 1.], [2., 0.], [3., -1.], [4., 0.],
...     [5., 1.], [6., 0.], [7., -1.], [8., 0.], [1000., 1.]
... ])
>>> det = MiniBatchKMeans(n_clusters=1, random_state=0)
>>> det.fit_predict(X)
array([ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1])
cluster_centers_

array-like of shape (n_clusters, n_features) – Coordinates of cluster centers.

inertia_

float – Value of the inertia criterion associated with the chosen partition.

labels_

array-like of shape (n_samples,) – Label of each point.