-
class
kenchi.outlier_detection.distance_based.
KNN
(aggregate=False, algorithm='auto', contamination=0.1, leaf_size=30, metric='minkowski', novelty=False, n_jobs=1, n_neighbors=20, p=2, metric_params=None)[source]¶ Bases:
kenchi.outlier_detection.base.BaseOutlierDetector
Outlier detector using k-nearest neighbors algorithm.
Parameters: - aggregate (bool, default False) – If True, return the sum of the distances from k nearest neighbors as the anomaly score.
- algorithm (str, default 'auto') – Tree algorithm to use. Valid algorithms are [‘kd_tree’|’ball_tree’|’auto’].
- contamination (float, default 0.1) – Proportion of outliers in the data set. Used to define the threshold.
- leaf_size (int, default 30) – Leaf size of the underlying tree.
- metric (str or callable, default 'minkowski') – Distance metric to use.
- novelty (bool, default False) – If True, you can use predict, decision_function and anomaly_score on new unseen data and not on the training data.
- n_jobs (int, default 1) – Number of jobs to run in parallel. If -1, then the number of jobs is set to the number of CPU cores.
- n_neighbors (int, default 20) – Number of neighbors.
- p (int, default 2) – Power parameter for the Minkowski metric.
- metric_params (dict, default None) – Additioal parameters passed to the requested metric.
-
anomaly_score_
¶ array-like of shape (n_samples,) – Anomaly score for each training data.
-
contamination_
¶ float – Actual proportion of outliers in the data set.
-
threshold_
¶ float – Threshold.
-
n_neighbors_
¶ int – Actual number of neighbors used for
kneighbors
queries.
References
[1] Angiulli, F., and Pizzuti, C., “Fast outlier detection in high dimensional spaces,” In Proceedings of PKDD, pp. 15-27, 2002. [2] Ramaswamy, S., Rastogi, R., and Shim, K., “Efficient algorithms for mining outliers from large data sets,” In Proceedings of SIGMOD, pp. 427-438, 2000. Examples
>>> import numpy as np >>> from kenchi.outlier_detection import KNN >>> X = np.array([ ... [0., 0.], [1., 1.], [2., 0.], [3., -1.], [4., 0.], ... [5., 1.], [6., 0.], [7., -1.], [8., 0.], [1000., 1.] ... ]) >>> det = KNN(n_neighbors=3) >>> det.fit_predict(X) array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1])
-
X_
¶ array-like of shape (n_samples, n_features) – Training data.
-
class
kenchi.outlier_detection.distance_based.
OneTimeSampling
(contamination=0.1, metric='euclidean', novelty=False, n_subsamples=20, random_state=None, metric_params=None)[source]¶ Bases:
kenchi.outlier_detection.base.BaseOutlierDetector
One-time sampling.
Parameters: - contamination (float, default 0.1) – Proportion of outliers in the data set. Used to define the threshold.
- metric (str, default 'euclidean') – Distance metric to use.
- novelty (bool, default False) – If True, you can use predict, decision_function and anomaly_score on new unseen data and not on the training data.
- n_subsamples (int, default 20) – Number of random samples to be used.
- random_state (int, RandomState instance, default None) – Seed of the pseudo random number generator.
- metric_params (dict, default None) – Additional parameters passed to the requested metric.
-
anomaly_score_
¶ array-like of shape (n_samples,) – Anomaly score for each training data.
-
contamination_
¶ float – Actual proportion of outliers in the data set.
-
threshold_
¶ float – Threshold.
-
subsamples_
¶ array-like of shape (n_subsamples,) – Indices of subsamples.
-
S_
¶ array-like of shape (n_subsamples, n_features) – Subset of the given training data.
References
[3] Sugiyama, M., and Borgwardt, K., “Rapid distance-based outlier detection via sampling,” Advances in NIPS, pp. 467-475, 2013. Examples
>>> import numpy as np >>> from kenchi.outlier_detection import OneTimeSampling >>> X = np.array([ ... [0., 0.], [1., 1.], [2., 0.], [3., -1.], [4., 0.], ... [5., 1.], [6., 0.], [7., -1.], [8., 0.], [1000., 1.] ... ]) >>> det = OneTimeSampling(n_subsamples=3, random_state=0) >>> det.fit_predict(X) array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1])