-
class
kenchi.outlier_detection.statistical.
GMM
(contamination=0.1, covariance_type='full', init_params='kmeans', max_iter=100, means_init=None, n_components=1, n_init=1, precisions_init=None, random_state=None, reg_covar=1e-06, tol=0.001, warm_start=False, weights_init=None)[source]¶ Bases:
kenchi.outlier_detection.base.BaseOutlierDetector
Outlier detector using Gaussian Mixture Models (GMMs).
Parameters: - contamination (float, default 0.1) – Proportion of outliers in the data set. Used to define the threshold.
- covariance_type (str, default 'full') – String describing the type of covariance parameters to use. Valid options are [‘full’|’tied’|’diag’|’spherical’].
- init_params (str, default 'kmeans') – Method used to initialize the weights, the means and the precisions. Valid options are [‘kmeans’|’random’].
- max_iter (int, default 100) – Maximum number of iterations.
- means_init (array-like of shape (n_components, n_features), default None) – User-provided initial means.
- n_init (int, default 1) – Number of initializations to perform.
- n_components (int, default 1) – Number of mixture components.
- precisions_init (array-like, default None) – User-provided initial precisions.
- random_state (int or RandomState instance, default None) – Seed of the pseudo random number generator.
- reg_covar (float, default 1e-06) – Non-negative regularization added to the diagonal of covariance.
- tol (float, default 1e-03) – Tolerance to declare convergence.
- warm_start (bool, default False) – If True, the solution of the last fitting is used as initialization for
the next call of
fit
. - weights_init (array-like of shape (n_components,), default None) – User-provided initial weights.
-
anomaly_score_
¶ array-like of shape (n_samples,) – Anomaly score for each training data.
-
contamination_
¶ float – Actual proportion of outliers in the data set.
-
threshold_
¶ float – Threshold.
Examples
>>> import numpy as np >>> from kenchi.outlier_detection import GMM >>> X = np.array([ ... [0., 0.], [1., 1.], [2., 0.], [3., -1.], [4., 0.], ... [5., 1.], [6., 0.], [7., -1.], [8., 0.], [1000., 1.] ... ]) >>> det = GMM(random_state=0) >>> det.fit_predict(X) array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1])
-
converged_
¶ bool – True when convergence was reached in
fit
, False otherwise.
-
covariances_
¶ array-like – Covariance of each mixture component.
-
lower_bound_
¶ float – Log-likelihood of the best fit of EM.
-
means_
¶ array-like of shape (n_components, n_features) – Mean of each mixture component.
-
n_iter_
¶ int – Number of step used by the best fit of EM to reach the convergence.
-
precisions_
¶ array-like – Precision matrix for each component in the mixture.
-
precisions_cholesky_
¶ array-like – Cholesky decomposition of the precision matrices of each mixture component.
-
weights_
¶ array-like of shape (n_components,) – Weight of each mixture components.
-
class
kenchi.outlier_detection.statistical.
HBOS
(bins='auto', contamination=0.1, novelty=False)[source]¶ Bases:
kenchi.outlier_detection.base.BaseOutlierDetector
Histogram-based outlier detector.
Parameters: - bins (int or str, default 'auto') – Number of hist bins.
- contamination (float, default 0.1) – Proportion of outliers in the data set. Used to define the threshold.
- novelty (bool, default False) – If True, you can use predict, decision_function and anomaly_score on new unseen data and not on the training data.
-
anomaly_score_
¶ array-like of shape (n_samples,) – Anomaly score for each training data.
-
contamination_
¶ float – Actual proportion of outliers in the data set.
-
threshold_
¶ float – Threshold.
-
bin_edges_
¶ array-like – Bin edges.
-
data_max_
¶ array-like of shape (n_features,) – Per feature maximum seen in the data.
-
data_min_
¶ array-like of shape (n_features,) – Per feature minimum seen in the data.
-
hist_
¶ array-like – Values of the histogram.
References
[1] Goldstein, M., and Dengel, A., “Histogram-based outlier score (HBOS): A fast unsupervised anomaly detection algorithm,” KI: Poster and Demo Track, pp. 59-63, 2012. Examples
>>> import numpy as np >>> from kenchi.outlier_detection import HBOS >>> X = np.array([ ... [0., 0.], [1., 1.], [2., 0.], [3., -1.], [4., 0.], ... [5., 1.], [6., 0.], [7., -1.], [8., 0.], [1000., 1.] ... ]) >>> det = HBOS() >>> det.fit_predict(X) array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1])
-
class
kenchi.outlier_detection.statistical.
KDE
(algorithm='auto', atol=0.0, bandwidth=1.0, breadth_first=True, contamination=0.1, kernel='gaussian', leaf_size=40, metric='euclidean', rtol=0.0, metric_params=None)[source]¶ Bases:
kenchi.outlier_detection.base.BaseOutlierDetector
Outlier detector using Kernel Density Estimation (KDE).
Parameters: - algorithm (str, default 'auto') – Tree algorithm to use. Valid algorithms are [‘kd_tree’|’ball_tree’|’auto’].
- atol (float, default 0.0) – Desired absolute tolerance of the result.
- bandwidth (float, default 1.0) – Bandwidth of the kernel.
- breadth_first (bool, default True) – If true, use a breadth-first approach to the problem. Otherwise use a depth-first approach.
- contamination (float, default 0.1) – Proportion of outliers in the data set. Used to define the threshold.
- kernel (str, default 'gaussian') – Kernel to use. Valid kernels are [‘gaussian’|’tophat’|’epanechnikov’|’exponential’|’linear’|’cosine’].
- leaf_size (int, default 40) – Leaf size of the underlying tree.
- metric (str, default 'euclidean') – Distance metric to use.
- rtol (float, default 0.0) – Desired relative tolerance of the result.
- metric_params (dict, default None) – Additional parameters to be passed to the requested metric.
-
anomaly_score_
¶ array-like of shape (n_samples,) – Anomaly score for each training data.
-
contamination_
¶ float – Actual proportion of outliers in the data set.
-
threshold_
¶ float – Threshold.
References
[2] Parzen, E., “On estimation of a probability density function and mode,” Ann. Math. Statist., 33(3), pp. 1065-1076, 1962. Examples
>>> import numpy as np >>> from kenchi.outlier_detection import KDE >>> X = np.array([ ... [0., 0.], [1., 1.], [2., 0.], [3., -1.], [4., 0.], ... [5., 1.], [6., 0.], [7., -1.], [8., 0.], [1000., 1.] ... ]) >>> det = KDE() >>> det.fit_predict(X) array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1])
-
X_
¶ array-like of shape (n_samples, n_features) – Training data.
-
class
kenchi.outlier_detection.statistical.
SparseStructureLearning
(alpha=0.01, assume_centered=False, contamination=0.1, enet_tol=0.0001, max_iter=100, mode='cd', tol=0.0001, apcluster_params=None)[source]¶ Bases:
kenchi.outlier_detection.base.BaseOutlierDetector
Outlier detector using sparse structure learning.
Parameters: - alpha (float, default 0.01) – Regularization parameter.
- assume_centered (bool, default False) – If True, data are not centered before computation.
- contamination (float, default 0.1) – Proportion of outliers in the data set. Used to define the threshold.
- enet_tol (float, default 1e-04) – Tolerance for the elastic net solver used to calculate the descent direction. This parameter controls the accuracy of the search direction for a given column update, not of the overall parameter estimate. Only used for mode=’cd’.
- max_iter (integer, default 100) – Maximum number of iterations.
- mode (str, default 'cd') – Lasso solver to use: coordinate descent or LARS.
- tol (float, default 1e-04) – Tolerance to declare convergence.
- apcluster_params (dict, default None) – Additional parameters passed to
sklearn.cluster.affinity_propagation
.
-
anomaly_score_
¶ array-like of shape (n_samples,) – Anomaly score for each training data.
-
contamination_
¶ float – Actual proportion of outliers in the data set.
-
threshold_
¶ float – Threshold.
-
labels_
¶ array-like of shape (n_features,) – Label of each feature.
References
[3] Ide, T., Lozano, C., Abe, N., and Liu, Y., “Proximity-based anomaly detection using sparse structure learning,” In Proceedings of SDM, pp. 97-108, 2009. Examples
>>> import numpy as np >>> from kenchi.outlier_detection import SparseStructureLearning >>> X = np.array([ ... [0., 0.], [1., 1.], [2., 0.], [3., -1.], [4., 0.], ... [5., 1.], [6., 0.], [7., -1.], [8., 0.], [1000., 1.] ... ]) >>> det = SparseStructureLearning() >>> det.fit_predict(X) array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, -1])
-
covariance_
¶ array-like of shape (n_features, n_features) – Estimated covariance matrix.
-
featurewise_anomaly_score
(X)[source]¶ Compute the feature-wise anomaly scores for each sample.
Parameters: X (array-like of shape (n_samples, n_features)) – Data. Returns: anomaly_score – Feature-wise anomaly scores for each sample. Return type: array-like of shape (n_samples, n_features)
-
graphical_model_
¶ networkx Graph – GGM.
-
isolates_
¶ array-like of shape (n_isolates,) – Indices of isolates.
-
location_
¶ array-like of shape (n_features,) – Estimated location.
-
n_iter_
¶ int – Number of iterations run.
-
partial_corrcoef_
¶ array-like of shape (n_features, n_features) – Partial correlation coefficient matrix.
-
plot_graphical_model
(**kwargs)[source]¶ Plot the Gaussian Graphical Model (GGM).
Parameters: - ax (matplotlib Axes, default None) – Target axes instance.
- figsize (tuple, default None) – Tuple denoting figure size of the plot.
- filename (str, default None) – If provided, save the current figure.
- random_state (int, RandomState instance, default None) – Seed of the pseudo random number generator.
- title (string, default 'GGM (n_clusters, n_features, n_isolates)') – Axes title. To disable, pass None.
- **kwargs (dict) – Other keywords passed to
nx.draw_networkx
.
Returns: ax – Axes on which the plot was drawn.
Return type: matplotlib Axes
-
plot_partial_corrcoef
(**kwargs)[source]¶ Plot the partial correlation coefficient matrix.
Parameters: - ax (matplotlib Axes, default None) – Target axes instance.
- cbar (bool, default True.) – If Ture, to draw a colorbar.
- figsize (tuple, default None) – Tuple denoting figure size of the plot.
- filename (str, default None) – If provided, save the current figure.
- title (string, default 'Partial correlation') – Axes title. To disable, pass None.
- **kwargs (dict) – Other keywords passed to
ax.pcolormesh
.
Returns: ax – Axes on which the plot was drawn.
Return type: matplotlib Axes
-
precision_
¶ array-like of shape (n_features, n_features) – Estimated pseudo inverse matrix.