climb.tool.impl.data_suite.utils package¶
Submodules¶
climb.tool.impl.data_suite.utils.data_utils module¶
- climb.tool.impl.data_suite.utils.data_utils.covariance_comparison(clean_array, noisy_array)[source]¶
> This function takes in two arrays, one clean and one noisy, and returns a list of the indices of the features that have a covariance that is greater than the covariance of the clean data
- Parameters:
clean_array – The clean data array
noisy_array – The array of noisy data
- Returns:
a list of the indices of the features that have a covariance that is greater than 0.
- climb.tool.impl.data_suite.utils.data_utils.get_suspect_features(clean_corpus, test_dataset, alpha=0.05)[source]¶
> This function takes in a clean corpus and a test dataset, and returns a list of feature indices that are statistically different between the two
- Parameters:
clean_corpus – the clean corpus
test_dataset – the dataset you want to test for contamination
alpha – the significance level for the KS test.
- Returns:
The suspicious features are being returned.
- climb.tool.impl.data_suite.utils.data_utils.read_from_file(filename)[source]¶
> This function loads a file from a pickle
- Parameters:
filename – the name of the file to read from
- Returns:
the pickle file.
- climb.tool.impl.data_suite.utils.data_utils.return_diagonal(array, just_diag=True)[source]¶
> This function takes a 2D array and returns the diagonal of that array
- Parameters:
array – the array you want to return the diagonal of
just_diag – If True, the function will return the diagonal values of the array. If False, it will
return the entire array. Defaults to True
- Returns:
The diagonal values of the array.
- climb.tool.impl.data_suite.utils.data_utils.scaler(fg, bg, center=True)[source]¶
> This function takes two arrays, one of foreground data and one of background data, and returns two arrays, one of foreground data and one of background data, where the foreground data is scaled to the background data
- Parameters:
fg – foreground data
bg – background data
center – If True, the data will be centered before scaling. Defaults to True
- Returns:
The transformed data.
climb.tool.impl.data_suite.utils.graphics module¶
climb.tool.impl.data_suite.utils.helpers module¶
- climb.tool.impl.data_suite.utils.helpers.inlier_outlier_dicts(conformal_dict, suspect_features)[source]¶
For each feature, we create a dataframe that contains the true value, the lower bound, the upper bound, and the confidence interval. We then create a column called “outlier” that is True if the true value is not within the confidence interval. We use the CIs to assign two dictionaries one for the inliers and one for the outliers
- Parameters:
conformal_dict – a dictionary of dataframes, where each dataframe contains the conformal prediction
- intervals for a given feature.
suspect_features: a list of features that you want to check for outliers
- Returns:
A dictionary of inliers and a dictionary of outliers.
- climb.tool.impl.data_suite.utils.helpers.sort_ci_vals(conformal_dict, inliers_dict, suspect_features, proportion=0.1)[source]¶
> This function takes in a dictionary of conformal inference results, a dictionary of inlier results, a list of suspect features, and a proportion of the data to be used for the analysis.
It then returns the indices of the data points with the smallest and largest confidence intervals, and a dataframe with the sorted confidence intervals.
- Parameters:
conformal_dict – a dictionary of dataframes, where each dataframe is the conformal intervals for a
- feature
inliers_dict: a dictionary of inlier ids for each feature suspect_features: a list of features that are suspected to be problematic proportion: the proportion of the data to use as certain and uncertain
- Returns:
the indices of the samples with the smallest and largest confidence intervals.
- climb.tool.impl.data_suite.utils.helpers.sort_cis_all(conformal_dict, inliers_dict, suspect_features)[source]¶
- climb.tool.impl.data_suite.utils.helpers.sort_cis_synth(conformal_dict, inliers_dict, suspect_features, proportion=0.1)[source]¶
> This function takes a dictionary of conformal intervals, a dictionary of inlier ids, and a list of suspect features. It then creates a dataframe of the conformal intervals for the first suspect feature, and then adds the conformal intervals for the other suspect features to the dataframe. It then sorts the dataframe by the norm_interval column, and returns the ids of the top and bottom proportion of the dataframe
- Parameters:
conformal_dict – a dictionary of dataframes, where each dataframe is the conformal intervals for a
- feature
inliers_dict: a dictionary of inlier ids for each feature suspect_features: a list of features that are suspected to be problematic proportion: the proportion of the data to use as certain and uncertain
- Returns:
the indices of the samples with the smallest and largest confidence intervals.
climb.tool.impl.data_suite.utils.uncertainty_metrics module¶
- climb.tool.impl.data_suite.utils.uncertainty_metrics.compute_deficet(true, lb, ub)[source]¶
> This function computes the average and the proportion of the time that the true value is outside the confidence interval
- Parameters:
true – the true values of the parameters
lb – lower bound
ub – upper bound
- Returns:
The mean and the proportion of the deficet
- climb.tool.impl.data_suite.utils.uncertainty_metrics.compute_excess(true, lb, ub)[source]¶
> This function computes the average excess of the true values over the lower and upper bounds
- Parameters:
true – the true values of the data
lb – lower bound
ub – upper bound
- Returns:
The mean and the proportion of excess
- climb.tool.impl.data_suite.utils.uncertainty_metrics.compute_uncertainty_metrics(preds, lower_bound, upper_bound, true)[source]¶
It computes the uncertainty metrics for a given set of predictions, lower bounds, upper bounds, and true values
- Parameters:
preds – the predicted values
lower_bound – the lower bound of the prediction interval
upper_bound – The upper bound of the prediction interval.
true – the true values
- climb.tool.impl.data_suite.utils.uncertainty_metrics.perf_measure(y_actual, y_pred)[source]¶
> This function takes two lists of the same length, and returns a tuple of four numbers: TN, FP, FN, TP
- Parameters:
y_actual – the actual values of the target variable
y_pred – The predicted values
- Returns:
True Negative, False Positive, False Negative, True Positive
- climb.tool.impl.data_suite.utils.uncertainty_metrics.process_results(wandb_dict, results, roc, uncert_metrics, excess, deficet, excess_all, deficet_all, name)[source]¶
> This function processes the results and stores it in a dict to log to wandb
- Parameters:
wandb_dict – a dictionary that will be used to log the results to wandb
results – a dictionary of dictionaries, where the keys are the names of the models and the values are dictionaries of the results of the model.
roc – the ROC AUC score
uncert_metrics – a dictionary of metrics that are calculated for the uncertainty
excess – excess of interval for specific model
deficet – deficet of interval for specific model
excess_all – proportion excess
deficet_all – proportion deficet
name – The name of the model.
- climb.tool.impl.data_suite.utils.uncertainty_metrics.test_ood(y_test_ids, idx_ordered)[source]¶
> This functuib takes the ordered list of indices and the true labels, and then iterates through the indices, assigning the first x% of the indices to the “certain” class, and the remaining to the “uncertain” class.
It then calculates performance metrics
- Parameters:
y_test_ids – the true labels of the test set
idx_ordered – the indices of the test set, ordered by the distance to the nearest neighbor.
- Returns:
dictionary of metrics, ROC score