climb.tool.impl.data_suite.utils package

Submodules

climb.tool.impl.data_suite.utils.data_utils module

climb.tool.impl.data_suite.utils.data_utils.covariance_comparison(clean_array, noisy_array)[source]

> This function takes in two arrays, one clean and one noisy, and returns a list of the indices of the features that have a covariance that is greater than the covariance of the clean data

Parameters:
  • clean_array – The clean data array

  • noisy_array – The array of noisy data

Returns:

a list of the indices of the features that have a covariance that is greater than 0.

climb.tool.impl.data_suite.utils.data_utils.get_suspect_features(clean_corpus, test_dataset, alpha=0.05)[source]

> This function takes in a clean corpus and a test dataset, and returns a list of feature indices that are statistically different between the two

Parameters:
  • clean_corpus – the clean corpus

  • test_dataset – the dataset you want to test for contamination

  • alpha – the significance level for the KS test.

Returns:

The suspicious features are being returned.

climb.tool.impl.data_suite.utils.data_utils.read_from_file(filename)[source]

> This function loads a file from a pickle

Parameters:

filename – the name of the file to read from

Returns:

the pickle file.

climb.tool.impl.data_suite.utils.data_utils.return_diagonal(array, just_diag=True)[source]

> This function takes a 2D array and returns the diagonal of that array

Parameters:
  • array – the array you want to return the diagonal of

  • just_diag – If True, the function will return the diagonal values of the array. If False, it will

return the entire array. Defaults to True

Returns:

The diagonal values of the array.

climb.tool.impl.data_suite.utils.data_utils.scaler(fg, bg, center=True)[source]

> This function takes two arrays, one of foreground data and one of background data, and returns two arrays, one of foreground data and one of background data, where the foreground data is scaled to the background data

Parameters:
  • fg – foreground data

  • bg – background data

  • center – If True, the data will be centered before scaling. Defaults to True

Returns:

The transformed data.

climb.tool.impl.data_suite.utils.data_utils.write_to_file(contents, filename)[source]

> This function takes in a variable and a filename, and writes the variable to the filename as a pickle file.

Parameters:
  • contents – the data to be written to the file

  • filename – the name of the file to write to

climb.tool.impl.data_suite.utils.graphics module

climb.tool.impl.data_suite.utils.graphics.get_metrics_graphs(df)[source]
climb.tool.impl.data_suite.utils.graphics.get_mse_table(df, column=0)[source]
climb.tool.impl.data_suite.utils.graphics.plot_graph(mean_prop, std_prop, mean_var, std_var, mean_dist, std_dist, metric, ylabel)[source]

climb.tool.impl.data_suite.utils.helpers module

climb.tool.impl.data_suite.utils.helpers.inlier_outlier_dicts(conformal_dict, suspect_features)[source]

For each feature, we create a dataframe that contains the true value, the lower bound, the upper bound, and the confidence interval. We then create a column called “outlier” that is True if the true value is not within the confidence interval. We use the CIs to assign two dictionaries one for the inliers and one for the outliers

Parameters:

conformal_dict – a dictionary of dataframes, where each dataframe contains the conformal prediction

intervals for a given feature.

suspect_features: a list of features that you want to check for outliers

Returns:

A dictionary of inliers and a dictionary of outliers.

climb.tool.impl.data_suite.utils.helpers.sort_ci_vals(conformal_dict, inliers_dict, suspect_features, proportion=0.1)[source]

> This function takes in a dictionary of conformal inference results, a dictionary of inlier results, a list of suspect features, and a proportion of the data to be used for the analysis.

It then returns the indices of the data points with the smallest and largest confidence intervals, and a dataframe with the sorted confidence intervals.

Parameters:

conformal_dict – a dictionary of dataframes, where each dataframe is the conformal intervals for a

feature

inliers_dict: a dictionary of inlier ids for each feature suspect_features: a list of features that are suspected to be problematic proportion: the proportion of the data to use as certain and uncertain

Returns:

the indices of the samples with the smallest and largest confidence intervals.

climb.tool.impl.data_suite.utils.helpers.sort_cis_all(conformal_dict, inliers_dict, suspect_features)[source]
climb.tool.impl.data_suite.utils.helpers.sort_cis_synth(conformal_dict, inliers_dict, suspect_features, proportion=0.1)[source]

> This function takes a dictionary of conformal intervals, a dictionary of inlier ids, and a list of suspect features. It then creates a dataframe of the conformal intervals for the first suspect feature, and then adds the conformal intervals for the other suspect features to the dataframe. It then sorts the dataframe by the norm_interval column, and returns the ids of the top and bottom proportion of the dataframe

Parameters:

conformal_dict – a dictionary of dataframes, where each dataframe is the conformal intervals for a

feature

inliers_dict: a dictionary of inlier ids for each feature suspect_features: a list of features that are suspected to be problematic proportion: the proportion of the data to use as certain and uncertain

Returns:

the indices of the samples with the smallest and largest confidence intervals.

climb.tool.impl.data_suite.utils.uncertainty_metrics module

climb.tool.impl.data_suite.utils.uncertainty_metrics.compute_deficet(true, lb, ub)[source]

> This function computes the average and the proportion of the time that the true value is outside the confidence interval

Parameters:
  • true – the true values of the parameters

  • lb – lower bound

  • ub – upper bound

Returns:

The mean and the proportion of the deficet

climb.tool.impl.data_suite.utils.uncertainty_metrics.compute_excess(true, lb, ub)[source]

> This function computes the average excess of the true values over the lower and upper bounds

Parameters:
  • true – the true values of the data

  • lb – lower bound

  • ub – upper bound

Returns:

The mean and the proportion of excess

climb.tool.impl.data_suite.utils.uncertainty_metrics.compute_uncertainty_metrics(preds, lower_bound, upper_bound, true)[source]

It computes the uncertainty metrics for a given set of predictions, lower bounds, upper bounds, and true values

Parameters:
  • preds – the predicted values

  • lower_bound – the lower bound of the prediction interval

  • upper_bound – The upper bound of the prediction interval.

  • true – the true values

climb.tool.impl.data_suite.utils.uncertainty_metrics.perf_measure(y_actual, y_pred)[source]

> This function takes two lists of the same length, and returns a tuple of four numbers: TN, FP, FN, TP

Parameters:
  • y_actual – the actual values of the target variable

  • y_pred – The predicted values

Returns:

True Negative, False Positive, False Negative, True Positive

climb.tool.impl.data_suite.utils.uncertainty_metrics.process_results(wandb_dict, results, roc, uncert_metrics, excess, deficet, excess_all, deficet_all, name)[source]

> This function processes the results and stores it in a dict to log to wandb

Parameters:
  • wandb_dict – a dictionary that will be used to log the results to wandb

  • results – a dictionary of dictionaries, where the keys are the names of the models and the values are dictionaries of the results of the model.

  • roc – the ROC AUC score

  • uncert_metrics – a dictionary of metrics that are calculated for the uncertainty

  • excess – excess of interval for specific model

  • deficet – deficet of interval for specific model

  • excess_all – proportion excess

  • deficet_all – proportion deficet

  • name – The name of the model.

climb.tool.impl.data_suite.utils.uncertainty_metrics.test_ood(y_test_ids, idx_ordered)[source]

> This functuib takes the ordered list of indices and the true labels, and then iterates through the indices, assigning the first x% of the indices to the “certain” class, and the remaining to the “uncertain” class.

It then calculates performance metrics

Parameters:
  • y_test_ids – the true labels of the test set

  • idx_ordered – the indices of the test set, ordered by the distance to the nearest neighbor.

Returns:

dictionary of metrics, ROC score

Module contents