climb.tool.impl.data_suite.data package

Submodules

climb.tool.impl.data_suite.data.data_loader module

climb.tool.impl.data_suite.data.data_loader.corrupt_data_func(data, feat_list, mean=0, variance=1, proportion=0.5, dist='normal')[source]

> This function takes in a dataframe, a list of features to corrupt, and a distribution to corrupt the data with. It then corrupts the data with the specified distribution and returns the corrupted data, the original data, a list of the corrupted data points, a list of the noise added to the data, and a list of the indices of the corrupted data points.

Parameters:
  • data – the data you want to corrupt

  • feat_list – the list of features to corrupt

  • mean – the mean of the distribution you want to sample from. Defaults to 0

  • variance – the variance of the noise. Defaults to 1

  • proportion – the proportion of data that will be corrupted

  • dist – the distribution of the noise. Defaults to normal

Returns:

corrupt_data, data, corrupt_ids, noise, noise_id

climb.tool.impl.data_suite.data.data_loader.generate_synthetic_large(num_samples=1000)[source]

> This function generates a random multivariate normal distribution with the given mean and covariance matrix

Parameters:

num_samples – The number of samples to generate. Defaults to 1000

Returns:

A tuple of two numpy arrays for train and test

climb.tool.impl.data_suite.data.data_loader.generate_synthetic_small(num_samples=1000)[source]

> This function generates a random sample of data from a multivariate normal distribution with a specified mean and covariance matrix

Parameters:

num_samples – The number of samples to generate. Defaults to 1000

Returns:

A tuple of two numpy arrays for train and test

climb.tool.impl.data_suite.data.data_loader.load_adult_data(split_size=0.3)[source]

> This function loads the adult dataset, removes all the rows with missing values, and then splits the data into a training and test set

Parameters:

split_size – The proportion of the dataset to include in the test split.

Returns:

X_train, X_test, y_train, y_test, X, y

climb.tool.impl.data_suite.data.data_loader.load_electric(path='electricity.arff')[source]

> This function loads the electric dataset from the file, encodes the class labels, and returns the training and test sets

Parameters:

path – the path to the dataset. Defaults to elecNormNew.arff

Returns:

X_train, X_test, y_train, y_test

climb.tool.impl.data_suite.data.data_loader.load_synthetic_data(n_synthetic=1000, mean=0, noise_variance=0, dim='small', prop='0.5', dist='normal')[source]

> This function generates a synthetic dataset with a specified number of samples, mean, noise variance, dimensionality, proportion of noise, and distribution of noise

Parameters:
  • n_synthetic – number of samples to generate. Defaults to 1000

  • mean – mean of the noise distribution. Defaults to 0

  • noise_variance – the variance of the noise distribution. Defaults to 0

  • dim – “small” or “large”. Defaults to small

  • prop – proportion of data to corrupt. Defaults to 0.5

  • dist – the distribution of the noise. Can be “normal” or “uniform”. Defaults to normal

Module contents