climb.tool.impl.data_suite.data package¶
Submodules¶
climb.tool.impl.data_suite.data.data_loader module¶
- climb.tool.impl.data_suite.data.data_loader.corrupt_data_func(data, feat_list, mean=0, variance=1, proportion=0.5, dist='normal')[source]¶
> This function takes in a dataframe, a list of features to corrupt, and a distribution to corrupt the data with. It then corrupts the data with the specified distribution and returns the corrupted data, the original data, a list of the corrupted data points, a list of the noise added to the data, and a list of the indices of the corrupted data points.
- Parameters:
data – the data you want to corrupt
feat_list – the list of features to corrupt
mean – the mean of the distribution you want to sample from. Defaults to 0
variance – the variance of the noise. Defaults to 1
proportion – the proportion of data that will be corrupted
dist – the distribution of the noise. Defaults to normal
- Returns:
corrupt_data, data, corrupt_ids, noise, noise_id
- climb.tool.impl.data_suite.data.data_loader.generate_synthetic_large(num_samples=1000)[source]¶
> This function generates a random multivariate normal distribution with the given mean and covariance matrix
- Parameters:
num_samples – The number of samples to generate. Defaults to 1000
- Returns:
A tuple of two numpy arrays for train and test
- climb.tool.impl.data_suite.data.data_loader.generate_synthetic_small(num_samples=1000)[source]¶
> This function generates a random sample of data from a multivariate normal distribution with a specified mean and covariance matrix
- Parameters:
num_samples – The number of samples to generate. Defaults to 1000
- Returns:
A tuple of two numpy arrays for train and test
- climb.tool.impl.data_suite.data.data_loader.load_adult_data(split_size=0.3)[source]¶
> This function loads the adult dataset, removes all the rows with missing values, and then splits the data into a training and test set
- Parameters:
split_size – The proportion of the dataset to include in the test split.
- Returns:
X_train, X_test, y_train, y_test, X, y
- climb.tool.impl.data_suite.data.data_loader.load_electric(path='electricity.arff')[source]¶
> This function loads the electric dataset from the file, encodes the class labels, and returns the training and test sets
- Parameters:
path – the path to the dataset. Defaults to elecNormNew.arff
- Returns:
X_train, X_test, y_train, y_test
- climb.tool.impl.data_suite.data.data_loader.load_synthetic_data(n_synthetic=1000, mean=0, noise_variance=0, dim='small', prop='0.5', dist='normal')[source]¶
> This function generates a synthetic dataset with a specified number of samples, mean, noise variance, dimensionality, proportion of noise, and distribution of noise
- Parameters:
n_synthetic – number of samples to generate. Defaults to 1000
mean – mean of the noise distribution. Defaults to 0
noise_variance – the variance of the noise distribution. Defaults to 0
dim – “small” or “large”. Defaults to small
prop – proportion of data to corrupt. Defaults to 0.5
dist – the distribution of the noise. Can be “normal” or “uniform”. Defaults to normal