Verifies the functionality of Doppelgangers
verifyDoppelgangers.Rd
The user constructs a csv file with with training-validation set pairs ideally incrementing the number of Doppelgangers between training and validation sets. For each training-validation set pair, 12 models with different feature sets will be trained. 10 random feature sets and 2 features sets of highest and lowest variance would be generated. If an increase in validation accuracy of the 10 random models with increasing number of doppelgangers can be observed, we can conclude that the doppelgangers included are functional doppelgangers.
Usage
verifyDoppelgangers(
experiment_plan_filename,
raw_data,
meta_data,
feature_set_portion = 0.1,
seed_num = 2021,
separator = "\\.",
do_batch_corr = TRUE,
k = 5,
num_random_feature_sets = 10,
size_of_val_set = 8,
batch_corr_method = "ComBat",
neg_con_seed = 10
)
Arguments
- experiment_plan_filename
Name of file containing csv experiment plan. The csv file has a header with the names of the training_validation sets (e.g. "Doppel_0.train" or "Doppel_0.valid"). In each column (e.g. "Doppel_0.train" column), we include the names of all samples included in this training/validation set.
- raw_data
Dataframe of count matrix before batch correction
- meta_data
Dataframe of meta data
- feature_set_portion
Proportion of variables to be used for feature set generation
- seed_num
Seed number for random feature set generation
- separator
The character separating the name of the training_validation pair e.g. "0 Doppel" from the "train", "valid" label. Name of each column should be in format "0 Doppel.train" if . is used as separator
- do_batch_corr
If False, no batch correction is carried out
- k
k hyperparameter for KNN classification models
- num_random_feature_sets
Number of random feature sets for each training-validation set
- size_of_val_set
Size of each validation set (We assume the size of each validation set is the same, this is used for the binomial model)
- batch_corr_method
Batch correlation method used. Only 2 options are accepted "ComBat" or "ComBat_seq".
- neg_con_seed
Seed used for negative control
Details
Troubleshooting tips:
Ensure all the headers have no spaces.
If excel is used for planning, save the spreadsheet as "CSV (MS-DOS) (*.csv)"
Use the exact label "train" and "valid" (take note of capital letters)
Ensure the separator does not exist in the name of the training-validation set (E.g. Doppel.0 is not allowed)
Try to put both training-validation columns beside each other and leave no column gaps
Refer to the csv file in the tutorial on the GitHub README.