Shokri et al. attack¶
target_model_fn() return the target model architecture as a scikit-like classifier. The
attack is white-box, meaning the attacker is assumed to know the architecture. Let
be the number of classes of the classification problem.
First, the attacker needs to train several shadow models —that mimick the target model—
on different datasets sampled from the original data distribution. The following code snippet
initializes a shadow model bundle, and runs the training of the shadows. For each shadow model,
2 * SHADOW_DATASET_SIZE examples are sampled without replacement from the full attacker’s
dataset. Half of them will be used for control, and the other half for training of the shadow model.
from mia.estimators import ShadowModelBundle smb = ShadowModelBundle( target_model_fn, shadow_dataset_size=SHADOW_DATASET_SIZE, num_models=NUM_MODELS, ) X_shadow, y_shadow = smb.fit_transform(attacker_X_train, attacker_y_train)
fit_transform returns attack data
X_shadow, y_shadow. Each row in
X_shadow is a
concatenated vector consisting of the prediction vector of a shadow model for an example from the
original dataset, and the example’s class (one-hot encoded). Its shape is hence
SHADOW_DATASET_SIZE, 2 * NUM_CLASSES). Each label in
y_shadow is zero if a corresponding
example was “out” of the training dataset of the shadow model (control), or one, if it was “in” the
mia provides a class to train a bundle of attack models, one model per class.
is supposed to return a scikit-like classifier that takes a vector of model predictions
and returns whether an example with these predictions was in the training, or out.
from mia.estimators import AttackModelBundle amb = AttackModelBundle(attack_model_fn, num_classes=NUM_CLASSES) amb.fit(X_shadow, y_shadow)
In place of the
AttackModelBundle one can use any binary classifier that takes
NUM_CLASSES, )-shape examples (as explained above, the first half of an input is the prediction
vector from a model, the second half is the true class of a corresponding example).
To evaluate the attack, one must encode the data in the above-mentioned format. Let
the target model,
data_in the data (tuple
X, y) that was used in the training of the target model, and
data_out the data that was not used in the training.
from mia.estimators import prepare_attack_data attack_test_data, real_membership_labels = prepare_attack_data( target_model, data_in, data_out ) attack_guesses = amb.predict(attack_test_data) attack_accuracy = np.mean(attack_guesses == real_membership_labels)