Skip to content

AI Security


  • Philip Skuddlik (Deutsche Bahn)
  • Djalel Benbouzid (Volkswagen Group)
  • Estelle Wang (Continental)
  • Håkan Jonsson (Zalando)


This page is not intended to be a comprehensive guide to AI security practices. The main purpose here is to make the reader aware of novel cybersecurity issues emerging from the usage of AI and data-based systems.

Overview: Attacks and Attack Mitigation

Attacks and harms towards ML models can roughly be placed into two categories:

  • Privacy: Informational Harms: Issues resulting from leakage of confidential information which can be associated with individuals or organizations.

  • Security: Behavioral Harms: Issues resulting from an attacker being able to manipulate a model’s behavior and thus impacting predictions and outcomes.

Further, one can see and with respect to five security challenges. The following lists the possible attacks in each of these domains and the respective mitigation strategies that should be implemented.

Data Privacy: Model Inversion & Membership Inference

Attack Scenario

Membership inference, involves inferring whether or not an individual’s data was contained in the data used to train the model, based on a sample of the model’s output. Through model inversion, an attacker might not only be able to tell if a particular sample was used in training but recreate the actual training data itself. If used by malicious third parties, such attacks could compromise the confidentiality of the model and violate the privacy of affected individuals.

Mitigation Strategies

  • Differentially private protection: Adding noise to data or models by means of differential privacy in the model training stage.

  • Private Aggregation of Teacher Ensembles (PATE): Segmenting training data into multiple sets, each for training an independent DNN model. The independent DNN models are then used to jointly train a student model by voting. This ensures that the inference of the student model does not reveal the information of a particular training data set

Model Confidentiality: Model Extraction

Attack Scenario

An attacker may use model outputs to recreate the model itself. This can have implications for the privacy and security as well as the intellectual property or proprietary business logic of the underlying model. The result is a myriad of possible harms and the very fact that models retain representations of their training data, makes it a privacy concern.

Mitigation Strategies

  • Model watermarking: This technology embeds special recognition neurons into the original model in the model training stage. These neurons later enable a special input sample to check whether that model was obtained by stealing the original model.

Data Integrity: Poisoning

Attack Scenario

Model poisoning occurs when an adversary is able to inject malicious (possibly carefully crafted) samples into training data in order to alter the behavior of the model at a later point in time. Malicious activities are for example:

  • encourage the model to have more beneficial outcomes for particular individuals
  • train a model to intentionally discriminate against a group of people
  • erode general model performance

This attack does not require direct access to a deployed model but works simply by access to training data. Models that continuously retrain on new data are more vulnerable.

Mitigation Strategies

  • Training data filtering: Identifying possible poisoned data points based on label characteristics and filtering those points out during retraining. In essence, controlling the collection of training data.

  • Model Monitoring: Continuously monitoring model behavior to check for situations in which the model misbehaves. This includes the detection of input animalities and domain shift, for example by means of Variational Autoencoders (VAE)

  • Ensemble analysis: By using multiple sub-models, the probability of the whole system being affected by poisoning attacks is reduced. The training data must be independent for each model.

Model Robustness: Evasion

Attack Scenario

Evasion occurs when special input data is fed into an ML system that intentionally causes the system to misclassify that data. Deep learning systems can be easily affected by well-crafted input samples, which are called adversarial examples. These inputs look no different to human eyes but greatly affect the output of deep learning models. An attacker may also create adversarial examples to induce a specific model output. Since adversarial attacks are often transferable between model trained on similar data, black-box-attacks are also possible. Models vulnerable to model extraction and inversion are thus also more vulnerable to evasion attacks.

Mitigation Strategies

  • Network distillation: These technologies work by concatenating multiple DNNs in the model training stage, so that the classification result generated by one DNN is used for training the next DNN. This sensitivity of an AI model to small perturbations and improve model robustness.

  • Adversarial training: Works by adding adversarial examples to the training data set and retraining one or multiple times to generate a new model, which then is resistant to these attack perturbations.

  • Adversarial example detection: Identify adversarial examples by adding an external detection model to the inference stage. Before an input sample arrives at the original model, the detection model determines whether the sample is adversarial.

  • Input Reconstruction: By deforming and then reconstructing input samples (e.g. with autoencoders), an adversarial input is less likely to still affect the normal classification function of models.

Software and Hardware Security: Backdoor

Attack Scenario

A model with a backdoor responds in the same way as the original model on normal input, but on a specific input, the responses are controlled by the backdoor. Unlike traditional programs, a neural network model only consists of a set of parameters, without source code. Therefore, backdoors in AI models are harder to detect than in traditional programs. The response triggered by the backdoor, can relate to any of the above-mentioned aspects.

Mitigation Strategies

  • White Hat or Red Team Hacking: White hat or red teams, either internally or provided by third parties, may be tasked to identify and potentially remediate discovered vulnerabilities.

  • Input pre-processing: This technology aims to filter out inputs that can trigger backdoors to minimize the risk of triggering backdoors and changing model inference results.

  • Model pruning: The goal is to prune off neurons of the original model while keeping normal functions. Neurons constituting a backdoor can be removed, reducing the possibility of the backdoor working.

Further readings

Last update: 2022.09.04, v0.1