Type: Article
Publication Date: 2021-01-01
Citations: 14
DOI: https://doi.org/10.1109/access.2021.3101289
Deep neural networks are being widely deployed for critical tasks. In many cases, pre-trained models are sourced from vendors who may have disrupted the training pipeline to insert Trojan behaviors. These malicious behaviors can be triggered at the adversary's will, which is a serious security threat. To verify the integrity of a deep model, we propose a method that captures its fingerprint with adversarial perturbations. Inserting backdoors into a network alters its decision boundaries which are effectively encoded by adversarial perturbations. Our proposed Trojan detection network learns features from adversarial patterns and its properties to encode the unknown trigger shape and deviations in the decision boundaries caused by backdoors. Our method works completely without or with limited clean samples for improved performance. Our method also performs anomaly detection to identify the target class of a Trojaned network and is invariant to the trigger type, trigger size, network architecture and does not require any triggered samples. Experiments are performed on MNIST, NIST-TrojAI and Odysseus datasets, with 5000 pre-trained models in total, making this the largest study to date on Trojaned detection and the new state-of-the-art accuracy is achieved.