Research Publications

Publications

AI Safety and LLM Security

Red teaming, jailbreaks, malicious-use evaluation, and safety benchmarks.
2026

Muse Spark Safety and Preparedness Report

MSL et al.; Zifan Wang: adversarial robustness lead

A comprehensive report for Muse Spark on preparedness, adversarial robustness, and model behaviors.

2025

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

B Wei, Z Che, N Li, UM Sehwag, J Götting, S Nedungadi, J Michael, ...

arXiv preprint arXiv:2510.27629, 2025

2025

A red teaming roadmap towards system-level safety

Z Wang, CQ Knight, J Kritz, WE Primack, J Michael

arXiv preprint arXiv:2506.05376, 2025

2025

Jailbreaking to jailbreak

J Kritz, V Robinson, R Vacareanu, B Varjavand, M Choi, B Gogov, ...

arXiv preprint arXiv:2502.09638, 2025

2024

Refusal-trained LLMs are easily jailbroken as browser agents

P Kumar, E Lau, S Vijayakumar, T Trinh, SR Team, E Chang, V Robinson, ...

arXiv preprint arXiv:2410.13886, 2024

2024

LLM defenses are not robust to multi-turn human jailbreaks yet

N Li, Z Han, I Steneker, W Primack, R Goodside, H Zhang, Z Wang, ...

arXiv preprint arXiv:2408.15221, 2024

2024

The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning

N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ...

arXiv preprint arXiv:2403.03218, 2024

2024

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu, E Sakhaee, N Li, ...

arXiv preprint arXiv:2402.04249, 2024

2023

Can LLMs Follow Simple Rules?

N Mu, S Chen, Z Wang, S Chen, D Karamardian, L Aljeraisy, B Alomair, ...

arXiv preprint arXiv:2311.04235, 2023

2023

Transfer attacks and defenses for large language models on coding tasks

C Zhang, Z Wang, R Mangal, M Fredrikson, L Jia, C Pasareanu

arXiv preprint arXiv:2311.13445, 2023

2023

Universal and transferable adversarial attacks on aligned language models

A Zou, Z Wang, N Carlini, M Nasr, JZ Kolter, M Fredrikson

arXiv preprint arXiv:2307.15043, 2023

2023

Representation Engineering: A Top-Down Approach to AI Transparency

A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ...

arXiv preprint arXiv:2310.01405, 2023

Agents, Web, and Evaluation

Benchmarks, monitoring, data contamination, and web-agent guardrails.
2025

Reliable Weak-to-Strong Monitoring of LLM Agents

N Kale, CBC Zhang, K Zhu, A Aich, P Rodriguez, SR Team, CQ Knight, ...

arXiv preprint arXiv:2508.19461, 2025

2025

Search-time data contamination

Z Han, M Mankikar, J Michael, Z Wang

arXiv preprint arXiv:2508.13180, 2025

2025

Webguard: Building a generalizable guardrail for web agents

B Zheng, Z Liao, S Salisbury, Z Liu, M Lin, Q Zheng, Z Wang, X Deng, ...

arXiv preprint arXiv:2507.14293, 2025

2025

SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

X Deng, J Da, E Pan, YY He, C Ide, K Garg, N Lauffer, A Park, N Pasari, ...

arXiv preprint arXiv:2509.16941, 2025

Robustness and Certification

Certified robustness, robust training, and adversarial robustness for vision models.
2023

A recipe for improved certifiable robustness: Capacity and data

K Hu, K Leino, Z Wang, M Fredrikson

ICLR 2024, 2023

2023

Is Certifying Robustness Still Worthwhile?

R Mangal, K Leino, Z Wang, K Hu, W Yu, C Pasareanu, A Datta, ...

arXiv preprint arXiv:2310.09361, 2023

2023

Unlocking Deterministic Robustness Certification on ImageNet

K Hu, A Zou, Z Wang, K Leino, M Fredrikson

Advances in Neural Information Processing Systems 36, 42993-43011, 2023

2023

Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization

Z Wang, N Ding, T Levinboim, X Chen, R Soricut

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2023

2022

On the perils of cascading robust classifiers

R Mangal, Z Wang, C Zhang, K Leino, C Pasareanu, M Fredrikson

arXiv preprint arXiv:2206.00278, 2022

2021

Globally-Robust Neural Networks

K Leino, Z Wang, M Fredrikson

Proceedings of ICML 2021, 2021

2021

Robust models are more interpretable because attributions look normal

Z Wang, M Fredrikson, A Datta

arXiv preprint arXiv:2103.11257, 2021

2020

Towards frequency-based explanation for robust CNN

Z Wang, Y Yang, A Shrivastava, V Rawal, Z Ding

arXiv preprint arXiv:2005.03141, 2020

Interpretability and Explainability

Attribution, counterfactuals, graph explanations, BERT influence flow, and explainability tooling.
2023

On the Feature Alignment of Deep Vision Models Explainability and Robustness Connected at Hip

Z Wang

Carnegie Mellon University, 2023

2022

Faithful explanations for deep graph models

Z Wang, Y Yao, C Zhang, H Zhang, Y Kang, C Joe-Wong, M Fredrikson, ...

arXiv preprint arXiv:2205.11850, 2022

2022

Exploring Conceptual Soundness with TruLens

A Datta, M Fredrikson, K Leino, K Lu, S Sen, R Shih, Z Wang

NeurIPS 2021 Competitions and Demonstrations Track, 302-307, 2022

2021

Machine learning explainability and robustness: connected at the hip

A Datta, M Fredrikson, K Leino, K Lu, S Sen, Z Wang

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data …, 2021

2021

Consistent counterfactuals for deep models

E Black, Z Wang, M Fredrikson, A Datta

ICLR 2022, 2021

2020

Influence Patterns for Explaining Information Flow in BERT

K Lu, Z Wang, P Mardziel, A Datta

arXiv preprint arXiv:2011.00740, 2020

2020

Reconstructing Actions To Explain Deep Reinforcement Learning

X Chen, Z Wang, Y Fan, B Jin, P Mardziel, C Joe-Wong, A Datta

arXiv preprint arXiv:2009.08507, 2020

2020

Smoothed Geometry for Robust Attribution

Z Wang, H Wang, S Ramkumar, M Fredrikson, P Mardziel, A Datta

Proceedings of NeurIPS 2020, 2020

2020

Interpreting interpretations: Organizing attribution methods by criteria

Z Wang, P Mardziel, A Datta, M Fredrikson

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2020

2020

Score-CAM: Score-weighted visual explanations for convolutional neural networks

H Wang, Z Wang, M Du, F Yang, Z Zhang, S Ding, P Mardziel, X Hu

Proceedings of the IEEE/CVF conference on computer vision and pattern …, 2020

Formal Methods and Systems

Neuro-symbolic reasoning, SMT grounding, and practical ML systems.
2024

VeriSplit: Secure and Practical Offloading of Machine Learning Inferences across IoT Devices

H Zhang, Z Wang, M Dhamankar, M Fredrikson, Y Agarwal

arXiv preprint arXiv:2406.00586, 2024

2023

Learning modulo theories

M Fredrikson, K Lu, S Vijayakumar, S Jha, V Ganesh, Z Wang

arXiv preprint arXiv:2301.11435, 2023

2023

Grounding neural inference with satisfiability modulo theories

Z Wang, S Vijayakumar, K Lu, V Ganesh, S Jha, M Fredrikson

Advances in Neural Information Processing Systems 36, 22794-22806, 2023