Research Publications

Publications

From year To year

AI Safety and LLM Security

Red teaming, jailbreaks, malicious-use evaluation, and safety benchmarks.

2026

Muse Spark Safety and Preparedness Report

MSL et al.; Zifan Wang: adversarial robustness lead

A comprehensive report for Muse Spark on preparedness, adversarial robustness, and model behaviors.

Report

2025

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

B Wei, Z Che, N Li, UM Sehwag, J Götting, S Nedungadi, J Michael, ...

arXiv preprint arXiv:2510.27629, 2025

arXiv Scholar

2025

A red teaming roadmap towards system-level safety

Z Wang, CQ Knight, J Kritz, WE Primack, J Michael

arXiv preprint arXiv:2506.05376, 2025

arXiv Scholar

2025

Jailbreaking to jailbreak

J Kritz, V Robinson, R Vacareanu, B Varjavand, M Choi, B Gogov, ...

arXiv preprint arXiv:2502.09638, 2025

arXiv Scholar

2024

Refusal-trained LLMs are easily jailbroken as browser agents

P Kumar, E Lau, S Vijayakumar, T Trinh, SR Team, E Chang, V Robinson, ...

arXiv preprint arXiv:2410.13886, 2024

arXiv Scholar

2024

LLM defenses are not robust to multi-turn human jailbreaks yet

N Li, Z Han, I Steneker, W Primack, R Goodside, H Zhang, Z Wang, ...

arXiv preprint arXiv:2408.15221, 2024

arXiv Scholar

2024

The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning

N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ...

arXiv preprint arXiv:2403.03218, 2024

arXiv Scholar

2024

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu, E Sakhaee, N Li, ...

arXiv preprint arXiv:2402.04249, 2024

arXiv Scholar

2023

Can LLMs Follow Simple Rules?

N Mu, S Chen, Z Wang, S Chen, D Karamardian, L Aljeraisy, B Alomair, ...

arXiv preprint arXiv:2311.04235, 2023

arXiv Scholar

2023

Transfer attacks and defenses for large language models on coding tasks

C Zhang, Z Wang, R Mangal, M Fredrikson, L Jia, C Pasareanu

arXiv preprint arXiv:2311.13445, 2023

arXiv Scholar

2023

Universal and transferable adversarial attacks on aligned language models

A Zou, Z Wang, N Carlini, M Nasr, JZ Kolter, M Fredrikson

arXiv preprint arXiv:2307.15043, 2023

arXiv Scholar

2023

Representation Engineering: A Top-Down Approach to AI Transparency

A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ...

arXiv preprint arXiv:2310.01405, 2023

arXiv Scholar

Agents, Web, and Evaluation

Benchmarks, monitoring, data contamination, and web-agent guardrails.

2025

Reliable Weak-to-Strong Monitoring of LLM Agents

N Kale, CBC Zhang, K Zhu, A Aich, P Rodriguez, SR Team, CQ Knight, ...

arXiv preprint arXiv:2508.19461, 2025

arXiv Scholar

2025

Search-time data contamination

Z Han, M Mankikar, J Michael, Z Wang

arXiv preprint arXiv:2508.13180, 2025

arXiv Scholar

2025

Webguard: Building a generalizable guardrail for web agents

B Zheng, Z Liao, S Salisbury, Z Liu, M Lin, Q Zheng, Z Wang, X Deng, ...

arXiv preprint arXiv:2507.14293, 2025

arXiv Scholar

2025

SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

X Deng, J Da, E Pan, YY He, C Ide, K Garg, N Lauffer, A Park, N Pasari, ...

arXiv preprint arXiv:2509.16941, 2025

arXiv Scholar

Robustness and Certification

Certified robustness, robust training, and adversarial robustness for vision models.

2023

A recipe for improved certifiable robustness: Capacity and data

K Hu, K Leino, Z Wang, M Fredrikson

ICLR 2024, 2023

Scholar

2023

Is Certifying Robustness Still Worthwhile?

R Mangal, K Leino, Z Wang, K Hu, W Yu, C Pasareanu, A Datta, ...

arXiv preprint arXiv:2310.09361, 2023

arXiv Scholar

2023

Unlocking Deterministic Robustness Certification on ImageNet

K Hu, A Zou, Z Wang, K Leino, M Fredrikson

Advances in Neural Information Processing Systems 36, 42993-43011, 2023

Scholar

2023

Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization

Z Wang, N Ding, T Levinboim, X Chen, R Soricut

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2023

arXiv PDF Scholar

2022

On the perils of cascading robust classifiers

R Mangal, Z Wang, C Zhang, K Leino, C Pasareanu, M Fredrikson

arXiv preprint arXiv:2206.00278, 2022

arXiv Scholar

2021

Globally-Robust Neural Networks

K Leino, Z Wang, M Fredrikson

Proceedings of ICML 2021, 2021

Scholar

2021

Robust models are more interpretable because attributions look normal

Z Wang, M Fredrikson, A Datta

arXiv preprint arXiv:2103.11257, 2021

arXiv Scholar

2020

Towards frequency-based explanation for robust CNN

Z Wang, Y Yang, A Shrivastava, V Rawal, Z Ding

arXiv preprint arXiv:2005.03141, 2020

arXiv Scholar

Interpretability and Explainability

Attribution, counterfactuals, graph explanations, BERT influence flow, and explainability tooling.

2023

On the Feature Alignment of Deep Vision Models Explainability and Robustness Connected at Hip

Z Wang

Carnegie Mellon University, 2023

Scholar

2022

Faithful explanations for deep graph models

Z Wang, Y Yao, C Zhang, H Zhang, Y Kang, C Joe-Wong, M Fredrikson, ...

arXiv preprint arXiv:2205.11850, 2022

arXiv Scholar

2022

Exploring Conceptual Soundness with TruLens

A Datta, M Fredrikson, K Leino, K Lu, S Sen, R Shih, Z Wang

NeurIPS 2021 Competitions and Demonstrations Track, 302-307, 2022

Scholar

2021

Machine learning explainability and robustness: connected at the hip

A Datta, M Fredrikson, K Leino, K Lu, S Sen, Z Wang

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data …, 2021

Scholar

2021

Consistent counterfactuals for deep models

E Black, Z Wang, M Fredrikson, A Datta

ICLR 2022, 2021

Scholar

2020

Influence Patterns for Explaining Information Flow in BERT

K Lu, Z Wang, P Mardziel, A Datta

arXiv preprint arXiv:2011.00740, 2020

arXiv Scholar

2020

Reconstructing Actions To Explain Deep Reinforcement Learning

X Chen, Z Wang, Y Fan, B Jin, P Mardziel, C Joe-Wong, A Datta

arXiv preprint arXiv:2009.08507, 2020

arXiv Scholar

2020

Smoothed Geometry for Robust Attribution

Z Wang, H Wang, S Ramkumar, M Fredrikson, P Mardziel, A Datta

Proceedings of NeurIPS 2020, 2020

Scholar

2020

Interpreting interpretations: Organizing attribution methods by criteria

Z Wang, P Mardziel, A Datta, M Fredrikson

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2020

Scholar

2020

Score-CAM: Score-weighted visual explanations for convolutional neural networks

H Wang, Z Wang, M Du, F Yang, Z Zhang, S Ding, P Mardziel, X Hu

Proceedings of the IEEE/CVF conference on computer vision and pattern …, 2020

Scholar

Formal Methods and Systems

Neuro-symbolic reasoning, SMT grounding, and practical ML systems.

2024

VeriSplit: Secure and Practical Offloading of Machine Learning Inferences across IoT Devices

H Zhang, Z Wang, M Dhamankar, M Fredrikson, Y Agarwal

arXiv preprint arXiv:2406.00586, 2024

arXiv Scholar

2023

Learning modulo theories

M Fredrikson, K Lu, S Vijayakumar, S Jha, V Ganesh, Z Wang

arXiv preprint arXiv:2301.11435, 2023

arXiv Scholar

2023

Grounding neural inference with satisfiability modulo theories

Z Wang, S Vijayakumar, K Lu, V Ganesh, S Jha, M Fredrikson

Advances in Neural Information Processing Systems 36, 22794-22806, 2023

Scholar