Back to Projects
Adversarial Prompt Detection in LLMs preview

Project

Adversarial Prompt Detection in LLMs

Built a real-time adversarial prompt detection pipeline using fine-tuned transformer models for identifying jailbreak and prompt-injection style attacks. The project demonstrates practical AI safety engineering with measurable performance and deployment potential.

Built a real-time adversarial prompt filter for GPT-4 using fine-tuned RoBERTa and DistilBERT models.
Evaluated the system on roughly 100K prompts and achieved strong accuracy and recall in malicious prompt detection.
Used AWS SageMaker for training and fine-tuning to harden the pipeline against evolving jailbreak methods.
LLMsSecurityPyTorch