Project
Vision Models for Multimodal Interaction
Worked on multimodal vision-language functionality for region-based image understanding and composition. The project focused on controllable generation, fine-tuning, and practical multimodal interaction pipelines.
Developed region-based image understanding workflows for precise bounding-box driven interaction.
Implemented image composition functionality and model-guided editing across 500+ test images.
Applied supervised fine-tuning with multimodal data to improve success rates on composition tasks.
Computer VisionMultimodalAWS