Computer Vision

[Technical Report for CVPR’s 2nd MMFM Challenge] This report presents Multimodal Structured Generation, a general framework which constrains the output logits of frozen Multimodal Foundation Models to force them to reason before responding with structured outputs that downstream APIs can parse and use. This approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference.