Abstract
Test-time prompt tuning for vision-language models(VLMs) is getting attention because of their ability to learn with unlabeled data without fine-tuning. Although test-time prompt tuning methods for VLMs can boost accuracy, the resulting models tend to demonstrate poor calibration, which casts doubts on the reliability and trustworthiness of these models. Notably, more attention needs to be devoted to calibrating the test-time prompt tuning in vision language models. To this end, we propose a new approach, called O-TPT that introduces orthogonality constraints on the textual features corresponding to the learnable prompts for calibrating test-time prompt tuning in VLMs. Towards introducing orthogonality constraints, we make the following contributions. First, we uncover new insights behind the suboptimal calibration performance of existing methods relying on textual feature dispersion. Second, we show that imposing a simple orthogonalization of textual features is a more effective approach towards obtaining textual dispersion. We conduct extensive experiments on various datasets with different backbones and baselines. The results indicate that our method consistently outperforms the prior state of the art in significantly reducing the overall average calibration error. Also, our method surpasses the zero shot calibration performance on fine-grained classification tasks
Orthogonality Constraint
textual features with lower cosine similarity (i.e., greater angular separation) between them lead to an improved calibration, as indicated by a lower Expected Calibration Error (ECE) Armed with this insight, we propose to impose orthogonalization constraints on the textual features by en forcing the angular distance between them. As such, this allows us to effectively utilize the feature space. Due to improved text feature separation.

Comparison of angular optimization
Interestingly, C-TPT, which applies dispersion in the L2 space, also struggles to calibrate, showing higher cosine similarities in cases where TPT fails (these challenging points), as illustrated in Fig. In contrast, our method’s orthogonalization constraint consistently produces text features with much lower and more consistent cosine similarities compared to CLIP initialization, resulting in better cali bration overall. Our orthogonalization method enforces angular distance between feature pairs, fully utilizing the hyperspherical space. As such, promoting orthogo nality enhances feature separation, leading to distinct class boundaries and improved calibration.

Comparison of calibration performance with CLIP-ViTB/16 backbone
Using the CLIP-B/16 backbone, our method achieves an average ECE of 4.21, outperforming C-TPT at 5.13 and Robust Adapt’s best result of 4.66.

Comparison of calibration performance with CLIP-RN50 backbone
When applied to the CLIP-RN50 backbone, our approach reduces ECE to 5.45, a substantial improvement over the 6.19 ECE achieved by C-TPT. Notably, our method also surpasses the zero-shot calibration performance showing lower ECE on both back bones

BibTeX
If you like our work, please consider citing us.
@InProceedings{Sharifdeen_2025_CVPR,
author = {Sharifdeen, Ashshak and Munir, Muhammad Akhtar and Baliah, Sanoojan and Khan, Salman and Khan, Muhammad Haris},
title = {O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {19942-19951}
}