Abstract
The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.
| Original language | English |
|---|---|
| Title of host publication | The IEEE International Conference on Multimedia & Expo 2026 |
| Subtitle of host publication | ICME 2026 |
| Publisher | IEEE Press |
| Chapter | 1 |
| Pages | 1-6 |
| Number of pages | 6 |
| Publication status | Published - 5 Jul 2026 |
| Event | The IEEE International Conference on Multimedia & Expo 2026: ICME 2026 - Bangkok, Thailand, Bangkok, Thailand Duration: 5 Jul 2026 → 9 Jul 2026 https://2026.ieeeicme.org/ |
Conference
| Conference | The IEEE International Conference on Multimedia & Expo 2026 |
|---|---|
| Country/Territory | Thailand |
| City | Bangkok |
| Period | 5/07/26 → 9/07/26 |
| Internet address |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver