Solak, Ahmet2025-07-102025-07-1020251863-17031863-1711https://doi.org/10.1007/s11760-025-04268-4https://hdl.handle.net/20.500.13091/10155Monkeypox has re-emerged as a global public health threat, particularly in regions lacking extensive laboratory infrastructure. To address the urgent need for rapid, non-invasive diagnosis, we introduce a hybrid deep-learning framework that fuses an instance-normalized vision transformer (IN-ViT) with ResNet-50. Our approach first applies instance normalization within each transformer encoder to stabilize per-patch feature statistics, then concatenates these global contextual embeddings with ResNet-50's locally extracted features via a lightweight multilayer perceptron. We evaluate performance on the publicly available Monkeypox Skin Lesion Dataset, comprising 3192 augmented images of monkeypox, chickenpox, and measles lesions, partitioned into 70% train, 10% validation, and 20% test sets. Against standalone baselines-VGG-16, VGG-19, ResNet-50, and a standard ViT-our IN-ViT + ResNet-50 ensemble achieves 96.26% accuracy, 96.35% precision, 96.26% recall, and 96.24% F1-score, representing a >= 1% improvement over prior state-of-the-art. Crucially, the model sustains real-time inference (similar to 30 ms per image on Tesla T4 GPU) and can be readily deployed in telemedicine or point-of-care screening. These results demonstrate that combining fine-grained instance normalization with feature-level fusion yields a robust and interpretable diagnostic tool. Future work will explore federated learning for cross-site generalization, advanced data-augmentation regimes to mitigate class imbalance, and clinical validation across diverse patient populations.eninfo:eu-repo/semantics/closedAccessDeep LearningEnsemble LearningInstance NormalizationMonkeypox DetectionVision TransformerEnsemble-Based Hybrid Deep Learning for Monkeypox Detection: Merging Instance-Normalized Transformers With CNNs for Enhanced Diagnostic PrecisionArticle10.1007/s11760-025-04268-42-s2.0-105007452201