Skip to main navigation Skip to search Skip to main content

Surface defect classification: leveraging transformer and transfer learning models for enhanced precision in industrial applications

  • Junqing Yang
  • , Anwar P. P. Abdul Majeed*
  • , Muhammad Ateeq
  • , Zaid Omar
  • , Rozita Jailani
  • , Rabiu Muazu Musa
  • , Yang Luo
  • , Nafrizuan Mat Yahya
  • *Corresponding author for this work
  • University College London
  • Sunway University
  • Xi'an Jiaotong-Liverpool University
  • School of IoT
  • Universiti Teknologi Malaysia
  • Universiti Teknologi MARA
  • Universiti Malaysia Terengganu
  • Universiti Malaysia Pahang Al-Sultan Abdullah

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)

Abstract

This study aims to compare the performance, training efficiency, and interpretability of Vision Transformer (ViT) and Convolutional Neural Network (CNN) architectures for automated classification of surface defects in hot-rolled steel strips using transfer learning. In this study, the authors fine-tuned Vision Transformer (ViT) and Convolutional Neural Network (CNN) models pretrained on ImageNet for the classification of six types of surface defects in the NEU Surface Defect Database (1800 images). Performance was assessed using test sets with different classification strategies, namely by directly using a fully connected layer for classification and combining pretrained models with additional classifiers, including k-Nearest Neighbors (KNN), Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM). The study evaluated multiple deep learning architectures, including ViT, DenseNet201, InceptionV3, VGG16, and VGG19. Model interpretability was analyzed using Grad-CAM for CNNs and attention maps for ViT. Training efficiency was assessed based on model training time and inference speed, while classification performance was evaluated using accuracy, F1-score, confusion matrices, and ROC curves. Both CNN and ViT models achieved high accuracy in surface defect classification, with DenseNet201 and ViT reaching 100% accuracy when using a fully connected classifier. However, ViT demonstrated superior feature extraction capabilities, as revealed by attention maps, highlighting its potential for complex defect patterns. ViT and DenseNet201 required significantly longer training time when using a fully connected classifier (270.49 s and 178.02 s, respectively). However, leveraging pretrained feature extractors with alternative classifiers such as SVM or Logistic Regression drastically reduced training time (e.g., 3.42 s for SVM on ViT and 0.36 s for LR), demonstrating the efficiency of combining deep feature representations with lightweight classifiers for classification. VGG16 and VGG19 underperformed with KNN classifiers (79.6% and 83.3% accuracy, respectively). ROC curves confirmed near-perfect classification across all models. These findings suggest ViT’s higher capacity for industrial surface defect classification. While both CNN and ViT models achieved high accuracy in surface defect classification, ViT demonstrated superior feature extraction capabilities, as evidenced by attention maps. Despite requiring longer training time with a fully connected classifier, leveraging pretrained feature extractors with classifiers like SVM or Logistic Regression significantly reduced training time while maintaining high accuracy. ViT’s ability to capture complex defect patterns more effectively suggests its potential for improved defect analysis in industrial applications.

Original languageEnglish
Pages (from-to)4141-4152
Number of pages12
JournalInternational Journal of Advanced Manufacturing Technology
Volume139
Issue number7-8
DOIs
Publication statusPublished - Aug 2025

Keywords

  • Convolutional Neural Network
  • Hot-rolled steel
  • Surface defect classification
  • Vision Transformer

Fingerprint

Dive into the research topics of 'Surface defect classification: leveraging transformer and transfer learning models for enhanced precision in industrial applications'. Together they form a unique fingerprint.

Cite this