GReFEL: Geometry-Aware Reliable Facial Expression Learning under Bias and Imbalanced Data Distribution

1Shahjalal University of Science and Technology 2Hanyang University 3Harvard University 4University of Cambridge
*Equal Contributions (Co-First Authors) #Corresponding author

Abstract

Reliable facial expression learning (FEL) involves the effective learning of distinctive facial expression characteristics for more reliable, unbiased and accurate predictions in real-life settings. However, current systems struggle with FEL tasks because of the variance in people’s facial expressions due to their unique facial structures, movements, tones, and demographics. Biased and imbalanced datasets compound this challenge, leading to wrong and biased prediction labels. To tackle these, we introduce GReFEL, leveraging Vision Transformers and a facial geometry-aware anchor-based reliability balancing module to combat imbalanced data distributions, bias, and uncertainty in facial expression learning. Integrating local and global data with anchors that learn different facial data points and structural features, our approach adjusts biased and mislabeled emotions caused by intra-class disparity, interclass similarity, and scale sensitivity, resulting in comprehensive, accurate, and reliable facial expression predictions. Our model outperforms current state-of-the-art methodologies, as demonstrated by extensive experiments on various datasets.

Previously known as ARBEx. Extended version is published as GReFEL.

Architecture

Pipeline of GReFEL.

Heavy Augmentation is applied to the input images and Data Refinement method selects training batch with properly distributed classes for each epoch. Window-Based Cross-Attention ViT framework uses mutli-level feature extraction and integration to provide embeddings (Feature Vectors). Linear Reduction Layer reduces the feature vector size for fast modeling. MLP predicts the primary labels and Confidence is calculated from label distribution. Reliability balancing receives embeddings and processes in two ways. Firstly, it places anchors in the embedding space. It improves prediction probabilities by utilizing trainable anchors for searching similarities in embedding space. On the other way, Multi-head self-attention values are used to calculate label correction and confidence. Weighted Average of these two are used to calculate the final label correction. Using label correction, primary label distribution and confidence, final corrected label distribution is calculated, making the model more reliable.

Reliability Balancing

Observation of confidence probability distributions in GReFEL using Aff-Wild2 dataset.

Eight different emotions—Neutral, Anger, Fear, Disgust, Happiness, Sadness, Surprise, and Other—are represented by columns under each image sequentially. Primary Distribution (PD) is the initial prediction while Corrected Distribution (CD) is the accurate prediction after Reliability Balancing. The correct label after reliability balancing is marked as green, and the inaccurate primary prediction label is marked as yellow.

BibTeX

@InProceedings{wasi2024GReFEL,
        author    = {Azmine Toushik Wasi and
          Taki Hasan Rafi and
          Raima Islam and
          Karlo Šerbetar and
          Dong-Kyu Chae},
        title     = {GReFEL: Geometry-Aware Reliable Facial Expression Learning under Bias and Imbalanced Data Distribution},
        booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
        month     = {December},
        year      = {2024}
    }