BIG-MOE: BYPASS ISOLATED GATING MOE FOR GENERALIZED MULTIMODAL FACE ANTI-SPOOFING
Yingjie Ma , Zitong , Xun Lin , Weicheng Xie , Linlin Shen College of Computer Science and Software Engineering, Shenzhen University Great Bay University National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University Guangdong Provincial Key Laboratory of Intelligent Information Processing
Abstract
In the domain of facial recognition security, multimodal Face Anti-Spoofing (FAS) is essential for countering presentation attacks. However, existing technologies encounter challenges due to modality biases and imbalances, as well as domain shifts. Our research introduces a Mixture of Experts (MoE) model to address these issues effectively. We identified three limitations in traditional MoE approaches to multimodal FAS: (1) Coarse-grained experts’ inability to capture nuanced spoofing indicators; (2) Gated networks’ susceptibility to input noise affecting decision-making; (3) MoE’s sensitivity to prompt tokens leading to overfitting with conventional learning methods. To mitigate these, we propose the Bypass Isolated Gating MoE (BIG-MoE) framework, featuring: (1) Fine-grained experts for enhanced detection of subtle spoofing cues; (2) An isolation gating mechanism to counteract input noise; (3) A novel differential convolutional prompt bypass enriching the gating network with critical local features, thereby improving perceptual capabilities. Extensive experiments on four benchmark datasets demonstrate significant generalization performance improvement in multimodal FAS task. The code is released at https://github.com/murInJ/BIGMoE.
Index Terms- Face Anti-Spoofing, Multimodal, Prompt Learning, Mixture of Experts
\section*{1. INTRODUCTION}
Face Recognition (FR) technology, celebrated for its efficiency and accuracy in applications such as security surveillance and mobile payments, now confronts escalating security threats from sophisticated face rendering attacks. Traditional FR systems struggle to discern these attacks, which include printed photos, video playback, and 3D masks, underscoring the urgent need for robust security measures.
To counter these threats, the research community has turned to Face Anti-Spoofing (FAS) techniques, which differentiate between genuine and spoofed faces [3]. Multimodal FAS methods [4, 5, 6, 7, 2, 8, 9], integrating information from RGB images, depth maps, and infrared images, have
The Mixture of Experts (MoE) model, adept at handling complex data distributions, decomposes a large network into specialized smaller networks, reducing computational load through sparse activation and enhancing model generalization [13, 14, 15, 16]. This architecture excels in multi-task and multi-modal learning scenarios, especially with high dimensional and heterogeneous data [17]. MoE has also shown excellent results for sparse representations in FAS tasks [18, 19, 20]. Building on this, our research integrates fine-grained experts [1] into the MoE framework for multimodal FAS tasks, improving the capture of detailed data features crucial for FAS performance [15, 16]. To counteract the vulnerability to input noise, we propose an Isolation Gating Mechanism, depicted in Fig. 1, which processes input vectors to robustly
We proposed the BIG-MoE, a novel multimodal FAS architecture that pioneers the application of MoE with fine-grained experts. This pioneering approach allows for more effective extraction of subtle cues and integration of multimodal features.
The BIG-MoE framework features an Isolated Gating Mechanism to shield the model against input noise and includes a convolutional prompt bypass, which fortifies the gating network with essential cues, thereby enhancing the model’s robustness against overfitting and noise.
Extensive experiments demonstrate the reliability and superior performance of BIG-MoE for generalized multimodal FAS.
\section*{2. METHODOLOGY}
As shown in Fig. 2, our proposed Bypass Isolated Gating MoE (BIG-MoE) framework is fundamentally composed of a pre-trained Vision Transformer (ViT), coupled with a sophisticated prompt generation module, the Convolutional Prompt Bypass (CPB), and the Isolated Gating Mechanism Adapter (IGMA). Input data is transformed into visual prompt tokens by the prompt generation module, which are then enhanced by the CPB module. Concurrently, the input is processed through the ViT Encoder and IGMA, with the latter leveraging the
CPB’s visual prompts to augment gating perception. The aggregated outputs from both modules are fed into a classifier, and the predictions are refined by cross-entropy loss during backpropagation.
Traditional MoE architectures are constrained by routing overhead in fine-grained expert partitioning. The PEER [1] architecture, however, employs the Product Key Retrieval (PKR) technique to efficiently identify and retrieve top- experts from a large pool for a -dimensional input vector , using low-dimensional sub-keys to construct a key set and inner product calculations, thus reducing computational load while preserving accuracy. The gating network, parameterized by , refines the expert outputs, incorporating a noise term , to yield the final gating decision .
The sensitivity of the gating network to input noise escalates with an increasing number of fine-grained experts, which, despite the introduction of training noise, fails to optimize performance or fully exploit multimodal processing. We attribute this to the gating network’s constrained feature perception due to low-dimensional sub-keys, impacting noise robustness. To address this, we introduce an IGM that distinguishes the expert-processed vector from the gating vector , enabling a more nuanced nonlinear transformation to reduce noise impact and enhance system performance efficiently. This refined process is formalized as follows:
Previous research has shown that routing selection in MoE models is sensitive to prompt tokens [16], which can introduce noise and limit the effectiveness of Prompt Learning when applied to MoE. To address this, we developed the CPB for the IGMA, utilizing Central Difference Convolution (CDC) [2] to enhance the extraction of local spoofing cues.
The CPB process initiates by concatenating multimodal inputs along the channel dimension to create clue prompts . A probability masks entire modal images, setting them to zero, which is integrated into the prompts as supplemental data. Static task-related prompts are concurrently acquired. These prompts are merged to form a comprehensive input prompt . Each layer’s prompt is combined with the perceptive vector , forming an integrated perceptive vector input to the gating network. This fusion enhances perceptual stability, particularly with composite features exhibiting substantial representational variance.
The PKR method is employed to partition the perceptive vector into two sub-spaces, avoiding interference from
prompt semantics and enhancing perception stability. The combined perceptive vector is processed through the Attention mechanism, generating a new prompt for the next layer, as described by the formula:
Here, the Efficient Channel Attention (ECA) module within the CPB enriches IGMA with supplemental perceptive information, blending insights across layers to reduce gating sensitivity and bolster the model’s performance and stability.
\section*{3. EXPERIMENT}
\subsection*{3.1. Data and Evaluation Metrics}
In this study, we followed the MMDG’s Protocols 1 and 3 [12], applying a Leave-One-Out (LOO) test on fixed modalities: S (SURF) [24], P (PADISI USC) [23], C (CeFA) [22], and W (WMCA) [25]. Performance was measured using Half Total Error Rate (HTER) and Area Under the Receiver Operating Characteristic Curve (AUC).
\subsection*{3.2. Implementation Details}
All input images were standardized to pixels, segmented into patches, and inputted into the ViT where token hidden dimension . We trained the model using the Adam optimizer, a learning rate of , weight decay of , over 100 epochs with a batch size of 32 . The classifier was a single fully connected layer reducing the class token output from 768 to 2 . The model was based on a pretrained ViT-Base on ImageNet, with the IGMA structure featuring 2 activated experts per head, in total 1600 experts, a hidden dimension of 8 , and a 64 -dimension CPB.
\subsection*{3.3. Cross-testing Results}
Sufficient Source Domains Scenario. The results in Table 1 highlight our model’s state-of-the-art performance ( 3 out of 4) across several sub-protocols. Specifically, our model with HTER dropping to , a decrease from the baseline, and AUC rising to , a increase from the baseline in 'PSW ’ setting. These improvements are a testament to the BIG-MoE architecture’s superiority in handling generalized multimodal FAS tasks, indicating the excellent generalization capacity of our model across unseen scenarios.
Fig. 3: Ablation study on expert numbers and activations. (a) HTER with Varying Numbers of Activated Experts. (b) HTER with Different Total Expert Counts. The ablation study investigates the impact of expert count and activation on model performance, providing insights into the optimal configuration for expert utilization in the model.
Limited Source Domains Scenario. The results of ‘PS CW’ in Table 2 also demonstrate our model’s superior generalization performance under limited source domains, demonstrating enhanced multimodal generalization over ‘ViT (Baseline)’. With the AUC of and the HTER of , our model leads in state-of-the-art generalization performance, highlighting the model’s outstanding ability to generalize across limited source domains scenarios
\subsection*{3.4. Ablation Study}
To validate the rationality and effectiveness of BIG-MoE, we conducted meticulous ablation experiments. These aimed to evaluate the impact of prompt settings on model performance, comparing BIG-MoE with a Vision Transformer, a coarse-grained MoE ( ST MoE ), and a fine-grained MoE (PEER) to highlight the advantages of our CPB and IGMA. Experiments were conducted using the configuration, testing prompt settings with alone, with , and the full setup. Results showed that all prompt configurations improved performance, substantiating the rationality and effectiveness of our approach and demonstrating BIG-MoE’s potential to enhance model capabilities, providing insights for future work.
Impact of Experts’ Granularity. Fig. 3 indicates that while moderate increases in IGMA granularity enhance the Adapter’s performance, overly fine granularity can lead to a decline in effectiveness. This suggests a critical trade-off: granularity must be judiciously adjusted to maximize system performance, underscoring the need for a balanced granularity strategy in model optimization.
Effectiveness of IGMA. The data ’ IGMA + CPB (With )’ in Table 3 indicates that, with the help of perceptual cues, IGMA sees a HTER reduction and a AUC increase over the resluts of PEER [1]. The integration of fine-grained experts with cues in the IGMA framework maximizes performance, surpassing the benefits of prompts alone. The framework is indispensable for achieving optimal results.
Effectiveness of CPB. The results from ‘w/ IGMA’ to ‘BIG-MoE’ in Table 3 delineate the significant performance enhancement attributable to each prompt element, thereby validating our design rationale. These findings not only
Table 3: Ablation results on the proposed BIG-MoE.
Method
CPW S
HTER(%)
AUC(%)
ViT [30] (Baseline)
20.88
84.77
w/ ST MoE [15]
14.31
88.69
w/ PEER [1]
22.34
84.97
w/ IGMA
21.12
85.50
w/ IGMA+CPB (With )
20.55
88.41
w/ IGMA+CPB (With )
10.44
93.87
BIG-MoE (Ours)
Fig. 4: t-SNE visualization when respectively tested on CeFA, PADISI, SURF, and WMCA domains.
demonstrate the synergistic effects across modalities and features, but also highlight the substantial refinement in cue detection and decision-making capabilities afforded by an optimal prompt combination.
\subsection*{3.5. Visualization and Analysis}
t -SNE was used for dimensionality reduction and visualization of complex data, effectively showing its utility with ViT and BIG-MoE methods. Fig. 4 illustrates BIG-MoE’s advanced classification, aided by CPB technology in capturing fine feature differences. However, to enhance model generalizability across domains, optimizing multi-domain training samples is needed due to variations in feature representation from different training datasets.
\section*{4. CONCLUSION}
This paper introduces BIG-MoE, integrating the Isolated Gating Mechanism Adapter and Convolutional Prompt Bypass for generalized multimodal face anti-spoofing (FAS). The former detects subtle spoofing cues with fine-grained experts and efficient key retrieval, while the latter extracts local features and boosts model perception via attention mechanisms. Our method demonstrates superior performance in generalized multimodal FAS through extensive experiments. Future work will focus on improving MoE’s generalization with limited samples and in multimodal settings.
Acknowledgement. This work was supported by National Natural Science Foundation of China (Grant No. 62306061, 82261138629, 62276170), Guangdong Basic and Applied Basic Research Foundation (Grant No. 2023A1515140037, 2023A1515010688), and Open Fund of National Engineering Laboratory for Big Data System Computing Technology (Grant No. SZU-BDSC-OF2024-02), and Guangdong Provincial Key Laboratory under Grant 2023B1212060076.
\section*{5. REFERENCES}
[1] Xu Owen He, “Mixture of a million experts,” arXiv preprint arXiv:2407.04153, 2024.
[2] Zitong Yu, Yunxiao Qin, Xiaobai Li, Zezheng Wang, Chenxu Zhao, Zhen Lei, and Guoying Zhao, “Multimodal face anti-spoofing based on central difference networks,” in CVPR, 2020, pp. 650-651.
[3] Zitong Yu, Yunxiao Qin, Xiaobai Li, Chenxu Zhao, Zhen Lei, and Guoying Zhao, “Deep learning for face anti-spoofing: A survey,” TPAMI, vol. 45, no. 5, pp. 5609-5631, 2022.
[4] Anjith George and Sébastien Marcel, “Cross modal focal loss for rgbd face anti-spoofing,” in CVPR, 2021, pp. 7882-7891.
[5] Anjith George and Sébastien Marcel, “Learning one class representations for face presentation attack detection using multi-channel convolutional neural networks,” TIFS, vol. 16, pp. 361-375, 2020.
[6] Ajian Liu and Yanyan Liang, “Ma-vit: Modalityagnostic vision transformers for face anti-spoofing,” arXiv preprint arXiv:2304.07549, 2023.
[7] Ajian Liu, Zichang Tan, Zitong Yu, Chenxu Zhao, Jun Wan, Yanyan Liang, Zhen Lei, Du Zhang, Stan Z Li, and Guodong Guo, “Fm-vit: Flexible modal vision transformers for face anti-spoofing,” TIFS, vol. 18, pp. 47754786, 2023.
[8] Zitong Yu, Rizhao Cai, Yawen Cui, Ajian Liu, and Changsheng Chen, “Visual prompt flexible-modal face anti-spoofing,” arXiv preprint arXiv:2307.13958, 2023.
[9] Zitong Yu, Rizhao Cai, Yawen Cui, Xin Liu, Yongjian Hu, and Alex C Kot, “Rethinking vision transformer and masked autoencoder in multimodal face anti-spoofing,” IJCV, pp. 1-22, 2024.
[10] Zitong Yu, Ajian Liu, Chenxu Zhao, Kevin HM Cheng, Xu Cheng, and Guoying Zhao, “Flexible-modal face anti-spoofing: A benchmark,” in CVPR, 2023, pp. 6346-6351.
[11] Qingyang Zhang, Haitao Wu, Changqing Zhang, Qinghua Hu, Huazhu Fu, Joey Tianyi Zhou, and Xi Peng, “Provable dynamic fusion for low-quality multimodal data,” in ICML. PMLR, 2023, pp. 4175341769.
[12] Xun Lin, Shuai Wang, Rizhao Cai, Yizhong Liu, Ying Fu, Wenzhong Tang, Zitong Yu, and Alex Kot, “Suppress and rebalance: Towards generalized multi-modal face anti-spoofing,” in CVPR, 2024, pp. 211-221.
[13] Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby, “Scaling vision with sparse mixture of experts,” NeurIPS, vol. 34, pp. 8583-8595, 2021.
[14] William Fedus, Barret Zoph, and Noam Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” , vol. 23, no. 120, pp. 1-39, 2022.
[15] Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus, “St-moe: Designing stable and transferable sparse expert models,” arXiv preprint arXiv:2202.08906, 2022.
[16] Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You, “Openmoe: An early effort on open mixture-of-experts language models,” arXiv preprint arXiv:2402.01739, 2024.
[17] Yuqi Yang, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jinwei Chen, and Bo Li, “Multi-task dense prediction via mixture of low-rank experts,” in , pp. 27927-27937.
[18] Qianyu Zhou, Ke-Yue Zhang, Taiping Yao, Ran Yi, Shouhong Ding, and Lizhuang Ma, “Adaptive mixture of experts learning for generalizable face anti-spoofing,” in ACM MM, 2022, pp. 6009-6018.
[19] Chenqi Kong, Anwei Luo, Song Xia, Yi Yu, Haoliang Li, and Alex C Kot, “Moe-ffd: Mixture of experts for generalized and parameter-efficient face forgery detection,” arXiv preprint arXiv:2404.08452, 2024.
[20] Ajian Liu, “Ca-moeit: Generalizable face anti-spoofing via dual cross-attention and semi-fixed mixture-ofexpert,” IJCV, pp. 1-14, 2024.
[21] Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, and Huchuan Lu, “Visual prompt multi-modal tracking,” in CVPR, 2023, pp. 9516-9526.
[22] Ajian Liu, Zichang Tan, Jun Wan, Sergio Escalera, Guodong Guo, and Stan Z Li, “Casia-surf cefa: A benchmark for multi-modal cross-ethnicity face antispoofing,” in ICCV, 2021, pp. 1179-1187.
[23] Mohammad Rostami, Leonidas Spinoulas, Mohamed Hussein, Joe Mathai, and Wael Abd-Almageed, “Detection and continual learning of novel face presentation attacks,” in ICCV, 2021, pp. 14851-14860.
[24] Shifeng Zhang, Ajian Liu, Jun Wan, Yanyan Liang, Guodong Guo, Sergio Escalera, Hugo Jair Escalante, and , “Casia-surf: A large-scale multi-modal benchmark for face anti-spoofing,” IEEE T-BIOM, vol. 2, no. 2, pp. 182-193, 2020.
[25] Anjith George, Zohreh Mostaani, David Geissenbuhler, Olegs Nikisins, André Anjos, and Sébastien Marcel, “Biometric face presentation attack detection with multi-channel convolutional neural network,” TIFS, vol. 15, pp. 42-55, 2019.
[26] Yunpei Jia, Jie Zhang, Shiguang Shan, and Xilin Chen, “Single-side domain generalization for face antispoofing,” in CVPR, 2020, pp. 8484-8493.
[27] Zhuo Wang, Zezheng Wang, Zitong Yu, Weihong Deng, Jiahong Li, Tingting Gao, and Zhongyuan Wang, “Domain generalization via shuffled style assembly for face anti-spoofing,” in CVPR, 2022, pp. 4123-4133.
[28] Qianyu Zhou, Ke-Yue Zhang, Taiping Yao, Xuequan Lu, Ran Yi, Shouhong Ding, and Lizhuang Ma, “Instance-aware domain generalization for face antispoofing,” in CVPR, 2023, pp. 20453-20463.
[29] Hsin-Ping Huang, Deqing Sun, Yaojie Liu, Wen-Sheng Chu, Taihong Xiao, Jinwei Yuan, Hartwig Adam, and Ming-Hsuan Yang, “Adaptive transformers for robust few-shot cross-domain face anti-spoofing,” in . Springer, 2022, pp. 37-54.
[30] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.