Audio-Visual Kinship Verification: a New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach
Facial kinship verification refers to automatically determining whether two people have a kin relation from their faces. It has become a popular research topic due to potential practical applications, such as finding missing children, family photo organization, or criminal investigations. Over the past decade, many efforts have been devoted to improving the verification performance of human faces only while lacking other biometric information, e.g., speaking voice. In this paper, to interpret and benefit from multiple modalities, we propose for the first time to combine human faces and voices to verify kinship, which we refer it as the audio-visual kinship verification study. Since there is still no standard and public audiovisual kinship dataset, we first establish a comprehensive audio-visual kinship dataset that consists of familial talking facial videos under various scenarios, called TALKIN-Family. Based on the dataset, we present the extensive evaluation of kinship verification from faces and voices. In particular, we propose a deep learning-based fusion method, named Unified Adaptive Adversarial Multimodal Learning (UAAML). It consists of the adversarial network and the attention module on the basis of unified multi-modal features. First, the modality adversarial learning eliminates the cross-modality variations by confusing the discriminator. The attention module quantifies the importance of kinship interested features. The overall multimodal fusion network is trained in Siamese fashion to encourage the compactness of kinship and separation of non-kinship. Experiments show that audio (voice) information is complementary to facial features and useful for the kinship verification problem. Further, the proposed fusion method outperforms baseline methods. In addition, we also evaluate the human kinship verification ability on a sub-set of TALKIN-Family. It indicates that human has higher accuracy when they have access to both faces and voice. The machine learning methods could effectively and efficiently outperform human ability. Finally, we include the future work and research opportunities with the TALKIN-Family dataset.
Email Address of Submitting Authorxiaoting.firstname.lastname@example.org
Submitting Author's InstitutionUniversity of Oulu
Submitting Author's CountryFinland
Read the peer-reviewed publication
in IEEE Transactions on Cybernetics