Abstract
Transformers have been widely recognized as powerful tools to analyze
multiple tasks due to its state-of art multi-head attention spaces, such
as Natural Language Processing (NLP), Computer Vision (CV) and Speech
Recognition (SR). Inspired by its abundant designs and strong functions
on analyzing input data, we would like to start from the various
architectures, further proceed to the investigation on its statistical
mechanism and inference and then introduce its applications on dominant
tasks. The underlying statistical mechanisms arouse our interests and
intrigue us to investigate it in a higher level, and this surveys will
focus on its mathematical foundations and then use the principles to try
to analyze the reasons for its excellent performance on many recognition
scenarios.