Recommended System Architecture-Caterpie Generated By Blastoise
Shanzai CL
Published on: 2021-08-12
Abstract
The motivation for writing the paper is to verify that each transformer has a different architecture and hyper-parameter settings without considering the constraints of resource consumption. In actual application scenarios, how much performance will be improved. First, we use the components in the evolutionary version of DynaBert to construct the search space, through NAS, we got a new Transformer at the beginning, including 6 new transformers in the encoder, and 6 new transformers in the decoder. We integrate a series of fresh technologies In this new Transformer, and developed it as a new model BLASTOISE, Secondly, in our experiments, BLASTOISE continuously improves Transformer on four data sets: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English- Czech and LM1B. Our model is always better than the latest model. Finally, we use the encoder in BLASTOISE as bert (CATERPIE). Extensive experiments conducted on four benchmark data sets show that our model is always better than the latest sequential model. Finally, we tested and verified the model on the recommended data set, showing that our model is effective.
Keywords
Bert; DynaBert; Transformer; Recommended systemIntroduction
The work done in this paper is to verify whether the simple stacking of sub-modules is the best solution by making different sub-modules in the transformer. This paper uses an evolutionary algorithm based on Tournament Selection, based on the hand-designed DynaBert neural network, combined with the network search algorithm to obtain an evolutionary version of DynaBert. Through the components in DynaBert, through NAS, we got a new Transformer at the beginning, including 6 new transformers in the encoder, and 6 new transformers in the decoder. The new transformer network structure obtained through DynaBert is more than The original Transformer achieved better performance. In order to speed up the training process of the model, we adopted a low-rank matrix approximation method to implement a new self-attention mechanism. At the same time, an optimized multi-attention head integration method is used to extract common information and share it with all attention heads, so that each attention head can focus on capturing unique information. A new type of optimizer LAMB is used. We use RealFormer, a simple Residual Attention Layer Transformer architecture.
In deep learning, the model usually reuses the same parameters for all inputs. The mixed-of-experts (MoE) model ignores this and chooses different parameters for each incoming example. The result is a sparsely activated model-the number of parameters is staggering-but the computational cost remains the same. However, although MoE has achieved some notable successes, its complexity, communication costs and unstable training have prevented widespread adoption. Google solved these problems with Switch Transformer. Google simplified the MoE routing algorithm and designed an intuitive improvement model, thereby reducing communication and computing costs. We integrated this technology into our previously proposed model-SQUIRTLE, and developed it into a new model-BLASTOISE The architecture we found in our experiments demonstrated continuous improvement over transformers on four recognized language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English-Czech and LM1B. Our model is always better than the latest models. After that, we used the encoder in BLASTOISE as bert (CATERPIE). Extensive experiments on four benchmark data sets showed that our model is always better than the latest sequential models. Finally, we tested and verified the model on the recommended data set, showing that our model is effective.