MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Searching for Efficient Multi-Stage Vision Transformers

Author(s)
Liao, Yi-Lun
Thumbnail
DownloadThesis PDF (1.621Mb)
Advisor
Sze, Vivienne
Karaman, Sertac
Terms of use
In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to image classification tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied in computer vision for years. This naturally raises the question of how the performance of ViT can be advanced with design techniques of CNN. To this end, we propose to incorporate two techniques and present ViT-ResNAS, an efficient multi-stage ViT architecture designed with neural architecture search (NAS). First, we propose residual spatial reduction to decrease sequence lengths for deeper layers and utilize a multi-stage architecture. When reducing lengths, we add skip connections to improve performance and stabilize training deeper networks. Second, we propose weight-sharing NAS with multi-architectural sampling. We enlarge a network and utilize its sub-networks to define a search space. A super-network covering all sub-networks is then trained for fast evaluation of their performance. To efficiently train the super-network, we propose to sample and train multiple subnetworks with one forward-backward pass given a batch of examples. After training the super-network, evolutionary search is performed to discover high-performance network architectures. Experiments on ImageNet demonstrate the effectiveness of ViT-ResNAS. Compared to the original DeiT, ViT-ResNAS-Tiny achieves 8.6% higher accuracy than DeiT-Ti with slightly higher multiply-accumulate operations (MACs), and ViTResNAS-Small achieves similar accuracy to DeiT-B while having 6.3× less MACs and 3.7× higher throughput. Additionally, ViT-ResNAS achieves better accuracyMACs and accuracy-throughput trade-offs than other strong baselines of ViT such as PVT and PiT and high-erpformance CNNs like RegNet and ResNet-RS.
Date issued
2021-09
URI
https://hdl.handle.net/1721.1/140187
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.