Low-cost Agents with Language Perception and Dynamic Inference
Author(s)
Pan, Bowen
DownloadThesis PDF (27.99Mb)
Advisor
Oliva, Aude
Terms of use
Metadata
Show full item recordAbstract
Designing efficient artificial intelligence agents presents significant challenges, particularly in terms of learning and inference costs. Traditional agents often suffer from high learning expenses due to their limited ability to generalize across diverse tasks and environments. Recent advances in large language models (LLMs) have shown strong generalization capabilities by leveraging high-level abstractions of the world through language. In this thesis, we propose leveraging language as a perceptual representation to enable LLM-based agents to perform vision-language navigation tasks with reduced data collection costs. We demonstrate that language not only facilitates the generation of efficient synthetic data but also serves as a bridge to minimize domain gaps between different environments. However, transformer-based agents are burdened with high inference costs, especially when handling long-horizon visual content. To mitigate this, we introduce two strategies: (1) reducing visual input redundancy through dynamic token selection, and (2) accelerating model inference using a memory-efficient Mixture of Experts (MoE) architecture. Together, these approaches offer a robust framework for enhancing both learning and inference efficiency in LLM agents.
Date issued
2024-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology