Low-cost Agents with Language Perception and Dynamic Inference

Pan, Bowen

Author(s)

Pan, Bowen

DownloadThesis PDF (27.99Mb)

Advisor

Oliva, Aude

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Designing efficient artificial intelligence agents presents significant challenges, particularly in terms of learning and inference costs. Traditional agents often suffer from high learning expenses due to their limited ability to generalize across diverse tasks and environments. Recent advances in large language models (LLMs) have shown strong generalization capabilities by leveraging high-level abstractions of the world through language. In this thesis, we propose leveraging language as a perceptual representation to enable LLM-based agents to perform vision-language navigation tasks with reduced data collection costs. We demonstrate that language not only facilitates the generation of efficient synthetic data but also serves as a bridge to minimize domain gaps between different environments. However, transformer-based agents are burdened with high inference costs, especially when handling long-horizon visual content. To mitigate this, we introduce two strategies: (1) reducing visual input redundancy through dynamic token selection, and (2) accelerating model inference using a memory-efficient Mixture of Experts (MoE) architecture. Together, these approaches offer a robust framework for enhancing both learning and inference efficiency in LLM agents.

Date issued

2024-09

URI

https://hdl.handle.net/1721.1/158499

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses