MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Visual-Language-Action

Anonymous Submission

MAIN-VLA Framework Overview

Overview of MAIN-VLA. IA and ESA align instructions and visuals into sparse, actionable primitives during training. At inference, top-K token pruning filters redundancy for efficient, low-latency embodied behavior.

Abstract

Despite significant progress in Visual-Language-Action (VLA), in highly complex and dynamic environments that involve real-time unpredictable interactions (such as 3D open worlds and large-scale PvP games), existing approaches remain inefficient at extracting action-critical signals from redundant sensor streams. To tackle this, we introduce MAIN-VLA, a framework that explicitly Models the Abstraction of Intention and eNvironment to ground decision-making in deep semantic alignment rather than superficial pattern matching. Specifically, our Intention Abstraction (IA) extracts verbose linguistic instructions and their associated reasoning into compact, explicit semantic primitives, while the Environment Semantics Abstraction (ESA) projects overwhelming visual streams into a structured, topological affordance representation. Furthermore, aligning these two abstract modalities induces an emergent attention-concentration effect, enabling a parameter-free token-pruning strategy that filters out perceptual redundancy without degrading performance. Extensive experiments in open-world Minecraft and large-scale PvP environments (Game for Peace and Valorant) demonstrate that MAIN-VLA sets a new state-of-the-art, which achieves superior decision quality, stronger generalization, and cutting-edge inference efficiency.

Benchmark on Game for Peace

We establish a taxonomy of six atomic tasks that encapsulate the complete lifecycle of a battle royale match at an intermediate difficulty level (Gold and Silver tiers).

Precision Parachuting

Controlling descent trajectory to land within a minimal radius of a designated waypoint.

Resource Scavenging

Exploring the environment to identify, pick up, and collect essential loot such as weapons and armor.

Combat Engagement

Detecting adversaries and managing recoil to inflict lethal damage in encounters.

Teammate Revival

Identifying and reviving knocked-down teammates to restore their combat status.

Vehicle Acquisition

Searching for, locating, and boarding available vehicles to secure strategic mobility across the battlefield.

Strategic Rotation

Navigating towards the shrinking safe zone while avoiding obstacles under strict time constraints.