D7z Menu V2 Link May 2026
We compare against:
With the proliferation of food delivery platforms, the need for accurate, automated menu digitization has never been higher. A typical menu is not merely a list of text; it is a complex document containing multi-modal information: dish names (text), descriptions (semantic), prices (numerical), and dietary labels (icons). d7z menu v2 link
Abstract The digitization of menu images remains a critical challenge in Document Intelligence, primarily due to the complex spatial layouts, diverse typography, and implicit semantic hierarchies (e.g., dishes nested under sections with pricing attributes). Existing Vision-Language Models (VLMs) often struggle with "hallucination" in zero-shot settings or fail to preserve the exact spatial hierarchies required for automated ordering systems. This paper introduces D7Z-Menu V2, a novel framework that utilizes a Decoder-Driven Zero-Refinement mechanism. Unlike traditional OCR-pipeline approaches, D7Z-Menu V2 treats menu parsing as a conditional generation task constrained by a structural grammar schema. We demonstrate that by shifting the refinement burden entirely to the decoder phase—without external retrieval augmentation—our model achieves state-of-the-art performance on the MenuOCR benchmark, significantly reducing structural errors while maintaining semantic integrity. We compare against: