DocPedia, a multi-modal document model jointly developed by ByteDance and the University of Science and Technology of China, has successfully broken through the limit of resolution and reached a high resolution of 2560×2560, while the industry's advanced multi-modal large models such as LLaVA and MiniGPT-4 process images with a resolution of 336×336, which cannot parse high-resolution document images. The result is that the research team has adopted a new approach to address the shortcomings of existing models in parsing high-resolution document images.
It is said that DocPedia can not only accurately identify image information, but also call the knowledge base to answer questions based on user needs, demonstrating the ability to understand high-resolution multimodal documents.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
ByteDance and USTC jointly proposed DocPedia, a large multimodal document model
DocPedia, a multi-modal document model jointly developed by ByteDance and the University of Science and Technology of China, has successfully broken through the limit of resolution and reached a high resolution of 2560×2560, while the industry's advanced multi-modal large models such as LLaVA and MiniGPT-4 process images with a resolution of 336×336, which cannot parse high-resolution document images. The result is that the research team has adopted a new approach to address the shortcomings of existing models in parsing high-resolution document images.
It is said that DocPedia can not only accurately identify image information, but also call the knowledge base to answer questions based on user needs, demonstrating the ability to understand high-resolution multimodal documents.