

Humanity has automated the prose of science, but the picture remained a manual struggle, until now.
While Large Language Models (LLMs) have successfully automated the “thinking and writing” phases of the research lifecycle (from literature reviews to hypothesis generation), a critical Labor-Intensive Bottleneck persists in the visual communication of these ideas. Current autonomous AI scientists can draft an entire manuscript in minutes, yet they struggle to generate illustrations that adhere to the rigorous standards of top-tier academic venues.
This represents a crucial evolution in the architectures of autonomy, moving from text-based reasoning to multi-modal creation.
The Hook: The Visual Bottleneck
For researchers, the transition from a technical breakthrough to a high-fidelity methodology diagram remains a manual struggle. Legacy tools like TikZ, PowerPoint, or Python-PPTX require significant manual effort and often lack the expressiveness needed for the specialized icons and custom graph topologies common in modern AI publications. Scientists frequently find themselves spending hours on graphic design rather than core discovery, highlighting a clear gap in the path toward fully autonomous research.
The Problem
Current autonomous AI scientists can draft an entire manuscript in minutes, yet they struggle to generate illustrations that adhere to the rigorous standards of top-tier academic venues.
The Solution: What is PaperBanana?
PaperBanana is an Agentic Framework designed to bridge the gap between technical descriptions and professional-grade visuals. We formalize the task of automated academic illustration as learning a mapping from a source context and a communicative intent to a visual representation , optionally augmented by a set of reference examples :
The framework operates as a Reference-Driven, collaborative multi-agent system. It is powered by a hybrid stack: Gemini-3-Pro serves as the VLM judge and agentic backbone, while Nano-Banana-Pro functions as the specialized image generation model.
By leveraging an Agentic Workflow that incorporates Self-Critique, PaperBanana transforms raw methodology sections and figure captions into Publication-Ready assets.
Technical Deep Dive: The 4-Agent Workflow
PaperBanana orchestrates a specialized team to ensure every illustration meets scholarly standards for both content fidelity and visual aesthetics.
1. Retrieval Agent: Finding the Right References
The workflow begins with “generative retrieval” from a dedicated reference set of 292 valid samples. Rather than matching by topic alone, the Retrieval Agent prioritizes:
- Visual Structure: Distinguishing between sequential pipelines and hierarchical architectures.
- Diagram Type: Identifying whether the user needs a flow chart, a plot, or a schematic.
By identifying references whose logical composition matches the user’s intent, the agent provides a concrete structural foundation for the downstream generation.
2. Planning Agent: Deciding Content and Style
The Planning Agent acts as the cognitive core, translating raw data into a detailed textual description via in-context learning. This phase incorporates the Stylist Agent, which performs a high-impact feat: it traverses the entire reference collection to automatically synthesize a comprehensive Aesthetic Guideline .
Similar to how Neural World Models plan physical interactions, the Planning Agent constructs a visual strategy before pixels are rendered.
This guideline defines community standards for:
- Color palettes (e.g., maximizing readability and professional tone)
- Typography (consistent font usage)
- Layout (hierarchical organization)
3. Rendering Agent: Bringing the Image to Life
The Rendering Agent (or Visualizer) transforms the stylistically optimized description into visual output.
- For methodology diagrams, it utilizes Nano-Banana-Pro to synthesize complex shapes and icons.
- For statistical plots, the agent pivots to a code-based paradigm, generating executable Matplotlib code.
Precise Visualization
Generating code instead of pixels prevents “numerical hallucinations,” where bars or data points might be drawn at inaccurate heights relative to axis ticks. This ensures the mathematical precision required for data visualization.
4. Review Agent: The Power of Iterative Self-Critique
To combat “visual hallucinations” and logical inconsistencies in graph topology, PaperBanana employs a Critic Agent in a closed-loop refinement mechanism.
The process involves T=3 rounds of feedback. The Review Agent inspects the generated image against the source context to identify factual misalignments or glitches, prompting the Rendering Agent to regenerate until the illustration meets high academic standards.
This self-critique loop is a prime example of the “Agentic Inflection,” where systems verify and correct their own outputs to ensure reliability.
Performance & Benchmarking
To evaluate our framework, we introduced PaperBananaBench, a dataset comprising 584 valid samples (292 test cases and 292 reference cases) curated from NeurIPS 2025 publications. Using a VLM-as-a-Judge methodology (Gemini-3-Pro), we compared PaperBanana against leading baselines.
Benchmark Results
PaperBanana demonstrates significant improvements across all metrics, with a massive leap in Conciseness and Overall Score.
Performance varied by domain:
- Agent & Reasoning diagrams: Highest overall score (69.9%).
- Vision & Perception: Most challenging (52.1%).
Our results indicate that while PaperBanana excels in aesthetics, fine-grained connectivity (such as specific arrow directions and source-target node matching) remains a frontier for current models.
Future Implications
PaperBanana acts as a critical bridge in the “Autonomous Research Lifecycle,” shifting the paradigm from AI that merely “thinks and writes” to AI that “visualizes and communicates.”
The current framework produces raster (PNG) images, which are inherently difficult to edit. The next frontier in this research involves the transition to vector graphics (SVG/AI) for “infinite scalability.” We envision the development of GUI Agents capable of autonomously operating professional design software like Adobe Illustrator. This would enable the production of fully editable, professional-grade vector illustrations with zero human intervention.
Frequently Asked Questions
Conclusion
PaperBanana “democratizes access” to high-quality visual tools, enabling researchers to communicate complex discoveries without requiring specialized design expertise. By automating the most tedious aspects of figure creation, it accelerates the pace of scientific dissemination.
However, researchers must maintain rigorous human oversight. Users should view these agents as collaborators and remain vigilant against fine-grained “visual hallucinations” to ensure the absolute scientific integrity of every published illustration.
As models like Kimi k2.5 and frameworks like PaperBanana evolve, the toolkit for the autonomous scientist becomes exponentially more powerful.
Start Building
Ready to integrate agentic workflows into your research or enterprise? Contact Sterlites Engineering to explore the future of autonomous systems.
Give your network a competitive edge in AI Research.
Establish your authority. Amplify these insights with your professional network.
Recommended for You
Hand-picked blogs to expand your knowledge.


