SketchAgent: Language-Driven Sequential Sketch Generation


1MIT, 2Stanford

What is SketchAgent?

SketchAgent is a drawing system that leverages LLMs to generate sequential sketches. 🖌️
Given a text prompt, it produces a sequence of strokes that are rendered to the canvas, transforming language into visual concepts! 🎨

What Can You Do with It?

🖌️ Interactive Sketching

Through our interface, users can collaborate with SketchAgent to draw together, stroke by stroke, engaging in a fun and creative drawing experience. Check out our demo!

🖌️ Chat-Based Sketch Editing

Users can also iteratively edit their sketches through a chat dialogue!

Additionally, the strong prior of LLMs can be leveraged to perform various types of sketch editing, such as animating sketches with CSS!

How does it work?


We leverage an off-the-shelf pretrained multimodal LLM to draw sketches based on natural language instructions.
Although these models can produce SVGs through direct prompting, the results often appear "mechanical," with uniform and overly precise shapes that lack the organic qualities of human sketches:

Furthermore, while these models excel in textual tasks, they often struggle with fine-grained spatial reasoning, limiting their effectiveness for sketch editing.
To overcome these challenges, we introduce an intuitive sketching language that combines a grid canvas with Bézier curve processing.
  1. The canvas is a numbered grid (1–50) along the x-axis and y-axis. Each cell is uniquely identified by its x and y coordinates (e.g., the bottom-left cell is x1y1).
  2. In our sketching language, a sketch is defined as a sequence of n ordered strokes, where each stroke is defined by a sequence of m cell coordinates on the grid:

How to we process the textual strokes so that the sketch appears more natural?
We treat the specified (x,y) coordinates as a set of desired points sampled along the curve, and fit a smooth Bezier curve to them.
Here is how a cubic Bezier curve is defined mathematically:

\[ B(t) = (1 - t)^3P_0 + 3(1-t)^2tP_1 + 3(1-t)t^2P_2 + t^3P_3, \]

Where the set \(P=\{P_0, P_1, P_2, P_3\}\) is often referred to as the curve's control points, and \(t\in[0,1]\) is a parameter that moves the point along the curve from \(P_0\) at \(t=0\) to \(P_3\) at \(t=1\).
To fit a curve to the specified coordinates, the model determines when each point on the curve is reached, corresponding to the \(t\) parameter.
Using the provided coordinates and \(t\) values, we fit a cubic Bézier curve to the sampled points by solving a system of linear equations with least squares, where the unknowns are the control points \(P={P_0,P_1,P_2,P_3}\):

\[ P = \text{argmin}_P ||AP - B||, \]

After proccessing the agent's output into vector graphics, we render the strokes onto the canvas to form the final sketch. The overall process is seen below:
SketchAgent, a frozen multimodal LLM, takes three inputs: (1) a system prompt detailing the sketching language guidelines, (2) a user prompt with task-specific instructions (e.g., "Draw a shark"), and (3) a blank canvas for sketching.
Based on the task, the agent generates a textual response representing the sequence of strokes. These strokes are processed into vector graphics and rendered onto the canvas. The canvas can then be reused in two ways: (1) it can be fed back into the model with updated prompts for additional tasks or edits, or (2) it can be accessed by a human user for collaborative sketching.

BibTeX


        @misc{vinker2024sketchagentlanguagedrivensequentialsketch,
          title={SketchAgent: Language-Driven Sequential Sketch Generation}, 
          author={Yael Vinker and Tamar Rott Shaham and Kristine Zheng and Alex Zhao and Judith E Fan and Antonio Torralba},
          year={2024},
          eprint={2411.17673},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2411.17673}, 
    }