Simon Willison’s Weblog: Gemini Bounding Box Visualization

Source URL: https://simonwillison.net/2024/Aug/26/gemini-bounding-box-visualization/#atom-everything
Source: Simon Willison’s Weblog
Title: Gemini Bounding Box Visualization

Feedly Summary: Gemini Bounding Box Visualization
Here’s another fun tool I built with the help of Claude 3.5 Sonnet.
I was browsing through Google’s Gemini documentation while researching how different multi-model LLM APIs work when I stumbled across this note in the vision documentation:

You can ask the model for the coordinates of bounding boxes for objects in images. For object detection, the Gemini model has been trained to provide these coordinates as relative widths or heights in range [0,1], scaled by 1000 and converted to an integer. Effectively, the coordinates given are for a 1000×1000 version of the original image, and need to be converted back to the dimensions of the original image.

I tried a few prompts using their Python library and got back what looked like bounding boxes!
>>> import google.generativeai as genai
>>> genai.configure(api_key=”…")
>>> model = genai.GenerativeModel(model_name="gemini-1.5-pro-latest")
>>> import PIL.Image
>>> goats = PIL.Image.open("/tmp/goats.jpeg")
>>> prompt = ‘Return bounding boxes around every goat, for each one return [ymin, xmin, ymax, xmax]’
>>> response = model.generate_content([goats, prompt])
print(response.text)
>>> print(response.text)
– [200, 90, 745, 527]
– [300, 610, 904, 937]
But how to verify them? I fired up Claude 3.5 Sonnet and got it to start iterating on Artifacts to try and visualize those co-ordinates against the original image.
After a lot of fiddling around, I built a tool that I could paste co-ordinates into and select an image and see that image rendered.
I figured it would be neat to have a tool where I could select an image, enter a prompt and have that prompt run through Gemini Pro and any bounding boxes displayed on the original image.
It turns out the Gemini API supports CORS! I wrote about this as a feature of the Anthropic Claude API the other day. It means you can build browser applications that can talk directly to LLMs, using the user’s own API key.
So I built this tool, which lets you select a photo, enter a prompt, paste in your Gemini key (which is stashed in localStorage – it is only stored in your browser) and see both the response from the prompt and your image with that bounding box visualized:

The code was almost all written by Claude. I started out with this extensive prompting sequence to help build the original tool for opening an image and pasting in those bounding box coordinates.
Once I had that working (I had to manually edit the code to fix some issues with the way the coordinates were interpreted) I used my favourite prompting trick to build the combined tool that called the Gemini API: I found this example that calls the @google/generative-ai API from a browser, pasted the full example into Claude along with my previous bounding box visualization tool and had it combine them to achieve the desired result.

Based on that example text, build me an HTML page with Vanilla JS that loads the Gemini API from esm.run – it should have a file input and a textarea and a submit button – you attach an image, enter a prompt and then click the button and it does a Gemini prompt with that image and prompt and injects the returned result into a div on the page

Then this follow-up prompt:

now incorporate the logic from this tool (I pasted in that HTML too), such that when the response is returned from the prompt the image is displayed with any rendered bounding boxes

Here’s the finished source code, after I tweaked it to store the API key in localStorage and increased the width of the rendered bounding boxes.
Tags: vision-llms, claude-3-5-sonnet, gemini, ai, llms, claude-artifacts, ai-assisted-programming, generative-ai, projects

AI Summary and Description: Yes

**Summary:**
The text discusses a tool developed for visualizing bounding box coordinates generated by Google’s Gemini model in response to image prompts. It highlights the capabilities of Gemini’s API for object detection and incorporates user interactivity in a web application setting. The insights provided are significant for developers and professionals interested in integrating AI, particularly LLMs, with web applications for practical uses such as image analysis and object recognition.

**Detailed Description:**
The text describes a practical project that demonstrates the application of multi-model large language models (LLMs) in the field of AI-driven image analysis and object detection. The project involves utilizing Google’s Gemini model to generate bounding box coordinates for objects detected in images, specifically in this case, goats in a photograph. Key points and implications include:

– **Object Detection Using Gemini:**
– The Gemini model can produce bounding box coordinates in a normalized format that requires conversion for use with original image dimensions.
– Example coordinates from a returned response show the flexibility and capability of the model to accurately identify and locate objects within an image.

– **Tool Development:**
– The author describes how they built a visualization tool to display the bounding boxes directly on the input image.
– The first step involved decoding the coordinates and rendering them, which required some troubleshooting and iterative development using Python.

– **Integration with Web Applications:**
– The project leverages the JavaScript capabilities to create a user-friendly interface allowing users to upload images, input prompts, and view results.
– The mention of CORS support signifies ease of integration for web-based applications interacting with Gemini’s API.

– **Use of Generative AI in Programming:**
– The involvement of Claude 3.5 in aiding the development of the tool showcases how generative AI can assist programmers in leveraging complex APIs and enhance productivity.
– The tool’s design, which accommodates user API keys through local storage, raises considerations for security but also usability.

– **Final Tool Features:**
– The final product allows for live interaction, combining image processing with user prompts, demonstrating the real-world applications of LLMs and AI in processing multimedia data.

**Implications for Professionals:**
– This project exemplifies the convergence of AI, image processing, and web technology, offering valuable lessons on building responsive AI-integrated tools.
– Security and privacy considerations arise in terms of how API keys are handled. While local storage offers convenience, it may introduce vulnerabilities if not managed correctly.
– The insights gained can be leveraged to develop more sophisticated applications that exploit LLM capabilities, extending beyond image bounding boxes to other data types and interaction forms in cloud-based environments. Professionals in AI, cloud security, and infrastructure can learn about the practical applications of latest LLMs in various domains, particularly in user-interactive applications.