Simon Willison’s Weblog: Running prompts against images and PDFs with Google Gemini

Source URL: https://simonwillison.net/2024/Oct/23/prompt-gemini/#atom-everything
Source: Simon Willison’s Weblog
Title: Running prompts against images and PDFs with Google Gemini

Feedly Summary: Running prompts against images and PDFs with Google Gemini
New TIL. I’ve been experimenting with the Google Gemini APIs for running prompts against images and PDFs (in preparation for finally adding multi-modal support to LLM) – here are my notes on how to send images or PDF files to their API using curl and the base64 -i macOS command.
I figured out the curl incantation first and then got Claude to build me a Bash script that I can execute like this:
prompt-gemini ‘extract text’ example-handwriting.jpg

Playing with this out is really fun. The Gemini models charge less than 1/10th of a cent per image, so it’s really inexpensive to try them out.
Tags: vision-llms, gemini, llm, bash, ai, llms, ai-assisted-programming, google, generative-ai, ocr, projects

AI Summary and Description: Yes

Summary: The text highlights experimentation with Google Gemini APIs, specifically for multi-modal support within a Large Language Model (LLM). It mentions practical applications such as extracting text from images and PDFs, emphasizing the cost-effectiveness of using these APIs.

Detailed Description: The provided content illustrates a personal project involving the Google Gemini APIs, intended for the manipulation and analysis of images and PDF documents. This is particularly relevant for professionals in the fields of AI and LLM security, as it touches on innovative applications of these technologies.

– **Multimodal Support**: The author discusses preparing to add multi-modal features to an LLM, which involves processing both text and visual inputs—images and PDFs.
– **API Functionality**: The experimentation primarily revolves around using curl commands to interact with the Gemini API, highlighting a specific use case where the API can extract text from images.
– **Cost Overview**: The content notes that using Gemini models is quite economical, costing less than 1/10th of a cent per image, which is relevant for budgeting in AI projects.
– **Practical Implementation**: Includes a practical example of how the author implemented a Bash script for executing commands efficiently, a consideration many developers might find valuable.

Overall, the insights provided can benefit AI developers and security professionals exploring how to leverage AI APIs for image processing and text extraction, ensuring that they are aware of the tools available and their operational costs.