Simon Willison tries out Gemini Pro 1.5 on video, and suggests its 1M token context size opens up powerful new opportunities using video prompts
Willison’s experience with, and reaction to, Gemini 1.5 Pro extracting structured output from video prompts parallels my own experience using GPT-4 Vision to extract structure from heirloom recipe images (often handwritten and horribly mangled):
… I’m pretty astonished by this.
… I find those results pretty astounding.
The ability to analyze video like this feels SO powerful. Being able to take a 20 second video of a bookshelf and get back a JSON array of those books is just the first thing I thought to try.