- Drifft Newsletter
- Posts
- Making AI Productive pt 3 - Models that reason over images and video
Making AI Productive pt 3 - Models that reason over images and video
The role of vision models in generative AI assisted workflows
💡What’s New: Understanding opensource AI with images and video.
🤔 Opinion: The world is better with competition - so why so little in AI?
🛠️ Tools & Data: Tools and data to speed your AI transformation
💡What’s new
Last week we looked at AI for text and the applications for working with words. This week we look at AI models for working on images.
It’s fascinating to me that no matter how complex your work, on a day-to-day basis you’re using the same type of inputs as every other human. That is, most of the time you’re making decisions based on images. If you drive a truck, you’re watching the road. If you’re a geochemical analyst, you’re probably looking into a screen. If you’re a mine engineer, you’re probably reviewing graphs.
The point of image AI is to speed up, simplify, and automate as much of that work as possible. How does it exactly? To get there, it’s helpful to review how The Image works.
For humans, images are a set of dots laid out in two dimensions, horizontal and vertical. We don’t typically “think” of images in that X,Y sense. The way we see our world is intuitive because we’ve been seeing it since we were babies.

Our eyes convert 3D to 2D. Each dot registers as a specific color with intensity.
Our eyes and brains are always at work and learning when they’re open. The discrete dots that we see moment by moment are converted into electrical and chemical signals, which physically change the brain. As we see images of the world, we “learn” its patterns–that is, the way things ought to look and behave.
Training AI works much the same. It’s just that computers can process trillions of these dots much faster than we can. The pattern recognition that takes humans a lifetime to develop can now be done in minutes and hours with a computer. (There are differences, but that’s neither here nor there as far as our jobs are concerned.)
There are many tricks our brains perform with these dots to perceive patterns. AI has advanced enough to perform these learning tricks as well. Here are some of them.
Segment - identify logical divisions.

Use cases: data extraction, labeling, masking
Huggingface Segmentation Space to learn more. Try some experiments here and here and here.
Classify - observe (even specific pixels as) members of a class.

Use cases: object detection, data labeling, navigation, finding anomalies.
Huggingface Image Classification to see more. See an example on satellite / drone imagery here.
Translate - convert an image to a related from in a separate data type, notably text.

Use cases: summarize (a set of) images, associate objects with their actions and behaviors, and even characterize an image with an identifiable trait. Note: These models are able to work in either direction (see below).
Huggingface Image to Text to learn more. Try some experiments here and here.
Predict - generate from text or from previous elements in a sequence.

Use cases - creating images (e.g., marketing and movies), and even predicting the next item in a sequence.
Huggingface Image Generation to learn more. Try some experiments here and here.
While these are interesting by themselves, they are more useful when we combine and automate them with some engineering and our own data. They then become automated driving, drone navigation, and satellite earth observation.
The most interesting use cases are those that involve reasoning between several information types, e.g., language + images. We’ll explore that in the next post where we look at breaking down your work into the data components you use, and then finding models that can facilitate the discrete tasks that make up your work.
It’s enough for now to know that these tasks comprise the bulk of modern image AI. For an expanded list of AI for images, check out Huggingface’s task list.
– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –
Related news
OpenAI announces its Sora Model for making videos from text.
DARPA progress with drones could lead to commercial application in mining.
Mineral multinationals are setting up in-house AI centers.
Google is now in business to help the energy industry with earth observation.
If MIT can use Maps to classify agricultural land use, why not minerals?
A recent piece explaining image segmentation for mineral processing.
An example of a high-growth startup moving to opensource ai.
Saudi Arabia is angling to become a high-tech mineral hub.
Axora ai software wins a mining cost-savings competition.
🤔 Opinion
If there’s one thing every consumer agrees on when it comes to their shopping, it’s that competition is good. Competition all but guarantees better service and lower prices. And right now there’s not enough competition in AI.
The world needs alternatives to ChatGPT. From what I have experienced, their inference prices are prohibitively expensive. They preclude someone outside of a major organization being able to developing use cases of significant size. That’s because the algorithms that enable large AI models require huge amounts of data to be processed to be useful. Also, the end-use applications themselves will see consumer usage on the order of millions of AI inference requests each minute. Without lower inference costs, it’s not practical for smaller teams to consider building and fielding the next generation of valuable LLM applications. (Maybe OpenAI knows this.)
Then there’s the problem of OpenAI’s infrastructure: It’s fragile. When major product announcements are made, like they were last week with the Sora model release, their APIs are unreliable.
That’s why I want to introduce you to some competition. Spend a few minutes to consider Together and Groq. You will be glad you did. Here’s a short video I made to help you get going.
🛠️ Tools and Data
Opensource tools to speed your ai transformation
QGPT Agent: https://github.com/momaabna/QGPTAgent A plugin for QGIS that allows users to interact with QGIS using natural language commands. | ![]() |
LHRS Bot: https://github.com/NJU-LHRS/LHRS-Bot AI that leverages globally available volunteer geographic information and remote sensing images and possesses the capability for sophisticated reasoning. | ![]() |
GeoGalactica: https://huggingface.co/geobrain-ai/geogalactica Scientific large language model in geoscience. | ![]() |
PointLLM: https://github.com/openrobotlab/pointllm a model capable of understanding colored point clouds of objects, types, geometric structures, and appearance without concerns for ambiguous depth, occlusion, or viewpoint dependency. | ![]() |
Thanks for reading! Want me to look into a particular topic? Email your suggestions and and I will dig.