jina-vlm: Seeing like an AI with vision language models

Learn about vision language models (VLMs), what jina-vlm can do, how to use it, and best practices.

Get hands-on with Elasticsearch: Dive into our sample notebooks in the Elasticsearch Labs repo, start a free cloud trial, or try Elastic on your local machine now.

Vision-language models (VLMs) are generative language models that take texts and images as input and generate new text as outputs. They’re designed for visual question-answering (VQA), an application of multimodal AI that fuses image and natural language processing (NLP) in ways that would have been impossible a few years ago. Jina’s new model, jina-vlm, breaks new ground for VQA with state-of-the-art semantic processing of images like scans, slides, and diagrams that combine text and visual materials to carry information instead of natural scenes like photographs.

jina-vlm opens up new application areas:

  • Accessibility: Creating descriptions of images for visually impaired people.
  • Indexing: Generating detailed textual information to improve search for images and documents that contain images.
  • Querying: Matching natural language queries to images through direct analysis of image content.
  • Visual content analysis: Automatically inspecting images for classification and analysis.
  • Retrieval-augmented generation (RAG): Using visual information as a knowledge source in RAG systems.

This article will introduce you to VLMs and show you how to use jina-vlm to bring multimodal AI to your own solutions.

Vision-language models

jina-vlm is a decoder-only language model with additional features and training to answer questions or create general descriptions of images.

We adapted the Qwen3-1.7B-Base open weight language model as the backbone of jina-vlm. What makes it a vision language model is that we’ve connected it to a custom fine-tuned image encoder model, derived from the SigLIP2-So400M/14-384 model, which converts images into a form that the language model can understand and trained it specifically for VQA and description generation.

To understand how this works, think about how language models “read” your prompts. Input texts are split into tokens (words or parts of words), which are replaced with embedding vectors from a lookup table and then inputted to a neural network. The model never sees the actual letters you typed.

As seen in the image, each unique token is substituted with a number that, in turn, corresponds to an embedding vector stored in the model’s vocabulary. This enables the model to process whatever text you give it, even if the words aren’t in its stored vocabulary.

This is why language models struggle to count letters in obscure words, like this example with Claude Haiku 4.5:

This model correctly recognizes what Qaqortoq is; it knows that the word means a town in Greenland, and it can tell you all about it, but it has no idea which actual letters compose the name. There are ways to supplement language models to address this shortcoming, so some models (like ChatGPT 5, Gemini, and Claude Sonnet) don’t generally make this kind of mistake.

However, this architectural shortcoming makes it possible to create multimodal language models.

The actual input to generative language models is a sequence of semantic embeddings, and there’s no reason those embeddings have to come from texts. They could come from anything, as long as they’re compatible with the model’s training. We can make any language model handle images, provided we can generate vectors for those images in the same embedding space as the tokens.

jina-vlm contains an image embedding model and adapter trained to do just that. The image (with some preprocessing into overlapping patches) is the input to the model, and it outputs a sequence of embeddings in the same semantic embedding space as the language model’s text token embeddings. These embeddings are then treated just like the text input embeddings. The full language model has access to your text and to the image at the same time, and it treats both as the same kind of input.

Figure 3 is a schematic of the jina-vlm model, showing the dual track processing of text and images together.

This architecture has some important benefits. Language models learn a lot of implicit information about the world from their training data, and the VLM architecture can use that information to help understand the content of images. Furthermore, some of the bias problems present in other multimodal architectures are absent in this one, notably the media “gap” that troubles CLIP-style models.

What can jina-vlm do?

jina-vlm has three overlapping core competencies:

  • Describing images and generating captions.
  • Understanding informationally rich images.
  • Answering questions about the content of images.

Describing and captioning pictures

Providing good descriptions of images can be a boon for accessibility. There are increasing legal and informal requirements to provide alternative text (alt-text) for digitized images on the internet. Search engine optimization (SEO) strategies in particular often demand high-quality image captions and alt-text. Manually composing such descriptions can be expensive and time consuming.

jina-vlm can help. For example:

You can input this image to jina-vlm with the text:

  • Input Instruction: Describe this image in detail.

The result is the following:

  • Output: The image captures a vibrant scene of a white plate filled with a multitude of colorful jelly beans. The jelly beans, in hues of green, yellow, orange, purple, red, and blue, are scattered across the plate, creating a rainbow-like effect. The plate itself is placed on a gray surface, providing a stark contrast to the colorful jelly beans. The image does not contain any discernible text. The relative positions of the jelly beans vary, with some clustered together and others spread out, giving the image a sense of depth and texture. The overall composition of the image is simple yet striking, with the colorful jelly beans being the central focus.

This description is a little wordy but quite detailed.

jina-vlm treats the text you give it as a question or instruction, and some prompt engineering may improve the result a lot. For example, using the same picture of jelly beans:

  • Input Instruction: Describe this image. Be concise.
  • Output: A white plate is filled with a variety of colorful jelly beans, including green, yellow, red, purple, and blue, arranged in a scattered manner.

Or:

  • Input Instruction: Provide a caption for this picture.
  • Output: A white plate full of colorful jelly beans.

Asking specifically for a caption tends to produce very short results. Most captions are brief, so jina-vlm is trained to generate brief captions.

You can also use jina-vlm to identify and list items in images:

As a relatively small model, jina-vlm has limits to its object recognition. For common categories, it’s very good. For example:

It’s able to handle content like “red cars”, but ask jina-vlm a question involving detailed categories and it can become unreliable very quickly. For example, using the same image:

  • Input Instruction: What model of car is this?
  • Output: The car in the image is a classic French car, specifically a Citroën DS. This model is known for its distinctive rounded body shape, which was a hallmark of the DS series. The DS was produced from 1955 to 1969 and was a popular choice for its combination of style, performance, and practicality. It was designed to be a luxury car that could be driven on city streets as well as on the roads, making it a versatile vehicle for its time. The DS was also notable for its innovative features, such as the rear spoiler and the front grille with the Citroën emblem.

This answer is largely hallucinated. The car in the picture is one of the most recognizable cars in history: a Citroën 2CV, manufactured from 1948 to 1990. The Citroën DS was made from 1955 to 1975 and looked like this:

In short, when pressed for details beyond its capacity, jina-vlm hallucinates without warning.

It’s especially important to keep this in mind with images of people:

Tom Cruise is famous enough, and present enough in the training data, for jina-vlm to recognize him. But, using the same picture:

  • Input Instruction: Provide a caption for this picture.
  • Output: A man with a green shirt and a big smile.

This caption never mentions Tom Cruise, even though it’s the most important fact about the picture. Asking for a caption produces very little hallucination but also very little information.

In contrast, let’s do the same thing with this image:

This is terribly wrong. Ronald Reagan looks nothing like Richard Nixon.

When you ask jina-vlm to identify persons, it’s likely to hallucinate because it doesn’t have very deep world knowledge. Only people incidentally included many times in the training data might be correctly identified.

jina-vlm is a small language model (SLM). It has very poor world knowledge and is at its best with superficially visible things. As with all language models, it’s prone to error and hallucination when it moves outside of its capabilities and will not give you any warning when it does.

jina-vlm hallucinates much less when asked to write a caption. For example, using the same picture of Richard Nixon:

  • Input Instruction: Provide a caption for this picture.
  • Output: A man in a suit sits in a chair with a bookshelf behind him.

The most effective way to reduce hallucination is to prompt the model to say less. Any use or deployment of jina-vlm should take this into account.

Understanding informational content

jina-vlm truly shines at understanding informational graphics and synthetic images, like diagrams, graphs, screenshots, scans of printed pages, advertising copy, and presentation slides. The examples below show only part of jina-vlm’s range.

Charts

jina-vlm can report on the contents of all sorts of data presentations.

Bar charts:

Pie charts:

Area charts:

Line graphs:

Table:

Multilingual content

jina-vlm’s multilingual support means that it can recognize languages and describe informational content in a different language from its presentation:

It can also list the languages when there’s more than one:

jina-vlm may answer in a different language than you asked for, if the image content is in another language or strongly associated with it:

To fix this, specify the language in the query. With the same image:

  • Input Instruction: Summarize the content of this document in English.
  • Output: The image is a promotional poster for a KFC Thailand event. The poster is predominantly red with white and black text. At the top, there is a logo of KFC, which is a red and white square with a black silhouette of Colonel Sanders. Below the logo, there are three young men, presumably the main participants or hosts of the event. They are dressed in casual attire, with two of them wearing red jackets and one in a white shirt with a black stripe.

    The central part of the poster features a large text in white that reads, “เปิดตัวใหม่! ร่วมกับ KFC Thailand พบกับความสนุกสนานที่ไม่เหมือนใคร!” which translates to “Launching New! Join KFC Thailand and experience something unique!” Below this, there is a smaller text that says, “ทุกคนมีส่วนร่วม! วันนี้ 8 พฤศจิกายน 64 เวลา 12:00 น.” indicating that the event is happening on November 8, 2021, at 12:00 PM.

    The bottom section of the poster includes social media icons and a Facebook page link, suggesting that the event will be live on Facebook. There are also icons of a chicken drumstick and a bucket, which are associated with KFC, reinforcing the brand’s identity.

    The overall design of the poster is vibrant and eye-catching, with a clear focus on the event’s theme and the brand’s presence. The use of red and white colors is typical for KFC branding, which is known for its bold and energetic visual style. The poster is designed to attract attention and inform viewers about the upcoming event.

Complex information displays

We specifically trained jina-vlm to understand complex information displays and to be able to summarize them in natural language.

Infographics:

Technical and educational diagrams:

Maps:

Screenshots

jina-vlm has been trained to parse and understand screenshots:

It can even parse video games:

Scans

jina-vlm excels at materials like scans and images of printed text:

Pre-digital materials

Robust support for scans also makes pre-digital materials accessible to digital search technologies:

Social media

Since much of the internet is “five giant websites filled with screenshots of text from the other four,” jina-vlm is especially helpful in processing the kind of content that the web is full of:

jina-vlm can sometimes even explain memes:

Visual question-answering

“A picture is worth a thousand words.” As cliché as that saying is, multimodal AI can prove it.

VQA is the domain where jina-vlm excels and the one with the most potential for new and revolutionary applications. jina-vlm answers questions about the content of images just like the way a language model can answer questions about texts. This means it can serve as part of a RAG system that assembles responses from mixtures of images and texts stored in a database.

The infographic below is very information rich, and with jina-vlm you can ask natural language questions about its content:

Or, using the same image:

  • Input Instruction: What is the crew module of the SLS called?
  • Output: Orion spacecraft

This capability makes it possible to query formal schematics and diagrams:

With the same image:

  • Input Instruction: How many bathrooms are there?
  • Output: 4

If you use photographs or natural scenes, jina-vlm is typically good with high-visibility characteristics:

But if you’re asking about finer, less visible details, the answers might get spotty:

Thank you, Captain Obvious.

jina-vlm is also not well-suited to resolving optical illusions:

You can ask questions about text in photographs:

And jina-vlm is ideally suited to understand traditional magazine and newspaper advertisements:

Visual question-answering also works cross-lingually:

Using the same image:

  • Input Instruction: How much do grapes cost per kilo?
  • Output: The cost of grapes per kilogram is 3.98.

It answers questions from manual pages:

It can also answer questions about the content of posters and artistic works:

We showed in the previous section that jina-vlm is not very good at general face recognition, outside of a few very famous people. However, it can use the text in images to answer questions it can’t answer from just a picture:

Best practices

jina-vlm is a generative language model. It’s not a database. It’s prone to error and hallucination. It’s very important to take that into account in your use case.

As with all generative language models, we very strongly recommend you only use jina-vlm in highly error-tolerant scenarios.

To minimize errors, we offer the following advice:

Test your prompts

If you depend on getting certain kinds of results consistently, you should do thorough quality testing on your queries. Getting a few good results with one prompt does not guarantee the same quality of result with different images. You can minimize this risk by manually reviewing the results of queries over a broad collection of images that are genuinely representative of what you plan to process.

Specify the language

Add to your query the language you expect the answer to be in. Like all language models, jina-vlm cannot easily distinguish between your instructions and the data it’s supposed to apply your instructions to. The only way for it to know how to answer is to look at the languages it recognizes in its input or to follow your explicit instructions.

Don’t ask leading questions

It’s very easy to get jina-vlm to hallucinate by asking leading questions. For example:

Prompt for short answers

jina-vlm can be very wordy sometimes. Using words like “briefly” and “concise” reduces the size of the output.

Compare:


And with the same image:

  • Input Instruction: What kind of vehicle is this? Be very concise.
  • Output: This is a Massey Ferguson tractor.

Only ask questions relevant to the image

jina-vlm will not always be able to tell that your question is irrelevant to the content and can easily hallucinate answers. For example:

With some planning, you might be able to reduce this effect by using a decision tree: Ask multiple questions, each one dependent on the answer to the previous one. Using the same Picasso painting above, we would never have asked about the language of the text if we had asked this question first:

  • Input Instruction: Is there any written language in this image? Be very concise.
  • Output: No

Think like a language model

We trained jina-vlm to attend to the image given to it. This means that no matter what you ask it, its answer will be a result of the picture it sees. Anthropomorphizing language models is a major hazard for AI users, but it’s easy to see how jina-vlm “thinks” if we ask it questions that call for reflection. For example:

You can ask jina-vlm anything, even questions like, “How are you feeling?” and it will answer in a way that is, if possible, attentive to the image.

jina-vlm has no filter, safety training, or reliable source of facts, and it only knows that what it sees in the image you give it matters. For example:

But:

This is not a bug, it’s a feature. jina-vlm will rely on what you give it to ground what it says. It’s up to you to decide what to give it.

As they say,

And,

Getting started with jina-vlm

jina-vlm is available via a public OpenAI-compatible API at https://api-beta-vlm.jina.ai. To use this API, get a free Jina API key at the Jina website. It comes with 10 million free tokens to try out Jina by Elastic’s models. Additional tokens are available for purchase.

You can find instructions for accessing the jina-vlm API on the Jina AI website, and there’s a web-based sandbox for trying out images and requests.

You can also download the model from Hugging Face and run it locally for noncommercial purposes. For commercial deployment on site, contact Elastic sales.

For more technical information about the model, including pricing information, visit the jina-vlm model page. You can also read the release paper on arXiv.org.

Related Content

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as you are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself