Using Instructor to Return Typed Data from Ollama

Getting data out of a large language model can be frustrating. We don’t really want text responses as a format and so instead end up having to parse what is returned into a structure that is more useful. Typically we have limited control of that structure.

A lot of approaches to solving this essentially bolt on additional text to a prompt to guide the model to provide a response as json which easier to parse. e.g With Ollama/llama2

import requests
import json

response = requests.post('http://localhost:11434/api/generate', json={
  "model": "llama2",
  "prompt": "List 5 cities from around the world and their countries, with a short description as json",
  "stream": False
})

print(json.loads(response.content)['response'])

{
"cities": [
    {
        "city": "Paris",
        "country": "France",
        "description": "The City of Light is famous for its art, fashion, and cuisine."
    },
    ...
]
}

This JSON response lists 5 cities from around the world and their corresponding countries. Each city is described with a short paragraph highlighting some of its notable features.

But as we can see, even doing so we may end up with broken formatting or a cheery ‘Sure here you go’ before our json, which we need to skip over. Using function calling or Ollama’s json format helps mitigate this to some degree - but it doesn’t guaruntee the structure we might get back is always the same.

import requests
import json

response = requests.post('http://localhost:11434/api/generate', json={
  "model": "llama2",
  "prompt": "List 5 cities from around the world and their countries, with a short description",
  "stream": False,
  "format": "json"
})

print(json.loads(response.content)['response'])

Returns both:

{
    "cities": [
        "Tokyo", "Japan"
        "New York City", "USA",
        "Paris", "France",
        "London", "UK",
        "Beijing", "China"
    ]
｝

{ 
    "@type": "ListItem", "position": 1, "name": "Tokyo", "itemType": "City", "description": "Capital city of Japan, known for its vibrant culture, cutting-edge technology, and historic landmarks"
}

In the first, we have a list of Cities minus the description and the second only includes a single city with a description.

Instructor

Instructor from Jason Liu is a library to make it easier to get structured data like this in a language of your choice. Even better is that it has wide support for many providers like OpenAI and Claude and now local models through Ollama. I’ll be using the python version to demonstrate how to get data as Pydantic models.

Firstly let’s consider the same simple request to Ollama to find detail on cities from around the world, this time using the OpenAI client library. This doesn’t do anything particularly clever and isn’t even returning it as json.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required, but unused
)

response = client.chat.completions.create(
    model="llama2",
    messages=[
        {
            "role": "user",
            "content": "List 5 cities from around the world and their countries, with a short description",
        }
    ],
)

print(response.choices[0].message.content)

Next, lets modify this to work via instructor and Pydantic. We define a pydantic model for a country and another as a list of countries. We then use instructor to patch the OpenAI library so it adds a response_model paramater to chat completions. Finally, we invoke the chat completion using our “Cities” as the reponse model.

from typing import List
from openai import OpenAI
from pydantic import BaseModel
import instructor


class City(BaseModel):
    name: str
    country: str

class Cities(BaseModel):
    cities: List[City]

# enables `response_model` in create call
client = instructor.from_openai(
    OpenAI(
        base_url="http://localhost:11434/v1",
        api_key="ollama",  # required, but unused
    ),
    mode=instructor.Mode.JSON,
)

city_list = client.chat.completions.create(
    model="llama2",
    messages=[
        {
            "role": "user",
            "content": "List 5 cities from around the world and their countries",
        }
    ],
    response_model=Cities,
)

print(city_list)

cities=[City(name='New York', country='USA'),City(name='Tokyo', country='Japan'),City(name='London', country='United Kingdom'),City(name='Paris', country='France'),City(name='Sydney', country='Australia')]

Now we have data in a format we desire, minus any parsing code that we may have had to write previously. Perfect.

In my tutorial on this, I actually extended it a little further so it’s capable of parsing structure scraped from websites. Take a look at this video if that interests you at all.

Instructor

Subscribe for Exclusives