We use a combination of natural language prompting and structured data validation to perform qualitative content analysis at quantitative scale. This approach allows us to create sophisticated classification schemes that combine the nuanced understanding of LLMs with strict data validation.

The key components of our classification strategy are:

  1. Natural Language Codebooks: We define our classification criteria in natural language, allowing for rich, nuanced instructions that LLMs can understand and apply consistently.
  2. Structured Output Enforcement: Using Pydantic models, we strictly define the shape and validation rules for our classification outputs.
  3. Type Safety: We ensure that all classifications conform to our predefined schemas while allowing the flexibility of natural language interpretation.

Here’s an example of how these components work together:

text = "Lets do something very bad, says very unstrustworthy president"

instruction = "On a 1-10 scale how how much this articles indicates something bad happening in politics"

int_value = xclass.classify("int", prompt, text)

# print(int_value)
# 9

or with a Pydantic Model
(note that we are using the field and model annotations as prompt instructons and leave the second argument empty)

class MyClassifier(BaseModel):
    """
    We want to classify our content with this. We are focusing on political news. 
    We need the most relevant keywords which are the Topics.

    Work with news related categories. The relevance level should accurately filter out 
    anecdotal content and focus on geopolitics.

    A 1 might be the opinion of a celebrity on an issue and a 10 an invasion of 
    one country into another.
    """
    keywords: List[str] = Field(description="The keywords relevant to this text. Only semantically relevant to a content system")
    relevance_level: int = Field(description="On a 1-10 scale how much the content is relevant")