A Practical Zero-Shot Multimodal Classifier

Most classifiers ship with a fixed label list. They perform well inside that box and fail silently outside it. So we took this as a challenge. We introduce a classifier that starts from the opposite assumption: the world is open. New product names appear, new categories form, and multiple languages.
At JigsawStack, we treat classification as a language task in addition to open vocabulary classification. It accepts text or images, arbitrary label strings, including long and descriptive ones, built on top of Small Vision Language Models; the result is a practical zero-shot classification that works, from multilingual support tickets to real-world images without retraining.
Where It Slaps 🔥
1) Crosslingual Text Classification
//npm install jigsawstack
import { JigsawStack } from "jigsawstack";
const jigsaw = JigsawStack({ apiKey: "your-api-key" });
const response = await jigsaw.classification({
"dataset": [
{
"type": "text",
"value": "Necesito un reembolso; me cobraron dos veces."
},
{
"type": "text",
"value": "सबमिट बटन काम नहीं कर रहा है।"
}
],
"labels": [
{
"type": "text",
"value": "billing issue"
},
{
"type": "text",
"value": "feature request"
},
{
"type": "text",
"value": "bug"
}
]
})
Result
{
"success": true,
"predictions": [
"billing issue",
"bug"
],
"_usage": {
"input_tokens": 145,
"output_tokens": 14,
"inference_time_tokens": 1386,
"total_tokens": 1545
}
}
2) Moderation & Intent Classification
const response = await jigsaw.classification({
"dataset": [
{
"type": "text",
"value": "sydney sweeney is mid"
},
{
"type": "text",
"value": "I'm thrilled to announce that I am part of JigsawStack"
}
],
"labels": [
{
"type": "text",
"value": "hate"
},
{
"type": "text",
"value": "harrasment"
},
{
"type": "text",
"value": "spam"
}
]
})
Result
{
"success": true,
"predictions": [
"hate",
"spam"
],
"_usage": {
"input_tokens": 127,
"output_tokens": 12,
"inference_time_tokens": 995,
"total_tokens": 1134
}
}
3) Tagging (Multi-label) Classification
const response = await jigsaw.classification({
"dataset": [
{
"type": "text",
"value": "I bought the AMD Ryzen 7 5700G to build a PC that could handle daily tasks, work, light gaming, and even some editing all without needing a dedicated graphics card right away. And honestly, I’m really impressed with how it performs. Right out of the box, this processor feels fast, stable, and efficient. With 8 cores and 16 threads, it handles multitasking like a champ. I can run multiple apps, edit videos, use development tools, and stream without lag or freezing. Everything is smooth and responsive."
}
],
"labels": [
{
"type": "text",
"value": "hardware"
},
{
"type": "text",
"value": "powered by AMD"
},
{
"key": "product review",
"type": "text",
"value": "This is a review from a user which talks about their recent computer hardware purchase."
}
],
"multiple_labels": true
})
Result
{
"success": true,
"predictions": [
[
"hardware",
"powered by AMD",
"product review"
]
],
"_usage": {
"input_tokens": 256,
"output_tokens": 20,
"inference_time_tokens": 1243,
"total_tokens": 1519
}
}
4) IRL/Context Aware Image Classification
Input image:
Classification:
const response = await jigsaw.classification({
"dataset": [
{
"type": "image",
"value": "https://images.unsplash.com/photo-1580655653885-65763b2597d0?q=80&w=1740&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
}
],
"labels": [
{
"type": "text",
"value": "This image is likely taken on the west coast of US"
},
{
"type": "text",
"value": "यह अमेरिका का हॉलीवुड है"
}
],
"multiple_labels": true
})
Result
{
"success": true,
"predictions": [
[
"This image is likely taken on the west coast of US",
"यह अमेरिका का हॉलीवुड है"
]
],
"_usage": {
"input_tokens": 145,
"output_tokens": 39,
"inference_time_tokens": 4235,
"total_tokens": 4419
}
}
5) Image Classification with Image as a Label Reference
Input Image:
Reference images as labels:

The two reference images above (side-by-side), which are passed as input, showcase two hands in poker; the input image is of a hand where the player has a ‘Royal Flush‘, which is the best and rarest hand in poker, consisting of the ace, king, queen, jack, and ten of the same suit. The current task is to classify the hand that the player has based on the reference images as follows
const response = await jigsaw.classification({
"dataset": [
{
"type": "image",
"value": "https://images.unsplash.com/photo-1655008109440-df3a58567cee?q=80&w=2726&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
}
],
"labels": [
{
"key": "This is a straight", // image on the left
"type": "image",
"value": "https://plus.unsplash.com/premium_photo-1694781503979-f2c7f599d269?q=80&w=870&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
},
{
"key": "This is a royal flush", //image on the right
"type": "image",
"value": "https://plus.unsplash.com/premium_photo-1671683370315-87306b0faf90?q=80&w=1744&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
}
]
})
Result
{
"success": true,
"predictions": [
"This is a royal flush"
],
"_usage": {
"input_tokens": 217,
"output_tokens": 14,
"inference_time_tokens": 5748,
"total_tokens": 5979
}
}
Not Just Novel, but Smart
Open-set safety: You can pass a label “unknown/none of the above” instead of forcing a bad match to avoid false positives.
Label agility: Built with care for real-world generalization.
Multilingual by construction: prompts and labels can be in the user’s language; we support mixed scripts and cross-lingual matching.
Zero-shot multimodal classification isn’t magic or a feat that can be achieved by training all the labels in the world; rather, it’s language-conditioned decision-making with honest uncertainty. By treating labels as language and leveraging SLMs, we get a classifier that adapts for better generalization without a retraining cycle.
👥 Join the JigsawStack Community
Have questions or want to show off what you’ve built? Join the JigsawStack developer community on Discord and X/Twitter. Let’s build something amazing together!





