The GenAI Playbook by Yasir Gaji Part 2: From Chatbot to Inference Engine
5 January 2026

In Part 1, we explored the theory: the layers of AI, how Transformers work, and the difference between “Generative” and “Agentic” systems. We established that while ChatGPT is a fantastic product, the real power for engineers lies in the Model.
But how do you move from chatting with a model to building with one?
In Part 2, we stop talking theory. We will build a minimal, reliable inference API. We will take a raw Generative Model (Google Gemini) and wrap it in code to perform a specific business task: Analyzing the Global Gold Market.
The “Hello World” of GenAI Engineering
When most people start with GenAI, they stay in the “Chat Loop”:
- Open ChatGPT.
- Type a prompt.
- Read the answer.
This is Manual Inference. It doesn’t scale.
To build a product (like TheGoldMetrics, a financial intelligence tool I am building), we need Programmatic Inference. We need a function that accepts data (inputs), processes it through the LLM (reasoning), and returns a structured result (output), all without a human in the loop.
The Architecture of an Inference Engine
A reliable GenAI API consists of three layers:
- The Context Layer (Data): The facts the AI needs to know (that it wasn’t trained on).
- The Instruction Layer (Prompt): The persona and rules that guide the behaviour.
- The Client Layer (Inference): The actual code that calls the model.
Let’s build these layers using TypeScript and Google Gemini.
Layer 1: The Context (Grounding the Model)
If I ask a standard model, “What is the price of gold right now?”, it will hallucinate or apologise because its training data cut off months ago. To build a useful engine, we must inject Context.
For TheGoldMetrics Codebase, I built a dedicated Data Adapter that connects to a reliable financial provider, MetalPriceAPI.
I then encapsulated this logic in a provider class to keep our code clean. Here is an instance of the implementation:
I didn’t ask the AI to guess the price, I could have scraped websites, but I did not do that, rather I gave it the price and asked it to analyse it. This is the foundation of RAG (Retrieval-Augmented Generation), this way we fetch facts first, then generate text.
Layer 2: The Instruction (Prompt Engineering as Code)
In a chat interface, you might type: “Write a tweet about gold.” In an inference API, that is too vague. We need reliability. We treat the Prompt as a Function Signature.
We define a “System Instruction” that locks the model into a specific role say for instance:
Notice we aren’t just asking for text; we are defining Constraints. This turns a creative writing tool into a deterministic data processor.
Layer 3: The Inference Client (The Code)
Now, we wire it together. We send the Context + Instruction to the model.
This function is the Inference API. It is Deterministic (mostly), Stateless (doesn’t remember previous calls), and Reliable. You can hook this function up to a Cron Job, a frontend button, or a Slack bot.
Why This Matters
We just transitioned from “AI User” to “AI Engineer.”
- We didn’t train a model. (Too expensive).
- We didn’t fine-tune. (Unnecessary for this task).
- We Engineered Context.
By controlling the inputs (Live Gold Price) and the instructions (Analyst Persona), we forced a general-purpose model (Gemini) to become a specialised domain expert.
Conclusion
Now that we have a function that thinks, how do we give it hands? A brain in a jar is useful, but an agent that can act is revolutionary.
In Part 3, we will explore Automation and Tool Use. We will take this Inference Engine and connect it to the real world, automatically posting to X (Twitter) without human intervention. We are entering the era of the Autonomous Agent.
Stay tuned for The GenAI Playbook by Yasir Gaji — Part 3.
I expect questions for clarification and how-tos. Kindly criticise and make corrections as well. Share Thank you.
References
Also published on Medium.