The GenAI Playbook by Yasir Gaji — Part 5: RAG & The Art of Sniper Precision

30 January 2026

In Part 4, I wrote about how I built a conversational AI that could answer questions about gold passing through the CAG (Context-Augmented Generation) pipeline. But it had a problem: it was frozen in time. Ask it “Why did gold spike today?” and it might hallucinate or refuse, it had no access to current events.

Now in this part I’d cover the two upgrades I made to better the tool:

RAG (Retrieval-Augmented Generation): giving our AI access to real-time news
Efficient Job Scheduling: automating everything while being cost effective

The Problem with “Smart” AI

Large Language Models have a knowledge cutoff. They can’t browse the web in the traditional sense, they don’t know what happened yesterday and for a financial tool this is fatal.

The solution is RAG (Retrieval-Augmented Generation), instead of hoping the model “knows” something, we retrieve relevant information and inject it into the prompt.

User Question → Embed → Search Vector DB → Retrieve Docs → Fetch Live Prices → Inject into Prompt → LLM → Generate Response

Building the RAG Pipeline

Step 1: The Database Schema

Given that I’m using Supabase I setup a table to store news articles with their vector embeddings and enable the pgvector extension by running this code in the Supabase SQL editor:

The 3072 dimension for the embedding vector that I used matches Google's models/gemini-embedding-001 model output since I’m using Gemini for this project it’s the best suited dimension.

Step 2: The Vector Search Function

After creating the table and enabling vector extension in Supabase I then create an RPC (Remote Procedure Call) function for similarity search:

What this does is it would perform vector similarity search to find news articles semantically similar to a user’s question. Here’s the breakdown:

What It Does:

Takes users question’s embedding (3072-dimensional vector)
Compares it against all stored article embeddings in the database
Calculates cosine similarity between the query and each article
Filters out low matches below the threshold (0.1 in our case)
Returns the top 3 most similar articles with their metadata

The Math:

1-(embedding <=> query_embedding) as similarity

<=> is pgvector's cosine distance operator (0 = identical, 2 = opposite)
1 - distance converts to similarity (0 = unrelated, 1 = identical)
So if distance is 0.3, similarity is 0.7 (70% similar)

Example:

User asks: “Why did gold prices rise today?”

Question gets embedded: [0.23, -0.45, 0.67, ...] (3072 numbers)
Database compares against stored articles:

“Fed raises interest rates” → similarity: 0.82
“Gold hits new high amid inflation” → similarity: 0.91 ✅
“Stock market crashes” → similarity: 0.35

3. Returns top 3 matches above 0.1 threshold

4. Chat API injects these into the prompt

It’s essentially semantic search whereby it finds articles by meaning and not just keywords

Note: The <=> operator calculates cosine distance. To get similarity (0 to 1), we compute 1 - distance.

Step 3: News Ingestion Service

Here, this service fetches RSS (Really Simple Syndication) feeds, generates embeddings, and stores them:

This service is the knowledge base builder for TGM’s RAG implementation t continuously populates our knowledge base with embedded news articles, making semantic search possible, without it, we’d be searching an empty database.

Production Consideration: Parallel Processing

The code above processes feeds sequentially. For production, one could parallelize with Promise.all():

The tradeoff: parallel is faster but may hit API rate limits. Sequential is slower but safer that’s why I used it.

Step 4: RAG Retrieval in the Chat Route

Now I wired it into the chat endpoint:

This is where RAG comes alive. When a user asks a question like “Why is gold up by 7+% today”, it would:

Convert their question to a vector: using the same models/gemini-embedding-001 model used during ingestion this ensures “semantic compatibility” this way questions and articles live in the same vector space
Search the knowledge base for similar articles: the match_news RPC calculates cosine similarity between the question vector and all stored article vectors, after it returns the 3 most similar articles above our 0.1 threshold, and all these happens in milliseconds thanks to the pgvector index.
Finally, it inject those articles into the LLM’s prompt and generates: It formats the results as markdown links: [Gold hits record high](https://...) this then becomes part of the prompt’s {news_context} variable and if no matches, we default to “No recent news available” to avoid confusion. This way the LLM receives the prompt with injected news articles and it can now say It can now say: “Gold prices are currently trading at $5,185.926 driven primarily by…” (citing Sources below) and the sources come from the database, not the model’s training data.

Step 5: The Prompt Template

Now for the tool to better execute perfectly the LLM has to receive the retrieved news as context:

This establishes a much better identity and boundaries, its also critical because without explicit instructions, the LLM might:

Ignore the provided news
Hallucinate reasons instead
And not cite sources

This way I’m teaching it: “The answer is in the context I gave you. Use it.”.

Tuning RAG: The Threshold Tradeoff

I’ve been mentioning the “0.1” threshold value and in chat endpoint you’d notice this line: match_threshold: 0.1 this is because a threshold of 0.1 is permissive.

Now I could have selected other values like the ones in this table:

But I chose 0.1 because:

Gold news is niche most articles in our database are already relevant.
Returning something is better than returning nothing for UX.
And so that the LLM can ignore irrelevant context.

For a general-purpose RAG system, you’d want a higher threshold (0.5–0.6).

Article Freshness

In my first implementation for match_news there was a gap it retrieved articles but it did not account for age. A 6-month-old article could surface for today’s question this way the database search was “Semantically Smart” but “Chronologically Stupid.” It looks for the best match in meaning, even if that match is news from 2023 and in financial markets, old news is worse than no news.

To fix this, I modified the match_news function:

This ensures users get current context and not stale news. (Note: This strictly limits the tool to recent events; for historical analysis, I would remove this filter or use a recency decay function instead.)

Automating Everything: The Cron Challenge

Now I have three recurring jobs:

Daily Post : Where TGM tweets daily market price updates at 2 PM WAT
Record Price : Where TGM saves gold price at 9 PM WAT
Ingest News : Where TGM now fetches RSS feeds every 4 hours WAT

Now the naive approach would be to create three separate Vercel cron jobs, but this becomes expensive and harder to maintain overtime given the project is progressing and there’s a likelihood of more cron jobs eventually. Each cron job on Vercel is another endpoint to monitor, another potential failure point, and another line item as we scale.

So I chose to use the dispatcher pattern it offers a cleaner and more cost-effective solution: one scheduled endpoint that orchestrates multiple jobs based on time logic.

Benefits:

Cost efficiency: Single cron instead of multiple endpoints
Centralized control: All scheduling logic in one place
Easier debugging: One log stream and one health check
Flexibility: Change schedules without redeploying multiple endpoints

The Dispatcher Pattern

Instead of multiple crons, I created one endpoint that decides what to run based on the current time:

The Service Layer

Notice I import functions, not make HTTP requests. This is intentional:

Why direct function calls instead of internal fetch()?

No cold start latency for a second serverless invocation
No network overhead
Better error handling and stack traces
Simpler to test

The Cron-Job.org Solution

Then I used Cron-Job.org as a free external cron job scheduler.

Step 1: Remove Vercel Crons

Step 2: Create the Cron Job and configure secrets

Create a new cron job with these settings:

Why This Works

Free: Unlimited cron jobs on the free tier
Reliable: 99.9% uptime, dedicated cron infrastructure
Simple: No code changes needed, just configuration
Flexible: Easy to pause/resume or adjust schedules via web UI

Alternative: GitHub Actions

If you prefer keeping everything in your repo, GitHub Actions is another solid option with 2,000 free minutes/month for private repos (unlimited for public). However, cron-job.org is simpler since it doesn’t require committing workflow files or managing GitHub secrets.

Conclusion

What I’ve built so far:

and of course, this is just the tip of the iceberg. What I’ve built is a robust ‘Stage 1’ RAG system. The holy grail lies in Tool & Function Calling and Multi-Hop Reasoning, which transforms the AI from a passive reader of news into an active researcher, trader, and consultant to autonomously actively query the database, verify facts, and reason through complex market scenarios dynamically and decide when to buy and sell, rather than relying on a hard-coded retrieval step.

Key Takeaways

RAG isn’t magic: it’s just “search + inject into prompt”
Thresholds matter: tune based on your domain’s specificity
Freshness matters : stale context is worse than no context
Platform limits are solvable: External schedulers like cron-job.org are a free escape hatch
Direct function calls > internal fetch: simpler, faster, and more reliable.

I expect questions, corrections, and criticisms. Share. Thank you.

References

Also published on Medium.