The GenAI Playbook by Yasir Gaji — Part 5: RAG & The Art of Sniper Precision
30 January 2026

In Part 4, I wrote about how I built a conversational AI that could answer questions about gold passing through the CAG (Context-Augmented Generation) pipeline. But it had a problem: it was frozen in time. Ask it “Why did gold spike today?” and it might hallucinate or refuse, it had no access to current events.
Now in this part I’d cover the two upgrades I made to better the tool:
- RAG (Retrieval-Augmented Generation): giving our AI access to real-time news
- Efficient Job Scheduling: automating everything while being cost effective
The Problem with “Smart” AI
Large Language Models have a knowledge cutoff. They can’t browse the web in the traditional sense, they don’t know what happened yesterday and for a financial tool this is fatal.
The solution is RAG (Retrieval-Augmented Generation), instead of hoping the model “knows” something, we retrieve relevant information and inject it into the prompt.
User Question → Embed → Search Vector DB → Retrieve Docs → Fetch Live Prices → Inject into Prompt → LLM → Generate Response
Building the RAG Pipeline
Step 1: The Database Schema
Given that I’m using Supabase I setup a table to store news articles with their vector embeddings and enable the pgvector extension by running this code in the Supabase SQL editor:
The 3072 dimension for the embedding vector that I used matches Google's models/gemini-embedding-001 model output since I’m using Gemini for this project it’s the best suited dimension.
Step 2: The Vector Search Function
After creating the table and enabling vector extension in Supabase I then create an RPC (Remote Procedure Call) function for similarity search:
What this does is it would perform vector similarity search to find news articles semantically similar to a user’s question. Here’s the breakdown:
What It Does:
- Takes users question’s embedding (3072-dimensional vector)
- Compares it against all stored article embeddings in the database
- Calculates cosine similarity between the query and each article
- Filters out low matches below the threshold (0.1 in our case)
- Returns the top 3 most similar articles with their metadata
The Math:
1-(embedding <=> query_embedding) as similarity
<=>is pgvector's cosine distance operator (0 = identical, 2 = opposite)1 - distanceconverts to similarity (0 = unrelated, 1 = identical)- So if distance is 0.3, similarity is 0.7 (70% similar)
Example:
User asks: “Why did gold prices rise today?”
- Question gets embedded:
[0.23, -0.45, 0.67, ...](3072 numbers) - Database compares against stored articles:
- “Fed raises interest rates” → similarity: 0.82
- “Gold hits new high amid inflation” → similarity: 0.91 ✅
- “Stock market crashes” → similarity: 0.35
3. Returns top 3 matches above 0.1 threshold
4. Chat API injects these into the prompt
It’s essentially semantic search whereby it finds articles by meaning and not just keywords
Note: The <=> operator calculates cosine distance. To get similarity (0 to 1), we compute 1 - distance.
Step 3: News Ingestion Service
Here, this service fetches RSS (Really Simple Syndication) feeds, generates embeddings, and stores them:
This service is the knowledge base builder for TGM’s RAG implementation t continuously populates our knowledge base with embedded news articles, making semantic search possible, without it, we’d be searching an empty database.
Production Consideration: Parallel Processing
The code above processes feeds sequentially. For production, one could parallelize with Promise.all():
The tradeoff: parallel is faster but may hit API rate limits. Sequential is slower but safer that’s why I used it.
Step 4: RAG Retrieval in the Chat Route
Now I wired it into the chat endpoint:
This is where RAG comes alive. When a user asks a question like “Why is gold up by 7+% today”, it would:
- Convert their question to a vector: using the same models/gemini-embedding-001 model used during ingestion this ensures “semantic compatibility” this way questions and articles live in the same vector space
- Search the knowledge base for similar articles: the
match_newsRPC calculates cosine similarity between the question vector and all stored article vectors, after it returns the 3 most similar articles above our 0.1 threshold, and all these happens in milliseconds thanks to the pgvector index. - Finally, it inject those articles into the LLM’s prompt and generates: It formats the results as markdown links:
[Gold hits record high](https://...)this then becomes part of the prompt’s{news_context}variable and if no matches, we default to “No recent news available” to avoid confusion. This way the LLM receives the prompt with injected news articles and it can now say It can now say: “Gold prices are currently trading at $5,185.926 driven primarily by…” (citing Sources below) and the sources come from the database, not the model’s training data.

Step 5: The Prompt Template
Now for the tool to better execute perfectly the LLM has to receive the retrieved news as context:
This establishes a much better identity and boundaries, its also critical because without explicit instructions, the LLM might:
- Ignore the provided news
- Hallucinate reasons instead
- And not cite sources
This way I’m teaching it: “The answer is in the context I gave you. Use it.”.
Tuning RAG: The Threshold Tradeoff
I’ve been mentioning the “0.1” threshold value and in chat endpoint you’d notice this line: match_threshold: 0.1 this is because a threshold of 0.1 is permissive.
Now I could have selected other values like the ones in this table:

But I chose 0.1 because:
- Gold news is niche most articles in our database are already relevant.
- Returning something is better than returning nothing for UX.
- And so that the LLM can ignore irrelevant context.
For a general-purpose RAG system, you’d want a higher threshold (0.5–0.6).
Article Freshness
In my first implementation for match_news there was a gap it retrieved articles but it did not account for age. A 6-month-old article could surface for today’s question this way the database search was “Semantically Smart” but “Chronologically Stupid.” It looks for the best match in meaning, even if that match is news from 2023 and in financial markets, old news is worse than no news.
To fix this, I modified the match_news function:
This ensures users get current context and not stale news. (Note: This strictly limits the tool to recent events; for historical analysis, I would remove this filter or use a recency decay function instead.)
Automating Everything: The Cron Challenge
Now I have three recurring jobs:
- Daily Post : Where TGM tweets daily market price updates at 2 PM WAT
- Record Price : Where TGM saves gold price at 9 PM WAT
- Ingest News : Where TGM now fetches RSS feeds every 4 hours WAT
Now the naive approach would be to create three separate Vercel cron jobs, but this becomes expensive and harder to maintain overtime given the project is progressing and there’s a likelihood of more cron jobs eventually. Each cron job on Vercel is another endpoint to monitor, another potential failure point, and another line item as we scale.
So I chose to use the dispatcher pattern it offers a cleaner and more cost-effective solution: one scheduled endpoint that orchestrates multiple jobs based on time logic.
Benefits:
- Cost efficiency: Single cron instead of multiple endpoints
- Centralized control: All scheduling logic in one place
- Easier debugging: One log stream and one health check
- Flexibility: Change schedules without redeploying multiple endpoints
The Dispatcher Pattern
Instead of multiple crons, I created one endpoint that decides what to run based on the current time:
The Service Layer
Notice I import functions, not make HTTP requests. This is intentional:
Why direct function calls instead of internal fetch()?
- No cold start latency for a second serverless invocation
- No network overhead
- Better error handling and stack traces
- Simpler to test
The Cron-Job.org Solution
Then I used Cron-Job.org as a free external cron job scheduler.
Step 1: Remove Vercel Crons
Step 2: Create the Cron Job and configure secrets
Create a new cron job with these settings:

Why This Works
- Free: Unlimited cron jobs on the free tier
- Reliable: 99.9% uptime, dedicated cron infrastructure
- Simple: No code changes needed, just configuration
- Flexible: Easy to pause/resume or adjust schedules via web UI
Alternative: GitHub Actions
If you prefer keeping everything in your repo, GitHub Actions is another solid option with 2,000 free minutes/month for private repos (unlimited for public). However, cron-job.org is simpler since it doesn’t require committing workflow files or managing GitHub secrets.
Conclusion
What I’ve built so far:
and of course, this is just the tip of the iceberg. What I’ve built is a robust ‘Stage 1’ RAG system. The holy grail lies in Tool & Function Calling and Multi-Hop Reasoning, which transforms the AI from a passive reader of news into an active researcher, trader, and consultant to autonomously actively query the database, verify facts, and reason through complex market scenarios dynamically and decide when to buy and sell, rather than relying on a hard-coded retrieval step.
Key Takeaways
- RAG isn’t magic: it’s just “search + inject into prompt”
- Thresholds matter: tune based on your domain’s specificity
- Freshness matters : stale context is worse than no context
- Platform limits are solvable: External schedulers like cron-job.org are a free escape hatch
- Direct function calls > internal fetch: simpler, faster, and more reliable.
I expect questions, corrections, and criticisms. Share. Thank you.
References
Also published on Medium.