Building an AI Shopping Assistant from Scratch

When we set out to build Spark — Starseek's AI shopping assistant — we had a clear goal and almost no playbook. The goal: let people describe what they want in plain English and get product recommendations that actually make sense. The playbook: nonexistent, because nobody had done this the way we wanted to do it.

Most AI-powered product recommendations work in one of two ways. Either they're collaborative filtering ("people who bought X also bought Y") or they're keyword extraction ("you said 'blue dress,' here are blue dresses"). Both are useful. Neither is what we were after.

We wanted something that could hold a conversation, understand nuance, search across multiple platforms in real time, and learn from the interaction. Basically, we wanted to build the shopping assistant that would put every other product recommendation system to shame. Here's how we approached it.

The RAG pipeline

At Starseek's core is a Retrieval-Augmented Generation pipeline. If you're not familiar, RAG is a pattern where you combine a large language model's conversational ability with a search system that retrieves relevant context. The LLM doesn't try to remember every product in our catalog — it asks our search system for relevant products and uses them to construct a helpful response.

Here's the flow:

User sends a message ("I need a warm jacket for hiking in Colorado in November")
We generate an embedding vector from the message using Google's gemini-embedding-001 model (1024 dimensions)
That vector hits our OpenSearch index, which runs a hybrid search: k-NN cosine similarity for semantic matching plus BM25 for keyword matching
Top results come back as product context
The full conversation history + product context goes to Google Gemini
Gemini generates a response that references specific products, explains trade-offs, and asks follow-up questions

This sounds straightforward on paper. In practice, every step has pitfalls.

Embeddings that understand shopping intent

The embedding model is the foundation of everything. If "warm jacket for hiking in Colorado in November" and "insulated waterproof men's shell" don't land near each other in vector space, the whole system falls apart.

We spent weeks evaluating embedding models. The challenge with shopping queries is that they mix functional requirements ("waterproof"), contextual information ("Colorado in November"), activity context ("hiking"), and implicit preferences (the user probably wants something rugged, not a fashion parka). Generic text embeddings handle some of this, but they're not optimized for product understanding.

We landed on Google's Gemini embedding model with 1024-dimensional vectors, paired with careful prompt engineering on the query side. Before we embed a user's message, we expand it using conversation context — so if they said "I'm going to Colorado for a hiking trip" three messages ago and now say "what about jackets?", we embed the enriched query, not just the two words.

Hybrid search: the best of both worlds

Pure vector search is great for semantic understanding but can miss exact matches. If someone asks for "Patagonia Nano Puff," you want the Patagonia Nano Puff — not the semantically similar North Face ThermoBall. Pure keyword search (BM25) handles this perfectly but completely misses intent.

Our hybrid approach in OpenSearch combines both. We run k-NN vector search and BM25 text matching in parallel, then merge the results with tunable weights. For queries that look like specific product names, we boost BM25. For exploratory queries ("something cozy for working from home"), we boost k-NN. The classifier that detects query type is simple — mostly based on whether the query contains brand names or model numbers — but it makes a meaningful difference in result quality.

The personalization layer

Here's where it gets interesting. The RAG pipeline gives us relevant products. The personalization layer makes them personally relevant.

Every user interaction feeds into a taste profile. We track category affinities (you browse a lot of outdoor gear), brand affinities (you tend to click on Patagonia and Arc'teryx), price sensitivity (your average viewed product is $80-150), and style signals (you favor earth tones and minimal branding). These signals get combined into a user vector that modifies search results at query time.

The ranking pipeline has four stages:

Nomination — multiple nominators (k-NN, co-visitation, trending, affinity-based, exploration) each contribute candidate products
Scoring — a weighted linear model scores each candidate across similarity, recency, popularity, persona match, price fit, and exploration
Re-ranking — we enforce diversity constraints (no more than 3 products from the same brand, collapse product variants, suppress recently purchased items, inject exploration candidates)
Caching — results are cached per surface type (homepage, product detail page, category page, cart) in Redis

The re-ranking step is critical and often overlooked. Without it, a user who looked at three Nike shoes would see nothing but Nike for the rest of their session. The exploration injection ensures the system doesn't just reinforce existing preferences — it occasionally shows you something outside your pattern that you might love.

Streaming and session management

Nobody wants to wait 5 seconds for a shopping recommendation. We stream Gemini's responses token by token, so the user sees the AI "thinking" in real time. The session state — conversation history, extracted preferences, product context — lives in Redis with automatic TTL cleanup.

Rate limiting was a non-obvious challenge. LLM calls are expensive, and a user rapidly sending messages can rack up costs fast. We settled on 20 requests per minute for chat messages and 60 per minute for session operations, with graceful degradation (the AI acknowledges it's being rate-limited rather than silently failing).

What we'd do differently

If we were starting over, we'd invest in evaluation infrastructure earlier. We spent too long relying on vibes-based quality assessment ("does this response feel good?") before building systematic evaluation. Now we have test suites of query-response pairs with expected product categories and quality scores, but we wish we'd had them from day one.

We'd also explore fine-tuning earlier. We built a pipeline using the SIMMC dataset (Situated Interactive Multi-Modal Conversational Data) for fine-tuning Gemini on shopping-specific conversations. The results are promising — the fine-tuned model is noticeably better at asking relevant follow-up questions and understanding shopping-specific context — but we treated it as an optimization instead of a foundation. In hindsight, even a small amount of domain-specific fine-tuning early on would have saved us months of prompt engineering iteration.

Where we are now

Spark handles thousands of conversations and gets better every week. The hybrid search returns relevant results for everything from "size 13 Nike Pegasus 41" to "my girlfriend likes cottage-core but not in an over-the-top way." The personalization layer meaningfully improves recommendations by the third session.

But we're still early. The gap between Spark and a truly great human shopping assistant is still real — especially for categories that require deep domain expertise (electronics, skincare, wine). Closing that gap is what we're working on next, and it's some of the most interesting engineering work we've ever done.

If this kind of problem excites you, we're hiring. Check our careers page.