Overview
ChatGPT Search is not a traditional search engine. It does not return a list of ten blue links and leave you to figure out which one has the answer. Instead, it reads pages on your behalf, pulls out relevant information, and synthesizes a direct answer with citations.
Under the hood, this is a multi-stage retrieval-augmented generation (RAG) system. It combines a fine-tuned version of GPT with real-time web retrieval, running through seven distinct stages before producing a response. The system uses a specialized orchestrator model called Thinky to manage the search process, and the main GPT model only comes in at the final stage to generate the answer.
This post walks through each stage of the pipeline, explains how ChatGPT decides what to search for, how it picks sources, and how the final answer gets assembled.
Stage 1: Query Classification
Not every query needs a web search. Asking ChatGPT to explain recursion or write a sorting function can be answered from its training data alone. Searching the web for those would just waste time and money.
So the first thing that happens is a lightweight classifier model evaluates the query in milliseconds. This model (internally called snc-pg-sw-3cls-ev3, part of the "sonic" system) returns three probability scores:
| Score | Threshold | What It Means |
|---|---|---|
| no_search_prob | 0.2 | Query is answerable from training data. Skip web search. |
| simple_search_prob | 0.0 | A single search round is enough. |
| complex_search_prob | 0.4 | Multiple search rounds needed. Triggers recursive loop. |
If the no_search_prob exceeds its threshold, the query goes straight to GPT with no web retrieval. Otherwise, it moves to Stage 2. The classifier also takes user message history and context into account, so the same query can be classified differently depending on the conversation.
Stage 2: Search Query Generation
Once the classifier decides a web search is needed, a specialized model called Thinky (internally alpha.sonic_thinky_v1) takes over. Thinky is the real orchestrator of ChatGPT Search. It does not generate the final answer. It plans and executes the search strategy.
Thinky generates two types of queries:
- Keyword queries: Traditional search terms sent to the search engine (Bing). Short, direct, similar to what you would type into Google.
- Semantic queries: Longer queries (~15 words on average) with intent-weighted vocabulary. These shift embedding vectors toward user intent rather than matching exact keywords.
For example, if you ask "best running shoes for 2026," the semantic query might be something like "running shoes 2026 top best list awards", where about 20% of the weight targets the product itself and 80% targets quality indicators.
For complex queries, Thinky enters a recursive planner loop. It searches, reads what comes back, identifies information gaps, and fires off another round of queries. This can repeat up to three times.
Stage 3: Initial Candidate Filtering
Each search query typically returns about 5 SERP (Search Engine Results Pages) pages, giving 40-50 candidate URLs. Thinky does not fetch all of them. That would be too slow and too expensive.
Instead, Thinky filters these candidates down to 10-20 pages using only the metadata available from the SERP results:
- SERP metadata: Title tags and meta descriptions from the search results.
- Domain authority signals: Whether the domain is known and trusted in the relevant field.
- Structured data elements: Schema markup, rich snippets, and other structured signals.
This is a fast, cheap filter. No pages are actually loaded at this point. It is purely based on what the search engine already knows about each URL.
Stage 4: Page Fetch and Content Chunking
This is where things get real. The system actually fetches the 10-20 filtered pages from the web. This stage is the primary latency bottleneck in the entire pipeline.
There is a hard timeout of roughly 2 seconds per page. If a page takes longer than that to respond, it gets skipped. Pages with slow Time-to-First-Byte (TTFB) risk being truncated mid-content. This means your page's server response time directly affects whether ChatGPT can even read your content.
Once a page is loaded, its content gets split into 128-token chunks using straightforward splitting logic. These chunks are the atomic units that get scored and compared in the next stage.
ChatGPT can also use the mclick function to scrape 3-10 diverse sources in parallel. If a page fails to load during this process, the system automatically skips it and retrieves information from other selected sources.
Stage 5: Semantic Embedding and Scoring
All the 128-token chunks from the fetched pages now go through GPU-accelerated embedding models. Each chunk gets converted into a numerical vector (an embedding) that represents its meaning.
These chunk embeddings are then scored against the semantic query embedding using cosine similarity. This measures how closely each chunk's meaning aligns with what the user is actually looking for. The entire embedding and scoring step completes in about 100-200 milliseconds across all chunks.
This is different from traditional keyword matching. A chunk about "athletic footwear performance reviews" would score well against a query about "best running shoes" even though none of the exact words match. That is the whole point of semantic search: matching meaning, not words.
Stage 6: Deep-Read Page Selection
Now Thinky receives the top-scoring chunk from each candidate page. Think of this as an "audition chunk": a short preview of the best content each page has to offer.
Using these audition chunks plus the original SERP metadata, Thinky selects 3-5 pages for final synthesis. This is the selection criteria used:
- Relevance: Direct alignment with the query terms and context.
- Authority: Reputation of the domain within the relevant field.
- Content quality: Clarity, depth, and how comprehensive the content is.
- Recency: Publication date, especially important for time-sensitive topics.
- Diversity: Multiple perspectives and viewpoints, not just one angle.
- Accessibility: Publicly available content without restrictive paywalls.
Stage 7: Context Assembly and Answer Generation
This is where the main GPT model finally enters the picture. The selected 3-5 pages contribute context through several channels:
- Sliding windows: ~300 words of context around the top-scoring chunks from each selected page.
- VIP summaries: High-authority domains (like Forbes, BBC, Reuters) get pre-synthesized summaries that bypass standard chunk selection.
- Direct API data: Content from partners like Reddit is retrieved directly via API.
- Cached indexes: Internal cached content for news and other verticals.
GPT receives roughly 5-6K tokens of curated context plus any programmatic results (weather data, stock prices, sports scores from data partners). It then generates the final answer with inline citations linking back to the source pages.
One important detail: the system tells GPT it performed the entire retrieval itself. GPT has no awareness that a separate model (Thinky) handled all the searching and filtering. It just sees the curated context and writes the answer.
Non-Deterministic Output
Because GPT uses temperature-based token selection, answers are probabilistic, not deterministic. Running the same query twice can produce different cited sources, different emphasis, and different phrasing. There is no single fixed "ranking" for any query.
Personalization
The classifier model receives user message history and context vectors. This means Thinky generates different search queries based on conversation context. Two users asking the same question will get different search strategies based on their prior conversation. There is no universal "search ranking" the way Google has one.
Inside a Real ChatGPT Search Response
To understand what actually happens under the hood, we intercepted a real XHR response from ChatGPT for the query "find best running shoes for daily use". Here is every data point in that response, explained.
The Message Tree
Every ChatGPT search response is a tree of messages, not a flat list. Each message has a unique UUID, a parent pointer, and a children array. The root node is always called client-created-root and has no message content. Here is the exact chain from the intercepted response:
| Order | Role | Content Type | What It Does |
|---|---|---|---|
| 0 | root | null | Empty anchor node. No message content. |
| 1 | system | text | Hidden system prompt. is_visually_hidden_from_conversation: true |
| 2 | system | text | Developer message rebase. rebase_developer_message: true |
| 3 | user | text | The actual query: "find best running shoes for daily use" |
| 4 | assistant | model_editable_context | Empty context slot. Model can write scratchpad data here. |
| 5 | system | text | Hidden injection with empty citations: [] and content_references: []. Sent to assistant. |
| 6 | assistant | code | The search call: search("find best running shoes for daily use") with recipient: "web" |
| 7 | assistant | text | Final answer with inline citations. real_author: tool:web |
A few things stand out. The model used is gpt-5-3, confirmed in both model_slug and resolved_model_slug fields. The default_model_slug is set to "auto", meaning ChatGPT automatically selected this model for the query.
Message #6 is particularly interesting. The assistant generates a content_type: "code" message containing the literal function call search("...") with recipient: "web". This is how the model triggers the search pipeline. The response comes back as message #7 with real_author: "tool:web" in the metadata, confirming the web tool executed and returned results.
User Message Metadata
The user message (node #3) carries more metadata than you would expect:
- request_id: A UUID that ties together every message in this turn. All 8 messages share the same
request_id. - turn_exchange_id: A separate UUID grouping messages into a single turn exchange. Different from request_id.
- resolved_model_slug: The actual model used (
gpt-5-3), as opposed to what the user selected (auto). - selected_github_repos: An empty array in this case, but indicates ChatGPT can connect to GitHub repositories for code-related queries.
- weight: Set to
1.0for real messages,0.0for hidden system messages. This controls whether the message contributes to the conversation context window.
Content References: The Citation System
The final answer (message #7) contains a content_references array in its metadata. This is how ChatGPT maps inline citation markers to actual sources. There are two distinct types:
Type 1: grouped_webpages
These are standard web citations. In the intercepted response, the answer text contains the marker citeturn0search1turn0search2. The content_reference maps this to actual sources:
{
"type": "grouped_webpages",
"matched_text": "citeturn0search1turn0search2",
"start_idx": 217,
"end_idx": 249,
"items": [{
"title": "Runna's Top 5 Daily Running Shoes",
"url": "support.runna.com/en/articles/...",
"attribution": "support.runna.com",
"supporting_websites": [{
"title": "Best Running Shoes 2025",
"url": "philipc91.sg-host.com/..."
}],
"refs": [
{ "turn_index": 0, "ref_type": "search", "ref_index": 1 },
{ "turn_index": 0, "ref_type": "search", "ref_index": 2 }
]
}]
}The matched_text string is what appears in the raw model output. The UI replaces it with clickable citation bubbles. The start_idx and end_idx are character positions in the answer text. The refs array uses a structured reference system: turn_index (which conversation turn), ref_type (search vs product), and ref_index (which result within that turn).
Notice the supporting_websites field. ChatGPT groups multiple sources under a primary citation. The first source becomes the main attribution, and corroborating sources are nested as supporting evidence.
Type 2: products (Shopping Connector)
For commercial queries, ChatGPT returns structured product data through a shopping connector. The raw answer text contains a special marker:
products{"selections":[
["turn0product3","Adidas Adizero Evo SL"],
["turn0product4","Nike Vomero Plus"],
["turn0product0","On Cloud 6 Versa"],
["turn0product2","New Balance Fresh Foam Roav"],
["turn0product5","Nike Uplift SC"]
]}The UI replaces this marker with product cards. Each product in the products array contains a rich data structure:
| Field | Example Value | Purpose |
|---|---|---|
| id | 7051740109208549025 | Unique product identifier from shopping provider |
| title | Adidas Adizero Evo SL | Display name |
| price | $150.00 | Primary listing price |
| rating | 4.8 | Aggregate review score |
| num_reviews | 15,647 | Total review count across merchants |
| image_urls | 3 OpenAI-proxied URLs | Product images served through images.openai.com |
| cite | turn0product3 | Citation key linking product to inline references |
| merchants | adidas + others | Summary of available merchants |
The Offers Array: Multi-Merchant Comparison
Each product contains an offers array with data from multiple retailers. For the Adidas Adizero Evo SL, the response included 4 offers:
| Merchant | Price | Availability | Tag |
|---|---|---|---|
| adidas | $150.00 | In stock, delivery $4.99 | - |
| Foot Locker | $150.00 | In stock, free delivery | - |
| DICK'S Sporting Goods | $149.99 | In stock, free delivery | Best price |
| Finish Line | $150.00 | In stock, free delivery | - |
Each offer includes a price_details object with base, shipping, tax, and total fields. The tag field is used for labels like "Best price". The provider field shows "p2" for all offers, which refers to a specific shopping data provider.
The debug_info on each offer reveals internal routing: source: "p2", p: "1012" for the primary merchant and p: "1000" for secondary merchants. The offers_see_more_boundary: 1 field controls how many offers are shown by default before a "see more" button.
Product Lookup Keys
Each product carries a product_lookup_key containing a JSON-encoded string with all the data needed to re-fetch the product. Inside it:
- request_query: The exact query used to find this product (e.g. "Adidas Adizero Evo SL").
- all_ids: A mapping of provider to product IDs. For example,
{"p2": ["7051740109208549025"]}. - metadata_sources: Which providers contributed metadata (
["p1", "p3"]), separate from the provider that returned offers (p2). - id_to_token_map: A base64-encoded token containing encoded product identifiers, geographic data (
gl: "us",hl: "en"), and the original search query. - merchant_hints: An array like
["adidas"]that hints at the primary merchant for this product.
Rating Grouped Citations
Products also include a rating_grouped_citation field. This groups together all the merchant pages that contributed to the aggregate rating. For the Adizero Evo SL (15,647 reviews, 4.8 rating), the citation groups Foot Locker, DICK'S, adidas.com, and Finish Line as supporting sources for that rating number.
Response Timing
The top-level response includes create_time and update_time as Unix timestamps. In this response, the gap between them is roughly 20 seconds. That is the total time from when the user hit enter to when the final answer finished streaming. The title field ("Best Daily Running Shoes") is auto-generated by ChatGPT and used in the sidebar conversation list.
The Answer Text Itself
The raw answer text (before the UI processes it) contains a mix of markdown, citation markers, and product markers. It looks something like this:
Here are some of the **best running shoes** ...
citeturn0search1turn0search2
products{"selections":[["turn0product3","Adidas..."]]}
### 1️⃣ Adidas Adizero Evo SL
- Lightweight and responsive
- Uses **Lightstrike Pro foam**
...
citeturn0search1The citation markers follow a strict pattern: cite + turn{N} + search{M} or product{M}. Multiple citations can be concatenated (like citeturn0search1turn0search2). The UI parses these markers, maps them to the content_references array, and renders them as interactive citation elements.
Why the Raw Response Is Huge but the UI Shows So Little
The raw XHR response for this single query is roughly 50KB of JSON containing 8 messages. The user sees one answer with some product cards and citation bubbles. Here is what gets filtered out and why.
6 of 8 messages are hidden
The UI applies several filtering rules to decide which messages to show:
| Message | Shown? | Why |
|---|---|---|
| client-created-root | No | Empty anchor node with no message content |
| system (hidden prompt) | No | is_visually_hidden_from_conversation: true, weight: 0.0 |
| system (developer rebase) | No | is_visually_hidden_from_conversation: true, weight: 0.0 |
| user query | Yes | Rendered as the user's chat bubble |
| assistant (model_editable_context) | No | Internal scratchpad. Content type is not renderable. |
| system (citation injection) | No | is_visually_hidden_from_conversation: true, recipient: "assistant" |
| assistant (search call) | Transformed | The raw search("...") code message becomes a "Searching the web..." loading animation |
| assistant (final answer) | Yes | Rendered with rich formatting, but heavily transformed |
The filtering happens through three metadata flags working together:
- is_visually_hidden_from_conversation: When
true, the UI skips the message entirely. Used for system prompts and context injections. - weight: Messages with
0.0do not contribute to the conversation context window. They exist in the tree but are invisible to both the user and the model's context. - content_type: Messages with type
model_editable_contextare internal data structures, not renderable text.
The answer text gets heavily transformed
Even the one message that does get shown (the final answer) goes through major transformations before the user sees it:
- Citation markers become clickable bubbles. The raw text
citeturn0search1turn0search2becomes two small numbered citation buttons that show source details on hover. - Product markers become rich cards. The raw
products{"selections":[...]}block gets replaced with a horizontal carousel of product cards showing images, prices, ratings, and merchant links. - Markdown gets rendered. The raw
**bold**,###headings, and- list itemsget converted into styled HTML. - URLs get tracking parameters. Every URL in
safe_urlshas?utm_source=chatgpt.comappended for click attribution. - The alt field provides fallback text. Each content_reference has an
altfield with a plain-text/markdown version for accessibility and non-JS rendering.
Product data is the biggest hidden payload
The bulk of the 50KB response is product data the user never directly sees. Each product carries:
- Multiple merchant offers (4-5 per product), each with full price breakdowns, availability status, delivery estimates, and checkout readiness flags.
- product_lookup_key containing a base64-encoded token with geographic targeting (
gl: "us"), the original search query, and provider-specific catalog IDs. This is used for re-fetching and variant selection. - debug_info on every offer with internal provider routing codes (
source: "p2",p: "1012"). - rating_grouped_citation linking the aggregate star rating to specific merchant review pages.
- analytics_meta with a
product_event_uuidfor tracking product impressions and clicks.
The user sees a clean product card with a title, image, price, and star rating. Behind that card is roughly 5-8KB of JSON per product powering merchant comparison, checkout flows, and analytics.
The offers_see_more_boundary controls visibility
Each product has an offers_see_more_boundary field (set to 1 in this response). This tells the UI to show only the first merchant offer by default, hiding the rest behind a "See more options" expansion. For the Adidas Adizero Evo SL, the user initially sees only the adidas.com offer at $150.00. But the response already contains offers from Foot Locker, DICK'S Sporting Goods, and Finish Line, pre-loaded and ready to show when the user clicks.
Where ChatGPT Gets Shopping Data
The product cards, prices, merchant offers, and images in the response do not come from the same pipeline as the web search results. ChatGPT runs a separate shopping connector that pulls structured product data from multiple upstream providers. And the biggest surprise: the dominant source is not Bing.
Google Shopping is the primary source
According to a study by Semrush and Search Engine Land, researchers analyzed over 43,000 ChatGPT carousel products across 10 industry verticals and compared them against 200,000+ organic shopping results. Their key findings:
- 83% of ChatGPT carousel products were strong matches in Google's top 40 organic shopping results.
- 45.8% had an exact title match in Google Shopping.
- For Bing, the match rate was only 11%. Only 70 products (0.16%) were found exclusively in Bing.
The mechanism: when a shopping query is detected, ChatGPT generates hidden "shopping fan-out queries" (averaging ~7 words each) that are sent to Google Shopping's organic index. Researchers confirmed this by intercepting network requests and decoding Base64-encoded strings in the id_to_token_map parameter. You can see this token in our XHR response inside the product_lookup_key. Per Semrush's findings, "the top ChatGPT product was included in Google Shopping's first 3 results 75% of the time."
This is not a formal partnership. Google and OpenAI are competitors in this space (OpenAI's Agentic Commerce Protocol vs. Google's Universal Commerce Protocol). ChatGPT queries Google's shopping index as a data source the same way it queries Bing's web index.
All product data sources
| Source | How It Works | Share |
|---|---|---|
| Google Shopping | ChatGPT sends hidden shopping fan-out queries to Google's organic shopping index. Products, prices, and merchant data are pulled from Google Shopping feeds. Not a formal partnership. | ~83% |
| Bing Merchant Center | Microsoft's product index. Any feed submitted to Bing Merchant Center is automatically eligible. No ad spend required. Despite the Microsoft partnership, Bing is a secondary source for products. | ~11% |
| OpenAI Merchants Portal | Direct feed submission at developers.openai.com/commerce. Merchants upload CSV, TSV, XML, or JSON feeds with 15-minute refresh cycles. Early access in US and Western Europe. | Growing |
| Schema.org/Product | OAI-SearchBot crawls websites and extracts structured product data from JSON-LD markup. Does not execute JavaScript. | Supplementary |
| Platform Integrations | Shopify (1M+ stores, "Shop with ChatGPT" toggle), Etsy (US sellers via Offsite Ads), Salesforce Commerce Cloud (Agentforce), Instacart (grocery), Target (beta app), Walmart (announced). | Varies |
The provider codes in the XHR
Every offer in the response carries a debug_info object with a source field. In our intercepted response, every offer shows source: "p2". The product also has metadata_sources: ["p1", "p3"] in its lookup data. These are internal provider codes that OpenAI has not documented publicly.
- p2 supplies the actual merchant offers (prices, availability, checkout URLs). All 4 offers for the Adizero Evo SL come from
p2. Given the 83% correlation with Google Shopping results, this is very likely the Google Shopping index. Thepsub-field distinguishes individual merchants:1012for the brand's own store (adidas.com),1000for third-party retailers (Foot Locker, DICK'S, Finish Line). - p1 and p3 appear only in
metadata_sources. They contribute product metadata (images, titles, specs, aggregate ratings) but not pricing or availability. These could be Bing Merchant Center and OpenAI's own crawled data respectively. This is how ChatGPT can show aggregate ratings from 15,647 reviews across multiple retailers while pulling live prices from Google Shopping.
The product_lookup_key contains a base64-encoded token with geographic targeting (gl: "us", hl: "en") and the original search query. When you click a product card or expand offers, ChatGPT uses this key to re-fetch fresh pricing from the provider without running a new search.
Commerce partners and checkout
OpenAI co-developed the Agentic Commerce Protocol (ACP) with Stripe for in-ChatGPT checkout. The payment ecosystem includes Stripe (primary), PayPal, Worldpay, Adyen, Fiserv, and Mastercard (Verifiable Intent trust layer). Named retail partners include Shopify, Instacart, Target, Walmart, and Salesforce Commerce Cloud merchants like L'Oreal, Pandora, and Saks.
However, Instant Checkout adoption has been minimal. According to early reports, only around 12 of Shopify's merchants actually went live with the integration. Reported issues included low conversion rates, limited merchant adoption, and gaps in sales tax collection. OpenAI has since shifted focus from in-ChatGPT checkout to routing users to merchant websites or connected apps. The 4% transaction fee on completed ACP purchases remains the stated monetization model.
Shopping intent detection
Not every query triggers the shopping connector. ChatGPT runs two separate pipelines:
- Shopping pipeline: Activated when the query signals purchase intent (e.g. "best running shoes under $100", "compare coffee makers"). Issues shorter, focused shopping fan-out queries (~7 words, ~1.16 queries per prompt) to Google Shopping and other product indexes. Renders product cards with images, pricing, and merchant offers.
- Web search pipeline: Activated for informational queries. Issues longer search fan-outs (~12 words each, ~2.4 queries per prompt). Pulls from narrative web content and synthesizes text responses with inline citations.
Our query "find best running shoes for daily use" triggered both pipelines simultaneously. The response contains web citations (grouped_webpages from Runna and philipc91.sg-host.com) and product cards (5 shoes with merchant offers). The classifier determined this was a query with both informational and commercial intent.
Product ranking: no paid placements
All product results are organic. There is no ad auction or pay-to-rank system. OpenAI ranks products by: relevance to the query, availability, price, star ratings and review quality, whether the merchant is the primary seller, and whether Instant Checkout is enabled (confirmed ranking boost). Critically, according to Semrush's research, the order of products shown closely mirrors how they are ranked by the upstream source (i.e., Google Shopping ranking carries through). According to OpenAI, the Shopping Research feature uses a reinforcement-trained variant of GPT-5 mini.
What the user actually sees
Reproduction of a real ChatGPT search response for "find best running shoes for daily use," rendered using ChatGPT's actual design system colors and layout patterns. Product images served from OpenAI's CDN.
🏃 Best Daily Running Shoes
1️⃣ Adidas Adizero Evo SL
- Lightweight and responsive
- Good for daily runs + faster workouts
- Uses Lightstrike Pro foam for energy return
- Best if you want a fast-feeling training shoe
2️⃣ Nike Vomero Plus
- Very cushioned with ZoomX foam
- Comfortable for long runs or walking all day
- Highly breathable upper
- Good if you prefer maximum softness
3️⃣ On Cloud 6 Versa
- Stylish everyday running shoe
- Uses CloudTec cushioning to absorb impact
- Popular for running + casual daily wear
4️⃣ New Balance Fresh Foam Roav
- Affordable and comfortable
- Fresh Foam midsole for soft cushioning
- Great for beginners or casual runners
5️⃣ Nike Uplift SC
- Budget-friendly daily trainer
- Lightweight with breathable upper
- Good for walking, gym, or light running
💡 If you want the safest picks (most runners like these):
- Best overall: Nike Pegasus series / Vomero
- Most comfortable: Hoka Clifton / Vomero style shoes
- Best beginner shoe: Brooks Ghost / New Balance Fresh Foam
These "daily trainers" usually handle 70-80% of weekly running mileage, so comfort and durability matter most.support.runna.com
✅ Quick tip before buying
- For running daily → get cushioned shoes
- For running + walking + casual wear → pick versatile trainers
- Always leave ~0.5-1 cm toe space to avoid toe pain.
✔️ If you want, tell me:
- Budget
- Running distance (2-3 km vs 10 km)
- Running or mostly walking
I can suggest the top 3 best shoes specifically for you.
Where ChatGPT Gets Its Data
ChatGPT Search pulls data from three main sources:
1. Microsoft Bing Index
Bing is the primary search index. When ChatGPT runs a search query, it calls Bing's API using the search(query, recency_days) function. Bing's own crawler (Bingbot) independently discovers and indexes web content. This is the workhorse of the system.
2. OpenAI's Own Index (OAI-SearchBot)
OpenAI maintains a supplementary search index built by its own crawler called OAI-SearchBot. Unlike GPTBot (which crawls for AI training data), OAI-SearchBot crawls specifically for search results. It does not collect training data.
Key limitations of OAI-SearchBot:
- Cannot render JavaScript. Unlike Googlebot, it only sees what is present in the initial HTML response. If your content is client-rendered, OAI-SearchBot will not see it.
- Targeted crawling. It does not attempt to index the entire web. It focuses on discovering relevant resources for user queries.
- Independent of GPTBot. Blocking OAI-SearchBot in your robots.txt does not block GPTBot, and vice versa. They are controlled separately.
3. Premium Data Partners
OpenAI has partnerships with news organizations and data providers for structured, real-time content. This includes breaking news, sports scores, financial data, and weather. These sources get special treatment in the pipeline: their content is often pre-synthesized and cached, giving them priority in the answer generation stage.
The Three OpenAI Crawlers
| Crawler | Purpose | Collects Training Data? |
|---|---|---|
| GPTBot | Crawls for AI model training | Yes |
| OAI-SearchBot | Builds search index for ChatGPT | No |
| ChatGPT-User | Live browsing when user requests it | No |
What This Means for Your Content
Understanding the pipeline changes how you think about visibility in AI search. Here are the practical takeaways:
Server speed matters
With a 2-second page fetch timeout, slow servers will get your content skipped. Server-side rendering (SSR) is essential since OAI-SearchBot cannot render JavaScript. If your content loads via client-side JS, it is invisible to the search index.
Write for the audition chunk
Your content gets split into 128-token chunks. The top-scoring chunk is what Thinky uses to decide if your page is worth reading in full. If your best content is buried after 2,000 words of filler, the audition chunk will be weak and your page will get passed over. Put your most relevant, information-dense content early and often.
SERP metadata still matters
The initial filtering (Stage 3) uses title tags, meta descriptions, and structured data from the SERP results. Good SEO fundamentals are still the first gate your content must pass through. Structured data (schema markup) gives you an advantage at this stage.
There is no single ranking
Because Thinky generates personalized search queries based on conversation context, and because GPT's output is probabilistic, there is no fixed ranking. Two users asking the same question will see different sources. This makes monitoring your AI visibility more important than chasing a position number.
Manage your crawler access
Make sure your robots.txt allows OAI-SearchBot. Blocking it means your content will not appear in ChatGPT's search results, even if it appears in Bing. And remember: allowing OAI-SearchBot does not give OpenAI permission to use your content for training. That is controlled separately by GPTBot.
Sources
- QueryBurst – "ChatGPT Is a Search Engine. Here's How It Works."(primary source for 7-stage pipeline details, classifier model names, Thinky architecture)
- Dejan Marketing – "How ChatGPT Search Results Work"(source selection criteria, mclick function, citation format)
- Semrush – "What Is ChatGPT Search & How Does It Work?"(OAI-SearchBot crawler, Bing partnership, auto-search triggering)
- High Voltage SEO – "Understanding the New AI-Powered Search Engine"(RAG architecture overview, 4-step query processing workflow)
- Daydream – "How OpenAI Crawls and Indexes Your Website"(three OpenAI crawlers, JavaScript rendering limitations, robots.txt controls)
- ByteByteGo – "The Architecture Behind Atlas: OpenAI's New ChatGPT-based Browser"(OWL architecture, Chromium integration, agent mode)
- TechTarget – "ChatGPT Search: Details About OpenAI's Search Engine"(fine-tuned GPT-4o, o1-preview distillation)
- OpenAI – "Overview of OpenAI Crawlers"(official documentation on GPTBot, OAI-SearchBot, ChatGPT-User)
- Yoast – "What is ChatGPT Search (and how does it use Bing data)?"(Bing integration details, query rewriting)
- arXiv:2312.10997 – "Retrieval-Augmented Generation for Large Language Models: A Survey"(RAG methodology, indexing/retrieval/generation phases)
- OpenAI – "Introducing Shopping Research in ChatGPT"(organic product rankings, no paid placements, shopping intent detection)
- OpenAI – "Buy It in ChatGPT: Instant Checkout and the Agentic Commerce Protocol"(Stripe partnership, ACP protocol, merchant feed pipeline, 4% transaction fee)
- OpenAI Help Center – "Shopping with ChatGPT Search"(product card rendering, merchant offers, price comparison)
- OpenAI Developers – Product Feed Specification(CSV/TSV/XML/JSON feed formats, 15-minute refresh, Merchants Portal)
- Semrush – "ChatGPT Searches Google Shopping"(43,000+ product study, Google Shopping as primary product data source)
- Search Engine Land – "83% of ChatGPT Carousel Products From Google Shopping"(shopping query fan-outs, 83% Google Shopping correlation, product carousel analysis)
Track Your AI Search Visibility
See exactly how ChatGPT and other AI search engines handle your brand's queries. Monitor query fanouts, track citations, and understand where your content appears.
Try the Query Fanouts Extension
