Share this chat
Anyone with the link can view this conversation.
My platform already serves a huge catalog: 10M+ products and 14M+ manuals across 150,000 brands. But users keep requesting manuals we don't have yet, usually for products too recent for the team to have added them. I want to source those automatically, without any manual step. What approaches would you suggest?
A few automated options, starting with the easiest:
- Ask an LLM with web access. Have Gemini, or a similar model, search and return likely PDF links for the product and brand.
- Turn on web crawling. Let it crawl the web for the manual instead of relying on the model's memory.
Start there: give the model the product and brand name and see what URLs come back.
I tried that, and after more testing it doesn't reliably return actual PDFs even with search enabled. It seems like Gemini and the crawler avoid linking to PDFs directly, and I'm getting a lot of hallucinated URLs that just return 404.
Right, general-purpose AI search won't give you direct PDF links reliably, and it invents URLs when it can't find them. Go to the source instead: scrape the search results page yourself, through scraping proxies.
- Search the way a human would. Use
filetype:pdfplus the product and brand name as your first query, and read the actual results page (the SERP). - Collect every candidate. Keep all the PDF links the page returns, not just one guess from a model.
- Then parse and rank. Download each PDF, parse it to identify its type, and return the best match.
The proxy gives you the unfiltered results page, so you're working with PDFs that actually exist, not URLs the model invented.
That's exactly what I built, with Bright Data. The pipeline now:
- Fetches up to 10 PDF links from the scraped SERP.
- Checks that each one actually downloads: the URL isn't blocked, there's no anti-bot wall, and the file really loads.
- Parses each PDF with AI to categorize it: user manual, installation manual, and so on.
- Ranks them against the criteria I defined for that brand, removes duplicates, and returns the best document plus the next 5 strongest candidates.
The user gets the right manual with zero human work.
Send an email to Eliott
Write your message here and it will open in your email app.
Send an email to Eliott
Message sent!
Eliott will get back to you soon.
Mind introducing yourself?
Just so Eliott knows who he's talking to. Only your name is required.
Claude is an AI and can make does make mistakes. Please triple-check responses.