My platform already serves a huge catalog: 10M+ products and 14M+ manuals across 150,000 brands. But users keep requesting manuals we haven't referenced yet, usually products too recent for the team to have added. I want to source those automatically, with no manual step. What approaches would you suggest?

A few automated routes, easiest first:

  • Ask an LLM with web access. Have Gemini, or a similar model, search and hand back likely PDF links for the product and brand.
  • Turn on web crawling. Let it crawl the open web for the manual instead of relying on the model's memory.

Start there: point the model at the product plus brand and see what URLs come back.

I tried that, and after digging in it doesn't reliably surface actual PDFs even with search enabled. It feels like Gemini and the crawler limit showing PDFs directly, and I'm getting a lot of hallucinated URLs that just 404.

Right, general-purpose AI search won't hand you raw PDFs reliably, and it invents URLs when it can't find them. Go to the source instead: scrape the results page yourself through scraping proxies.

  • Query like a human would. Run filetype:pdf plus the product and brand name as the first query, and read the real SERP.
  • Collect every candidate. Pull all the PDF links the page returns, not just the model's single guess.
  • Then parse and rank. Download each PDF, parse it to categorize the type, and return the best match.

The proxy gives you the unfiltered results page, so you're working from PDFs that actually exist instead of ones the model hoped existed.

That's exactly what I built, with Bright Data. The pipeline now:

  • Fetch up to 10 PDF links from the scraped SERP.
  • Check each one downloads: URL not blocked, not bot-gated, the file actually loads.
  • Parse with AI to categorize: user manual, installation manual, and so on.
  • Rank against the criteria I hold for that brand, dedupe, and surface the best document plus the next 5 I judge strongest.

The user gets the right notice with zero manual sourcing.

Claude is an AI and can make does make mistakes. Please triple-check responses.