How I build prospect lists that are actually true
The full method I use to build prospect lists that hold up: parallel sources, verified emails, honest confidence scoring and an evidence URL on every record.
Most guides like this hold something back. The method gets vague around step four and the final paragraph turns into a pitch. This one does not, and here is why. Nothing in how I build prospect lists is secret. The protection is not the knowledge, it is the work: the unglamorous plumbing, the email verification nobody skips until they do, the hundredth record checked by hand. People read methods like this, nod along, and never build the thing. So I can give away the keys and sleep fine. The moat is doing the work, not knowing the steps.
Here is the problem the method solves. Bought lists are wrong in a specific way: they tell you where a business is registered, not where it shows up. The clearest example I have built for is Vivify Venues, who hire out school halls and pitches to community groups. Their best prospects were groups already renting space at schools that were not Vivify customers. A dance class meets in a school hall every Tuesday, but its registered address is the organiser's front room. No data provider indexes the hall. The signal lives in scattered places: club finders, activity directories, Facebook pages, league fixture grids. That pattern is everywhere in B2B. Your best prospects are visible on the open web, but the thing that qualifies them is not in anyone's database.
So you build the database yourself. Here is the whole method, in order.
The method
1. Decide what makes a record true before you collect anything
Write down the qualifying signal in one sentence. For Vivify it was "evidence that this group meets at this venue". Not "sports clubs near this postcode", which is a different and much worse list. If you cannot state the signal in one sentence, stop. Everything downstream, the sources you pick, the confidence scoring, the evidence you keep, is built around that sentence. Skip this and you will collect a large list of nothing in particular, very efficiently.
2. Fan out to several sources in parallel
One source lies by omission. Google Places is good at organisations with premises and terrible at a netball club that meets in a borrowed hall. So I run sources in parallel: Google Places for the obvious candidates, search results via DataForSEO using queries tagged with the venue name, and Facebook page data harvested via Apify, because half of community life only exists on Facebook.
The published numbers from the first live Vivify run make the case. The search-results source added 27 groups that Google Places alone had missed, 5 of them with venue-confirmed evidence. Run one source and you are choosing to not know about a quarter of your market.
3. Give every record a deterministic ID at the door
Sources do not share an ID scheme. Google gives you a place ID, a search result gives you a URL, Facebook gives you a page. Any record arriving without a place ID gets a synthetic one, generated deterministically from what the record is, so the same group from the same source always produces the same ID. This sounds like bookkeeping. It is the difference between a clean merge and records silently vanishing when the pipeline relinks results later. The Vivify run came through with zero orphaned records, and that line of boring ID discipline is the only reason why.
4. Enrich from the live web, not the cache
Every record then goes through the same chain: scrape the organisation's website for current contact details, fall back to search results for groups with no site, pull contact data from Facebook, geocode the location. The point is to enrich from what is live today, not from whatever a data vendor cached two years ago. Stale enrichment is how you end up emailing a treasurer who quit in 2023.
5. Verify every email before you trust it
An unverified email address is a guess with a domain on it. Verification happens inside the pipeline, before the record reaches the master table, not as a panicked cleanup the night before a campaign. The reason is not politeness. Bounces poison the domain you send from, and a poisoned sending domain costs you far more than the list was ever worth.
6. Classify with a cheap model against a controlled vocabulary
Each record gets an AI pass that classifies what the organisation does. The trick is to never ask the model an open question. "What is this organisation?" produces four hundred creative synonyms for football club. Instead the model picks from a fixed list of activity types you defined up front. That turns classification from an essay into a multiple-choice exam, which is exactly the kind of exam a cheap, fast model passes reliably. Save the expensive models for problems that deserve them.
7. Score confidence honestly and store the evidence
Every record gets a confidence tier. I use three: confirmed, where the evidence text names the venue; likely, where web or social activity matches; and proximity-only, where the group is nearby but no venue evidence was found. Crucially, the evidence URL is stored on the record, so a salesperson can click through and see exactly why the system believes what it believes.
Honest scoring means most records do not make the top tier. That is the point. A list where everything is marked confirmed is a list where confirmed means nothing, and sales will work that out within a week.
8. Dedupe and merge into one master record
Then the cleanup: junk emails stripped, dead websites removed, phone and mobile numbers separated, a quality score assigned, and duplicates across sources merged into a single master record with the evidence retained. One organisation, one record, every claim on it traceable to a source. The master table is the asset. The searches are just how it gets fed.
9. Track cost per source, per search
The system records real spend for each source on every search and shows it in the interface and the exports. Where a source will not tell you the real figure, store an estimate and label it as an estimate, because a guess dressed up as an actual is how cost reports lose their credibility. The first live Vivify run returned 107 hirer groups, 89 of them new to the database, for a total data cost of £0.14. Those figures come from the system's own search and cost records. When data costs a fraction of a penny per evidence-backed prospect, the cost tracking is not bureaucracy, it is the proof the whole approach works.
Where this goes wrong
I have made, or watched, every one of these mistakes.
Trusting one source. The Places-only version of the Vivify search was 27 groups lighter and nobody would ever have known. Missing records do not announce themselves.
Skipping verification because the list looks clean. It always looks clean. Bounce rates do not.
Open-ended classification. Free-text categories breed until your dropdown has Football, football club, Football (5-a-side) and Footy as separate entries. Controlled vocabulary from day one.
Confidence inflation. Marking proximity-only records as confirmed makes the demo look better and destroys trust in the data within days. Tier honestly and let the top tier be small.
Ignoring infrastructure limits. A multi-source search takes longer than a web request is allowed to live. Gateways will kill long-running requests, so the pipeline accepts the job immediately and the front end polls until it finishes. Learn this from a guide rather than from a wall of timeout errors at 11pm.
Deduping without deterministic IDs. Fuzzy matching on names is where good records go to die. Stable IDs first, fuzzy matching only as a last resort.
What it takes
Honestly: a few solid days of building if you have wired APIs, a workflow tool and a database together before, and meaningfully longer if you have not. The skills are not exotic, they are API plumbing, basic data modelling and enough prompt discipline to keep a model on a fixed vocabulary. The scarce ingredient is stamina. The method only pays off if you verify the boring things, check real records by hand, and keep doing it after the novelty wears off. That is why I am relaxed about publishing it. Most people will not do the work, and the few who will were always going to be fine.
If you would rather just have the list, and the system that keeps it true, book a consultation and I will build it for you.
Frequently asked questions
- Why are bought prospect lists so often wrong?
- Because they index where a business is registered, not where it shows up. For many B2B markets the qualifying signal lives in scattered open-web sources such as directories, Facebook pages and fixture lists that no standard data provider covers. The only reliable list is one built from those sources directly, with every record verified and carrying the evidence for why it qualifies.
- How do you keep AI classification of prospect data accurate?
- Never ask the model an open question. Define a controlled vocabulary of categories up front and have the model pick from that fixed list. That turns classification into a multiple-choice exam, which cheap, fast models pass reliably, and it stops free-text categories breeding hundreds of near-duplicate labels.
- What does multi-source prospect discovery cost to run?
- Very little at the data layer. The first live run of the system I built for Vivify Venues returned 107 evidence-backed prospects, 89 of them new to their database, for a total data cost of £0.14 across Google Places, search results and Facebook, according to the system's own cost records. The real cost is the build and the upkeep, not the per-search spend.
- What is a confidence tier on a prospect record?
- An honest label for how strong the evidence is. I use three: confirmed, where the evidence text names the qualifying signal directly; likely, where web or social activity matches; and proximity-only, where the prospect fits the profile but no direct evidence was found. Each record keeps its evidence URL so anyone can check why it was scored that way, and an honest system keeps the top tier small.