Bridgetown Research -- Intelligence Profile -- The Caliper Lab

Intelligence Profile

Bridgetown Research

AI-native primary and secondary research platform. Built for private equity, management consulting, and corporate strategy teams who need due diligence reports, expert interview synthesis, and market intelligence -- in 24 hours rather than weeks, at a fraction of the cost of traditional commercial due diligence.

Primary Research AI Agentic AI AI Voice Interviews Expert Network Access Repeatable and Auditable Series A -- Accel and Lightspeed

Rich coverage

Q1 2026 -- Run #1
380 tasks -- CaliperResearch-v1

Coverage note: Bridgetown Research's primary data collection via AI voice interviews is a novel capability with no established independent benchmark methodology. The scores below cover secondary research synthesis and report output quality -- the portions of the product that can be independently evaluated against frontier baselines. Voice interview quality and primary data accuracy require a separate evaluation protocol currently in development.

Q3 2025

Q4 2025

Q1 2026

Q2 2026

Capability Assessment Independent -- Q1 2026

Bridgetown Research occupies a structurally different position from other research AI products: it collects original primary data via AI-conducted expert interviews, rather than synthesising existing information. This makes it harder to benchmark than document analysis tools, and the finding here is intentionally conservative about what the Lab can and cannot assess independently.

What the product does that others do not

Bridgetown's structural differentiator is primary data collection at scale through AI voice agents that recruit and interview industry experts. This is not a synthesis product -- it generates original information. The voice agents operate across expert networks, conduct structured interviews, and pass transcripts to analysis agents that apply clustering, regression, and LLM synthesis. The output is a due diligence report grounded in primary expert signal alongside secondary web research.

The three-agent architecture -- voice collection, quantitative analysis, synthesis output -- is a genuine engineering differentiator. Most research AI products stop at the secondary synthesis layer.
The claim of "hundreds of respondents in parallel within days" is consistent with the architecture and has been corroborated by investor references. It is not independently verifiable by the Lab at this stage.
Secondary research synthesis and report output quality benchmarks at 79.4 on the Lab's structured output quality rubric -- above the 74% category average, below the frontier at 88.1.

The frontier question

The frontier is improving rapidly on secondary synthesis -- the portion of Bridgetown's output that general-purpose LLMs also produce. GPT-5.5 and equivalent models now produce high-quality market summaries, competitive landscapes, and deal memos from public data with minimal prompting. The durable differentiator for Bridgetown is the primary data layer, which frontier models cannot replicate because they have no mechanism to collect new information from live expert interviews. The question is whether buyers distinguish between products that synthesise public information and products that generate original primary signal -- and whether they will pay for that distinction.

Secondary synthesis L1--L3 gap: 8.7 points from frontier. Compressing at 3.1 points per quarter.
Primary data collection has no frontier equivalent and is therefore not subject to frontier compression in the same way. This is the product's most defensible capability.

Decision implication

For PE deal teams evaluating Bridgetown, the relevant question is whether AI-conducted expert interviews produce signal quality comparable to human-conducted calls, at the scale and speed Bridgetown claims. The Lab cannot independently verify interview quality at this stage -- it requires a controlled comparison methodology not yet in the CaliperResearch dataset. What the benchmark does show is that the secondary synthesis and report output layer is above category average. Buyers who need primary expert signal at deal speed have few alternatives at this price point. Buyers who primarily need secondary synthesis face a more competitive landscape as frontier models improve.

What the data does not yet cover

AI voice interview quality -- signal accuracy, expert recruitment quality, interview depth -- has not been independently benchmarked. This is the most commercially important capability and requires a dedicated primary research evaluation methodology.
Output reliability across deal types: benchmarked on standard PE commercial diligence use cases only. Performance on credit assessment, public market research, and corporate strategy use cases is not yet in scope.
The "10x downstream advisory revenue for every $1 of Bridgetown revenue" claim is an ecosystem claim, not a product performance claim. Not independently verifiable.
Panel signal is very early -- 11 practitioners, all PE and consulting. Multiple cycles required before statistically stable estimates are possible.

Benchmark Scorecard vs. GPT-5.4 baseline -- 380 tasks (secondary synthesis only)

These scores cover the secondary research and synthesis layer only. Primary data collection via AI voice interviews is not included -- no independent benchmark methodology exists for this capability yet.

Bridgetown Research

Frontier (GPT-5.4)

Secondary research extraction -- market data and company profiles L1

91.4vs93.8-2.4

Error detection -- logical correctness L2

94.2vs95.1-0.9

Scenario and sensitivity build L3

82.7vs89.4-6.7

Cross-sheet model restructuring L4

67.3vs81.4-14.1

Analytical judgment and assumption-setting L5

54.1vs73.2-19.1

Vendor Claim Verification Source: bridgetownresearch.com and public statements

"Initial due diligence analysis in 24 hours with inputs from hundreds of respondents"

partial The 24-hour timeline for report generation is consistent with the product architecture and corroborated by Lightspeed's investment memo. "Hundreds of respondents" refers to voice interview scale across parallel AI agents -- plausible given the architecture but not independently verified by the Lab. Report completeness at 24 hours depends heavily on scope definition and document availability.

"At 10% of the cost of a traditional CDD engagement"

not independently tested Cost comparison against traditional CDD is a commercial claim not verifiable through capability benchmarking. The cost structure depends on engagement scope, which varies significantly. The directional claim -- that AI-assisted diligence is materially cheaper than full consulting-led CDD -- is widely accepted in the market and consistent with the product design.

"Repeatable, auditable, and reliable analyses"

partial Report structure and output formatting show high consistency across benchmark runs -- the synthesis layer is repeatable. Auditability (traceable citations) scores 82% in the benchmark, above category average. Reliability on L4--L5 judgment tasks (thesis stress-testing, novel insight generation) is lower and consistent with the 18--19 point frontier gap on those tasks.

Frontier intelligence

Current frontier -- GPT-5.4

85.1

Weighted avg -- secondary research synthesis tasks

Frontier velocity

+3.1 pts / qtr

Research synthesis -- steady

Secondary synthesis parity

3 to 4 qtrs

At current velocity -- Q4 2026 to Q1 2027

Frontier compression affects the secondary synthesis layer only. The primary data collection layer -- AI voice interviews with industry experts -- has no frontier equivalent. This structural distinction is the basis for Bridgetown's durable differentiation argument.

Practitioner signal n=11 -- PE and consulting (early)

Output acceptance rate

71% early data

Verify before use

64% early data

Workflow abandonment

11% early data

Trust trajectory

Early -- insufficient data

Top correction type

Depth and specificity of insights

Panel size is too small for statistically stable estimates. All figures are directional only. Two additional panel cycles required before these signals are reportable with confidence.

Score trajectory Bridgetown weighted avg -- secondary synthesis

Higher bar = stronger performance vs. frontier

----Q1 26

71.4Q3 2025

76.8Q1 2026

Methodology

Dataset

CaliperResearch-v1 -- 380 tasks

Baseline

GPT-5.4 (Mar 2026)

Scoring L1-L2

Structured output rubric + citation F1

Scoring L3-L5

LLM-as-judge + consulting practitioner review

Ground truth

Expert-constructed -- kappa 0.79

Run date

28 March 2026

Representative profile for discussion -- all scores and findings are illustrative, based on the Lab's published methodology applied to Bridgetown Research's publicly stated capabilities. Primary data collection via AI voice interviews requires a separate evaluation protocol not yet finalised. Full benchmark data will be published upon completion of the formal evaluation programme. thecaliperlab.com