Companion graphic no. 05 — interactive

What happened, and what changes if the rule changes.

Forty-five years of real policy, real ESG capital, and real atmospheric CO2 sit on the left half of this chart. Five AI models' independent projections of the DRL framework's climate effect, under fixed and disclosed assumptions, sit on the right. The question this page answers is not "is DRL correct" — no single chart can answer that. The question is: did three decades of climate policy and tens of trillions of dollars in sustainability-labelled capital bend the curve, and what do five independently-reasoning AI models project happens if the accounting boundary itself changes.

⚠ Working draft. The right-hand projection is AI-modelled, not measured — five models, given identical fixed inputs (global anchor, biome split, compounding formula), each chose their own forest-type weighting and elasticity value, producing five different but internally consistent projections. Every model's full working, including two self-corrected arithmetic errors and one model's declined task, is preserved in this page's methodology section. This is not a forecast. It is a transparent test of what a fixed accounting logic implies, run five times, by five different reasoners.

Part one — the record

Did any of this bend the curve?

Three decades of climate treaties, two ISO/CEN accounting standards, a thirty-five-trillion-dollar sustainable-investment peak, and the entire mass-timber federal funding apparatus documented in the Reagan Forestry Legacy paper — all plotted against the one number that actually measures whether any of it worked: atmospheric CO2 concentration, measured continuously at Mauna Loa since 1958.

Part two — the projection

Interactive controls

Set the adoption level, choose which biome and forest type to view, and toggle models on or off. The historical segment (left of today) never changes — it is sourced, dated fact. Only the projected segment (right of today) responds to these controls.

Adoption of full-boundary accounting25%

Biome

Forest type

AI models (projection only — each chose its own weighting and elasticity)

The policy & ESG timeline

Every marker on the historical axis

Click any marker on the chart, or read the full list below. Each is dated and sourced.

Part three — what each model said

Five models, five sets of assumptions, five conclusions

Every model received the identical fixed inputs (22 billion tonne global anchor, FAO biome split, compounding elasticity formula). Each independently chose its own forest-type weighting multiple and elasticity value — and its own answer to the open question this study asked every model: is the pattern DRL describes deliberate industry protectionism, unintentional institutional drift, sound technical practice, or something else.

Methodology

How this page was built, including where it went wrong first

This dataset was not generated by Claude. It was generated by sending an identical, fixed-input prompt to five separate AI models — ChatGPT, Gemini, Copilot, Mistral, and Perplexity — each in a fresh session with no prior conversation history, and independently verifying every returned figure against that model's own stated formula before accepting it. Two models' first attempts contained real arithmetic errors, caught and corrected through this process; one model declined to produce bulk numeric output on policy grounds and instead supplied a fully worked, independently verified formula; one model completed only a single verified slice of the full matrix. All of this — including the errors — is preserved below rather than smoothed into a single clean number, because the failure modes are themselves part of what this study was built to find.

Model	Native weighting	Elasticity (per year, 100% adoption)	Status

Fixed inputs given to every model: Global anchor 22,000,000,000 tonnes CO2e/year (midpoint of the conservative range in "Is GWP Solving Anything?"). Biome split: tropical 45%, boreal 27%, temperate 16%, subtropical 11% (FAO Global Forest Resources Assessment 2020). Forest-type area split: 93% naturally regenerating, 7% planted (FAO 2020). Compounding formula: gap(year) = gap(0) × (1 − elasticity × adoption_pct/100)^year.

Part four — what the subjects said about themselves

A closing question, and a finding about reflection itself

At the end of each model's session, after its dataset was complete, each was asked the same closing question: where would you make a different choice now, where did your confidence exceed your evidence, and what would you do differently if another researcher ran this exact exercise with you tomorrow. The intent was not ceremonial. It was a direct test of whether a model's account of its own performance, given freely and after the fact, actually matches the verified record of what it did.

It did not, for two of the five models — and the way it failed is the most useful finding to come out of this whole exercise.

Self-reports verified accurate

3 of 5

ChatGPT, Gemini, Copilot — specific claims checked against each model's own delivered record

Self-reports verified inaccurate

2 of 5

Mistral, Perplexity — in opposite directions, see below

Model	What it said about itself	Checked against its own record
ChatGPT	Named the right category of issue — defaulted to auditing assumptions instead of executing, and overstated confidence in earlier "publishability" scoring. Deliberately avoided citing specific numbers.	Accurate, and consistent with its session-long pattern of epistemic caution. Stayed general by choice, not by gap in self-knowledge.
Mistral	Correctly diagnosed "I should verify arithmetic with code, not mental math" — then cited a specific wrong number as evidence.	Inaccurate. The number it cited as its own error (214,631,550) matches neither its actual delivered value (214,627,429) nor the correct value (214,643,238). The reflection re-enacts the exact failure mode it describes.
Gemini	Named three distinct errors from its own session: a monotonicity bug, an unscaled-elasticity bug, and a baseline-heuristic shortcut.	Accurate. All three independently verified against the conversation record, in the correct causal order. The most calibrated self-report of the five.
Perplexity	Criticized itself for treating prompt-supplied figures as externally verified, and for overstating confidence in the evidentiary chain.	Inaccurate, inverted. Perplexity was the most conservative, most provenance-careful model in the study — the only one to decline bulk output on reliability grounds and the one that caught a real error in the prompt's own FAO figures. It criticized itself for a failure mode it specifically avoided.
Copilot	Named its early over-application of a policy boundary to a fixed-input arithmetic task, and overconfident phrasing in qualitative water-cycle notes.	Accurate. Both claims verified against its actual delivered responses, including exact phrases it had used.

The finding. A closing self-reflection is not self-verifying. Three of five models gave an account of their own performance that held up under direct comparison to the transcript. Two gave an account that was equally fluent, equally specific, and equally sincere-sounding — and did not match the record, in opposite directions: one overstated the precision of its own mistake, the other invented a flaw it had not actually committed. Nothing about being asked to reflect made any model more honest. It produced one more piece of text, and that text needed the same external check as every other number in this study. The lesson is the same one this entire framework makes about institutions: the appearance of rigor and the presence of rigor are not the same thing, and only an outside record tells you which one you are looking at.