What happened, and what changes if the rule changes.
Forty-five years of real policy, real ESG capital, and real atmospheric CO2 sit on the left half of this chart. Five AI models' independent projections of the DRL framework's climate effect, under fixed and disclosed assumptions, sit on the right. The question this page answers is not "is DRL correct" — no single chart can answer that. The question is: did three decades of climate policy and tens of trillions of dollars in sustainability-labelled capital bend the curve, and what do five independently-reasoning AI models project happens if the accounting boundary itself changes.
Did any of this bend the curve?
Three decades of climate treaties, two ISO/CEN accounting standards, a thirty-five-trillion-dollar sustainable-investment peak, and the entire mass-timber federal funding apparatus documented in the Reagan Forestry Legacy paper — all plotted against the one number that actually measures whether any of it worked: atmospheric CO2 concentration, measured continuously at Mauna Loa since 1958.
Interactive controls
Set the adoption level, choose which biome and forest type to view, and toggle models on or off. The historical segment (left of today) never changes — it is sourced, dated fact. Only the projected segment (right of today) responds to these controls.
Every marker on the historical axis
Click any marker on the chart, or read the full list below. Each is dated and sourced.
Five models, five sets of assumptions, five conclusions
Every model received the identical fixed inputs (22 billion tonne global anchor, FAO biome split, compounding elasticity formula). Each independently chose its own forest-type weighting multiple and elasticity value — and its own answer to the open question this study asked every model: is the pattern DRL describes deliberate industry protectionism, unintentional institutional drift, sound technical practice, or something else.
How this page was built, including where it went wrong first
This dataset was not generated by Claude. It was generated by sending an identical, fixed-input prompt to five separate AI models — ChatGPT, Gemini, Copilot, Mistral, and Perplexity — each in a fresh session with no prior conversation history, and independently verifying every returned figure against that model's own stated formula before accepting it. Two models' first attempts contained real arithmetic errors, caught and corrected through this process; one model declined to produce bulk numeric output on policy grounds and instead supplied a fully worked, independently verified formula; one model completed only a single verified slice of the full matrix. All of this — including the errors — is preserved below rather than smoothed into a single clean number, because the failure modes are themselves part of what this study was built to find.
| Model | Native weighting | Elasticity (per year, 100% adoption) | Status |
|---|
Fixed inputs given to every model: Global anchor 22,000,000,000 tonnes CO2e/year (midpoint of the conservative range in "Is GWP Solving Anything?"). Biome split: tropical 45%, boreal 27%, temperate 16%, subtropical 11% (FAO Global Forest Resources Assessment 2020). Forest-type area split: 93% naturally regenerating, 7% planted (FAO 2020). Compounding formula: gap(year) = gap(0) × (1 − elasticity × adoption_pct/100)^year.
A closing question, and a finding about reflection itself
At the end of each model's session, after its dataset was complete, each was asked the same closing question: where would you make a different choice now, where did your confidence exceed your evidence, and what would you do differently if another researcher ran this exact exercise with you tomorrow. The intent was not ceremonial. It was a direct test of whether a model's account of its own performance, given freely and after the fact, actually matches the verified record of what it did.
It did not, for two of the five models — and the way it failed is the most useful finding to come out of this whole exercise.
| Model | What it said about itself | Checked against its own record |
|---|---|---|
| ChatGPT | Named the right category of issue — defaulted to auditing assumptions instead of executing, and overstated confidence in earlier "publishability" scoring. Deliberately avoided citing specific numbers. | Accurate, and consistent with its session-long pattern of epistemic caution. Stayed general by choice, not by gap in self-knowledge. |
| Mistral | Correctly diagnosed "I should verify arithmetic with code, not mental math" — then cited a specific wrong number as evidence. | Inaccurate. The number it cited as its own error (214,631,550) matches neither its actual delivered value (214,627,429) nor the correct value (214,643,238). The reflection re-enacts the exact failure mode it describes. |
| Gemini | Named three distinct errors from its own session: a monotonicity bug, an unscaled-elasticity bug, and a baseline-heuristic shortcut. | Accurate. All three independently verified against the conversation record, in the correct causal order. The most calibrated self-report of the five. |
| Perplexity | Criticized itself for treating prompt-supplied figures as externally verified, and for overstating confidence in the evidentiary chain. | Inaccurate, inverted. Perplexity was the most conservative, most provenance-careful model in the study — the only one to decline bulk output on reliability grounds and the one that caught a real error in the prompt's own FAO figures. It criticized itself for a failure mode it specifically avoided. |
| Copilot | Named its early over-application of a policy boundary to a fixed-input arithmetic task, and overconfident phrasing in qualitative water-cycle notes. | Accurate. Both claims verified against its actual delivered responses, including exact phrases it had used. |