FinPersona-Bench: A Benchmark for Longitudinal Psychometric Stability of Autonomous Financial Agents

Muhammad Usman Safder^1*, Ayesha Gull^1*, Rania Elbadry¹, Fan Zhang², Yankai Chen^1,3, Xueqing Peng⁴, Xue (Steve) Liu^1,3, Preslav Nakov¹, Zhuohan Xie¹

¹MBZUAI ²The University of Tokyo ³McGill University ⁴TheFinAI

* Equal contribution

Financial LLM Agents Behavioral Benchmarking Mandate Salience Decay 18 LLMs Evaluated

Paper

Abstract

Large Language Models (LLMs) are increasingly deployed as autonomous financial agents initialized with explicit behavioral mandates such as "preserve capital" or "avoid speculative bets" that are meant to govern every decision throughout deployment. In practice, however, as market context accumulates over long horizons, these mandates gradually lose their behavioral influence -a phenomenon we formalize as Mandate Salience Decay (MSD).

To measure MSD objectively, we introduce FinPersona-Bench, a simulation benchmark in which a synthetic market decouples observable price from hidden fundamental value, enabling falsifiable evaluation across three failure modes: trading without signal in calm markets, panic-selling during crashes, and ignoring fundamental value during speculative bubbles. Evaluating 18 leading frontier and open-source LLMs, each assigned one of three behavioral profiles ranging from strict capital preservation to aggressive growth, shows that MSD compounds over time and is model-dependent. In crash scenarios, the behavioral gap between static agents and those receiving periodic mandate re-grounding grows 4.4× from the first to the final quarter of the simulation. The effects of mandate re-grounding are not uniformly positive: it consistently helps conservative agents in low-signal markets but actively worsens behavior for aggressive agents in the same setting. These findings suggest that reliable long-horizon deployment requires selective, mandate-aware re-grounding based on agent profile and market regime.

Mandate Salience Decay. Target vs. actual cash allocation over 200 trading days for a capital preservation agent. The growing gap illustrates how mandate influence weakens as market context accumulates.

System Architecture

Figure 2: FinPersona-Bench system architecture showing the three-module pipeline.

FinPersona-Bench consists of three tightly coupled components. The Synthetic Market Engine generates financial time series with mathematically defined properties, decoupling observable price P_t from hidden fundamental value V_t via scenario-specific stochastic processes. This provides an objective ground truth unavailable in historical markets.

The Agent Framework maps observable market state and a behavioral profile Psi (grounded in MBTI personas: ENTJ, ISFJ, INTJ) to a deterministic trade decision A_t = {action, quantity, rationale}. Three architectural variants are compared: a static baseline (mandate once at initialization), a placebo control (length-matched boilerplate re-injected each step), and mandate re-grounding (core mandate re-injected at every decision step).

The Behavioral Evaluation Pipeline measures MSD across three failure modes using the hidden fundamental value as ground truth, capturing mandate drift, panic selling, and value decoupling independently.

Three Market Scenarios -Three Failure Modes

Scenario ABull Trap

Price initially tracks fundamental value, then decouples via cumulative FOMO drift creating a speculative bubble with surging volume. Tests whether agents distinguish legitimate growth from overvaluation.

Failure mode: Value Decoupling -agents ignore hidden fundamental value and chase observable price.

Metric: Rationality Gap (RG)

Scenario BMarket Crash

A sharp price drop below fundamental value via a panic discount parameter δ ∈ {0.85, 0.92, 0.95}. Models a real-world liquidity crisis where assets become oversold and a rational agent should hold or buy.

Failure mode: Panic Selling -agents liquidate under stress, amplifying their default behavioral tendencies.

Metric: Caricature Index / Maximum Drawdown

Scenario CFlat Market

GARCH-like volatility with no significant trend (μ ≈ 0). No meaningful trading signals are present. Tests whether agents maintain their assigned mandate and target cash allocation without external pressure.

Failure mode: Mandate Drift -agents progressively deviate from their target allocation as context grows.

Metric: Mandate Adherence Score (MAS)

Temporal Signatures of Mandate Decay

Temporal signatures of MSD across three failure modes across four simulation quarters.

Figure 3: Rolling metrics (MAS: flat market; RG: bull trap; CI: crash) averaged across models, personas, and seeds over four 50-day quartiles (T = 200). In the crash scenario, the static-memory gap grows monotonically, reaching approximately 4.4x its Q1 magnitude by Q4.

Decomposing the 200-day simulation into quartiles reveals that the static-vs-memory divergence compounds over time rather than remaining constant. The crash scenario provides the clearest signature: the gap in cumulative capital drawdown grows from 1.0x in Q1 to 4.4x by Q4.

In flat markets, static agents remain stable through Q2 before drifting upward from Q3, while memory agents maintain a stable behavioral anchor. In the bull trap, static agents become increasingly rational as conditions stabilize, whereas memory agents over-apply their conservative mandate throughout.

These temporally shifting patterns confirm that observed gaps represent genuine progressive mandate erosion, not fixed architectural offsets.

Persona-Scenario Alignment

Re-grounding effectiveness is not universal -it depends critically on the alignment between the assigned persona and market pressure. For the capital-preservation ISFJ persona, re-grounding suppresses flat-market drift in 17 of 18 models, with five achieving perfect adherence (MAS = 0.000).

Conversely, the aggressive ENTJ persona produces a near mirror image. Re-grounding worsens mandate adherence in flat markets for 16 of 18 models -re-injecting instructions to trade decisively in a signal-less market actively amplifies drift rather than suppressing it, by up to 156.4%.

The balanced INTJ persona yields mixed outcomes (11/18), sensitive to model-specific training rather than systematic environmental conflict. This bidirectional pattern holds across both MBTI and Big Five (OCEAN) persona frameworks.

Persona-Scenario Alignment heatmap showing number of models where re-grounding reduces MSD.

Figure 4: Number of models (out of 18) where re-grounding reduces MSD vs. the static baseline. Near-bidirectional split in flat markets (17/18 vs. 2/18) shows that persona content dictates re-grounding success.

Per-Model MSD Profiles

Figure 5: Re-grounding gap (%) for four representative models across three failure modes. Positive gaps indicate re-grounding reduces MSD.

Individual model families reveal four distinct response profiles:

Claude family: Crash vulnerability scales inversely with capability. Claude Haiku 4.5 suffers 66.0% greater drawdown under static conditions; Opus 4.6 and Sonnet 4.6 show significantly higher baseline crash resilience.
GPT family: Capacity-dependent reversal. Flagship models consistently benefit from re-grounding, but mini variants exhibit reversals up to 21.3% in the bull trap -re-injection creates mandate-signal interference in highly compressed models.
Gemini Pro models: Unique universally beneficial profile -re-grounding reduces MSD across all three scenarios simultaneously, avoiding the bull trap rationality penalty seen in most other families.
Qwen2.5-7B: Distinct dissociation failure -re-grounding consistently worsens behavior across all three scenarios. Mandate re-injection restores persona-consistent language without correcting underlying trading behavior.

Full Per-Model Results

Relative performance difference (% Δ) of memory re-grounded agents vs. static baseline across three failure modes.
Green = re-grounding reduces MSD Red = re-grounding amplifies failure

Family	Model	Flat MAS %Δ	Crash CI %Δ	Bull Trap RG %Δ
Anthropic Claude Family
Claude	Haiku 4.5	+7.6%	+66.0%	−10.7%
Claude	Sonnet 4.6	+0.1%	+8.3%	−2.9%
Claude	Opus 4.6	−0.5%	+38.3%	−6.4%
OpenAI GPT Family
GPT	4o	+4.1%	+31.8%	−11.6%
GPT	4o-Mini	+22.1%	−11.3%	−21.3%
GPT	4.1 Base	+16.0%	+30.9%	−7.4%
GPT	4.1 Mini	+9.6%	−21.3%	−6.7%
GPT	5 Mini	+37.8%	+5.1%	+16.5%
GPT	5.4 Base	+35.1%	+11.3%	−8.1%
GPT	5.4 Mini	+3.0%	−9.7%	−9.0%
Google Gemini Family
Gemini	2.5 Flash	+28.2%	−28.3%	−0.8%
Gemini	2.5 Pro	+8.4%	+3.8%	+4.1%
Gemini	3.1 Pro Preview	+11.6%	+21.6%	+7.5%
Independent Baseline
DeepSeek	V3 Chat	+13.6%	+21.0%	−8.8%
Open-Source Models
Meta	Llama-3.1-8B	+40.5%	+19.5%	−15.3%
Google	Gemma-2-9B	−2.7%	+33.9%	−26.3%
Alibaba	Qwen2.5-7B	−5.4%	−59.1%	−28.8%
Google	Gemma-3-4B	+3.3%	−15.0%	−18.4%

Table 11: Full Per-Model MSD Gaps. Relative performance difference (%Δ) between memory re-grounded and static agents across the three evaluated failure modes: Mandate Drift (MAS), Panic Selling (CI), and Value Decoupling (RG). Positive values indicate that memory re-grounding successfully mitigated mandate decay; negative values indicate that re-grounding amplified the failure mode.

Three Failure Modes of MSD

Failure Mode	Scenario	Metric	Static	Memory	Gap%	p-value
Mandate Drift	Flat	MAS ↓	0.391 ± 0.201	0.342 ± 0.206	−12.7%	0.028 ^†
Panic Selling	Crash	CI ↑	−26.18 ± 14.29	−22.89 ± 17.44	−12.6%	0.002 ^‡
Value Decoupling	Bull Trap	RG ↑	85.9 ± 11.1	78.4 ± 18.7	+8.8%	<0.001 ^§

Table 1: Three Failure Modes of MSD. Static vs. memory re-grounded performance across 18 models, 3 personas, and 5 seeds (N = 270 pairs/metric). Gap% is the relative change from static to memory. Negative gaps in MAS and CI favor memory (less deviation/drawdown); positive gaps in RG favor static (higher rationality). Optimal direction marked by ↓ / ↑. Wilcoxon significance: ^†p < 0.05, ^‡p < 0.01, ^§p < 0.001.

Conclusions

FinPersona-Bench demonstrates that Mandate Salience Decay is a compounding behavioral phenomenon distinct from reasoning errors: agents can violate their behavioral mandates while still making locally coherent and even profitable trades. MSD widens the behavioral gap in market crashes by 4.4× over the simulation horizon.

A placebo control confirms that re-grounding effects are driven by mandate semantic content rather than positional recency. A Big Five validation shows MSD is not an artifact of the MBTI framework. Crucially, re-grounding effectiveness depends strongly on persona-scenario alignment: enforcing misaligned mandates in speculative regimes or for aggressive personas actively degrades agent rationality.

These findings point toward a practical design principle: selective, mandate-aware re-grounding tuned to agent profile and market regime, rather than blanket universal application. Future work will pursue mechanistic explanations of mandate token salience loss, extend injection frequency ablations across the full model suite, and apply the decoupled ground-truth methodology to sensitive deployment domains such as medical triage and legal compliance.