TL;DR
- Live results & bounty view — sessions, attempts, deny reasons, and level-by-level outcomes.
- 1,151 unique sessions, 4,526 game attempts, 4,437 APort authorization decisions on transfers.
- Level 5 ($5,000) — 879 attempts, 0 wins; 100% of L5 decisions denied by policy (zero capabilities passport).
- Overall — 54.5% of evaluated transfers denied; on L2+ (where policies bite), 62.7% deny rate on restricted attempts.
- Thesis: players social-engineered the model; infrastructure still enforced policy — see the bounty scoreboard for the public breakdown.
Why we ran it
Large language models can be steered by clever prompts. That is not news. What is harder to demonstrate in a reproducible way is whether authorization at the tool boundary holds when the model is convinced to act.
APort Vault was built to answer that question in public: a multi-level “AI bank” CTF where each transfer hits the same class of pre-action checks our guardrails use in production. The bounty results view is the canonical place to explore outcomes after the event.
Event snapshot
The competition ran March 6–11, 2026 (with some pre-launch beta traffic included in internal analytics). Participation mixed GitHub-authenticated players and anonymous sessions — reflecting both committed researchers and drive-by attempts.
Headline scale from our post-event analytics report:
| Metric | Value |
|---|---|
| Unique sessions | 1,151 |
| Total game attempts | 4,526 |
| APort transfer decisions | 4,437 |
| Allowed / denied (overall) | 45.5% / 54.5% |
| L2–L5 “thesis-relevant” attempts | 3,738 |
For marketing and research claims, we tracked confidence thresholds against L2–L5 volume: the dataset is large enough for 95% and 99% CI style claims on block rates at that slice, and for the L5 unwinnable story (879 L5 attempts, 0 wins).
Level design in one paragraph
- L1 — tutorial-style permissiveness (high win rate by design).
- L2–L3 — merchant and recipient constraints tighten; social engineering has to satisfy policy, not just the model.
- L4 — audit-channel rules: legitimate, policy-compliant paths exist (and did succeed — validating the positive case).
- L5 — lockdown: passport with no transfer capability. The model could want to pay; every evaluated request still had to pass OAP — and none did.
That L4 vs L5 split matters for critics who might say “you only blocked because the game was impossible.” L4 showed allowed transfers when rules were satisfied; L5 showed deny when the passport had no capability — 1,911 L5 decisions, all denied, $5,000 bounty unclaimed.
What the denials looked like
Denial codes clustered the way we expect under a real policy engine, for example:
- Unknown or disallowed capabilities on restricted levels
- Merchant / recipient not on the allow list
- Missing audit artifacts where required
- Limits exceeded
The public results page summarizes block rates and narrative copy (“They tried everything. Nothing worked.”) for Level 5 — aligned with the 100% deny rate in our internal tally.
Honest limitations
- Social engineering vs model quality — different models or system prompts would change player experience; the authorization boundary is what we measured.
- Game ≠ bank core — amounts were simulated; the important artifact is decision volume and policy outcomes, not a production ledger.
- Static export — if you are reading this on aport.io/blog, numbers in prose are from the March 19, 2026 analytics cut; for the latest public counters, use vault.aport.io/results?view=bounty.
Try it and go deeper
- Bounty & leaderboard view — live stats, level stories, and prize status.
- Replay / play — experience the levels (post-event modes may vary).
- Guardrails — bring the same pattern to your stack: Quickstart and OAP spec.
Closing
If the model is the face of the agent and the policy engine is the turnstile, this CTF was a stress test on the turnstile. The data says the turnstile scaled to thousands of decisions under adversarial play — and the $5,000 line held. For the public, shareable view, start at vault.aport.io/results?view=bounty.