# SCORING.md — Sequence Dojo Scoring (Draft v0.1)
See also:
* [RULES.md](./RULES.md) (correctness gates and constraints)
* [SPEC.md](./SPEC.md) (protocol and canonicalization definition)
* [PLATFORM.md](./PLATFORM.md) (sandbox enforcement details)
This document defines the **trial scoring system** for Sequence Dojo.
It is intended to be **simple, enforceable, and adjustable** after a pilot period.
Scoring has two layers:
1. **Correctness gates** (hard requirements)
2. **Rank ordering among correct solutions**, primarily by **brevity**, with optional bonuses
The platform computes all metrics from the **canonicalized submitted source** and **sandbox execution results**.
---
## 1. Ground Rules
### 1.1 Program-only submissions
Official solutions must be submitted as code (`solver.py`).
Copy/pasted integer lists are not accepted as official submissions.
### 1.2 Deterministic judging
All scoring is based on the platform sandbox output:
* `a_hat[0..N_check-1]` generated by the solver
* `a_true[0..N_check-1]` generated by the validated setter
Exact equality is required.
### 1.3 Canonicalization
Source is canonicalized before length measurement:
* UTF-8
* normalize newlines to `\n`
* strip trailing blank lines at EOF
* leave line-trailing whitespace unchanged
The canonical bytes are used for:
* `solver_hash`
* `L_chars` (brevity length)
---
## 2. Correctness Gates (Hard)
Let `N_check = 200` (default season setting).
### 2.1 Stage Pass
A submission is **Stage Pass** if:
* `a_hat[0:100] == a_true[0:100]`
### 2.2 Reward Correct
A submission is **Reward Correct** if:
* `a_hat[0:200] == a_true[0:200]`
Submissions that are not Stage Pass receive **0 points** for the problem.
---
## 3. Anti-Hardcode Policy (Trial)
Brevity incentives can encourage “answer embedding” (hardcoding).
During the trial period, the platform applies a lightweight anti-hardcode policy.
### 3.1 Disallowed (Reject)
A solver submission is rejected for a problem if it contains:
* file I/O, networking, subprocess usage (per sandbox rules), or
* **large embedded payloads** exceeding thresholds below.
### 3.2 Payload thresholds (trial defaults)
Platform computes from the solver source AST and tokens:
* `MAX_NUMERIC_LITERALS = 120`
(count of integer/float literal tokens in the source)
* `MAX_STRING_LITERAL_CHARS = 2000`
(sum of lengths of all string literals)
* `MAX_LIST_TUPLE_ELEMENTS = 400`
(maximum number of elements in any single list/tuple literal)
If any threshold is exceeded, the submission is rejected as “hardcode-like”.
> Rationale: allow small constants (seeds, coefficients), but block embedding large answer tables.
### 3.3 Notes
* This policy is intentionally coarse. It is expected to be tuned after pilot data.
* If the platform later supports “explain mode” solutions, a more semantic policy may be used.
---
## 4. Score per Problem
A solver’s score for a single problem is determined as follows:
### 4.1 If not Stage Pass
* `score = 0`
### 4.2 If Stage Pass but not Reward Correct
* `score = STAGE_BASE`
Trial default:
* `STAGE_BASE = 200`
### 4.3 If Reward Correct
* `score = REWARD_BASE + BREVITY_BONUS + DIVERSITY_BONUS`
Trial defaults:
* `REWARD_BASE = 1000`
* `BREVITY_BONUS <= 200`
* `DIVERSITY_BONUS <= 50` (optional, can be disabled)
---
## 5. Brevity Bonus (Primary Ranking Signal)
Among Reward Correct solutions, shorter solver programs should rank higher.
### 5.1 Brevity length metric
Let:
* `L_chars = len(canonical_solver_source_bytes)`
### 5.2 Brevity bonus function (trial)
We use an exponential decay mapped to a bounded bonus:
* `C = exp(-beta * L_chars)`
* `BREVITY_BONUS = floor(B_max * C)`
Trial defaults:
* `B_max = 200`
* `beta = 1/800`
Interpretation:
* Around 800 chars → bonus ~ `200 * e^-1 ≈ 73`
* Very short code approaches full bonus 200
* Very long code approaches 0 bonus
### 5.3 Tie-breaking
If two solutions have identical total score for a problem:
1. smaller `L_chars` wins
2. if still tied, earlier submission timestamp wins (optional)
3. if still tied, stable deterministic ordering by `solver_hash`
For a per-problem leaderboard, the platform reports **one row per solver (user)** by selecting the solver’s best submission
under the same ordering (`score` desc, then the tie-breaks above).
---
## 6. Diversity Bonus (Optional)
This bonus encourages **independent agreement**: if multiple solver “principle families” reach the same 200-term sequence,
that convergence is a useful proxy for “non-accidental correctness”. It is also a way to reward problems that admit
multiple derivations (a form of consensus / rarity signal), without requiring the platform to fully understand semantics.
### 6.1 Method tag
A solver may optionally provide `solution.json` with:
```json
{
"method_tag": "closed_form | linear_recurrence | matrix_power | symbolic_guess | search_enum | other",
"notes": "short description"
}
```
The platform may also compute a coarse static classifier; if it conflicts with the declared tag, the platform can ignore the tag.
Implementation defaults:
* If `method_tag` is missing / not a string / empty / too long, it is treated as `"unspecified"`.
* Maximum `method_tag` length is 64 characters.
### 6.2 Bonus rule (trial default)
For each problem, among Reward Correct submissions:
* The **first** Reward Correct submission for each distinct `method_tag` gets `+30`.
* If at least **two** distinct tags appear among Reward Correct solutions, then **all** Reward Correct solutions get an additional `+20` (implemented as two `+10` shared bonuses).
Defaults:
* `DIVERSITY_FIRST_TAG_BONUS = 30`
* `DIVERSITY_SHARED_BONUS_EACH = 10`
* `DIVERSITY_SHARED_BONUS_REPEATS_IF_AT_LEAST_TWO_TAGS = 2`
* Total diversity bonus is capped at `50` per problem.
> This keeps consensus meaningful but smaller than correctness and brevity.
### 6.3 Notes (implementation caution)
This is intentionally a coarse proxy and may be gamed if tags are unverified. Platforms should mitigate by:
* running an optional coarse classifier and ignoring tags that strongly conflict,
* requiring a short `notes` justification for the declared tag, and/or
* publishing tag distributions after Reveal for auditing.
### 6.4 Window, dedup, and finalization
To ensure scoring is replayable and does not change after the problem closes:
* Effective submissions for diversity statistics are those with `reward_correct == true`.
* Diversity statistics are computed over the window `submission_created_at <= freeze_started_at` and finalized at Freeze.
* To reduce spam effects, diversity statistics deduplicate by `(user_id, method_tag)` by keeping the earliest submission under:
`submission_created_at`, then `solver_hash`, then `submission_id`.
---
## 7. Total Score Across Problems
A solver’s season score is:
* Sum of per-problem scores over all problems in the season.
The platform may additionally report:
* number of Stage Pass problems
* number of Reward Correct problems
* median `L_chars` among Reward Correct solutions
---
## 8. Setter Incentives (Out of Scope)
This document defines **solver scoring** only.
Setter incentives (e.g., rewarding “simple programs that remain difficult”) are defined separately; see the optional
setter incentive section in [RULES.md](./RULES.md).
---
## 9. Pilot Period & Updates
This scoring system is explicitly a **trial**.
During the pilot period, the platform may adjust:
* `beta`, `B_max`
* anti-hardcode thresholds
* diversity bonus parameters
Changes must:
* be versioned (`SCORING.md v0.2`, etc.)
* not retroactively change already-finalized results unless explicitly announced and agreed by the platform.
---
## Appendix: Recommended Defaults (Trial v0.1)
* `N_check = 200`
* `STAGE_BASE = 200`
* `REWARD_BASE = 1000`
* `B_max = 200`
* `beta = 1/800`
* `MAX_NUMERIC_LITERALS = 120`
* `MAX_STRING_LITERAL_CHARS = 2000`
* `MAX_LIST_TUPLE_ELEMENTS = 400`
* `DIVERSITY_FIRST_TAG_BONUS = 30`
* `DIVERSITY_SHARED_BONUS_EACH = 10`
* `DIVERSITY_SHARED_BONUS_REPEATS_IF_AT_LEAST_TWO_TAGS = 2`
* `DIVERSITY_BONUS_CAP = 50`