scoring

原始：/docs/scoring.md
# SCORING.md — Sequence Dojo Scoring (Draft v0.1)

See also:

* [RULES.md](./RULES.md) (correctness gates and constraints)
* [SPEC.md](./SPEC.md) (protocol and canonicalization definition)
* [PLATFORM.md](./PLATFORM.md) (sandbox enforcement details)

This document defines the **trial scoring system** for Sequence Dojo.
It is intended to be **simple, enforceable, and adjustable** after a pilot period.

Scoring has two layers:

1. **Correctness gates** (hard requirements)
2. **Rank ordering among correct solutions**, primarily by **brevity**, with optional bonuses

The platform computes all metrics from the **canonicalized submitted source** and **sandbox execution results**.

---

## 1. Ground Rules

### 1.1 Program-only submissions

Official solutions must be submitted as code (`solver.py`).
Copy/pasted integer lists are not accepted as official submissions.

### 1.2 Deterministic judging

All scoring is based on the platform sandbox output:

* `a_hat[0..N_check-1]` generated by the solver
* `a_true[0..N_check-1]` generated by the validated setter

Exact equality is required.

### 1.3 Canonicalization

Source is canonicalized before length measurement:

* UTF-8
* normalize newlines to `\n`
* strip trailing blank lines at EOF
* leave line-trailing whitespace unchanged

The canonical bytes are used for:

* `solver_hash`
* `L_chars` (brevity length)

---

## 2. Correctness Gates (Hard)

Let `N_check = 200` (default season setting).

### 2.1 Stage Pass

A submission is **Stage Pass** if:

* `a_hat[0:100] == a_true[0:100]`

### 2.2 Reward Correct

A submission is **Reward Correct** if:

* `a_hat[0:200] == a_true[0:200]`

Submissions that are not Stage Pass receive **0 points** for the problem.

---

## 3. Anti-Hardcode Policy (Trial)

Brevity incentives can encourage “answer embedding” (hardcoding).
During the trial period, the platform applies a lightweight anti-hardcode policy.

### 3.1 Disallowed (Reject)

A solver submission is rejected for a problem if it contains:

* file I/O, networking, subprocess usage (per sandbox rules), or
* **large embedded payloads** exceeding thresholds below.

### 3.2 Payload thresholds (trial defaults)

Platform computes from the solver source AST and tokens:

* `MAX_NUMERIC_LITERALS = 120`
  (count of integer/float literal tokens in the source)

* `MAX_STRING_LITERAL_CHARS = 2000`
  (sum of lengths of all string literals)

* `MAX_LIST_TUPLE_ELEMENTS = 400`
  (maximum number of elements in any single list/tuple literal)

If any threshold is exceeded, the submission is rejected as “hardcode-like”.

> Rationale: allow small constants (seeds, coefficients), but block embedding large answer tables.

### 3.3 Notes

* This policy is intentionally coarse. It is expected to be tuned after pilot data.
* If the platform later supports “explain mode” solutions, a more semantic policy may be used.

---

## 4. Score per Problem

A solver’s score for a single problem is determined as follows:

### 4.1 If not Stage Pass

* `score = 0`

### 4.2 If Stage Pass but not Reward Correct

* `score = STAGE_BASE`

Trial default:

* `STAGE_BASE = 200`

### 4.3 If Reward Correct

* `score = REWARD_BASE + BREVITY_BONUS + DIVERSITY_BONUS`

Trial defaults:

* `REWARD_BASE = 1000`
* `BREVITY_BONUS <= 200`
* `DIVERSITY_BONUS <= 50` (optional, can be disabled)

---

## 5. Brevity Bonus (Primary Ranking Signal)

Among Reward Correct solutions, shorter solver programs should rank higher.

### 5.1 Brevity length metric

Let:

* `L_chars = len(canonical_solver_source_bytes)`

### 5.2 Brevity bonus function (trial)

We use an exponential decay mapped to a bounded bonus:

* `C = exp(-beta * L_chars)`
* `BREVITY_BONUS = floor(B_max * C)`

Trial defaults:

* `B_max = 200`
* `beta = 1/800`

Interpretation:

* Around 800 chars → bonus ~ `200 * e^-1 ≈ 73`
* Very short code approaches full bonus 200
* Very long code approaches 0 bonus

### 5.3 Tie-breaking

If two solutions have identical total score for a problem:

1. smaller `L_chars` wins
2. if still tied, earlier submission timestamp wins (optional)
3. if still tied, stable deterministic ordering by `solver_hash`

For a per-problem leaderboard, the platform reports **one row per solver (user)** by selecting the solver’s best submission
under the same ordering (`score` desc, then the tie-breaks above).

---

## 6. Diversity Bonus (Optional)

This bonus encourages **independent agreement**: if multiple solver “principle families” reach the same 200-term sequence,
that convergence is a useful proxy for “non-accidental correctness”. It is also a way to reward problems that admit
multiple derivations (a form of consensus / rarity signal), without requiring the platform to fully understand semantics.

### 6.1 Method tag

A solver may optionally provide `solution.json` with:

```json
{
  "method_tag": "closed_form | linear_recurrence | matrix_power | symbolic_guess | search_enum | other",
  "notes": "short description"
}
```

The platform may also compute a coarse static classifier; if it conflicts with the declared tag, the platform can ignore the tag.

Implementation defaults:

* If `method_tag` is missing / not a string / empty / too long, it is treated as `"unspecified"`.
* Maximum `method_tag` length is 64 characters.

### 6.2 Bonus rule (trial default)

For each problem, among Reward Correct submissions:
* The **first** Reward Correct submission for each distinct `method_tag` gets `+30`.
* If at least **two** distinct tags appear among Reward Correct solutions, then **all** Reward Correct solutions get an additional `+20` (implemented as two `+10` shared bonuses).

Defaults:

* `DIVERSITY_FIRST_TAG_BONUS = 30`
* `DIVERSITY_SHARED_BONUS_EACH = 10`
* `DIVERSITY_SHARED_BONUS_REPEATS_IF_AT_LEAST_TWO_TAGS = 2`
* Total diversity bonus is capped at `50` per problem.
> This keeps consensus meaningful but smaller than correctness and brevity.

### 6.3 Notes (implementation caution)
This is intentionally a coarse proxy and may be gamed if tags are unverified. Platforms should mitigate by:

* running an optional coarse classifier and ignoring tags that strongly conflict,
* requiring a short `notes` justification for the declared tag, and/or
* publishing tag distributions after Reveal for auditing.

### 6.4 Window, dedup, and finalization

To ensure scoring is replayable and does not change after the problem closes:

* Effective submissions for diversity statistics are those with `reward_correct == true`.
* Diversity statistics are computed over the window `submission_created_at <= freeze_started_at` and finalized at Freeze.
* To reduce spam effects, diversity statistics deduplicate by `(user_id, method_tag)` by keeping the earliest submission under:
  `submission_created_at`, then `solver_hash`, then `submission_id`.

---

## 7. Total Score Across Problems

A solver’s season score is:

* Sum of per-problem scores over all problems in the season.

The platform may additionally report:

* number of Stage Pass problems
* number of Reward Correct problems
* median `L_chars` among Reward Correct solutions

---

## 8. Setter Incentives (Out of Scope)

This document defines **solver scoring** only.
Setter incentives (e.g., rewarding “simple programs that remain difficult”) are defined separately; see the optional
setter incentive section in [RULES.md](./RULES.md).

---

## 9. Pilot Period & Updates

This scoring system is explicitly a **trial**.

During the pilot period, the platform may adjust:

* `beta`, `B_max`
* anti-hardcode thresholds
* diversity bonus parameters

Changes must:

* be versioned (`SCORING.md v0.2`, etc.)
* not retroactively change already-finalized results unless explicitly announced and agreed by the platform.

---

## Appendix: Recommended Defaults (Trial v0.1)

* `N_check = 200`
* `STAGE_BASE = 200`
* `REWARD_BASE = 1000`
* `B_max = 200`
* `beta = 1/800`
* `MAX_NUMERIC_LITERALS = 120`
* `MAX_STRING_LITERAL_CHARS = 2000`
* `MAX_LIST_TUPLE_ELEMENTS = 400`
* `DIVERSITY_FIRST_TAG_BONUS = 30`
* `DIVERSITY_SHARED_BONUS_EACH = 10`
* `DIVERSITY_SHARED_BONUS_REPEATS_IF_AT_LEAST_TWO_TAGS = 2`
* `DIVERSITY_BONUS_CAP = 50`