scoring

SCORING.md — Sequence Dojo Scoring (Draft v0.1)

See also:

  • RULES.md (correctness gates and constraints)
  • SPEC.md (protocol and canonicalization definition)
  • PLATFORM.md (sandbox enforcement details)

This document defines the trial scoring system for Sequence Dojo. It is intended to be simple, enforceable, and adjustable after a pilot period.

Scoring has two layers:

  1. Correctness gates (hard requirements)
  2. Rank ordering among correct solutions, primarily by brevity, with optional bonuses

The platform computes all metrics from the canonicalized submitted source and sandbox execution results.


1. Ground Rules

1.1 Program-only submissions

Official solutions must be submitted as code (solver.py). Copy/pasted integer lists are not accepted as official submissions.

1.2 Deterministic judging

All scoring is based on the platform sandbox output:

  • a_hat[0..N_check-1] generated by the solver
  • a_true[0..N_check-1] generated by the validated setter

Exact equality is required.

1.3 Canonicalization

Source is canonicalized before length measurement:

  • UTF-8
  • normalize newlines to \n
  • strip trailing blank lines at EOF
  • leave line-trailing whitespace unchanged

The canonical bytes are used for:

  • solver_hash
  • L_chars (brevity length)

2. Correctness Gates (Hard)

Let N_check = 200 (default season setting).

2.1 Stage Pass

A submission is Stage Pass if:

  • a_hat[0:100] == a_true[0:100]

2.2 Reward Correct

A submission is Reward Correct if:

  • a_hat[0:200] == a_true[0:200]

Submissions that are not Stage Pass receive 0 points for the problem.


3. Anti-Hardcode Policy (Trial)

Brevity incentives can encourage “answer embedding” (hardcoding). During the trial period, the platform applies a lightweight anti-hardcode policy.

3.1 Disallowed (Reject)

A solver submission is rejected for a problem if it contains:

  • file I/O, networking, subprocess usage (per sandbox rules), or
  • large embedded payloads exceeding thresholds below.

3.2 Payload thresholds (trial defaults)

Platform computes from the solver source AST and tokens:

  • MAX_NUMERIC_LITERALS = 120

(count of integer/float literal tokens in the source)

  • MAX_STRING_LITERAL_CHARS = 2000

(sum of lengths of all string literals)

  • MAX_LIST_TUPLE_ELEMENTS = 400

(maximum number of elements in any single list/tuple literal)

If any threshold is exceeded, the submission is rejected as “hardcode-like”.

Rationale: allow small constants (seeds, coefficients), but block embedding large answer tables.

3.3 Notes

  • This policy is intentionally coarse. It is expected to be tuned after pilot data.
  • If the platform later supports “explain mode” solutions, a more semantic policy may be used.

4. Score per Problem

A solver’s score for a single problem is determined as follows:

4.1 If not Stage Pass

  • score = 0

4.2 If Stage Pass but not Reward Correct

  • score = STAGE_BASE

Trial default:

  • STAGE_BASE = 200

4.3 If Reward Correct

  • score = REWARD_BASE + BREVITY_BONUS + DIVERSITY_BONUS

Trial defaults:

  • REWARD_BASE = 1000
  • BREVITY_BONUS <= 200
  • DIVERSITY_BONUS <= 50 (optional, can be disabled)

5. Brevity Bonus (Primary Ranking Signal)

Among Reward Correct solutions, shorter solver programs should rank higher.

5.1 Brevity length metric

Let:

  • L_chars = len(canonical_solver_source_bytes)

5.2 Brevity bonus function (trial)

We use an exponential decay mapped to a bounded bonus:

  • C = exp(-beta * L_chars)
  • BREVITY_BONUS = floor(B_max * C)

Trial defaults:

  • B_max = 200
  • beta = 1/800

Interpretation:

  • Around 800 chars → bonus ~ 200 * e^-1 ≈ 73
  • Very short code approaches full bonus 200
  • Very long code approaches 0 bonus

5.3 Tie-breaking

If two solutions have identical total score for a problem:

  1. smaller L_chars wins
  2. if still tied, earlier submission timestamp wins (optional)
  3. if still tied, stable deterministic ordering by solver_hash

For a per-problem leaderboard, the platform reports one row per solver (user) by selecting the solver’s best submission under the same ordering (score desc, then the tie-breaks above).


6. Diversity Bonus (Optional)

This bonus encourages independent agreement: if multiple solver “principle families” reach the same 200-term sequence, that convergence is a useful proxy for “non-accidental correctness”. It is also a way to reward problems that admit multiple derivations (a form of consensus / rarity signal), without requiring the platform to fully understand semantics.

6.1 Method tag

A solver may optionally provide solution.json with:

{
  "method_tag": "closed_form | linear_recurrence | matrix_power | symbolic_guess | search_enum | other",
  "notes": "short description"
}

The platform may also compute a coarse static classifier; if it conflicts with the declared tag, the platform can ignore the tag.

Implementation defaults:

  • If method_tag is missing / not a string / empty / too long, it is treated as "unspecified".
  • Maximum method_tag length is 64 characters.

6.2 Bonus rule (trial default)

For each problem, among Reward Correct submissions:

  • The first Reward Correct submission for each distinct method_tag gets +30.
  • If at least two distinct tags appear among Reward Correct solutions, then all Reward Correct solutions get an additional +20 (implemented as two +10 shared bonuses).

Defaults:

  • DIVERSITY_FIRST_TAG_BONUS = 30
  • DIVERSITY_SHARED_BONUS_EACH = 10
  • DIVERSITY_SHARED_BONUS_REPEATS_IF_AT_LEAST_TWO_TAGS = 2
  • Total diversity bonus is capped at 50 per problem.

This keeps consensus meaningful but smaller than correctness and brevity.

6.3 Notes (implementation caution)

This is intentionally a coarse proxy and may be gamed if tags are unverified. Platforms should mitigate by:

  • running an optional coarse classifier and ignoring tags that strongly conflict,
  • requiring a short notes justification for the declared tag, and/or
  • publishing tag distributions after Reveal for auditing.

6.4 Window, dedup, and finalization

To ensure scoring is replayable and does not change after the problem closes:

  • Effective submissions for diversity statistics are those with reward_correct == true.
  • Diversity statistics are computed over the window submission_created_at <= freeze_started_at and finalized at Freeze.
  • To reduce spam effects, diversity statistics deduplicate by (user_id, method_tag) by keeping the earliest submission under:

submission_created_at, then solver_hash, then submission_id.


7. Total Score Across Problems

A solver’s season score is:

  • Sum of per-problem scores over all problems in the season.

The platform may additionally report:

  • number of Stage Pass problems
  • number of Reward Correct problems
  • median L_chars among Reward Correct solutions

8. Setter Incentives (Out of Scope)

This document defines solver scoring only. Setter incentives (e.g., rewarding “simple programs that remain difficult”) are defined separately; see the optional setter incentive section in RULES.md.


9. Pilot Period & Updates

This scoring system is explicitly a trial.

During the pilot period, the platform may adjust:

  • beta, B_max
  • anti-hardcode thresholds
  • diversity bonus parameters

Changes must:

  • be versioned (SCORING.md v0.2, etc.)
  • not retroactively change already-finalized results unless explicitly announced and agreed by the platform.

Appendix: Recommended Defaults (Trial v0.1)

  • N_check = 200
  • STAGE_BASE = 200
  • REWARD_BASE = 1000
  • B_max = 200
  • beta = 1/800
  • MAX_NUMERIC_LITERALS = 120
  • MAX_STRING_LITERAL_CHARS = 2000
  • MAX_LIST_TUPLE_ELEMENTS = 400
  • DIVERSITY_FIRST_TAG_BONUS = 30
  • DIVERSITY_SHARED_BONUS_EACH = 10
  • DIVERSITY_SHARED_BONUS_REPEATS_IF_AT_LEAST_TWO_TAGS = 2
  • DIVERSITY_BONUS_CAP = 50
查看原始 Markdown
# SCORING.md — Sequence Dojo Scoring (Draft v0.1)

See also:

* [RULES.md](./RULES.md) (correctness gates and constraints)
* [SPEC.md](./SPEC.md) (protocol and canonicalization definition)
* [PLATFORM.md](./PLATFORM.md) (sandbox enforcement details)

This document defines the **trial scoring system** for Sequence Dojo.
It is intended to be **simple, enforceable, and adjustable** after a pilot period.

Scoring has two layers:

1. **Correctness gates** (hard requirements)
2. **Rank ordering among correct solutions**, primarily by **brevity**, with optional bonuses

The platform computes all metrics from the **canonicalized submitted source** and **sandbox execution results**.

---

## 1. Ground Rules

### 1.1 Program-only submissions

Official solutions must be submitted as code (`solver.py`).
Copy/pasted integer lists are not accepted as official submissions.

### 1.2 Deterministic judging

All scoring is based on the platform sandbox output:

* `a_hat[0..N_check-1]` generated by the solver
* `a_true[0..N_check-1]` generated by the validated setter

Exact equality is required.

### 1.3 Canonicalization

Source is canonicalized before length measurement:

* UTF-8
* normalize newlines to `\n`
* strip trailing blank lines at EOF
* leave line-trailing whitespace unchanged

The canonical bytes are used for:

* `solver_hash`
* `L_chars` (brevity length)

---

## 2. Correctness Gates (Hard)

Let `N_check = 200` (default season setting).

### 2.1 Stage Pass

A submission is **Stage Pass** if:

* `a_hat[0:100] == a_true[0:100]`

### 2.2 Reward Correct

A submission is **Reward Correct** if:

* `a_hat[0:200] == a_true[0:200]`

Submissions that are not Stage Pass receive **0 points** for the problem.

---

## 3. Anti-Hardcode Policy (Trial)

Brevity incentives can encourage “answer embedding” (hardcoding).
During the trial period, the platform applies a lightweight anti-hardcode policy.

### 3.1 Disallowed (Reject)

A solver submission is rejected for a problem if it contains:

* file I/O, networking, subprocess usage (per sandbox rules), or
* **large embedded payloads** exceeding thresholds below.

### 3.2 Payload thresholds (trial defaults)

Platform computes from the solver source AST and tokens:

* `MAX_NUMERIC_LITERALS = 120`
  (count of integer/float literal tokens in the source)

* `MAX_STRING_LITERAL_CHARS = 2000`
  (sum of lengths of all string literals)

* `MAX_LIST_TUPLE_ELEMENTS = 400`
  (maximum number of elements in any single list/tuple literal)

If any threshold is exceeded, the submission is rejected as “hardcode-like”.

> Rationale: allow small constants (seeds, coefficients), but block embedding large answer tables.

### 3.3 Notes

* This policy is intentionally coarse. It is expected to be tuned after pilot data.
* If the platform later supports “explain mode” solutions, a more semantic policy may be used.

---

## 4. Score per Problem

A solver’s score for a single problem is determined as follows:

### 4.1 If not Stage Pass

* `score = 0`

### 4.2 If Stage Pass but not Reward Correct

* `score = STAGE_BASE`

Trial default:

* `STAGE_BASE = 200`

### 4.3 If Reward Correct

* `score = REWARD_BASE + BREVITY_BONUS + DIVERSITY_BONUS`

Trial defaults:

* `REWARD_BASE = 1000`
* `BREVITY_BONUS <= 200`
* `DIVERSITY_BONUS <= 50` (optional, can be disabled)

---

## 5. Brevity Bonus (Primary Ranking Signal)

Among Reward Correct solutions, shorter solver programs should rank higher.

### 5.1 Brevity length metric

Let:

* `L_chars = len(canonical_solver_source_bytes)`

### 5.2 Brevity bonus function (trial)

We use an exponential decay mapped to a bounded bonus:

* `C = exp(-beta * L_chars)`
* `BREVITY_BONUS = floor(B_max * C)`

Trial defaults:

* `B_max = 200`
* `beta = 1/800`

Interpretation:

* Around 800 chars → bonus ~ `200 * e^-1 ≈ 73`
* Very short code approaches full bonus 200
* Very long code approaches 0 bonus

### 5.3 Tie-breaking

If two solutions have identical total score for a problem:

1. smaller `L_chars` wins
2. if still tied, earlier submission timestamp wins (optional)
3. if still tied, stable deterministic ordering by `solver_hash`

For a per-problem leaderboard, the platform reports **one row per solver (user)** by selecting the solver’s best submission
under the same ordering (`score` desc, then the tie-breaks above).

---

## 6. Diversity Bonus (Optional)

This bonus encourages **independent agreement**: if multiple solver “principle families” reach the same 200-term sequence,
that convergence is a useful proxy for “non-accidental correctness”. It is also a way to reward problems that admit
multiple derivations (a form of consensus / rarity signal), without requiring the platform to fully understand semantics.

### 6.1 Method tag

A solver may optionally provide `solution.json` with:

```json
{
  "method_tag": "closed_form | linear_recurrence | matrix_power | symbolic_guess | search_enum | other",
  "notes": "short description"
}
```

The platform may also compute a coarse static classifier; if it conflicts with the declared tag, the platform can ignore the tag.

Implementation defaults:

* If `method_tag` is missing / not a string / empty / too long, it is treated as `"unspecified"`.
* Maximum `method_tag` length is 64 characters.

### 6.2 Bonus rule (trial default)

For each problem, among Reward Correct submissions:
* The **first** Reward Correct submission for each distinct `method_tag` gets `+30`.
* If at least **two** distinct tags appear among Reward Correct solutions, then **all** Reward Correct solutions get an additional `+20` (implemented as two `+10` shared bonuses).

Defaults:

* `DIVERSITY_FIRST_TAG_BONUS = 30`
* `DIVERSITY_SHARED_BONUS_EACH = 10`
* `DIVERSITY_SHARED_BONUS_REPEATS_IF_AT_LEAST_TWO_TAGS = 2`
* Total diversity bonus is capped at `50` per problem.
> This keeps consensus meaningful but smaller than correctness and brevity.

### 6.3 Notes (implementation caution)
This is intentionally a coarse proxy and may be gamed if tags are unverified. Platforms should mitigate by:

* running an optional coarse classifier and ignoring tags that strongly conflict,
* requiring a short `notes` justification for the declared tag, and/or
* publishing tag distributions after Reveal for auditing.

### 6.4 Window, dedup, and finalization

To ensure scoring is replayable and does not change after the problem closes:

* Effective submissions for diversity statistics are those with `reward_correct == true`.
* Diversity statistics are computed over the window `submission_created_at <= freeze_started_at` and finalized at Freeze.
* To reduce spam effects, diversity statistics deduplicate by `(user_id, method_tag)` by keeping the earliest submission under:
  `submission_created_at`, then `solver_hash`, then `submission_id`.

---

## 7. Total Score Across Problems

A solver’s season score is:

* Sum of per-problem scores over all problems in the season.

The platform may additionally report:

* number of Stage Pass problems
* number of Reward Correct problems
* median `L_chars` among Reward Correct solutions

---

## 8. Setter Incentives (Out of Scope)

This document defines **solver scoring** only.
Setter incentives (e.g., rewarding “simple programs that remain difficult”) are defined separately; see the optional
setter incentive section in [RULES.md](./RULES.md).

---

## 9. Pilot Period & Updates

This scoring system is explicitly a **trial**.

During the pilot period, the platform may adjust:

* `beta`, `B_max`
* anti-hardcode thresholds
* diversity bonus parameters

Changes must:

* be versioned (`SCORING.md v0.2`, etc.)
* not retroactively change already-finalized results unless explicitly announced and agreed by the platform.

---

## Appendix: Recommended Defaults (Trial v0.1)

* `N_check = 200`
* `STAGE_BASE = 200`
* `REWARD_BASE = 1000`
* `B_max = 200`
* `beta = 1/800`
* `MAX_NUMERIC_LITERALS = 120`
* `MAX_STRING_LITERAL_CHARS = 2000`
* `MAX_LIST_TUPLE_ELEMENTS = 400`
* `DIVERSITY_FIRST_TAG_BONUS = 30`
* `DIVERSITY_SHARED_BONUS_EACH = 10`
* `DIVERSITY_SHARED_BONUS_REPEATS_IF_AT_LEAST_TWO_TAGS = 2`
* `DIVERSITY_BONUS_CAP = 50`