rules

原始:/docs/rules.md
# Sequence Dojo — Competition Rules (Draft v0.3)

See also:

* [SPEC.md](./SPEC.md) (protocol and artifact formats)
* [SCORING.md](./SCORING.md) (ranking and anti-hardcode rules)
* [REVEAL.md](./REVEAL.md) (lifecycle phases and what becomes public)
* [PLATFORM.md](./PLATFORM.md) (implementation guidance)

## 0. Purpose

Sequence Dojo is a competition for **programmatic inference** of integer sequences under partial disclosure.

* A **Setter** provides a deterministic program that generates an integer sequence.
* The **Platform** validates the setter program and publishes a **commitment hash** plus partial disclosed terms.
* A **Solver** submits a deterministic program that reconstructs the first `N` terms (default `N=200`).
* The Platform judges solutions by **exact equality** in a controlled environment.

This competition is designed to be:

* **reproducible** (platform-defined environment and canonicalization)
* **verifiable** (platform-generated hash commitment and later reveal)
* **robust** (no manual copy/paste of long integer lists)

---

## 1. Roles

### 1.1 Setter

* Submits a problem package (`problem.json` + `setter.py`) to the Platform.
* Must comply with all safety, determinism, and resource constraints.

### 1.2 Solver

* Submits a solution package (`solver.py`) to the Platform.
* Must comply with safety and resource constraints.

### 1.3 Platform (Judge)

* Validates setter submissions.
* Publishes only after validation passes.
* Computes and publishes commitment hashes and disclosure data.
* Executes solver submissions and issues verdicts.
* Reveals setter code after judging closes.

---

## 2. Definitions

### 2.1 Sequence

Each problem defines an integer sequence:
`a_0, a_1, a_2, ...`
All terms are Python integers (`int`).

### 2.2 Check Length

Unless specified otherwise, the Platform checks:

* `N_check = 200` terms, i.e. `a_0..a_199`.

### 2.3 Disclosure (Default)

The Platform discloses:

* the first 50 **odd-index** terms:
  `a_1, a_3, ..., a_99`

The Platform does **not** disclose even-index terms.

---

## 3. Submission Types

### 3.1 Setter Submission Package

A setter submits a package containing:

#### `problem.json`

Minimal required fields:

```json
{
  "title": "Trial 002",
  "interface": "seq",
  "N_check": 200
}
```

* `interface`: `"seq"` or `"gen"`
* `N_check`: defaults to `200` if omitted

#### `setter.py`

Must implement **exactly one** interface:

**Option A**

```python
def seq(n: int) -> int:
```

**Option B**

```python
def gen(N: int) -> list[int]:
```

Return values must be Python `int`.

### 3.2 Solver Submission Package

A solver submits a package containing:

#### `solver.py`

Recommended interface:

```python
def solver() -> list[int]:
```

It must return exactly `N_check` integers:
`[a_0, a_1, ..., a_{N_check-1}]`

> Program-only submissions are mandatory; pasting long integer lists is not an official submission format.

---

## 4. Mandatory Constraints

### 4.1 Purity & Determinism

Setter programs must be **pure and deterministic**:

* no file I/O
* no network access
* no subprocess execution
* no system time, clock, or environment variable reads
* no external state

Randomness is allowed only if **fully deterministic** (fixed seed, reproducible output).

Solver programs must follow the same sandbox safety constraints.

### 4.2 Dependency Whitelist

Allowed imports:

* `sympy`, `math`, `fractions`, `itertools`

Disallowed imports include (not exhaustive):

* `os`, `pathlib`, `subprocess`, `socket`, `requests`, `time`, `datetime`

The Platform enforces this using static scans and runtime import interception.

### 4.3 Resource Limits (Default Season Settings)

* Setter must generate `a_0..a_{N_check-1}` within **1 second** on the Platform standard machine.
* Setter code length ≤ **100 lines** (Platform-defined effective line counting).
* Setter UTF-8 character count ≤ **5000**.
* Platform may apply comparable limits to solver programs.

> All limits and counting rules must be published and stable for the duration of a season.

---

## 5. Platform Validation & Publication (Commit–Reveal)

### 5.1 Validation Gates (Setter)

A problem is published only if it passes all gates:

1. **Static Gate**: line/char limits, banned imports/symbols.
2. **Sandbox Gate**: no I/O, no network, no subprocess; runtime import whitelist.
3. **Performance Gate**: generate first `N_check` terms within 1 second.
4. **Determinism Gate**: repeated runs produce identical outputs.

If any gate fails, the submission is rejected and not published.

### 5.2 Canonicalization (for Hashing)

The Platform canonicalizes `setter.py` before hashing:

* UTF-8 encoding
* normalize newlines to `\n`
* strip trailing blank lines at EOF
* leave line-trailing whitespace unchanged

### 5.3 Commitment Hash

The Platform computes:

* `P_hash = SHA-256(canonical_setter.py_bytes)`

**Setters must not self-report hashes.** The Platform-generated `P_hash` is authoritative.

### 5.4 Published Problem Record

After validation, the Platform publishes a record containing at least:

* `problem_id`, `title`
* `P_hash`
* `interface`, `N_check`
* disclosure data (default: odd-index first 50)

The Platform must generate disclosure values directly from the validated setter output (no manual transcription).

### 5.5 Reveal

After the problem closes, the Platform reveals `setter.py` (and canonicalization policy) so anyone can verify:

* `SHA-256(canonical(setter.py)) == P_hash`

---

## 6. Judging & Scoring

### 6.1 Ground Truth

The Platform generates ground truth `a_true[0..N_check-1]` from the validated setter program.

### 6.2 Solver Output

The Platform runs the solver program to obtain `a_hat[0..N_check-1]`.

### 6.3 Verdict

* **Stage Pass**: first 100 terms match exactly
  `a_hat[0:100] == a_true[0:100]`
* **Reward**: first 200 terms match exactly
  `a_hat[0:200] == a_true[0:200]`

The Platform should report:

* pass/fail
* earliest mismatch index (if any)
* expected value vs actual value at mismatch

---

## 7. Setter Incentives (Optional Rule)

A platform may introduce setter rewards to encourage **stable, extrapolatable** problems.

Intended target (informative):

* Problems where matching the first 100 terms often implies a solver has captured the true generative principle,
  so Stage passers tend to also reach 200 (the distribution of “how far you stay correct” is biased toward 200).
* This discourages “trap problems” that are easy to fit up to 100 but fail beyond 100.

### 7.1 Recommended minimum rule (stage→reward consistency)
One recommended rule:

* Let `S100` be solvers who pass Stage (first 100 correct),
* Let `S200` be solvers who receive Reward (first 200 correct).

Setter reward condition (piecewise):

* If `|S100| <= 3`, require `S100 ⊆ S200` (all stage passers get reward).
* If `|S100| >= 4`, require `|S200| / |S100| >= 0.9`.

### 7.2 Optional published diagnostics (extrapolation profile)
Platforms may additionally publish a per-problem diagnostic after Reveal to make “extrapolatability” concrete.
One simple metric is the **longest correct prefix length** for each Stage passer:

* For each solver in `S100`, define `L = max k in [100..200]` such that `a_hat[0:k] == a_true[0:k]`.
* Publish summary statistics of `L` over `S100` (e.g., median, 10th percentile, histogram bins).

This does not change solver verdicts; it is intended for transparency, postmortems, and optional setter incentives.

### 7.3 Optional quantitative setter reward (brevity × difficulty)
Some platforms may want a **numerical** setter reward score that increases when:

* the setter program is shorter (in canonical bytes), and
* the problem is harder (fewer solvers reach Reward), while still being extrapolatable (not a trap).

This section is informative and defines one compatible approach.

#### 7.3.1 Brevity factor (setter length)
Let:

* `L_set = len(canonical_setter.py_bytes)`

Define a bounded brevity factor:

* `B_set = exp(-beta_set * L_set)` in `(0, 1]`

`beta_set` is a season parameter (example scale: `beta_set = 1/800` matches the solver brevity scale).

#### 7.3.2 Difficulty factor (participation-adjusted)
Let:

* `T` be the set of solver submissions for the problem (within the season window),
* `R = |S200|` (Reward Correct count),
* `U = |T|` (submission count).

Define a difficulty factor that avoids rewarding “nobody tried” problems:

* Require `U >= U_min` and `R >= 1` for any setter reward to apply.
* `D = clamp( log((U + 1) / (R + 1)) / log((U + 1) / 2), 0, 1 )`

Intuition:

* If many people try (`U` large) but few reach 200 (`R` small), `D` approaches 1.
* If most submissions reach 200, `D` approaches 0.

All parameters (`U_min`) and counting rules must be published for the season.

#### 7.3.3 Extrapolatability factor (anti-trap)
Reuse the stage→reward consistency idea as a multiplier:

* `E = 0` if the recommended minimum rule (7.1) fails.
* Otherwise `E = clamp(|S200| / max(1, |S100|), 0, 1)`.

This makes the setter reward collapse to 0 for classic “100 fits, 200 fails” traps.

#### 7.3.4 Consensus / rarity boost (optional)
Let `K` be the number of distinct solving method tags among Reward Correct solutions (see `SCORING.md`):

* `K = |{method_tag among Reward Correct submissions}|`

Define a small multiplicative boost:

* `C = 1 + gamma * log2(max(1, K))`

where `gamma` is a season parameter (e.g., `gamma = 0.1`).

#### 7.3.5 Final setter reward score (example)
An example per-problem setter reward score:

* `setter_score = SETTER_BASE * B_set * D * E * C`

with `SETTER_BASE` and all parameters published for the season.

---

## 8. Transparency Requirements

For each season, the Platform must publish and keep stable:

* Python version, library versions (e.g., sympy)
* standard machine class/spec
* timing method (wall vs CPU)
* line/character counting rules
* canonicalization policy used for hashing

---

## 9. Disputes

* Platform sandbox results are authoritative.
* After reveal, any participant may reproduce results locally to verify integrity.
* If a discrepancy is found, the Platform must publish a postmortem including:

  * canonicalization details
  * environment versions
  * reproduction steps

---

## 10. Versioning

This document is **Draft v0.3**. Breaking changes require a new season or an explicit version bump. Non-breaking clarifications may be issued at any time but must not change outcomes of already-published problems.