rules

Sequence Dojo — Competition Rules (Draft v0.3)

See also:

  • SPEC.md (protocol and artifact formats)
  • SCORING.md (ranking and anti-hardcode rules)
  • REVEAL.md (lifecycle phases and what becomes public)
  • PLATFORM.md (implementation guidance)

0. Purpose

Sequence Dojo is a competition for programmatic inference of integer sequences under partial disclosure.

  • A Setter provides a deterministic program that generates an integer sequence.
  • The Platform validates the setter program and publishes a commitment hash plus partial disclosed terms.
  • A Solver submits a deterministic program that reconstructs the first N terms (default N=200).
  • The Platform judges solutions by exact equality in a controlled environment.

This competition is designed to be:

  • reproducible (platform-defined environment and canonicalization)
  • verifiable (platform-generated hash commitment and later reveal)
  • robust (no manual copy/paste of long integer lists)

1. Roles

1.1 Setter

  • Submits a problem package (problem.json + setter.py) to the Platform.
  • Must comply with all safety, determinism, and resource constraints.

1.2 Solver

  • Submits a solution package (solver.py) to the Platform.
  • Must comply with safety and resource constraints.

1.3 Platform (Judge)

  • Validates setter submissions.
  • Publishes only after validation passes.
  • Computes and publishes commitment hashes and disclosure data.
  • Executes solver submissions and issues verdicts.
  • Reveals setter code after judging closes.

2. Definitions

2.1 Sequence

Each problem defines an integer sequence: a_0, a_1, a_2, ... All terms are Python integers (int).

2.2 Check Length

Unless specified otherwise, the Platform checks:

  • N_check = 200 terms, i.e. a_0..a_199.

2.3 Disclosure (Default)

The Platform discloses:

  • the first 50 odd-index terms:

a_1, a_3, ..., a_99

The Platform does not disclose even-index terms.


3. Submission Types

3.1 Setter Submission Package

A setter submits a package containing:

problem.json

Minimal required fields:

{
  "title": "Trial 002",
  "interface": "seq",
  "N_check": 200
}
  • interface: "seq" or "gen"
  • N_check: defaults to 200 if omitted

setter.py

Must implement exactly one interface:

Option A

def seq(n: int) -> int:

Option B

def gen(N: int) -> list[int]:

Return values must be Python int.

3.2 Solver Submission Package

A solver submits a package containing:

solver.py

Recommended interface:

def solver() -> list[int]:

It must return exactly N_check integers: [a_0, a_1, ..., a_{N_check-1}]

Program-only submissions are mandatory; pasting long integer lists is not an official submission format.


4. Mandatory Constraints

4.1 Purity & Determinism

Setter programs must be pure and deterministic:

  • no file I/O
  • no network access
  • no subprocess execution
  • no system time, clock, or environment variable reads
  • no external state

Randomness is allowed only if fully deterministic (fixed seed, reproducible output).

Solver programs must follow the same sandbox safety constraints.

4.2 Dependency Whitelist

Allowed imports:

  • sympy, math, fractions, itertools

Disallowed imports include (not exhaustive):

  • os, pathlib, subprocess, socket, requests, time, datetime

The Platform enforces this using static scans and runtime import interception.

4.3 Resource Limits (Default Season Settings)

  • Setter must generate a_0..a_{N_check-1} within 1 second on the Platform standard machine.
  • Setter code length ≤ 100 lines (Platform-defined effective line counting).
  • Setter UTF-8 character count ≤ 5000.
  • Platform may apply comparable limits to solver programs.

All limits and counting rules must be published and stable for the duration of a season.


5. Platform Validation & Publication (Commit–Reveal)

5.1 Validation Gates (Setter)

A problem is published only if it passes all gates:

  1. Static Gate: line/char limits, banned imports/symbols.
  2. Sandbox Gate: no I/O, no network, no subprocess; runtime import whitelist.
  3. Performance Gate: generate first N_check terms within 1 second.
  4. Determinism Gate: repeated runs produce identical outputs.

If any gate fails, the submission is rejected and not published.

5.2 Canonicalization (for Hashing)

The Platform canonicalizes setter.py before hashing:

  • UTF-8 encoding
  • normalize newlines to \n
  • strip trailing blank lines at EOF
  • leave line-trailing whitespace unchanged

5.3 Commitment Hash

The Platform computes:

  • P_hash = SHA-256(canonical_setter.py_bytes)

Setters must not self-report hashes. The Platform-generated P_hash is authoritative.

5.4 Published Problem Record

After validation, the Platform publishes a record containing at least:

  • problem_id, title
  • P_hash
  • interface, N_check
  • disclosure data (default: odd-index first 50)

The Platform must generate disclosure values directly from the validated setter output (no manual transcription).

5.5 Reveal

After the problem closes, the Platform reveals setter.py (and canonicalization policy) so anyone can verify:

  • SHA-256(canonical(setter.py)) == P_hash

6. Judging & Scoring

6.1 Ground Truth

The Platform generates ground truth a_true[0..N_check-1] from the validated setter program.

6.2 Solver Output

The Platform runs the solver program to obtain a_hat[0..N_check-1].

6.3 Verdict

  • Stage Pass: first 100 terms match exactly

a_hat[0:100] == a_true[0:100]

  • Reward: first 200 terms match exactly

a_hat[0:200] == a_true[0:200]

The Platform should report:

  • pass/fail
  • earliest mismatch index (if any)
  • expected value vs actual value at mismatch

7. Setter Incentives (Optional Rule)

A platform may introduce setter rewards to encourage stable, extrapolatable problems.

Intended target (informative):

  • Problems where matching the first 100 terms often implies a solver has captured the true generative principle,

so Stage passers tend to also reach 200 (the distribution of “how far you stay correct” is biased toward 200).

  • This discourages “trap problems” that are easy to fit up to 100 but fail beyond 100.

7.1 Recommended minimum rule (stage→reward consistency)

One recommended rule:

  • Let S100 be solvers who pass Stage (first 100 correct),
  • Let S200 be solvers who receive Reward (first 200 correct).

Setter reward condition (piecewise):

  • If |S100| <= 3, require S100 ⊆ S200 (all stage passers get reward).
  • If |S100| >= 4, require |S200| / |S100| >= 0.9.

7.2 Optional published diagnostics (extrapolation profile)

Platforms may additionally publish a per-problem diagnostic after Reveal to make “extrapolatability” concrete. One simple metric is the longest correct prefix length for each Stage passer:

  • For each solver in S100, define L = max k in [100..200] such that a_hat[0:k] == a_true[0:k].
  • Publish summary statistics of L over S100 (e.g., median, 10th percentile, histogram bins).

This does not change solver verdicts; it is intended for transparency, postmortems, and optional setter incentives.

7.3 Optional quantitative setter reward (brevity × difficulty)

Some platforms may want a numerical setter reward score that increases when:

  • the setter program is shorter (in canonical bytes), and
  • the problem is harder (fewer solvers reach Reward), while still being extrapolatable (not a trap).

This section is informative and defines one compatible approach.

7.3.1 Brevity factor (setter length)

Let:

  • L_set = len(canonical_setter.py_bytes)

Define a bounded brevity factor:

  • B_set = exp(-beta_set * L_set) in (0, 1]

beta_set is a season parameter (example scale: beta_set = 1/800 matches the solver brevity scale).

7.3.2 Difficulty factor (participation-adjusted)

Let:

  • T be the set of solver submissions for the problem (within the season window),
  • R = |S200| (Reward Correct count),
  • U = |T| (submission count).

Define a difficulty factor that avoids rewarding “nobody tried” problems:

  • Require U >= U_min and R >= 1 for any setter reward to apply.
  • D = clamp( log((U + 1) / (R + 1)) / log((U + 1) / 2), 0, 1 )

Intuition:

  • If many people try (U large) but few reach 200 (R small), D approaches 1.
  • If most submissions reach 200, D approaches 0.

All parameters (U_min) and counting rules must be published for the season.

7.3.3 Extrapolatability factor (anti-trap)

Reuse the stage→reward consistency idea as a multiplier:

  • E = 0 if the recommended minimum rule (7.1) fails.
  • Otherwise E = clamp(|S200| / max(1, |S100|), 0, 1).

This makes the setter reward collapse to 0 for classic “100 fits, 200 fails” traps.

7.3.4 Consensus / rarity boost (optional)

Let K be the number of distinct solving method tags among Reward Correct solutions (see SCORING.md):

  • K = |{method_tag among Reward Correct submissions}|

Define a small multiplicative boost:

  • C = 1 + gamma * log2(max(1, K))

where gamma is a season parameter (e.g., gamma = 0.1).

7.3.5 Final setter reward score (example)

An example per-problem setter reward score:

  • setter_score = SETTER_BASE * B_set * D * E * C

with SETTER_BASE and all parameters published for the season.


8. Transparency Requirements

For each season, the Platform must publish and keep stable:

  • Python version, library versions (e.g., sympy)
  • standard machine class/spec
  • timing method (wall vs CPU)
  • line/character counting rules
  • canonicalization policy used for hashing

9. Disputes

  • Platform sandbox results are authoritative.
  • After reveal, any participant may reproduce results locally to verify integrity.
  • If a discrepancy is found, the Platform must publish a postmortem including:
  • canonicalization details
  • environment versions
  • reproduction steps

10. Versioning

This document is Draft v0.3. Breaking changes require a new season or an explicit version bump. Non-breaking clarifications may be issued at any time but must not change outcomes of already-published problems.

查看原始 Markdown
# Sequence Dojo — Competition Rules (Draft v0.3)

See also:

* [SPEC.md](./SPEC.md) (protocol and artifact formats)
* [SCORING.md](./SCORING.md) (ranking and anti-hardcode rules)
* [REVEAL.md](./REVEAL.md) (lifecycle phases and what becomes public)
* [PLATFORM.md](./PLATFORM.md) (implementation guidance)

## 0. Purpose

Sequence Dojo is a competition for **programmatic inference** of integer sequences under partial disclosure.

* A **Setter** provides a deterministic program that generates an integer sequence.
* The **Platform** validates the setter program and publishes a **commitment hash** plus partial disclosed terms.
* A **Solver** submits a deterministic program that reconstructs the first `N` terms (default `N=200`).
* The Platform judges solutions by **exact equality** in a controlled environment.

This competition is designed to be:

* **reproducible** (platform-defined environment and canonicalization)
* **verifiable** (platform-generated hash commitment and later reveal)
* **robust** (no manual copy/paste of long integer lists)

---

## 1. Roles

### 1.1 Setter

* Submits a problem package (`problem.json` + `setter.py`) to the Platform.
* Must comply with all safety, determinism, and resource constraints.

### 1.2 Solver

* Submits a solution package (`solver.py`) to the Platform.
* Must comply with safety and resource constraints.

### 1.3 Platform (Judge)

* Validates setter submissions.
* Publishes only after validation passes.
* Computes and publishes commitment hashes and disclosure data.
* Executes solver submissions and issues verdicts.
* Reveals setter code after judging closes.

---

## 2. Definitions

### 2.1 Sequence

Each problem defines an integer sequence:
`a_0, a_1, a_2, ...`
All terms are Python integers (`int`).

### 2.2 Check Length

Unless specified otherwise, the Platform checks:

* `N_check = 200` terms, i.e. `a_0..a_199`.

### 2.3 Disclosure (Default)

The Platform discloses:

* the first 50 **odd-index** terms:
  `a_1, a_3, ..., a_99`

The Platform does **not** disclose even-index terms.

---

## 3. Submission Types

### 3.1 Setter Submission Package

A setter submits a package containing:

#### `problem.json`

Minimal required fields:

```json
{
  "title": "Trial 002",
  "interface": "seq",
  "N_check": 200
}
```

* `interface`: `"seq"` or `"gen"`
* `N_check`: defaults to `200` if omitted

#### `setter.py`

Must implement **exactly one** interface:

**Option A**

```python
def seq(n: int) -> int:
```

**Option B**

```python
def gen(N: int) -> list[int]:
```

Return values must be Python `int`.

### 3.2 Solver Submission Package

A solver submits a package containing:

#### `solver.py`

Recommended interface:

```python
def solver() -> list[int]:
```

It must return exactly `N_check` integers:
`[a_0, a_1, ..., a_{N_check-1}]`

> Program-only submissions are mandatory; pasting long integer lists is not an official submission format.

---

## 4. Mandatory Constraints

### 4.1 Purity & Determinism

Setter programs must be **pure and deterministic**:

* no file I/O
* no network access
* no subprocess execution
* no system time, clock, or environment variable reads
* no external state

Randomness is allowed only if **fully deterministic** (fixed seed, reproducible output).

Solver programs must follow the same sandbox safety constraints.

### 4.2 Dependency Whitelist

Allowed imports:

* `sympy`, `math`, `fractions`, `itertools`

Disallowed imports include (not exhaustive):

* `os`, `pathlib`, `subprocess`, `socket`, `requests`, `time`, `datetime`

The Platform enforces this using static scans and runtime import interception.

### 4.3 Resource Limits (Default Season Settings)

* Setter must generate `a_0..a_{N_check-1}` within **1 second** on the Platform standard machine.
* Setter code length ≤ **100 lines** (Platform-defined effective line counting).
* Setter UTF-8 character count ≤ **5000**.
* Platform may apply comparable limits to solver programs.

> All limits and counting rules must be published and stable for the duration of a season.

---

## 5. Platform Validation & Publication (Commit–Reveal)

### 5.1 Validation Gates (Setter)

A problem is published only if it passes all gates:

1. **Static Gate**: line/char limits, banned imports/symbols.
2. **Sandbox Gate**: no I/O, no network, no subprocess; runtime import whitelist.
3. **Performance Gate**: generate first `N_check` terms within 1 second.
4. **Determinism Gate**: repeated runs produce identical outputs.

If any gate fails, the submission is rejected and not published.

### 5.2 Canonicalization (for Hashing)

The Platform canonicalizes `setter.py` before hashing:

* UTF-8 encoding
* normalize newlines to `\n`
* strip trailing blank lines at EOF
* leave line-trailing whitespace unchanged

### 5.3 Commitment Hash

The Platform computes:

* `P_hash = SHA-256(canonical_setter.py_bytes)`

**Setters must not self-report hashes.** The Platform-generated `P_hash` is authoritative.

### 5.4 Published Problem Record

After validation, the Platform publishes a record containing at least:

* `problem_id`, `title`
* `P_hash`
* `interface`, `N_check`
* disclosure data (default: odd-index first 50)

The Platform must generate disclosure values directly from the validated setter output (no manual transcription).

### 5.5 Reveal

After the problem closes, the Platform reveals `setter.py` (and canonicalization policy) so anyone can verify:

* `SHA-256(canonical(setter.py)) == P_hash`

---

## 6. Judging & Scoring

### 6.1 Ground Truth

The Platform generates ground truth `a_true[0..N_check-1]` from the validated setter program.

### 6.2 Solver Output

The Platform runs the solver program to obtain `a_hat[0..N_check-1]`.

### 6.3 Verdict

* **Stage Pass**: first 100 terms match exactly
  `a_hat[0:100] == a_true[0:100]`
* **Reward**: first 200 terms match exactly
  `a_hat[0:200] == a_true[0:200]`

The Platform should report:

* pass/fail
* earliest mismatch index (if any)
* expected value vs actual value at mismatch

---

## 7. Setter Incentives (Optional Rule)

A platform may introduce setter rewards to encourage **stable, extrapolatable** problems.

Intended target (informative):

* Problems where matching the first 100 terms often implies a solver has captured the true generative principle,
  so Stage passers tend to also reach 200 (the distribution of “how far you stay correct” is biased toward 200).
* This discourages “trap problems” that are easy to fit up to 100 but fail beyond 100.

### 7.1 Recommended minimum rule (stage→reward consistency)
One recommended rule:

* Let `S100` be solvers who pass Stage (first 100 correct),
* Let `S200` be solvers who receive Reward (first 200 correct).

Setter reward condition (piecewise):

* If `|S100| <= 3`, require `S100 ⊆ S200` (all stage passers get reward).
* If `|S100| >= 4`, require `|S200| / |S100| >= 0.9`.

### 7.2 Optional published diagnostics (extrapolation profile)
Platforms may additionally publish a per-problem diagnostic after Reveal to make “extrapolatability” concrete.
One simple metric is the **longest correct prefix length** for each Stage passer:

* For each solver in `S100`, define `L = max k in [100..200]` such that `a_hat[0:k] == a_true[0:k]`.
* Publish summary statistics of `L` over `S100` (e.g., median, 10th percentile, histogram bins).

This does not change solver verdicts; it is intended for transparency, postmortems, and optional setter incentives.

### 7.3 Optional quantitative setter reward (brevity × difficulty)
Some platforms may want a **numerical** setter reward score that increases when:

* the setter program is shorter (in canonical bytes), and
* the problem is harder (fewer solvers reach Reward), while still being extrapolatable (not a trap).

This section is informative and defines one compatible approach.

#### 7.3.1 Brevity factor (setter length)
Let:

* `L_set = len(canonical_setter.py_bytes)`

Define a bounded brevity factor:

* `B_set = exp(-beta_set * L_set)` in `(0, 1]`

`beta_set` is a season parameter (example scale: `beta_set = 1/800` matches the solver brevity scale).

#### 7.3.2 Difficulty factor (participation-adjusted)
Let:

* `T` be the set of solver submissions for the problem (within the season window),
* `R = |S200|` (Reward Correct count),
* `U = |T|` (submission count).

Define a difficulty factor that avoids rewarding “nobody tried” problems:

* Require `U >= U_min` and `R >= 1` for any setter reward to apply.
* `D = clamp( log((U + 1) / (R + 1)) / log((U + 1) / 2), 0, 1 )`

Intuition:

* If many people try (`U` large) but few reach 200 (`R` small), `D` approaches 1.
* If most submissions reach 200, `D` approaches 0.

All parameters (`U_min`) and counting rules must be published for the season.

#### 7.3.3 Extrapolatability factor (anti-trap)
Reuse the stage→reward consistency idea as a multiplier:

* `E = 0` if the recommended minimum rule (7.1) fails.
* Otherwise `E = clamp(|S200| / max(1, |S100|), 0, 1)`.

This makes the setter reward collapse to 0 for classic “100 fits, 200 fails” traps.

#### 7.3.4 Consensus / rarity boost (optional)
Let `K` be the number of distinct solving method tags among Reward Correct solutions (see `SCORING.md`):

* `K = |{method_tag among Reward Correct submissions}|`

Define a small multiplicative boost:

* `C = 1 + gamma * log2(max(1, K))`

where `gamma` is a season parameter (e.g., `gamma = 0.1`).

#### 7.3.5 Final setter reward score (example)
An example per-problem setter reward score:

* `setter_score = SETTER_BASE * B_set * D * E * C`

with `SETTER_BASE` and all parameters published for the season.

---

## 8. Transparency Requirements

For each season, the Platform must publish and keep stable:

* Python version, library versions (e.g., sympy)
* standard machine class/spec
* timing method (wall vs CPU)
* line/character counting rules
* canonicalization policy used for hashing

---

## 9. Disputes

* Platform sandbox results are authoritative.
* After reveal, any participant may reproduce results locally to verify integrity.
* If a discrepancy is found, the Platform must publish a postmortem including:

  * canonicalization details
  * environment versions
  * reproduction steps

---

## 10. Versioning

This document is **Draft v0.3**. Breaking changes require a new season or an explicit version bump. Non-breaking clarifications may be issued at any time but must not change outcomes of already-published problems.