platform

PLATFORM.md — Sequence Dojo Platform Implementation Guide (Draft v0.3)

This document is for maintainers and implementers of the Sequence Dojo platform. It describes how the platform should validate setter submissions, generate commitments, publish problems, sandbox execution, judge solvers, and reveal setters under a platform-enforced Commit–Reveal protocol.

Design goals: reproducible, verifiable, safe, and low-friction.

See also:

  • SPEC.md (normative protocol and artifact formats)
  • RULES.md (participant-facing constraints)
  • REVEAL.md (lifecycle phases and disclosure)

1. Platform Responsibilities

The Platform must:

  1. Validate setter packages before publication
  2. Canonicalize and compute P_hash (SHA-256) for the setter source
  3. Generate disclosure data from true outputs (no manual transcription)
  4. Publish a public problem record (published.json)
  5. Execute and judge solver submissions in a sandbox
  6. Reveal setter source after the problem closes, enabling third-party verification of P_hash
  7. Provide clear diagnostics on failures (static gate, sandbox, runtime, mismatch)

2. Standard Interfaces

2.1 Setter interfaces

The platform must support exactly one of the following per season (recommended: seq):

  • seq(n: int) -> int
  • gen(N: int) -> list[int]

2.2 Solver interface (recommended)

  • solver() -> list[int] returning exactly N_check integers

Keep the solver interface fixed and simple; avoid allowing “paste-a-list” solutions.


3. Artifact Formats

3.1 Setter package

Expected files:

  • problem.json
  • setter.py

3.2 Solver package

Expected files:

  • solver.py
  • optional solution.json

3.3 Published problem record

The platform outputs a published.json containing:

  • problem_id, title
  • P_hash
  • interface, N_check
  • disclosure (default: odd-index first 50)
  • timestamp
  • platform metadata (versions + canonicalization policy)

4. Validation Pipeline (Gates)

A setter submission is published only if it passes all gates. Fail fast with clear error messages.

Gate A — Static validation (no execution)

Inputs: setter.py, problem.json

Checks:

  1. Line limit: effective lines ≤ 100
  2. Character limit: UTF-8 char count ≤ 5000
  3. AST parse: parse must succeed
  4. Import whitelist:
  • allowed: sympy, math, fractions, itertools
  • disallowed (non-exhaustive): os, pathlib, subprocess, socket, requests, time, datetime
  1. Dangerous builtins (recommend reject if referenced):
  • open, eval, exec, compile, __import__, input
  1. Suspicious patterns (recommend reject if present):
  • attribute access to __dict__, __class__, __mro__, __subclasses__
  • globals(), locals()
  • getattr/setattr on unknown targets
  • importlib usage

Outputs:

  • pass/fail
  • list of violations with line/column if possible

Implementation hint: use Python ast to scan Import, ImportFrom, Name, Attribute, Call.


Gate B — Sandbox safety validation (controlled execution)

Goal: ensure the code cannot perform I/O/network/subprocess and cannot import forbidden modules at runtime.

Recommended baseline approach (portable, “good enough”):

  • Run code in a separate process
  • Use resource limits:
  • CPU time limit
  • wall time limit
  • memory limit
  • Use a restricted import hook:
  • allow only whitelist modules
  • deny everything else
  • Provide a restricted builtins dict:
  • remove/replace open, eval, exec, compile, __import__, etc.

Stronger approach (optional, more robust):

  • containerized sandbox (Docker/Firejail/nsjail) with no network, read-only FS, limited syscalls

Minimum requirements:

  • No file writes
  • No network access
  • No subprocess execution
  • No clock/time APIs (or ensure they return fixed values)

Outputs:

  • pass/fail
  • if fail: indicate which capability was attempted (e.g., forbidden import, file access)

Gate C — Performance validation (1 second)

Goal: setter must generate a[0..N_check-1] within time limit on the platform standard machine.

Procedure:

  1. Load the setter in the sandbox
  2. Generate first N_check terms (default 200)
  3. Measure runtime (see §7 timing policy)
  4. Fail if runtime > 1 second

Record:

  • wall time
  • CPU time (if available)
  • peak RSS memory (optional but useful)

Gate D — Determinism validation

Goal: repeated runs must produce identical results.

Procedure:

  1. Run generation twice in the same environment (fresh process recommended)
  2. Compare the full N_check list
  3. Fail if any mismatch

This catches:

  • non-fixed random seeds
  • time/environment dependence
  • nondeterministic iteration order / hashing dependence (less likely in pure numeric code, but still possible)

5. Canonicalization & Hashing (Commit)

The platform must produce a stable P_hash using a documented canonicalization policy.

5.1 Canonicalization policy (recommended)

Given setter.py raw bytes:

  1. Decode as UTF-8 (reject if invalid)
  2. Normalize newlines to \n
  3. Strip trailing blank lines at end-of-file
  4. Keep all other bytes exactly as-is (do not reformat code)

The canonical bytes are then hashed:

  • P_hash = SHA-256(canonical_bytes)

The platform must publish this policy in published.json.platform.canonicalization.

5.2 Why canonicalization

Without canonicalization, the same logical source can hash differently due to:

  • CRLF vs LF
  • trailing newlines
  • packaging artifacts

Canonicalization makes hash verification robust and predictable.


6. Disclosure Generation (No Human Copy/Paste)

After Gate A–D pass:

  1. Generate a_true = [a_0..a_{N_check-1}]
  2. Build disclosure values by rule:
  • default: odd_first_50 = [a_1, a_3, …, a_99]
  1. Store disclosure into published.json

The platform must never accept manually entered disclosure lists.


7. Timing & Resource Accounting

7.1 Timing

Pick one timing definition and publish it for the season:

  • Wall time is recommended (closer to user experience).
  • Measure inside the sandbox process.

Implementation notes:

  • exclude compilation/import overhead? (choose and stick to it)
  • simplest: measure end-to-end generate(N_check) inside the sandbox

7.2 Limits

At minimum enforce:

  • wall time ≤ 1 second for N_check=200 (setter)
  • memory ≤ platform-defined cap (recommend: 256–1024 MB)
  • recursion depth: default Python limit is fine; optionally cap to prevent abuse

8. Judging Solver Submissions

8.1 Load & run solver

Execute solver.py in the same sandbox policy used for setters (or stricter).

Call:

  • solver() → list of ints of length N_check

Validate:

  • type is list
  • length is exactly N_check
  • all elements are Python int

8.2 Compare to ground truth

Compute:

  • first_mismatch_index (if any)
  • stage pass: match first 100
  • reward: match first 200

Return a structured verdict object:

{
  "ok": false,
  "stage_pass": true,
  "reward": false,
  "first_mismatch": {
    "index": 145,
    "expected": 2918000611027443,
    "got": 2917990611027443
  }
}

9. Reveal Procedure

After the problem is closed (or per policy):

  1. Publish the original setter.py (or the canonicalized version—choose one and document it)
  2. Publish the canonicalization policy
  3. (Optional) publish validation logs (gate results, runtime, determinism checks)

Third parties should be able to verify:

  • SHA-256(canonicalize(setter.py)) == P_hash

10. Error Codes & Diagnostics (Recommended)

Use stable, machine-readable error codes.

Examples:

Static gate errors

  • E_STATIC_LINE_LIMIT
  • E_STATIC_CHAR_LIMIT
  • E_STATIC_IMPORT_FORBIDDEN
  • E_STATIC_DANGEROUS_BUILTIN
  • E_STATIC_AST_PARSE

Sandbox/runtime errors

  • E_SANDBOX_FORBIDDEN_IMPORT
  • E_SANDBOX_IO_ATTEMPT
  • E_SANDBOX_SUBPROCESS_ATTEMPT
  • E_TIMEOUT
  • E_OOM

Interface errors

  • E_INTERFACE_MISSING
  • E_INTERFACE_BAD_RETURN_TYPE
  • E_INTERFACE_BAD_LENGTH
  • E_INTERFACE_NON_INT_ELEMENT

Determinism

  • E_NONDETERMINISTIC_OUTPUT

Judging mismatch

  • E_MISMATCH

Diagnostics should include:

  • which gate failed
  • the offending module/symbol if applicable
  • earliest mismatch index for judging

11. Minimal CLI Workflow (Suggested)

A minimal working CLI typically includes:

validate <setter_pack_dir>

  • run Gate A–D
  • print pass/fail + diagnostics

publish <setter_pack_dir> --out published.json

  • run validate
  • canonicalize + compute P_hash
  • generate disclosure
  • write published.json

judge <published.json> <solver_pack_dir>

  • load setter by problem_id (or stored ground truth)
  • run solver
  • compare and output verdict JSON

reveal <published.json> --out reveal_dir

  • export setter source + metadata
  • optionally export logs

12. Implementation Notes & Tradeoffs

12.1 Sandboxing in Python is hard

A pure-Python “restricted builtins” approach is not bulletproof against adversarial code. For early prototypes it is acceptable, but for public competitions consider process- or container-level sandboxes.

12.2 Keep the interface small

Favor seq(n) or solver() and fixed N_check. This reduces complexity and avoids ambiguous I/O contracts.

12.3 Treat all disclosure as generated artifacts

Trial 002 showed how easy it is to corrupt a problem by copying large integers manually. This platform must prevent that class of failure entirely.


13. Season Configuration (Recommended)

A season.toml / config.json can fix:

  • allowed imports
  • banned modules/builtins
  • N_check, disclosure rule
  • time/memory limits
  • canonicalization policy
  • judging thresholds (100/200)

The platform should embed the relevant configuration in published.json.platform for auditability.

查看原始 Markdown
# PLATFORM.md — Sequence Dojo Platform Implementation Guide (Draft v0.3)

This document is for maintainers and implementers of the **Sequence Dojo platform**. It describes how the platform should validate setter submissions, generate commitments, publish problems, sandbox execution, judge solvers, and reveal setters under a **platform-enforced Commit–Reveal protocol**.

> Design goals: **reproducible**, **verifiable**, **safe**, and **low-friction**.

See also:

* [SPEC.md](./SPEC.md) (normative protocol and artifact formats)
* [RULES.md](./RULES.md) (participant-facing constraints)
* [REVEAL.md](./REVEAL.md) (lifecycle phases and disclosure)

---

## 1. Platform Responsibilities

The Platform must:

1. **Validate** setter packages before publication
2. **Canonicalize** and compute `P_hash` (SHA-256) for the setter source
3. **Generate disclosure data** from true outputs (no manual transcription)
4. **Publish** a public problem record (`published.json`)
5. **Execute and judge** solver submissions in a sandbox
6. **Reveal** setter source after the problem closes, enabling third-party verification of `P_hash`
7. Provide **clear diagnostics** on failures (static gate, sandbox, runtime, mismatch)

---

## 2. Standard Interfaces

### 2.1 Setter interfaces

The platform must support exactly one of the following per season (recommended: `seq`):

* `seq(n: int) -> int`
* `gen(N: int) -> list[int]`

### 2.2 Solver interface (recommended)

* `solver() -> list[int]` returning exactly `N_check` integers

> Keep the solver interface fixed and simple; avoid allowing “paste-a-list” solutions.

---

## 3. Artifact Formats

### 3.1 Setter package

Expected files:

* `problem.json`
* `setter.py`

### 3.2 Solver package

Expected files:

* `solver.py`
* optional `solution.json`

### 3.3 Published problem record

The platform outputs a `published.json` containing:

* `problem_id`, `title`
* `P_hash`
* `interface`, `N_check`
* `disclosure` (default: odd-index first 50)
* `timestamp`
* `platform` metadata (versions + canonicalization policy)

---

## 4. Validation Pipeline (Gates)

A setter submission is published **only if it passes all gates**. Fail fast with clear error messages.

### Gate A — Static validation (no execution)

**Inputs**: `setter.py`, `problem.json`

Checks:

1. **Line limit**: effective lines ≤ 100
2. **Character limit**: UTF-8 char count ≤ 5000
3. **AST parse**: parse must succeed
4. **Import whitelist**:

   * allowed: `sympy`, `math`, `fractions`, `itertools`
   * disallowed (non-exhaustive): `os`, `pathlib`, `subprocess`, `socket`, `requests`, `time`, `datetime`
5. **Dangerous builtins** (recommend reject if referenced):

   * `open`, `eval`, `exec`, `compile`, `__import__`, `input`
6. **Suspicious patterns** (recommend reject if present):

   * attribute access to `__dict__`, `__class__`, `__mro__`, `__subclasses__`
   * `globals()`, `locals()`
   * `getattr`/`setattr` on unknown targets
   * `importlib` usage

**Outputs**:

* pass/fail
* list of violations with line/column if possible

> Implementation hint: use Python `ast` to scan `Import`, `ImportFrom`, `Name`, `Attribute`, `Call`.

---

### Gate B — Sandbox safety validation (controlled execution)

**Goal**: ensure the code cannot perform I/O/network/subprocess and cannot import forbidden modules at runtime.

Recommended baseline approach (portable, “good enough”):

* Run code in a **separate process**
* Use **resource limits**:

  * CPU time limit
  * wall time limit
  * memory limit
* Use a **restricted import hook**:

  * allow only whitelist modules
  * deny everything else
* Provide a **restricted builtins** dict:

  * remove/replace `open`, `eval`, `exec`, `compile`, `__import__`, etc.

Stronger approach (optional, more robust):

* containerized sandbox (Docker/Firejail/nsjail) with no network, read-only FS, limited syscalls

Minimum requirements:

* No file writes
* No network access
* No subprocess execution
* No clock/time APIs (or ensure they return fixed values)

**Outputs**:

* pass/fail
* if fail: indicate which capability was attempted (e.g., forbidden import, file access)

---

### Gate C — Performance validation (1 second)

**Goal**: setter must generate `a[0..N_check-1]` within time limit on the platform standard machine.

Procedure:

1. Load the setter in the sandbox
2. Generate first `N_check` terms (default 200)
3. Measure runtime (see §7 timing policy)
4. Fail if runtime > 1 second

Record:

* wall time
* CPU time (if available)
* peak RSS memory (optional but useful)

---

### Gate D — Determinism validation

**Goal**: repeated runs must produce identical results.

Procedure:

1. Run generation twice in the same environment (fresh process recommended)
2. Compare the full `N_check` list
3. Fail if any mismatch

This catches:

* non-fixed random seeds
* time/environment dependence
* nondeterministic iteration order / hashing dependence (less likely in pure numeric code, but still possible)

---

## 5. Canonicalization & Hashing (Commit)

The platform must produce a stable `P_hash` using a **documented canonicalization policy**.

### 5.1 Canonicalization policy (recommended)

Given `setter.py` raw bytes:

1. Decode as UTF-8 (reject if invalid)
2. Normalize newlines to `\n`
3. Strip trailing blank lines at end-of-file
4. Keep all other bytes exactly as-is (do not reformat code)

The canonical bytes are then hashed:

* `P_hash = SHA-256(canonical_bytes)`

> The platform must publish this policy in `published.json.platform.canonicalization`.

### 5.2 Why canonicalization

Without canonicalization, the same logical source can hash differently due to:

* CRLF vs LF
* trailing newlines
* packaging artifacts

Canonicalization makes hash verification robust and predictable.

---

## 6. Disclosure Generation (No Human Copy/Paste)

After Gate A–D pass:

1. Generate `a_true = [a_0..a_{N_check-1}]`
2. Build disclosure values by rule:

   * default: `odd_first_50 = [a_1, a_3, …, a_99]`
3. Store disclosure into `published.json`

The platform must never accept manually entered disclosure lists.

---

## 7. Timing & Resource Accounting

### 7.1 Timing

Pick one timing definition and publish it for the season:

* **Wall time** is recommended (closer to user experience).
* Measure inside the sandbox process.

Implementation notes:

* exclude compilation/import overhead? (choose and stick to it)
* simplest: measure end-to-end `generate(N_check)` inside the sandbox

### 7.2 Limits

At minimum enforce:

* wall time ≤ 1 second for `N_check=200` (setter)
* memory ≤ platform-defined cap (recommend: 256–1024 MB)
* recursion depth: default Python limit is fine; optionally cap to prevent abuse

---

## 8. Judging Solver Submissions

### 8.1 Load & run solver

Execute `solver.py` in the same sandbox policy used for setters (or stricter).

Call:

* `solver()` → list of ints of length `N_check`

Validate:

* type is list
* length is exactly `N_check`
* all elements are Python `int`

### 8.2 Compare to ground truth

Compute:

* `first_mismatch_index` (if any)
* stage pass: match first 100
* reward: match first 200

Return a structured verdict object:

```json
{
  "ok": false,
  "stage_pass": true,
  "reward": false,
  "first_mismatch": {
    "index": 145,
    "expected": 2918000611027443,
    "got": 2917990611027443
  }
}
```

---

## 9. Reveal Procedure

After the problem is closed (or per policy):

1. Publish the original `setter.py` (or the canonicalized version—choose one and document it)
2. Publish the canonicalization policy
3. (Optional) publish validation logs (gate results, runtime, determinism checks)

Third parties should be able to verify:

* `SHA-256(canonicalize(setter.py)) == P_hash`

---

## 10. Error Codes & Diagnostics (Recommended)

Use stable, machine-readable error codes.

Examples:

### Static gate errors

* `E_STATIC_LINE_LIMIT`
* `E_STATIC_CHAR_LIMIT`
* `E_STATIC_IMPORT_FORBIDDEN`
* `E_STATIC_DANGEROUS_BUILTIN`
* `E_STATIC_AST_PARSE`

### Sandbox/runtime errors

* `E_SANDBOX_FORBIDDEN_IMPORT`
* `E_SANDBOX_IO_ATTEMPT`
* `E_SANDBOX_SUBPROCESS_ATTEMPT`
* `E_TIMEOUT`
* `E_OOM`

### Interface errors

* `E_INTERFACE_MISSING`
* `E_INTERFACE_BAD_RETURN_TYPE`
* `E_INTERFACE_BAD_LENGTH`
* `E_INTERFACE_NON_INT_ELEMENT`

### Determinism

* `E_NONDETERMINISTIC_OUTPUT`

### Judging mismatch

* `E_MISMATCH`

Diagnostics should include:

* which gate failed
* the offending module/symbol if applicable
* earliest mismatch index for judging

---

## 11. Minimal CLI Workflow (Suggested)

A minimal working CLI typically includes:

### `validate <setter_pack_dir>`

* run Gate A–D
* print pass/fail + diagnostics

### `publish <setter_pack_dir> --out published.json`

* run validate
* canonicalize + compute `P_hash`
* generate disclosure
* write `published.json`

### `judge <published.json> <solver_pack_dir>`

* load setter by problem_id (or stored ground truth)
* run solver
* compare and output verdict JSON

### `reveal <published.json> --out reveal_dir`

* export setter source + metadata
* optionally export logs

---

## 12. Implementation Notes & Tradeoffs

### 12.1 Sandboxing in Python is hard

A pure-Python “restricted builtins” approach is not bulletproof against adversarial code. For early prototypes it is acceptable, but for public competitions consider process- or container-level sandboxes.

### 12.2 Keep the interface small

Favor `seq(n)` or `solver()` and fixed `N_check`. This reduces complexity and avoids ambiguous I/O contracts.

### 12.3 Treat all disclosure as generated artifacts

Trial 002 showed how easy it is to corrupt a problem by copying large integers manually. This platform must prevent that class of failure entirely.

---

## 13. Season Configuration (Recommended)

A `season.toml` / `config.json` can fix:

* allowed imports
* banned modules/builtins
* N_check, disclosure rule
* time/memory limits
* canonicalization policy
* judging thresholds (100/200)

The platform should embed the relevant configuration in `published.json.platform` for auditability.