原始:
/docs/platform.md
# PLATFORM.md — Sequence Dojo Platform Implementation Guide (Draft v0.3)
This document is for maintainers and implementers of the **Sequence Dojo platform**. It describes how the platform should validate setter submissions, generate commitments, publish problems, sandbox execution, judge solvers, and reveal setters under a **platform-enforced Commit–Reveal protocol**.
> Design goals: **reproducible**, **verifiable**, **safe**, and **low-friction**.
See also:
* [SPEC.md](./SPEC.md) (normative protocol and artifact formats)
* [RULES.md](./RULES.md) (participant-facing constraints)
* [REVEAL.md](./REVEAL.md) (lifecycle phases and disclosure)
---
## 1. Platform Responsibilities
The Platform must:
1. **Validate** setter packages before publication
2. **Canonicalize** and compute `P_hash` (SHA-256) for the setter source
3. **Generate disclosure data** from true outputs (no manual transcription)
4. **Publish** a public problem record (`published.json`)
5. **Execute and judge** solver submissions in a sandbox
6. **Reveal** setter source after the problem closes, enabling third-party verification of `P_hash`
7. Provide **clear diagnostics** on failures (static gate, sandbox, runtime, mismatch)
---
## 2. Standard Interfaces
### 2.1 Setter interfaces
The platform must support exactly one of the following per season (recommended: `seq`):
* `seq(n: int) -> int`
* `gen(N: int) -> list[int]`
### 2.2 Solver interface (recommended)
* `solver() -> list[int]` returning exactly `N_check` integers
> Keep the solver interface fixed and simple; avoid allowing “paste-a-list” solutions.
---
## 3. Artifact Formats
### 3.1 Setter package
Expected files:
* `problem.json`
* `setter.py`
### 3.2 Solver package
Expected files:
* `solver.py`
* optional `solution.json`
### 3.3 Published problem record
The platform outputs a `published.json` containing:
* `problem_id`, `title`
* `P_hash`
* `interface`, `N_check`
* `disclosure` (default: odd-index first 50)
* `timestamp`
* `platform` metadata (versions + canonicalization policy)
---
## 4. Validation Pipeline (Gates)
A setter submission is published **only if it passes all gates**. Fail fast with clear error messages.
### Gate A — Static validation (no execution)
**Inputs**: `setter.py`, `problem.json`
Checks:
1. **Line limit**: effective lines ≤ 100
2. **Character limit**: UTF-8 char count ≤ 5000
3. **AST parse**: parse must succeed
4. **Import whitelist**:
* allowed: `sympy`, `math`, `fractions`, `itertools`
* disallowed (non-exhaustive): `os`, `pathlib`, `subprocess`, `socket`, `requests`, `time`, `datetime`
5. **Dangerous builtins** (recommend reject if referenced):
* `open`, `eval`, `exec`, `compile`, `__import__`, `input`
6. **Suspicious patterns** (recommend reject if present):
* attribute access to `__dict__`, `__class__`, `__mro__`, `__subclasses__`
* `globals()`, `locals()`
* `getattr`/`setattr` on unknown targets
* `importlib` usage
**Outputs**:
* pass/fail
* list of violations with line/column if possible
> Implementation hint: use Python `ast` to scan `Import`, `ImportFrom`, `Name`, `Attribute`, `Call`.
---
### Gate B — Sandbox safety validation (controlled execution)
**Goal**: ensure the code cannot perform I/O/network/subprocess and cannot import forbidden modules at runtime.
Recommended baseline approach (portable, “good enough”):
* Run code in a **separate process**
* Use **resource limits**:
* CPU time limit
* wall time limit
* memory limit
* Use a **restricted import hook**:
* allow only whitelist modules
* deny everything else
* Provide a **restricted builtins** dict:
* remove/replace `open`, `eval`, `exec`, `compile`, `__import__`, etc.
Stronger approach (optional, more robust):
* containerized sandbox (Docker/Firejail/nsjail) with no network, read-only FS, limited syscalls
Minimum requirements:
* No file writes
* No network access
* No subprocess execution
* No clock/time APIs (or ensure they return fixed values)
**Outputs**:
* pass/fail
* if fail: indicate which capability was attempted (e.g., forbidden import, file access)
---
### Gate C — Performance validation (1 second)
**Goal**: setter must generate `a[0..N_check-1]` within time limit on the platform standard machine.
Procedure:
1. Load the setter in the sandbox
2. Generate first `N_check` terms (default 200)
3. Measure runtime (see §7 timing policy)
4. Fail if runtime > 1 second
Record:
* wall time
* CPU time (if available)
* peak RSS memory (optional but useful)
---
### Gate D — Determinism validation
**Goal**: repeated runs must produce identical results.
Procedure:
1. Run generation twice in the same environment (fresh process recommended)
2. Compare the full `N_check` list
3. Fail if any mismatch
This catches:
* non-fixed random seeds
* time/environment dependence
* nondeterministic iteration order / hashing dependence (less likely in pure numeric code, but still possible)
---
## 5. Canonicalization & Hashing (Commit)
The platform must produce a stable `P_hash` using a **documented canonicalization policy**.
### 5.1 Canonicalization policy (recommended)
Given `setter.py` raw bytes:
1. Decode as UTF-8 (reject if invalid)
2. Normalize newlines to `\n`
3. Strip trailing blank lines at end-of-file
4. Keep all other bytes exactly as-is (do not reformat code)
The canonical bytes are then hashed:
* `P_hash = SHA-256(canonical_bytes)`
> The platform must publish this policy in `published.json.platform.canonicalization`.
### 5.2 Why canonicalization
Without canonicalization, the same logical source can hash differently due to:
* CRLF vs LF
* trailing newlines
* packaging artifacts
Canonicalization makes hash verification robust and predictable.
---
## 6. Disclosure Generation (No Human Copy/Paste)
After Gate A–D pass:
1. Generate `a_true = [a_0..a_{N_check-1}]`
2. Build disclosure values by rule:
* default: `odd_first_50 = [a_1, a_3, …, a_99]`
3. Store disclosure into `published.json`
The platform must never accept manually entered disclosure lists.
---
## 7. Timing & Resource Accounting
### 7.1 Timing
Pick one timing definition and publish it for the season:
* **Wall time** is recommended (closer to user experience).
* Measure inside the sandbox process.
Implementation notes:
* exclude compilation/import overhead? (choose and stick to it)
* simplest: measure end-to-end `generate(N_check)` inside the sandbox
### 7.2 Limits
At minimum enforce:
* wall time ≤ 1 second for `N_check=200` (setter)
* memory ≤ platform-defined cap (recommend: 256–1024 MB)
* recursion depth: default Python limit is fine; optionally cap to prevent abuse
---
## 8. Judging Solver Submissions
### 8.1 Load & run solver
Execute `solver.py` in the same sandbox policy used for setters (or stricter).
Call:
* `solver()` → list of ints of length `N_check`
Validate:
* type is list
* length is exactly `N_check`
* all elements are Python `int`
### 8.2 Compare to ground truth
Compute:
* `first_mismatch_index` (if any)
* stage pass: match first 100
* reward: match first 200
Return a structured verdict object:
```json
{
"ok": false,
"stage_pass": true,
"reward": false,
"first_mismatch": {
"index": 145,
"expected": 2918000611027443,
"got": 2917990611027443
}
}
```
---
## 9. Reveal Procedure
After the problem is closed (or per policy):
1. Publish the original `setter.py` (or the canonicalized version—choose one and document it)
2. Publish the canonicalization policy
3. (Optional) publish validation logs (gate results, runtime, determinism checks)
Third parties should be able to verify:
* `SHA-256(canonicalize(setter.py)) == P_hash`
---
## 10. Error Codes & Diagnostics (Recommended)
Use stable, machine-readable error codes.
Examples:
### Static gate errors
* `E_STATIC_LINE_LIMIT`
* `E_STATIC_CHAR_LIMIT`
* `E_STATIC_IMPORT_FORBIDDEN`
* `E_STATIC_DANGEROUS_BUILTIN`
* `E_STATIC_AST_PARSE`
### Sandbox/runtime errors
* `E_SANDBOX_FORBIDDEN_IMPORT`
* `E_SANDBOX_IO_ATTEMPT`
* `E_SANDBOX_SUBPROCESS_ATTEMPT`
* `E_TIMEOUT`
* `E_OOM`
### Interface errors
* `E_INTERFACE_MISSING`
* `E_INTERFACE_BAD_RETURN_TYPE`
* `E_INTERFACE_BAD_LENGTH`
* `E_INTERFACE_NON_INT_ELEMENT`
### Determinism
* `E_NONDETERMINISTIC_OUTPUT`
### Judging mismatch
* `E_MISMATCH`
Diagnostics should include:
* which gate failed
* the offending module/symbol if applicable
* earliest mismatch index for judging
---
## 11. Minimal CLI Workflow (Suggested)
A minimal working CLI typically includes:
### `validate <setter_pack_dir>`
* run Gate A–D
* print pass/fail + diagnostics
### `publish <setter_pack_dir> --out published.json`
* run validate
* canonicalize + compute `P_hash`
* generate disclosure
* write `published.json`
### `judge <published.json> <solver_pack_dir>`
* load setter by problem_id (or stored ground truth)
* run solver
* compare and output verdict JSON
### `reveal <published.json> --out reveal_dir`
* export setter source + metadata
* optionally export logs
---
## 12. Implementation Notes & Tradeoffs
### 12.1 Sandboxing in Python is hard
A pure-Python “restricted builtins” approach is not bulletproof against adversarial code. For early prototypes it is acceptable, but for public competitions consider process- or container-level sandboxes.
### 12.2 Keep the interface small
Favor `seq(n)` or `solver()` and fixed `N_check`. This reduces complexity and avoids ambiguous I/O contracts.
### 12.3 Treat all disclosure as generated artifacts
Trial 002 showed how easy it is to corrupt a problem by copying large integers manually. This platform must prevent that class of failure entirely.
---
## 13. Season Configuration (Recommended)
A `season.toml` / `config.json` can fix:
* allowed imports
* banned modules/builtins
* N_check, disclosure rule
* time/memory limits
* canonicalization policy
* judging thresholds (100/200)
The platform should embed the relevant configuration in `published.json.platform` for auditability.