Rules table
A rules table is a content-addressed list of additional patterns, conditional on feature sets. It captures what the static templates miss but real projects emit consistently.
File format
json
{
"version": "sha256:72fd0c323cc1",
"rules": [
{ "features": ["python"], "patterns": [".env"] },
{ "features": ["node"], "patterns": [".env.local"] },
{ "features": ["python", "docker"], "patterns": [".dockerignore.local"] }
]
}versionissha256(canonical_json(rules))[:12], prefixed withsha256:. Edit the rules and the version changes deterministically.featuresare matched as a subset: a rule with["python", "docker"]fires when both features are present in the fingerprint.patternsare sorted alphabetically inside each entry; entries are sorted byfeatures.
Loading
python
from pathlib import Path
from occam_gitignore_core import JsonRulesTable
rules = JsonRulesTable.from_file(Path("data/rules_table.json"))
rules.version() # "sha256:72fd0c323cc1"
rules.extras_for(frozenset({"python"})) # tuple of Rule(...)Mining a new table
The occam-gitignore-training package mines a rules table from JSONL records — one record per repo — describing the files listed and the .gitignore rules the repo actually used. The pipeline:
- Fingerprint each record's file list (or use a declared feature set).
- Group records by feature.
- Single-feature rules — emit a pattern for feature F iff its support in F-bearing repos clears
min_supportand the pattern is not already covered by F's template. - Pair rules — for feature pairs (A, B), emit a pattern iff:
- support among {A,B} repos clears
min_pair_support, - the same support is
>= min_pair_lift × max(support_A, support_B), - it's not already emitted as a single-feature rule.
- support among {A,B} repos clears
- Render the result with
to_payload(...)— content-addressed, sorted, stable.
bash
uv run occam-gitignore-train mine \
--records dataset.jsonl \
--templates data/templates \
--output data/rules_table.jsonWhy mining is conservative
MineConfig defaults are deliberately strict:
min_support = 0.5(a pattern must appear in at least half the repos)min_repos_per_feature = 2min_pair_lift = 1.5(the pair must explain the pattern more than either feature alone)
Occam: prefer not emitting over emitting noise. A noisy rules table burns precision in the benchmark and confuses users.