Creating the Puns 205 Dataset
Overview
puns_205.json is a curated dataset of 205 fill-in-the-blank pun sentences designed
for contextual cloze testing — measuring whether language models can be primed toward
straight or punny completions depending on the surrounding context.
Each entry contains a sentence with a blank (___) that admits both a literal (“straight”)
completion and a pun (“punny”) completion. The dataset includes model-derived completion
frequencies, classifications, and discriminativeness tiers that characterize how ambiguous
each joke is to language models.
The final benchmark artifact is contextual_cloze_tests_100.json: 100 contrastive test
prompts (50 pairs) where each pair tests the same target joke under straight-primed and
funny-primed context.
Stage 1: Raw Collection & Rating
Scripts: rawdata/analyze_jokes.py, rawdata/repair_jokes.py, rawdata/apply_edits.py
Starting from a pool of ~1,000 candidate cloze pun sentences (rawdata/raw_samples.json),
three LLM providers rated each joke on quality (0-10):
- Gemini — rated in batches of 25 via
analyze_jokes.py - Anthropic (Claude) — rated in the same batch pipeline
- Claude Code — additional rating pass for consensus scoring
Low-scoring jokes (avg < 7.0) were sent to Claude Sonnet for repair via repair_jokes.py,
which attempted to improve the pun while preserving the fill-in-the-blank structure.
Hand edits were applied via apply_edits.py to remove duplicates and fix sentence wording.
Stage 2: Building puns_205.json
Script: rawdata/build_puns_205.py
The top 205 jokes were selected by merging ratings from all three providers:
- Merge
top205_full.jsonwithannotated_205_gemini.jsonandannotated_205_anthropic.json - Compute
consensus_score = gemini_rating + claudecode_rating - Sort by consensus_score DESC, then anthropic_rating DESC, then sentence length DESC
- Assign sequential
index(0-204) and preserve originalid
Output fields at this stage: index, id, sentence, straight, punny, explanation,
gemini_rating, anthropic_rating, claudecode_rating, consensus_score.
Stage 3: Completion Collection
Script: collect_completions.py (phase: collect) / annotate_completions.py (phase: collect)
For each of the 205 jokes, the sentence is truncated before ___ and sent to 5 Llama
models via Together.ai:
| Nickname | Model ID |
|---|---|
| 3b | meta-llama/Llama-3.2-3B-Instruct-Turbo |
| 8b | meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo |
| 70b | meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo |
| 3.3-70b | meta-llama/Llama-3.3-70B-Instruct-Turbo |
| 405b | meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo |
Per model per joke:
- 20 diverse samples at temperature=0.8, n=20
- 1 greedy sample at temperature=0
Total: 205 jokes x 5 models x (20+1) = 21,525 completions. After extracting the first word and lowercasing, results are aggregated into frequency counts across all 100 diverse samples per joke.
Checkpoint: raw_completions.json
Stage 4: Classification
Script: collect_completions.py (phase: classify) / annotate_completions.py (phase: classify)
For each joke, candidate words not already in the known straight or punny lists are
classified by Claude Sonnet into three categories:
- straight — a sensible literal completion without wordplay
- funny — exploits a double meaning or pun in context
- other — doesn’t fit well, is a filler word, or is nonsensical
Classification is done in batches of 10 jokes per API call. Known words are pre-classified: existing straight words → “straight”, existing punny words → “funny”.
Checkpoint: classified_completions.json
Stage 5: Discriminative Ranking
Script: rank_discriminative.py / annotate_completions.py (phase: rank)
For each joke, the 100 model samples are tallied by classification to compute:
p_straight= fraction of samples classified as straightp_funny= fraction classified as funnyp_other= fraction classified as otherbalance = min(p_straight, p_funny) / max(p_straight, p_funny)- 1.0 = perfectly split between straight and funny
- 0.0 = completely one-sided
Each joke is assigned a tier based on these thresholds:
| Tier | Condition | Count |
|---|---|---|
balanced |
balance >= 0.3 and p_funny >= 5% | 42 |
leaning |
0.05 < balance < 0.3 and p_funny >= 5% | 51 |
funny_dominated |
p_funny > 50% and balance <= 0.05 | 24 |
straight_dominated |
p_funny < 5% (everything else) | 88 |
Per-model funny rates are also computed, showing which models are more likely to produce pun completions.
Stage 6: Contrastive Test Construction
Script: build_cloze_tests.py
The test builder uses tier-aware triple selection designed so that target jokes (C) are ones where models naturally default to the straight completion, while context jokes (A, B) provide clear pun/straight contrasts for priming:
1. Partition 205 jokes by cloze_tier
2. C_pool (targets): 50 from 88 straight_dominated jokes
- Models default to straight completions on these jokes
- Funny context must "unlock" the pun — measuring true context sensitivity
3. AB_pool (context): 100 from 131 candidates
(42 balanced + 51 leaning + 38 remaining straight_dominated)
- These jokes have clear straight/funny word pairs for effective priming
4. Exclude 24 funny_dominated (always puns, poor context contrast)
5. Shuffle both pools deterministically (seed=42)
6. Form 50 triples: A = AB[2i], B = AB[2i+1], C = C_pool[i]
7. For each triple, emit 2 tests (straight-primed and funny-primed)
Prefill logic:
- Straight prefill:
joke["straight"][0](highest-frequency straight word) - Funny prefill:
joke["punny"][0](highest-frequency pun word) - Context:
fill(A, word) + " " + fill(B, word) + " " + truncate(C)
The 24 funny-dominated jokes are excluded — they always produce puns regardless of context, so they don’t work as contrastive context. The remaining 31 jokes from the AB candidate pool (131 - 100 = 31) are also excluded to keep the pool at exactly 100.
Output: contextual_cloze_tests_100.json (100 entries = 50 pairs x 2)
Dataset Schema
puns_205.json
Each of the 205 entries contains:
| Field | Type | Description |
|---|---|---|
index |
int | Sequential position (0-204) |
id |
int | Original joke ID from raw collection |
sentence |
string | Fill-in-the-blank sentence with ___ |
straight |
string[] | Non-pun completions, ordered by model frequency |
punny |
string[] | Pun completions, ordered by model frequency |
explanation |
string | Why the pun works (from initial rating) |
gemini_rating |
int | Gemini quality score (0-10) |
anthropic_rating |
int | Anthropic quality score (0-10) |
claudecode_rating |
int | Claude Code quality score (0-10) |
consensus_score |
int | gemini_rating + claudecode_rating |
cloze_balance |
float | Discriminativeness: min(p_str,p_fun)/max(p_str,p_fun) |
cloze_p_straight |
float | Fraction of model samples classified straight |
cloze_p_funny |
float | Fraction classified funny |
cloze_p_other |
float | Fraction classified other |
cloze_tier |
string | One of: balanced, leaning, funny_dominated, straight_dominated |
cloze_model_funny_rate |
object | Per-model funny completion rates {3b, 8b, 70b, 3.3-70b, 405b} |
contextual_cloze_tests_100.json
Each of the 100 test entries contains:
| Field | Type | Description |
|---|---|---|
pair_id |
int | Which of 50 pairs (0-49) |
type |
string | “straight” or “funny” — the priming context type |
prompt |
string | Filled A + filled B + truncated C |
expected_completion |
string[] | Completions matching the priming direction |
contrast_completion |
string[] | Completions for the opposite direction |
joke_a_index |
int | Index of context joke A in puns_205.json |
joke_b_index |
int | Index of context joke B in puns_205.json |
joke_c_index |
int | Index of target joke C in puns_205.json |
joke_c_sentence |
string | Full sentence of joke C (with ___) |
joke_a_tier |
string | Cloze tier of joke A |
joke_b_tier |
string | Cloze tier of joke B |
joke_c_tier |
string | Cloze tier of joke C |
joke_c_balance |
float | Discriminativeness balance of joke C |
Tier Distribution
balanced: 42 jokes (balance >= 0.3, p_funny >= 5%)
leaning: 51 jokes (0.05 < balance < 0.3, p_funny >= 5%)
straight_dominated: 88 jokes (p_funny < 5%)
funny_dominated: 24 jokes (p_funny > 50%, balance <= 0.05)
───
Total: 205 jokes
In the contrastive tests:
- C pool (targets): 50 straight_dominated jokes (models default to straight completions)
- AB pool (context): 100 from 42 balanced + 51 leaning + 38 straight_dominated remainder
- Excluded: 24 funny_dominated + 31 overflow from AB candidates = 55 jokes