Voice typing for prompt engineering on Mac

THE CASE

Prompt engineering is voice-density work

Halopen is a native macOS dictation app for prompt engineers — hold the function key, speak the system prompt, the eval rubric, or the long-context spec, release; the verbatim text lands in the Anthropic Console, the OpenAI Playground, the ChatGPT and Claude desktop apps, Cursor’s chat panel, or any markdown file where a prompt lives. Cursor-context biasing handles model names like claude-opus-4-7 and snake_case hyperparameters in the right idiom.

Prompt engineering is one of the rare disciplines where the artifact and the process both reward more words. A good system prompt has more rules than a bad one. A good eval rubric names more dimensions and more pass/fail conditions than a bad one. A good long-context prompt names more constraints, more format requirements, and more negative instructions than a bad one. Density is the craft. The prompts that work in production are thousands of words; the rubrics that ship are dozens of paragraphs; the long-context prompts include the document, the task, the format, and the prohibition on every category of error the model has been seen to make.

That density cost something to write. Every additional rule in a system prompt is a sentence the engineer had to type. Every additional eval dimension is a paragraph. Every negative constraint in a long-context prompt is a clause that competes with the next clause for keyboard time. The friction shows up most clearly on the third revision — by the time you’re rewriting the system prompt for the third time, the marginal cost of a typed rule starts to feel like enough to skip the rule entirely. The model then makes the error the skipped rule would have prevented; you add the rule back; the loop is the rule, the typing tax, and the regression.

Voice removes the typing tax. Speaking the third revision of a prompt is no harder than speaking the first; the rules accrete; the rubric gets the dimension that catches the failure mode; the long-context prompt gets the negative constraint the model needed. The artifact gets denser without the writer paying the density tax. That’s the wedge for prompt engineering specifically — Halopen lets the prompts that ship be the prompts the engineer would have written if writing were free, and prompt-engineering rewards exactly that prompt.

SYSTEM PROMPTS

How to dictate a system prompt

System prompts have predictable structure. The role definition opens. The rules follow. The output format gets named. The examples land at the bottom. The negative constraints — the things the model should never do — sit between the rules and the examples or at the very end as a final guardrail.

Spoken, that structure becomes a paragraph rhythm. The role definition is one breath. The rules are a list of breaths, one rule per breath. The output format is one breath. The examples are one breath each. The negatives are one breath each. The whole prompt arrives at the cursor in the same shape the structure already has.

Real Halopen session

Dictating a system prompt for an LLM-backed customer support classifier. Anthropic Console open in Safari, system-prompt input field focused. fn key down, multi-paragraph take with intentional pauses.

"You are a customer support classifier for Halopen — a native Mac dictation app. Each input is one customer email or chat message. Your job is to classify the message into exactly one of these categories: pricing, install, accuracy, privacy, refund, feature_request, bug_report, other.

Rules. Read the entire message before classifying. Prefer the most specific applicable category. If the message names a pricing tier and asks a non-pricing question, classify by the question, not the tier. If the message describes a transcription mistake, classify as accuracy unless the customer is asking for their money back, which is refund. Trial-related questions are pricing. Hotkey conflicts are bug_report.

Output format. Return a single JSON object with two keys: category (one of the eight strings above) and confidence (a float from 0 to 1). No explanation, no markdown, no surrounding text. The output is parsed directly.

Negatives. Do not invent a category not in the list. Do not return multiple categories. Do not return a confidence below 0.2 — if the message is too ambiguous to classify above 0.2, classify as other with the appropriate low confidence."

·The full prompt is roughly 230 words. Spoken, it took about ninety seconds in two takes (one for role + rules, one for output + negatives).
·The eight category names land verbatim. Halopen's cursor-context biasing reads the existing prompt content and prefers the snake_case identifiers already in the buffer.
·The negative constraints — "do not invent a category" — are the half typed versions skip. Spoken, they survive.

EVAL RUBRICS

How to dictate an eval rubric

Eval rubrics have one structure repeated per dimension: dimension name, what passing looks like, what failing looks like, an example of each. The repetition is the surface tax — typing the same structure five times for five dimensions is exactly the work that prompt engineers leave half-finished under typing cost. Voice closes the gap.

Pattern is dimension + pass criterion + fail criterion + example. Each dimension is one paragraph. A five-dimension rubric is five paragraphs and roughly two minutes spoken.

Real Halopen session

Building an eval rubric for transcription quality on the Halopen edge function. Markdown file open in Cursor, fn key down for one of the five dimensions.

"Dimension. Verbatim preservation of contractions and intensifiers.

Pass. The transcript preserves contractions exactly as spoken — 'we're', 'don't', 'it's', 'I'd' — and preserves intensifiers — 'really', 'actually', 'just', 'literally' — without dropping or substituting. Spoken 'we're not adding queso fresco' arrives as 'we're not adding queso fresco', not 'we are not adding queso fresco' or 'we're not adding cheese'.

Fail. The transcript expands contractions ('we are'), drops intensifiers ('we're not adding queso fresco' becomes 'we're not adding queso fresco' is fine, but 'we're really not adding queso fresco' becoming 'we're not adding queso fresco' is a fail), or substitutes regional vocabulary ('queso fresco' becoming 'cheese').

Example pass. Input audio: 'Bro, we're really not adding queso fresco, one kinda cheese.' Transcript: 'Bro, we're really not adding queso fresco, one kinda cheese.' All contractions preserved, all intensifiers preserved, all regional vocabulary preserved.

Example fail. Input audio: 'Bro, we're really not adding queso fresco.' Transcript: 'We are not adding cheese.' Three failures: contraction expanded, intensifier dropped, regional vocabulary substituted."

·Roughly 220 words for one dimension. Spoken, ninety seconds; typed, four to five minutes if the engineer types the example pair instead of skipping it.
·The example pair is the half typed versions skip. Without the example pair, the rubric is interpretation-dependent. With it, the rubric is mechanical to apply.
·Five dimensions follow the same structure. The whole rubric arrives in about ten minutes spoken. Typed, it would take an afternoon and the example pairs would mostly be missing.

LONG-CONTEXT PROMPTS

How to dictate long-context prompts

Long-context prompts pair a document (the context) with a task (the prompt). The document is usually pasted; the prompt is usually typed. The prompt is where the constraint clauses live, and where typing cost compresses the prompt away from its full form. Voice keeps the prompt in its full form.

Pattern is task statement + format requirement + constraint list + negative constraints + tie-breaker. The tie-breaker is the clause that tells the model what to do when two interpretations of the constraint are both valid. It’s the half typed long-context prompts always skip; spoken, it costs nothing to include.

Real Halopen session

Long-context prompt for summarizing customer support emails into a weekly digest. Pasted ten emails as context, fn key down for the task spec.

"Summarize the ten customer emails above into a single weekly digest in the format below.

Format. Three sections: Pricing questions, Bug reports, Feature requests. Under each section, a bulleted list of one-sentence summaries. Under each bullet, a sub-bullet with the customer email date and the message word count.

Constraints. Group near-duplicate questions under one bullet, with the count of customers who asked it. Preserve any customer-named technical detail verbatim — model names, terminal apps, hotkey choices. Do not include personally identifying information (name, email address) in the digest.

Tie-breaker. If a single email touches two sections, prefer the section the customer's primary ask falls under. If primary ask is genuinely ambiguous, place under Feature requests rather than Bug reports — bug reports require an actual reproducible failure, not just an absence of expected behavior."

·Roughly 180 words. Spoken, sixty seconds. The tie-breaker clause runs the last twenty seconds; typing it usually feels like over-engineering and gets skipped.
·The model produced a digest that respected the format, the constraints, and the tie-breaker on the first run. Without the tie-breaker, the second run would have been a correction round about which section the ambiguous email belonged in.

PROMPT ITERATION

The third-revision multiplier

The single biggest place voice pulls ahead in prompt engineering is the third revision. The first version of a prompt costs roughly the same to type or speak; the writer is composing the structure from scratch and most of the time goes into thinking, not into pressing keys. The second version is where typing starts to feel slow — every paragraph already exists in the previous version, and rewriting it just to add one new constraint feels disproportionate. The third version is where typed iteration usually compresses: the writer rewrites the changes as deltas, the prompt drifts toward incoherence, and the model’s behavior gets unpredictable in a way that’s hard to attribute to any single revision.

Voice keeps the third revision coherent. Each revision is a fresh take; speaking the prompt from scratch with the new constraint added is the same effort as speaking the first version was. The prompt that ships at revision twelve is the prompt the engineer would have written at revision twelve if writing were free. The constraint clauses that accumulated through eleven revisions all survive. The model reads the dense version.

The same multiplier shows up in eval rubrics. The first dimension is composition; the fifth dimension is repetition; the tenth dimension is repetition the writer didn’t want to do typed. Spoken, the tenth dimension costs the same as the first. The rubric ends up with ten dimensions instead of three, and the eval catches more failure modes because more dimensions are scored.

WHERE HALOPEN LANDS PROMPTS

Every prompt-input surface

Halopen lands text at the cursor through the macOS Accessibility API. Every prompt-input surface on Mac is fair game:

The Anthropic Console system-prompt and user-prompt inputs in Safari, Chrome, Arc, or any Mac browser.
The OpenAI Playground prompt inputs in Safari, Chrome, Arc, or any Mac browser.
The ChatGPT chat input — both web and Mac-native desktop app.
The Claude desktop app’s chat input.
The chat panel in Cursor for prompt-iteration sessions against the model directly.
The terminal where you run Claude Code or Aider for the agent-prompt patterns.
The markdown file in your editor where the system prompt lives in source control.
The eval framework’s rubric input — yaml, json, or markdown — wherever it’s edited.

The full directory of Mac AI coding surfaces Halopen lands prompts in sits at the hub. The pattern across all of them is the same: the input is a standard Mac text field; Halopen lands the verbatim text at the cursor; the agent or the model reads what was said.

UNDER THE HOOD

Why this works for technical vocabulary

Prompt engineering’s vocabulary is brutal: model names like claude-opus-4-7-20260201, hyperparameter names in snake_case, JSON Schema keys, OpenAPI fragments, regex patterns, the names of every tool the model can call. Halopen handles them through cursor-context biasing — the transcription engine reads the buffer of the current input and prefers tokens that match the casing and structure of the existing text. Model names already in the prompt land verbatim on the next dictation; new model names get caught by the live partial and spelled out in flight; snake_case stays snake_case; PascalCase stays PascalCase.

The full engineering walk-through — AVAudioEngine, CGEventTap, SFSpeechRecognizer, the transcription edge function, the live partial self-correction — is in the Claude Code flagship. The privacy posture — audio held only while the function key is held, never retained, every cloud call in the local audit log — is in the verbatim manifesto. The set of /for/ entities lists every surface this works in.

QUESTIONS

Frequently asked questions

What's the best Mac dictation app for prompt engineering?

Halopen. Prompt engineering is voice-density work — system prompts run thousands of words, eval rubrics need explicit pass/fail criteria for every dimension, long-context prompts include the document plus the constraint plus the format plus the negative constraints. Typing all of that compresses every clause; speaking it preserves every clause. Halopen lands the verbatim text at the cursor in any prompt-input field on Mac.

How do I dictate system prompts on Mac?

Open the playground, the API doc tester, or the editor where the system prompt lives. Hold the function key, speak the role definition, the rules, the output format, the constraints, the examples — release. The full system prompt lands at the cursor verbatim. Halopen's hold-to-talk window is generous enough for a multi-section prompt; release between sections and hold again to continue from the cursor.

Can I voice-type prompts to ChatGPT and Claude on Mac?

Yes. The chat input on chat.openai.com, on console.anthropic.com, and in any LLM playground is a standard Mac text field; Halopen lands the prompt at the cursor on release. Same hotkey, same behavior, no per-platform configuration. The Mac-native ChatGPT desktop app and the Claude desktop app both expose their text inputs the same way.

Is voice typing better for prompt iteration than typing?

Yes — and the multiplier widens with revision count. The first version of a prompt is similar effort either way; the third version is where voice pulls ahead, because every revision lets you say the constraint clause that the previous version was missing. Typing the third revision recompresses the prompt; speaking the third revision keeps adding precision.

How do I dictate an eval rubric on Mac?

Speak the rubric in its natural structure — the dimension being scored, the rule for a passing score, the rule for a failing score, the example of each. Halopen's hold-to-talk window covers a multi-dimension rubric in one continuous take. Pauses become sentence breaks; longer pauses become paragraph breaks. The rubric arrives at the cursor in the format eval frameworks read.

Does Halopen handle technical vocabulary in prompt engineering — model names, hyperparameters, API field names?

Yes. Cursor-context biasing reads the tokens already in your prompt buffer and prefers technical vocabulary around the cursor. Model names like claude-opus-4-7 and gpt-4o-mini land verbatim. Hyperparameter names — temperature, top_p, max_tokens — land in the right snake_case. For unusual identifiers, the live partial catches the misread and you spell it out in flight.

Will my prompts be sent somewhere I don't expect?

No. Audio streams to Halopen's transcription edge function only while you hold the function key, only for the seconds you're holding, and is not retained after the transcript returns. The text Halopen lands at your cursor is the only artifact that crosses your machine. The audit log on disk shows every cloud call by timestamp and byte count. Your prompts go where you send them — to ChatGPT, to Claude, to whichever model — and nowhere else.

Open the playground. Hold the function key. Speak the system prompt the way you’d brief a coworker.

The third revision is the one that ships. Voice is the modality that keeps the third revision coherent. The dense prompt, the multi-dimension rubric, the long-context spec with the tie-breaker clause — all of them are the artifact prompt engineering rewards, and all of them compress under typing cost. Halopen lands the dense version at the cursor.

Voice typing for prompt engineering on Mac

Prompt engineering is voice-density work

How to dictate a system prompt

How to dictate an eval rubric

How to dictate long-context prompts

The third-revision multiplier

Every prompt-input surface

Why this works for technical vocabulary

Frequently asked questions

What's the best Mac dictation app for prompt engineering?

How do I dictate system prompts on Mac?

Can I voice-type prompts to ChatGPT and Claude on Mac?

Is voice typing better for prompt iteration than typing?

How do I dictate an eval rubric on Mac?

Does Halopen handle technical vocabulary in prompt engineering — model names, hyperparameters, API field names?

Will my prompts be sent somewhere I don't expect?

Open the playground. Hold the function key. Speak the system prompt the way you’d brief a coworker.

Hold the function key. Speak.

More from Halopen Learn

Voice typing for Claude Code on Mac

Voice typing for Cursor on Mac

Voice typing patterns for Claude Code on Mac

Take Halopen with you.