PR Notes · Reviewer Context
extract_screen_text — active-display OCR for the Altic MCP serverNew MCP tool that captures the active display and returns its visible text via macOS Vision OCR,
with an optional macOS 27 Foundation Models visual-summary mode. Ported from the sibling branch
feat/extract-screen-text with a Swift compile-bug fix and the empty package.json cruft dropped.
The altic-studio skill and README are updated so the model knows when and how to use it.
The server could already screenshot the active display (capture_active_screen) but could not
read the text on it. This PR adds extract_screen_text, which captures the display containing the
frontmost app, runs VNRecognizeTextRequest (Vision) OCR over it, and returns structured JSON
(text, line count, average confidence, screenshot path). When include_visual_summary=true, it
additionally asks a macOS 27 Foundation Models language model to describe the screen — gracefully degrading to
OCR-only with a visual_error when that capability is unavailable.
Default mode
OCR only
Fast, deterministic, no FM dependency
Opt-in mode
+ Visual summary
macOS 27 Foundation Models
Return type
JSON string
Matches house _json/_error style
Provenance — this is a port, not a fresh write
A complete working implementation already existed on the sibling branch feat/extract-screen-text
(its closed PR was #5). The Python wrapper and tests were
copied verbatim; the Swift helper was copied with one bug fix. Reviewers comparing against that branch should
focus on the two intentional deviations below.
| Decision | Rationale | Alternative rejected |
|---|---|---|
Port from feat/extract-screen-text |
A proven, test-covered implementation already existed; re-deriving it risked drift and wasted effort. | Write a fresh implementation on this branch. |
Fix undefined OCRTool() in the Swift FM path |
The reference used LanguageModelSession(model: model, tools: [OCRTool()]) — OCRTool is defined nowhere. On a macOS 27 toolchain (where FoundationModels imports), this fails to compile. Changed to LanguageModelSession(model: model). |
Leave as-is (would break swiftc on this exact machine — confirmed). |
| Ship OCR + visual summary | User chose full parity with the reference over an OCR-only subset. | Drop FM summary, visual_prompt, and visual_* fields. |
Skip package.json / package-lock.json |
This is a Python/uv project; the reference branch's empty JS manifests are accidental cruft. | Mirror the reference branch exactly. |
Python-side truncation (max_chars, default 20000, capped 200000) |
Keeps the Swift helper simple (always returns full text) and bounds payload size at the tool boundary. | Truncate inside Swift. |
| File | State | What & why |
|---|---|---|
tools/screen_text.py | NEW | Python wrapper: resolves output path, runs Swift helper (timeout=90), parses JSON, truncates text, normalizes payload. Mirrors tools/screenshot.py / clipboard.py conventions. |
tools/scripts/extract-screen-text.swift | NEW | ScreenCaptureKit capture + Vision OCR + gated macOS 27 FM summary. Contains the OCRTool() fix. |
skills/altic-studio/scripts/extract-screen-text.swift | NEW | Byte-identical mirror of the fixed Swift helper (the skill keeps its own script copies). |
tests/test_screen_text.py | NEW | 7 tests: invocation shape, visual-option passthrough, truncation, FM-unavailable, subprocess error, invalid JSON, server registration. |
server.py | MOD | Added screen_text import + @mcp.tool() extract_screen_text(...) registration after capture_active_screen. |
skills/altic-studio/SKILL.md | MOD | New Mode B2 section, capability/tool-list/command-template entries, operational rule, permissions. |
skills/altic-studio/scripts/README.md | MOD | Example invocation line for the new Swift script. |
README.md | MOD | Feature bullet, skill listing, Screen Recording + macOS 27 permission notes, smoke-test section. |
The two Swift files are kept identical on purpose — a reviewer change to one must be applied to both.
include_visual_summary=true)| Condition | Behavior | Result fields |
|---|---|---|
| macOS 27 + FoundationModels available | LanguageModelSession summarizes the captured image | visual_summary, visual_model_available=true, visual_model_source |
| < macOS 27 or FM modules absent | Degrade to OCR-only, no throw | visual_model_available=false, visual_error explains why |
| FM call throws at runtime | Caught; OCR text still returned | visual_error carries the localized error |
| Check | Result | Detail |
|---|---|---|
pytest tests/test_screen_text.py -q | 7 passed | All new-tool unit tests, subprocess mocked |
pytest -q (full suite) | 39 passed | No regressions across the repo |
import server registration | True | 'extract_screen_text' in mcp._tool_manager._tools |
swiftc -typecheck extract-screen-text.swift | exit 0 | Run on macOS 27.0, so the FoundationModels branch was compiled — directly confirms the OCRTool() fix |
| End-to-end on real screen | not run | Requires interactively granting Screen Recording permission; documented in README smoke tests |
# targeted $ uv run pytest tests/test_screen_text.py -q ....... [100%] 7 passed in 0.63s # full suite $ uv run pytest -q ....................................... [100%] 39 passed in 0.43s # macOS 27 — FoundationModels path compiled, OCRTool fix confirmed $ swiftc -typecheck tools/scripts/extract-screen-text.swift typecheck exit: 0 $ sw_vers | grep ProductVersion ProductVersion: 27.0
Known uncertainty
No live capture/OCR run was performed (needs Screen Recording grant + a foreground window with text), and the Foundation Models runtime summary path was not executed — only type-checked. Visual-summary output quality is therefore unverified.
| Area | Risk | Why look here |
|---|---|---|
Swift FM path (generateFoundationVisualSummary) | medium | The deviation from the reference; verify the LanguageModelSession(model:) API and Attachment(imageURL:) usage against the macOS 27 SDK you target. |
| Two identical Swift copies | low | No automated check enforces parity — future edits must touch both tools/scripts/ and skills/altic-studio/scripts/. |
subprocess timeout (90s) | low | FM summarization can be slow; confirm 90s is enough on cold start, or surface a clearer timeout error. |
| Truncation semantics | low | truncated reflects Python-side cut OR a Swift-provided flag; length_chars is the full length, returned_length_chars the served length. |
| Best reviewer entry point | — | Start at tools/screen_text.py (contract) → server.py (signature/limits) → Swift helper (capture/OCR/FM). |
# Plan: Add extract_screen_text MCP tool + update the skill
## Context
The repo (altic-mcp, a Python 3.13 / FastMCP server for macOS automation) can capture
the active display as an image (capture_active_screen) but cannot read the text on screen.
We want a new tool, extract_screen_text, that captures the active display and returns
the visible text via macOS Vision OCR — with an optional macOS 27 Foundation Models "visual
summary" mode for higher-level UI understanding. The skill (altic-studio) and README must be
updated so the model knows when/how to use it.
A complete, working reference implementation already exists on the sibling branch
feat/extract-screen-text (10 files, 856 insertions). We are on feat/text-extraction and will
port that work — with two deviations: (1) fix a latent compile bug in the Foundation Models
path, and (2) skip the empty package.json / package-lock.json files (the project is Python/uv,
those are cruft).
## Scope decisions (confirmed with user)
- OCR + visual summary — full parity with the reference branch, including
include_visual_summary / visual_prompt.
- Skip package.json and package-lock.json.
## Files to create
1. tools/screen_text.py (new, Python wrapper)
Mirror feat/extract-screen-text:tools/screen_text.py verbatim. Follows existing wrapper
conventions (compare tools/screenshot.py, tools/clipboard.py):
- DEFAULT_VISUAL_PROMPT constant.
- _error(), _json(), _script_path() (uses SCRIPTS_PREFIX from tools/constants.py),
_default_output_path() (writes to tempfile.gettempdir()/altic-mcp-screen-text).
- extract_screen_text(output_path, max_chars=20000, include_visual_summary=False,
visual_prompt=DEFAULT_VISUAL_PROMPT): resolves/creates output path, runs the Swift
helper via subprocess.run(..., timeout=90), parses the JSON stdout, truncates text to
max_chars in Python, returns a normalized JSON payload (action, screenshot_path, text,
length_chars, returned_length_chars, truncated, line_count, average_confidence, plus
visual_* keys when present).
- Error handling: Error: ... strings for subprocess failure, invalid JSON, timeout, generic.
2. tools/scripts/extract-screen-text.swift (new, Swift helper)
Mirror feat/extract-screen-text:..., reusing capture-active-screen.swift display-selection
logic and adding:
- recognizeText(in:) using VNRecognizeTextRequest (Vision).
- Optional FM visual summary gated behind @available(macOS 27.0, *) and
#if canImport(FoundationModels) && canImport(_Vision_FoundationModels).
- Emits one JSON object on stdout via JSONEncoder with .withoutEscapingSlashes.
- CLI args: <output_path> [include_visual_summary] [visual_prompt].
BUG FIX (deviation): in generateFoundationVisualSummary, change
let session = LanguageModelSession(model: model, tools: [OCRTool()])
to
let session = LanguageModelSession(model: model)
(OCRTool() is undefined; this breaks compile on a macOS 27 toolchain.)
3. skills/altic-studio/scripts/extract-screen-text.swift (new, mirror)
Copy the fixed Swift file so the two copies stay identical.
4. tests/test_screen_text.py (new)
Mirror feat/extract-screen-text:tests/test_screen_text.py — pytest + monkeypatch over
screen_text.subprocess.run. Covers default args/swift invocation shape, visual-summary
options passthrough, Python-side truncation, visual-unavailable reporting, subprocess error,
invalid JSON, and test_server_exposes_extract_screen_text.
## Files to modify
5. server.py — add screen_text to the from tools import (...) block (alpha order, after safari);
register the tool right after capture_active_screen (before add_screen_glow), mirroring
feat/extract-screen-text:server.py with Field(default=..., ge=1, le=200000) constraints.
6. skills/altic-studio/SKILL.md — apply the SKILL.md diff from the reference branch:
- Intro list: add line 6 (MCP screen text mode); amend Swift utility scripts sentence.
- Mode A capabilities list: add extract-screen-text.swift entry + Swift command template.
- Mode B (Chrome) tool list: add extract_screen_text.
- New Mode B2: Screen Text and Visual Understanding (MCP) with tool, args, workflow rules.
- Operational Rules + Permissions Checklist additions.
7. skills/altic-studio/scripts/README.md — add the example invocation line.
8. README.md — feature bullet, skill listing, Screen Recording / macOS 27 permission updates,
"Manual Smoke Tests For Screen Text Tools" section.
## Not doing
- No new Python dependencies (Vision/FoundationModels are native; called via Swift subprocess).
- Skip package.json / package-lock.json.
## Verification
1. uv run pytest tests/test_screen_text.py -v ; uv run pytest -q (full suite).
2. Server registration via test_server_exposes_extract_screen_text; optionally import server.
3. swift -typecheck tools/scripts/extract-screen-text.swift (OCRTool fix prevents known error).
4. End-to-end manual on macOS w/ Screen Recording: OCR-only, FM summary on macOS 27, and the
FM-unavailable fallback; plus a direct script invocation.