PR Notes · Reviewer Context

Add extract_screen_text — active-display OCR for the Altic MCP server

New MCP tool that captures the active display and returns its visible text via macOS Vision OCR, with an optional macOS 27 Foundation Models visual-summary mode. Ported from the sibling branch feat/extract-screen-text with a Swift compile-bug fix and the empty package.json cruft dropped. The altic-studio skill and README are updated so the model knows when and how to use it.

8 files +~430 insertions Python 3.13 · FastMCP · Swift 39 / 39 tests pass swiftc typecheck: exit 0 macOS-only · manual E2E pending

01  Reviewer Digest

The server could already screenshot the active display (capture_active_screen) but could not read the text on it. This PR adds extract_screen_text, which captures the display containing the frontmost app, runs VNRecognizeTextRequest (Vision) OCR over it, and returns structured JSON (text, line count, average confidence, screenshot path). When include_visual_summary=true, it additionally asks a macOS 27 Foundation Models language model to describe the screen — gracefully degrading to OCR-only with a visual_error when that capability is unavailable.

Default mode

OCR only

Fast, deterministic, no FM dependency

Opt-in mode

+ Visual summary

macOS 27 Foundation Models

Return type

JSON string

Matches house _json/_error style

Provenance — this is a port, not a fresh write

A complete working implementation already existed on the sibling branch feat/extract-screen-text (its closed PR was #5). The Python wrapper and tests were copied verbatim; the Swift helper was copied with one bug fix. Reviewers comparing against that branch should focus on the two intentional deviations below.

02  Key Decisions

DecisionRationaleAlternative rejected
Port from feat/extract-screen-text A proven, test-covered implementation already existed; re-deriving it risked drift and wasted effort. Write a fresh implementation on this branch.
Fix undefined OCRTool() in the Swift FM path The reference used LanguageModelSession(model: model, tools: [OCRTool()])OCRTool is defined nowhere. On a macOS 27 toolchain (where FoundationModels imports), this fails to compile. Changed to LanguageModelSession(model: model). Leave as-is (would break swiftc on this exact machine — confirmed).
Ship OCR + visual summary User chose full parity with the reference over an OCR-only subset. Drop FM summary, visual_prompt, and visual_* fields.
Skip package.json / package-lock.json This is a Python/uv project; the reference branch's empty JS manifests are accidental cruft. Mirror the reference branch exactly.
Python-side truncation (max_chars, default 20000, capped 200000) Keeps the Swift helper simple (always returns full text) and bounds payload size at the tool boundary. Truncate inside Swift.

03  Impact Map — Files Changed

FileStateWhat & why
tools/screen_text.pyNEWPython wrapper: resolves output path, runs Swift helper (timeout=90), parses JSON, truncates text, normalizes payload. Mirrors tools/screenshot.py / clipboard.py conventions.
tools/scripts/extract-screen-text.swiftNEWScreenCaptureKit capture + Vision OCR + gated macOS 27 FM summary. Contains the OCRTool() fix.
skills/altic-studio/scripts/extract-screen-text.swiftNEWByte-identical mirror of the fixed Swift helper (the skill keeps its own script copies).
tests/test_screen_text.pyNEW7 tests: invocation shape, visual-option passthrough, truncation, FM-unavailable, subprocess error, invalid JSON, server registration.
server.pyMODAdded screen_text import + @mcp.tool() extract_screen_text(...) registration after capture_active_screen.
skills/altic-studio/SKILL.mdMODNew Mode B2 section, capability/tool-list/command-template entries, operational rule, permissions.
skills/altic-studio/scripts/README.mdMODExample invocation line for the new Swift script.
README.mdMODFeature bullet, skill listing, Screen Recording + macOS 27 permission notes, smoke-test section.

The two Swift files are kept identical on purpose — a reviewer change to one must be applied to both.

04  Data Flow & Degradation Path

MCP call
extract_screen_text
screen_text.py
subprocess, timeout 90s
Swift helper
capture + OCR
JSON stdout
text + metadata
Normalized JSON
truncated to max_chars

Visual-summary branch (only when include_visual_summary=true)

ConditionBehaviorResult fields
macOS 27 + FoundationModels availableLanguageModelSession summarizes the captured imagevisual_summary, visual_model_available=true, visual_model_source
< macOS 27 or FM modules absentDegrade to OCR-only, no throwvisual_model_available=false, visual_error explains why
FM call throws at runtimeCaught; OCR text still returnedvisual_error carries the localized error

05  Actions Taken

  1. Explored the codebase (read-only)2 parallel Explore agents — tool registration patterns, the screenshot tool, and the altic-studio skill structure
  2. Discovered the sibling reference branchfeat/extract-screen-text — full impl + tests; verified current branch lacked it
  3. Wrote & got approval on a plan (Plan Mode)Confirmed scope (OCR+summary) and package.json skip via AskUserQuestion
  4. Ported Python wrapper + tests verbatimgit show feat/extract-screen-text:<path> → tools/screen_text.py, tests/test_screen_text.py
  5. Ported Swift helper and applied the OCRTool fixEdited line 131; copied fixed file to the skill scripts dir (diff -q → identical)
  6. Wired the tool into server.pyImport in alpha order; registration block after capture_active_screen
  7. Updated skill + docsSKILL.md Mode B2, scripts/README.md, README.md feature/permissions/smoke-tests
  8. Verifiedpytest (targeted + full), server import, swiftc -typecheck

06  Verification

CheckResultDetail
pytest tests/test_screen_text.py -q7 passedAll new-tool unit tests, subprocess mocked
pytest -q (full suite)39 passedNo regressions across the repo
import server registrationTrue'extract_screen_text' in mcp._tool_manager._tools
swiftc -typecheck extract-screen-text.swiftexit 0Run on macOS 27.0, so the FoundationModels branch was compiled — directly confirms the OCRTool() fix
End-to-end on real screennot runRequires interactively granting Screen Recording permission; documented in README smoke tests
# targeted
$ uv run pytest tests/test_screen_text.py -q
.......                                          [100%]
7 passed in 0.63s

# full suite
$ uv run pytest -q
.......................................          [100%]
39 passed in 0.43s

# macOS 27 — FoundationModels path compiled, OCRTool fix confirmed
$ swiftc -typecheck tools/scripts/extract-screen-text.swift
typecheck exit: 0
$ sw_vers | grep ProductVersion
ProductVersion: 27.0

Known uncertainty

No live capture/OCR run was performed (needs Screen Recording grant + a foreground window with text), and the Foundation Models runtime summary path was not executed — only type-checked. Visual-summary output quality is therefore unverified.

07  Review Focus

AreaRiskWhy look here
Swift FM path (generateFoundationVisualSummary)mediumThe deviation from the reference; verify the LanguageModelSession(model:) API and Attachment(imageURL:) usage against the macOS 27 SDK you target.
Two identical Swift copieslowNo automated check enforces parity — future edits must touch both tools/scripts/ and skills/altic-studio/scripts/.
subprocess timeout (90s)lowFM summarization can be slow; confirm 90s is enough on cold start, or surface a clearer timeout error.
Truncation semanticslowtruncated reflects Python-side cut OR a Swift-provided flag; length_chars is the full length, returned_length_chars the served length.
Best reviewer entry pointStart at tools/screen_text.py (contract) → server.py (signature/limits) → Swift helper (capture/OCR/FM).

08  Full Plan (approved, verbatim)

Approved Plan Mode plan
# Plan: Add extract_screen_text MCP tool + update the skill

## Context
The repo (altic-mcp, a Python 3.13 / FastMCP server for macOS automation) can capture
the active display as an image (capture_active_screen) but cannot read the text on screen.
We want a new tool, extract_screen_text, that captures the active display and returns
the visible text via macOS Vision OCR — with an optional macOS 27 Foundation Models "visual
summary" mode for higher-level UI understanding. The skill (altic-studio) and README must be
updated so the model knows when/how to use it.

A complete, working reference implementation already exists on the sibling branch
feat/extract-screen-text (10 files, 856 insertions). We are on feat/text-extraction and will
port that work — with two deviations: (1) fix a latent compile bug in the Foundation Models
path, and (2) skip the empty package.json / package-lock.json files (the project is Python/uv,
those are cruft).

## Scope decisions (confirmed with user)
- OCR + visual summary — full parity with the reference branch, including
  include_visual_summary / visual_prompt.
- Skip package.json and package-lock.json.

## Files to create
1. tools/screen_text.py (new, Python wrapper)
   Mirror feat/extract-screen-text:tools/screen_text.py verbatim. Follows existing wrapper
   conventions (compare tools/screenshot.py, tools/clipboard.py):
   - DEFAULT_VISUAL_PROMPT constant.
   - _error(), _json(), _script_path() (uses SCRIPTS_PREFIX from tools/constants.py),
     _default_output_path() (writes to tempfile.gettempdir()/altic-mcp-screen-text).
   - extract_screen_text(output_path, max_chars=20000, include_visual_summary=False,
     visual_prompt=DEFAULT_VISUAL_PROMPT): resolves/creates output path, runs the Swift
     helper via subprocess.run(..., timeout=90), parses the JSON stdout, truncates text to
     max_chars in Python, returns a normalized JSON payload (action, screenshot_path, text,
     length_chars, returned_length_chars, truncated, line_count, average_confidence, plus
     visual_* keys when present).
   - Error handling: Error: ... strings for subprocess failure, invalid JSON, timeout, generic.

2. tools/scripts/extract-screen-text.swift (new, Swift helper)
   Mirror feat/extract-screen-text:..., reusing capture-active-screen.swift display-selection
   logic and adding:
   - recognizeText(in:) using VNRecognizeTextRequest (Vision).
   - Optional FM visual summary gated behind @available(macOS 27.0, *) and
     #if canImport(FoundationModels) && canImport(_Vision_FoundationModels).
   - Emits one JSON object on stdout via JSONEncoder with .withoutEscapingSlashes.
   - CLI args: <output_path> [include_visual_summary] [visual_prompt].
   BUG FIX (deviation): in generateFoundationVisualSummary, change
     let session = LanguageModelSession(model: model, tools: [OCRTool()])
   to
     let session = LanguageModelSession(model: model)
   (OCRTool() is undefined; this breaks compile on a macOS 27 toolchain.)

3. skills/altic-studio/scripts/extract-screen-text.swift (new, mirror)
   Copy the fixed Swift file so the two copies stay identical.

4. tests/test_screen_text.py (new)
   Mirror feat/extract-screen-text:tests/test_screen_text.py — pytest + monkeypatch over
   screen_text.subprocess.run. Covers default args/swift invocation shape, visual-summary
   options passthrough, Python-side truncation, visual-unavailable reporting, subprocess error,
   invalid JSON, and test_server_exposes_extract_screen_text.

## Files to modify
5. server.py — add screen_text to the from tools import (...) block (alpha order, after safari);
   register the tool right after capture_active_screen (before add_screen_glow), mirroring
   feat/extract-screen-text:server.py with Field(default=..., ge=1, le=200000) constraints.

6. skills/altic-studio/SKILL.md — apply the SKILL.md diff from the reference branch:
   - Intro list: add line 6 (MCP screen text mode); amend Swift utility scripts sentence.
   - Mode A capabilities list: add extract-screen-text.swift entry + Swift command template.
   - Mode B (Chrome) tool list: add extract_screen_text.
   - New Mode B2: Screen Text and Visual Understanding (MCP) with tool, args, workflow rules.
   - Operational Rules + Permissions Checklist additions.

7. skills/altic-studio/scripts/README.md — add the example invocation line.

8. README.md — feature bullet, skill listing, Screen Recording / macOS 27 permission updates,
   "Manual Smoke Tests For Screen Text Tools" section.

## Not doing
- No new Python dependencies (Vision/FoundationModels are native; called via Swift subprocess).
- Skip package.json / package-lock.json.

## Verification
1. uv run pytest tests/test_screen_text.py -v ; uv run pytest -q (full suite).
2. Server registration via test_server_exposes_extract_screen_text; optionally import server.
3. swift -typecheck tools/scripts/extract-screen-text.swift (OCRTool fix prevents known error).
4. End-to-end manual on macOS w/ Screen Recording: OCR-only, FM summary on macOS 27, and the
   FM-unavailable fallback; plus a direct script invocation.