PR Notes · Reviewer Context

Add `extract_screen_text` — active-display OCR for the Altic MCP server

New MCP tool that captures the active display and returns its visible text via macOS Vision OCR, with an optional macOS 27 Foundation Models visual-summary mode. Ported from the sibling branch feat/extract-screen-text with a Swift compile-bug fix and the empty package.json cruft dropped. The altic-studio skill and README are updated so the model knows when and how to use it.

8 files +~430 insertions Python 3.13 · FastMCP · Swift 39 / 39 tests pass swiftc typecheck: exit 0 macOS-only · manual E2E pending

01 Reviewer Digest

The server could already screenshot the active display (capture_active_screen) but could not read the text on it. This PR adds extract_screen_text, which captures the display containing the frontmost app, runs VNRecognizeTextRequest (Vision) OCR over it, and returns structured JSON (text, line count, average confidence, screenshot path). When include_visual_summary=true, it additionally asks a macOS 27 Foundation Models language model to describe the screen — gracefully degrading to OCR-only with a visual_error when that capability is unavailable.

Default mode

OCR only

Fast, deterministic, no FM dependency

Opt-in mode

+ Visual summary

macOS 27 Foundation Models

Return type

JSON string

Matches house _json/_error style

Provenance — this is a port, not a fresh write

A complete working implementation already existed on the sibling branch feat/extract-screen-text (its closed PR was #5). The Python wrapper and tests were copied verbatim; the Swift helper was copied with one bug fix. Reviewers comparing against that branch should focus on the two intentional deviations below.

02 Key Decisions

Decision	Rationale	Alternative rejected
Port from `feat/extract-screen-text`	A proven, test-covered implementation already existed; re-deriving it risked drift and wasted effort.	Write a fresh implementation on this branch.
Fix undefined `OCRTool()` in the Swift FM path	The reference used `LanguageModelSession(model: model, tools: [OCRTool()])` — `OCRTool` is defined nowhere. On a macOS 27 toolchain (where `FoundationModels` imports), this fails to compile. Changed to `LanguageModelSession(model: model)`.	Leave as-is (would break `swiftc` on this exact machine — confirmed).
Ship OCR + visual summary	User chose full parity with the reference over an OCR-only subset.	Drop FM summary, `visual_prompt`, and visual_* fields.
Skip `package.json` / `package-lock.json`	This is a Python/uv project; the reference branch's empty JS manifests are accidental cruft.	Mirror the reference branch exactly.
Python-side truncation (`max_chars`, default 20000, capped 200000)	Keeps the Swift helper simple (always returns full text) and bounds payload size at the tool boundary.	Truncate inside Swift.

03 Impact Map — Files Changed

File	State	What & why
`tools/screen_text.py`	NEW	Python wrapper: resolves output path, runs Swift helper (`timeout=90`), parses JSON, truncates text, normalizes payload. Mirrors `tools/screenshot.py` / `clipboard.py` conventions.
`tools/scripts/extract-screen-text.swift`	NEW	ScreenCaptureKit capture + Vision OCR + gated macOS 27 FM summary. Contains the `OCRTool()` fix.
`skills/altic-studio/scripts/extract-screen-text.swift`	NEW	Byte-identical mirror of the fixed Swift helper (the skill keeps its own script copies).
`tests/test_screen_text.py`	NEW	7 tests: invocation shape, visual-option passthrough, truncation, FM-unavailable, subprocess error, invalid JSON, server registration.
`server.py`	MOD	Added `screen_text` import + `@mcp.tool() extract_screen_text(...)` registration after `capture_active_screen`.
`skills/altic-studio/SKILL.md`	MOD	New Mode B2 section, capability/tool-list/command-template entries, operational rule, permissions.
`skills/altic-studio/scripts/README.md`	MOD	Example invocation line for the new Swift script.
`README.md`	MOD	Feature bullet, skill listing, Screen Recording + macOS 27 permission notes, smoke-test section.

The two Swift files are kept identical on purpose — a reviewer change to one must be applied to both.

04 Data Flow & Degradation Path

MCP call

extract_screen_text

→

screen_text.py

subprocess, timeout 90s

→

Swift helper

capture + OCR

→

JSON stdout

text + metadata

→

Normalized JSON

truncated to max_chars

Visual-summary branch (only when `include_visual_summary=true`)

Condition	Behavior	Result fields
macOS 27 + FoundationModels available	LanguageModelSession summarizes the captured image	`visual_summary`, `visual_model_available=true`, `visual_model_source`
< macOS 27 or FM modules absent	Degrade to OCR-only, no throw	`visual_model_available=false`, `visual_error` explains why
FM call throws at runtime	Caught; OCR text still returned	`visual_error` carries the localized error

05 Actions Taken

Explored the codebase (read-only)2 parallel Explore agents — tool registration patterns, the screenshot tool, and the altic-studio skill structure
Discovered the sibling reference branchfeat/extract-screen-text — full impl + tests; verified current branch lacked it
Wrote & got approval on a plan (Plan Mode)Confirmed scope (OCR+summary) and package.json skip via AskUserQuestion
Ported Python wrapper + tests verbatimgit show feat/extract-screen-text:<path> → tools/screen_text.py, tests/test_screen_text.py
Ported Swift helper and applied the OCRTool fixEdited line 131; copied fixed file to the skill scripts dir (diff -q → identical)
Wired the tool into server.pyImport in alpha order; registration block after capture_active_screen
Updated skill + docsSKILL.md Mode B2, scripts/README.md, README.md feature/permissions/smoke-tests
Verifiedpytest (targeted + full), server import, swiftc -typecheck

06 Verification

Check	Result	Detail
`pytest tests/test_screen_text.py -q`	7 passed	All new-tool unit tests, subprocess mocked
`pytest -q` (full suite)	39 passed	No regressions across the repo
`import server` registration	True	`'extract_screen_text' in mcp._tool_manager._tools`
`swiftc -typecheck extract-screen-text.swift`	exit 0	Run on macOS 27.0, so the FoundationModels branch was compiled — directly confirms the `OCRTool()` fix
End-to-end on real screen	not run	Requires interactively granting Screen Recording permission; documented in README smoke tests

# targeted
$ uv run pytest tests/test_screen_text.py -q
.......                                          [100%]
7 passed in 0.63s

# full suite
$ uv run pytest -q
.......................................          [100%]
39 passed in 0.43s

# macOS 27 — FoundationModels path compiled, OCRTool fix confirmed
$ swiftc -typecheck tools/scripts/extract-screen-text.swift
typecheck exit: 0
$ sw_vers | grep ProductVersion
ProductVersion: 27.0

Known uncertainty

No live capture/OCR run was performed (needs Screen Recording grant + a foreground window with text), and the Foundation Models runtime summary path was not executed — only type-checked. Visual-summary output quality is therefore unverified.

07 Review Focus

Area	Risk	Why look here
Swift FM path (`generateFoundationVisualSummary`)	medium	The deviation from the reference; verify the `LanguageModelSession(model:)` API and `Attachment(imageURL:)` usage against the macOS 27 SDK you target.
Two identical Swift copies	low	No automated check enforces parity — future edits must touch both `tools/scripts/` and `skills/altic-studio/scripts/`.
`subprocess` timeout (90s)	low	FM summarization can be slow; confirm 90s is enough on cold start, or surface a clearer timeout error.
Truncation semantics	low	`truncated` reflects Python-side cut OR a Swift-provided flag; `length_chars` is the full length, `returned_length_chars` the served length.
Best reviewer entry point	—	Start at `tools/screen_text.py` (contract) → `server.py` (signature/limits) → Swift helper (capture/OCR/FM).

08 Full Plan (approved, verbatim)

Approved Plan Mode plan

# Plan: Add extract_screen_text MCP tool + update the skill

## Context
The repo (altic-mcp, a Python 3.13 / FastMCP server for macOS automation) can capture
the active display as an image (capture_active_screen) but cannot read the text on screen.
We want a new tool, extract_screen_text, that captures the active display and returns
the visible text via macOS Vision OCR — with an optional macOS 27 Foundation Models "visual
summary" mode for higher-level UI understanding. The skill (altic-studio) and README must be
updated so the model knows when/how to use it.

A complete, working reference implementation already exists on the sibling branch
feat/extract-screen-text (10 files, 856 insertions). We are on feat/text-extraction and will
port that work — with two deviations: (1) fix a latent compile bug in the Foundation Models
path, and (2) skip the empty package.json / package-lock.json files (the project is Python/uv,
those are cruft).

## Scope decisions (confirmed with user)
- OCR + visual summary — full parity with the reference branch, including
  include_visual_summary / visual_prompt.
- Skip package.json and package-lock.json.

## Files to create
1. tools/screen_text.py (new, Python wrapper)
   Mirror feat/extract-screen-text:tools/screen_text.py verbatim. Follows existing wrapper
   conventions (compare tools/screenshot.py, tools/clipboard.py):
   - DEFAULT_VISUAL_PROMPT constant.
   - _error(), _json(), _script_path() (uses SCRIPTS_PREFIX from tools/constants.py),
     _default_output_path() (writes to tempfile.gettempdir()/altic-mcp-screen-text).
   - extract_screen_text(output_path, max_chars=20000, include_visual_summary=False,
     visual_prompt=DEFAULT_VISUAL_PROMPT): resolves/creates output path, runs the Swift
     helper via subprocess.run(..., timeout=90), parses the JSON stdout, truncates text to
     max_chars in Python, returns a normalized JSON payload (action, screenshot_path, text,
     length_chars, returned_length_chars, truncated, line_count, average_confidence, plus
     visual_* keys when present).
   - Error handling: Error: ... strings for subprocess failure, invalid JSON, timeout, generic.

2. tools/scripts/extract-screen-text.swift (new, Swift helper)
   Mirror feat/extract-screen-text:..., reusing capture-active-screen.swift display-selection
   logic and adding:
   - recognizeText(in:) using VNRecognizeTextRequest (Vision).
   - Optional FM visual summary gated behind @available(macOS 27.0, *) and
     #if canImport(FoundationModels) && canImport(_Vision_FoundationModels).
   - Emits one JSON object on stdout via JSONEncoder with .withoutEscapingSlashes.
   - CLI args: <output_path> [include_visual_summary] [visual_prompt].
   BUG FIX (deviation): in generateFoundationVisualSummary, change
     let session = LanguageModelSession(model: model, tools: [OCRTool()])
   to
     let session = LanguageModelSession(model: model)
   (OCRTool() is undefined; this breaks compile on a macOS 27 toolchain.)

3. skills/altic-studio/scripts/extract-screen-text.swift (new, mirror)
   Copy the fixed Swift file so the two copies stay identical.

4. tests/test_screen_text.py (new)
   Mirror feat/extract-screen-text:tests/test_screen_text.py — pytest + monkeypatch over
   screen_text.subprocess.run. Covers default args/swift invocation shape, visual-summary
   options passthrough, Python-side truncation, visual-unavailable reporting, subprocess error,
   invalid JSON, and test_server_exposes_extract_screen_text.

## Files to modify
5. server.py — add screen_text to the from tools import (...) block (alpha order, after safari);
   register the tool right after capture_active_screen (before add_screen_glow), mirroring
   feat/extract-screen-text:server.py with Field(default=..., ge=1, le=200000) constraints.

6. skills/altic-studio/SKILL.md — apply the SKILL.md diff from the reference branch:
   - Intro list: add line 6 (MCP screen text mode); amend Swift utility scripts sentence.
   - Mode A capabilities list: add extract-screen-text.swift entry + Swift command template.
   - Mode B (Chrome) tool list: add extract_screen_text.
   - New Mode B2: Screen Text and Visual Understanding (MCP) with tool, args, workflow rules.
   - Operational Rules + Permissions Checklist additions.

7. skills/altic-studio/scripts/README.md — add the example invocation line.

8. README.md — feature bullet, skill listing, Screen Recording / macOS 27 permission updates,
   "Manual Smoke Tests For Screen Text Tools" section.

## Not doing
- No new Python dependencies (Vision/FoundationModels are native; called via Swift subprocess).
- Skip package.json / package-lock.json.

## Verification
1. uv run pytest tests/test_screen_text.py -v ; uv run pytest -q (full suite).
2. Server registration via test_server_exposes_extract_screen_text; optionally import server.
3. swift -typecheck tools/scripts/extract-screen-text.swift (OCRTool fix prevents known error).
4. End-to-end manual on macOS w/ Screen Recording: OCR-only, FM summary on macOS 27, and the
   FM-unavailable fallback; plus a direct script invocation.

Built from the live coding-session context and local Git state (no external transcript file was provided). Branch feat/text-extraction · Altic MCP.