Signal Isolation

NLP pipeline for content-word delta analysis (Signal Isolation Layer).

See Interpretive Provenance Chain (IPC) and Hashing System for related provenance concepts.

app/signal_isolation.py

Signal Isolation Layer for the Axis Descriptor Lab.

Why a dedicated module?

The signal isolation pipeline transforms raw LLM output text into a filtered set of content lemmas so that meaningful lexical pivots can be surfaced without structural noise. Centralising this logic in one module keeps main.py focused on routing and ensures the NLP pipeline is independently testable.

Pipeline

The module applies a five-step pipeline to each text:

  1. Tokenise — split text into word tokens using NLTK’s Penn Treebank tokeniser (word_tokenize). Lowercase all tokens and discard any that contain no alphabetic characters (punctuation, numbers).

  2. Lemmatise — reduce inflected forms to their base lemma using the WordNet lemmatiser. A two-pass heuristic is used: try verb lemmatisation first (catches “carries” → “carry”, “failing” → “fail”), then fall back to the default noun lemmatisation (“figures” → “figure”).

  3. Filter stopwords — remove English function words (articles, auxiliaries, pronouns, conjunctions) using NLTK’s stopwords corpus.

  4. Collect into a set — deduplicate the remaining content lemmas.

  5. Compute delta — set-difference the two lemma sets to find words that were added or removed.

Design principles (from the specification)

  • Deterministic: same input text always produces the same lemma set.

  • Transparent: every step is inspectable; no hidden inference.

  • No axis attribution: the pipeline does not know which axis caused a word to appear.

  • No embeddings: operates strictly at the lexical level.

  • No TF-IDF (Phase 1): results are sorted alphabetically, not by corpus rarity. TF-IDF sorting is reserved for a future phase.

NLTK data requirements

This module requires three NLTK data packages:

  • punkt_tab — tokeniser models (Penn Treebank)

  • stopwords — English stopword list (179 words)

  • wordnet — lemmatiser database (WordNet 3.0)

These resources are validated explicitly at call time rather than being downloaded during module import. Environment preparation should bootstrap them up front via python tools/bootstrap_nltk.py.

app.signal_isolation.extract_content_lemmas(text)[source]

Run the full signal isolation pipeline on a text string.

Pipeline steps (applied in order):

  1. Tokenise — NLTK word_tokenize, lowercase, filter non-alpha.

  2. Lemmatise — WordNet, verb-then-noun fallback.

  3. Filter stopwords — remove NLTK English stopwords.

  4. Collect into a set — deduplicate remaining content lemmas.

Parameters:

text (str) – Raw input text (e.g. an LLM-generated paragraph).

Returns:

Unique content lemmas extracted from the text. Empty set if the text is empty or contains only stopwords.

Return type:

set[str]

app.signal_isolation.compute_delta(baseline_text, current_text)[source]

Compute the content-word delta between two texts.

Runs the signal isolation pipeline (extract_content_lemmas) on both texts, then computes set differences to find words that were added or removed.

Parameters:
  • baseline_text (str) – The reference text (A) — typically the stored baseline output.

  • current_text (str) – The comparison text (B) — typically the latest generated output.

Returns:

A 2-tuple of:

  • removed — content lemmas present in A but absent from B, sorted alphabetically.

  • added — content lemmas present in B but absent from A, sorted alphabetically.

Return type:

tuple[list[str], list[str]]