Signal Isolation¶
NLP pipeline for content-word delta analysis (Signal Isolation Layer).
See Interpretive Provenance Chain (IPC) and Hashing System for related provenance concepts.
app/signal_isolation.py¶
Signal Isolation Layer for the Axis Descriptor Lab.
Why a dedicated module?¶
The signal isolation pipeline transforms raw LLM output text into a filtered
set of content lemmas so that meaningful lexical pivots can be surfaced
without structural noise. Centralising this logic in one module keeps
main.py focused on routing and ensures the NLP pipeline is independently
testable.
Pipeline¶
The module applies a five-step pipeline to each text:
Tokenise — split text into word tokens using NLTK’s Penn Treebank tokeniser (
word_tokenize). Lowercase all tokens and discard any that contain no alphabetic characters (punctuation, numbers).Lemmatise — reduce inflected forms to their base lemma using the WordNet lemmatiser. A two-pass heuristic is used: try verb lemmatisation first (catches “carries” → “carry”, “failing” → “fail”), then fall back to the default noun lemmatisation (“figures” → “figure”).
Filter stopwords — remove English function words (articles, auxiliaries, pronouns, conjunctions) using NLTK’s stopwords corpus.
Collect into a set — deduplicate the remaining content lemmas.
Compute delta — set-difference the two lemma sets to find words that were added or removed.
Design principles (from the specification)¶
Deterministic: same input text always produces the same lemma set.
Transparent: every step is inspectable; no hidden inference.
No axis attribution: the pipeline does not know which axis caused a word to appear.
No embeddings: operates strictly at the lexical level.
No TF-IDF (Phase 1): results are sorted alphabetically, not by corpus rarity. TF-IDF sorting is reserved for a future phase.
NLTK data requirements¶
This module requires three NLTK data packages:
punkt_tab— tokeniser models (Penn Treebank)stopwords— English stopword list (179 words)wordnet— lemmatiser database (WordNet 3.0)
These resources are validated explicitly at call time rather than being
downloaded during module import. Environment preparation should bootstrap
them up front via python tools/bootstrap_nltk.py.
- app.signal_isolation.extract_content_lemmas(text)[source]¶
Run the full signal isolation pipeline on a text string.
Pipeline steps (applied in order):
Tokenise — NLTK
word_tokenize, lowercase, filter non-alpha.Lemmatise — WordNet, verb-then-noun fallback.
Filter stopwords — remove NLTK English stopwords.
Collect into a set — deduplicate remaining content lemmas.
- app.signal_isolation.compute_delta(baseline_text, current_text)[source]¶
Compute the content-word delta between two texts.
Runs the signal isolation pipeline (
extract_content_lemmas) on both texts, then computes set differences to find words that were added or removed.- Parameters:
- Returns:
A 2-tuple of:
removed — content lemmas present in A but absent from B, sorted alphabetically.
added — content lemmas present in B but absent from A, sorted alphabetically.
- Return type: