././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1704581742.6974146 interegular-0.3.3/0000777000000000000000000000000014546355157011007 5ustar00././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1677774165.0 interegular-0.3.3/LICENSE.txt0000666000000000000000000000211014400146525012606 0ustar00MIT License Copyright (c) 2019 @MegaIng and @qntm Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1704581742.6974146 interegular-0.3.3/PKG-INFO0000666000000000000000000000575114546355157012114 0ustar00Metadata-Version: 2.1 Name: interegular Version: 0.3.3 Summary: a regex intersection checker Home-page: https://github.com/MegaIng/regex_intersections Download-URL: https://github.com/MegaIng/interegular/tarball/master Author: MegaIng Author-email: MegaIng License: MIT Classifier: Operating System :: OS Independent Classifier: License :: OSI Approved :: MIT License Classifier: Programming Language :: Python :: 3 :: Only Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3.12 Requires-Python: >=3.7 Description-Content-Type: text/markdown License-File: LICENSE.txt # Interegular ***regex intersection checker*** A library to check a subset of python regexes for intersections. Based on [grennery](https://github.com/qntm/greenery) by [@qntm](https://github.com/qntm). Adapted for [lark-parser](https://github.com/lark-parser/lark). The primary difference with `grennery` library is that `interegular` is focused on speed and compatibility with python re syntax, whereas grennery has a way to reconstruct a regex from a FSM, which `interegular` lacks. ## Interface | Function | Usage | | -------- | ----- | | `compare_regexes(*regexes: str)` | Takes a series of regexes as strings and returns a Generator of all intersections as `(str, str)`| | `parse_pattern(regex: str)` | Parses a regex as string to a `Pattern` object| | `interegular.compare_patterns(*patterns: Pattern)` | Takes a series of regexes as patterns and returns a Generator of all intersections as `(Pattern, Pattern)`| | `Pattern` | A class representing a parsed regex (intermediate representation)| | `REFlags` | A enum representing the flags a regex can have | | `FSM` | A class representing a fully parsed regex. (Has many useful members) | | `Pattern.with_flags(added: REFlags, removed: REFlags)` | A function to change the flags that are applied to a regex| | `Pattern.to_fsm() -> FSM` | A function to create a `FSM` object from the Pattern | | `Comparator` | A Class to compare a group of Patterns | ## What is supported? Most normal python-regex syntax is support. But because of the backend that is used (final-state-machines), some things can not be implemented. This includes: - Backwards references (`\1`, `(?P=open)`) - Conditional Matching (`(?(1)a|b)`) - Some cases of lookaheads/lookbacks (You gotta try out which work and which don't) - A word of warning: This is currently not correctly handled, and some things might parse, but not work correctly. I am currently working on this. Some things are simply not implemented and will implemented in the future: - Some flags (Progress: `ims` from `aiLmsux`) - Some cases of lookaheads/lookbacks (You gotta try out which work and which don't) ## TODO - Docs - More tests - Checks that the syntax is correctly handled. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1678363623.0 interegular-0.3.3/README.md0000666000000000000000000000427314402345747012267 0ustar00# Interegular ***regex intersection checker*** A library to check a subset of python regexes for intersections. Based on [grennery](https://github.com/qntm/greenery) by [@qntm](https://github.com/qntm). Adapted for [lark-parser](https://github.com/lark-parser/lark). The primary difference with `grennery` library is that `interegular` is focused on speed and compatibility with python re syntax, whereas grennery has a way to reconstruct a regex from a FSM, which `interegular` lacks. ## Interface | Function | Usage | | -------- | ----- | | `compare_regexes(*regexes: str)` | Takes a series of regexes as strings and returns a Generator of all intersections as `(str, str)`| | `parse_pattern(regex: str)` | Parses a regex as string to a `Pattern` object| | `interegular.compare_patterns(*patterns: Pattern)` | Takes a series of regexes as patterns and returns a Generator of all intersections as `(Pattern, Pattern)`| | `Pattern` | A class representing a parsed regex (intermediate representation)| | `REFlags` | A enum representing the flags a regex can have | | `FSM` | A class representing a fully parsed regex. (Has many useful members) | | `Pattern.with_flags(added: REFlags, removed: REFlags)` | A function to change the flags that are applied to a regex| | `Pattern.to_fsm() -> FSM` | A function to create a `FSM` object from the Pattern | | `Comparator` | A Class to compare a group of Patterns | ## What is supported? Most normal python-regex syntax is support. But because of the backend that is used (final-state-machines), some things can not be implemented. This includes: - Backwards references (`\1`, `(?P=open)`) - Conditional Matching (`(?(1)a|b)`) - Some cases of lookaheads/lookbacks (You gotta try out which work and which don't) - A word of warning: This is currently not correctly handled, and some things might parse, but not work correctly. I am currently working on this. Some things are simply not implemented and will implemented in the future: - Some flags (Progress: `ims` from `aiLmsux`) - Some cases of lookaheads/lookbacks (You gotta try out which work and which don't) ## TODO - Docs - More tests - Checks that the syntax is correctly handled.././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1704581742.6871338 interegular-0.3.3/interegular/0000777000000000000000000000000014546355157013330 5ustar00././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1704581676.0 interegular-0.3.3/interegular/__init__.py0000666000000000000000000000215014546355054015433 0ustar00""" A package to compare python-style regexes and test if they have intersections. Based on the `greenery`-package by @qntm, adapted and specialized for `lark-parser` """ from typing import Iterable, Tuple from interegular.fsm import FSM from interegular.patterns import Pattern, parse_pattern, REFlags, Unsupported, InvalidSyntax from interegular.comparator import Comparator from interegular.utils import logger __all__ = ['FSM', 'Pattern', 'Comparator', 'parse_pattern', 'compare_patterns', 'compare_regexes', '__version__', 'REFlags', 'Unsupported', 'InvalidSyntax'] def compare_regexes(*regexes: str) -> Iterable[Tuple[str, str]]: """ Checks the regexes for intersections. Returns all pairs it found """ c = Comparator({r: parse_pattern(r) for r in regexes}) print(c._patterns) return c.check(regexes) def compare_patterns(*ps: Pattern) -> Iterable[Tuple[Pattern, Pattern]]: """ Checks the Patterns for intersections. Returns all pairs it found """ c = Comparator({p: p for p in ps}) return c.check(ps) __version__ = "0.3.3" ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1678549081.0 interegular-0.3.3/interegular/comparator.py0000666000000000000000000001603614403120131016025 0ustar00from collections import namedtuple from dataclasses import dataclass from itertools import combinations from typing import List, Tuple, Any, Dict, Iterable, Set, FrozenSet, Optional from interegular import InvalidSyntax, REFlags from interegular.fsm import FSM, Alphabet, anything_else from interegular.patterns import Pattern, Unsupported, parse_pattern from interegular.utils import logger, soft_repr @dataclass class ExampleCollision: """ Captures the full text of an example collision between two regex. `main_text` is the part that actually gets captured by the two regex `prefix` is the part that is potentially needed for lookbehinds `postfix` is the part that is potentially needed for lookahead """ prefix: str main_text: str postfix: str def format_multiline(self, intro: str = "Example Collision: ", indent: str = "", force_pointer: bool = False) -> str: """ Formats this example somewhat similar to a python syntax error. - intro is added on the first line - indent is added on the second line The three parts of the example are concatenated and `^` is used to underline them. ExampleCollision(prefix='a', main_text='cd', postfix='ef').format_multiline() leads to Example Collision: acdef ^^ This function will escape the character where necessary to stay readable. if `force_pointer` is False, the function will not produce the second line if only main_text is set """ if len(intro) < len(indent): raise ValueError("Can't have intro be shorter than indent") prefix = soft_repr(self.prefix) main_text = soft_repr(self.main_text) postfix = soft_repr(self.postfix) text = f"{prefix}{main_text}{postfix}" if len(text) != len(main_text): whitespace = ' ' * (len(intro) - len(indent) + len(prefix)) pointers = '^' * len(main_text) return f"{intro}{text}\n{indent}{whitespace}{pointers}" else: return f"{intro}{text}" @property def full_text(self): return self.prefix + self.main_text + self.postfix class Comparator: """ A class that represents the main interface for comparing a list of regex to each other. It expects a dictionary of arbitrary labels mapped to `Pattern` instances, but there is a utility function to create the instances `from_regex` strings. The main interface function all expect the abitrary labels to be given, which then get mapped to the correct `Pattern` and/or `FSM` instance. There is a utility function `mark(a,b)` which allows to mark pairs that shouldn't be checked again by `check`. """ def __init__(self, patterns: Dict[Any, Pattern]): self._patterns = patterns self._marked_pairs: Set[FrozenSet[Any]] = set() if not patterns: # `isdisjoint` can not be called anyway, so we don't need to create a valid state return self._alphabet = Alphabet.union(*(p.get_alphabet(REFlags(0)) for p in patterns.values()))[0] prefix_postfix_s = [p.prefix_postfix for p in patterns.values()] self._prefix_postfix = max(p[0] for p in prefix_postfix_s), max(p[1] for p in prefix_postfix_s) self._fsms: Dict[Any, FSM] = {} self._know_pairs: Dict[Tuple[Any, Any], bool] = {} def get_fsm(self, a: Any) -> FSM: if a not in self._fsms: try: self._fsms[a] = self._patterns[a].to_fsm(self._alphabet, self._prefix_postfix) except Unsupported as e: self._fsms[a] = None logger.warning(f"Can't compile Pattern to fsm for {a}\n {repr(e)}") except KeyError: self._fsms[a] = None # In case it was thrown away in `from_regexes` return self._fsms[a] def isdisjoint(self, a: Any, b: Any) -> bool: if (a, b) not in self._know_pairs: fa, fb = self.get_fsm(a), self.get_fsm(b) if fa is None or fb is None: self._know_pairs[a, b] = True # We can't know. Assume they are disjoint else: self._know_pairs[a, b] = fa.isdisjoint(fb) return self._know_pairs[a, b] def check(self, keys: Iterable[Any] = None, skip_marked: bool = False) -> Iterable[Tuple[Any, Any]]: if keys is None: keys = self._patterns for a, b in combinations(keys, 2): if skip_marked and self.is_marked(a, b): continue if not self.isdisjoint(a, b): yield a, b def get_example_overlap(self, a: Any, b: Any, max_time: float = None) -> ExampleCollision: pa, pb = self._patterns[a], self._patterns[b] needed_pre = max(pa.prefix_postfix[0], pb.prefix_postfix[0]) needed_post = max(pa.prefix_postfix[1], pb.prefix_postfix[1]) # We use the optimal alphabet here instead of the general one since that # massively improves performance by every metric. alphabet = pa.get_alphabet(REFlags(0)).union(pb.get_alphabet(REFlags(0)))[0] fa, fb = pa.to_fsm(alphabet, (needed_pre, needed_post)), pb.to_fsm(alphabet, (needed_pre, needed_post)) intersection = fa.intersection(fb) if max_time is None: max_iterations = None else: # We calculate an approximation for that value of max_iterations # that makes sure for this function to finish in under max_time seconds # This values will heavily depend on CPU, python version, exact patterns # and probably more factors, but this should generally be in the correct # ballpark. max_iterations = int((max_time - 0.09)/(1.4e-6 * len(alphabet))) try: text = next(intersection.strings(max_iterations)) except StopIteration: raise ValueError(f"No overlap between {a} and {b} exists") text = ''.join(c if c != anything_else else '?' for c in text) if needed_post > 0: return ExampleCollision(text[:needed_pre], text[needed_pre:-needed_post], text[-needed_post:]) else: return ExampleCollision(text[:needed_pre], text[needed_pre:], '') def is_marked(self, a: Any, b: Any) -> bool: return frozenset({a, b}) in self._marked_pairs @property def marked_pairs(self): return self._marked_pairs def count_marked_pairs(self): return len(self._marked_pairs) def mark(self, a: Any, b: Any): self._marked_pairs.add(frozenset({a, b})) @classmethod def from_regexes(cls, regexes: Dict[Any, str]): patterns = {} for k, r in regexes.items(): try: patterns[k] = parse_pattern(r) except (Unsupported, InvalidSyntax) as e: logger.warning(f"Can't compile regex to Pattern for {k}\n {repr(e)}") return cls(patterns) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1704580989.0 interegular-0.3.3/interegular/fsm.py0000666000000000000000000011433214546353575014475 0ustar00""" Finite state machine library, extracted from `greenery.fsm` and adapted by MegaIng """ from _collections import deque from collections import defaultdict from functools import total_ordering from typing import Any, Set, Dict, Union, NewType, Mapping, Tuple, Iterable from interegular.utils import soft_repr class _Marker(BaseException): pass @total_ordering class _AnythingElseCls: """ This is a surrogate symbol which you can use in your finite state machines to represent "any symbol not in the official alphabet". For example, if your state machine's alphabet is {"a", "b", "c", "d", fsm.anything_else}, then you can pass "e" in as a symbol and it will be converted to fsm.anything_else, then follow the appropriate transition. """ def __str__(self): return "anything_else" def __repr__(self): return "anything_else" def __lt__(self, other): return False def __eq__(self, other): return self is other def __hash__(self): return hash(id(self)) # We use a class instance because that gives us control over how the special # value gets serialised. Otherwise this would just be `object()`. anything_else = _AnythingElseCls() def nice_char_group(chars: Iterable[Union[str, _AnythingElseCls]]): out = [] current_range = [] for c in sorted(chars): if c is not anything_else and current_range and ord(current_range[-1]) + 1 == ord(c): current_range.append(c) continue if len(current_range) >= 2: out.append(f"{soft_repr(current_range[0])}-{soft_repr(current_range[-1])}") else: out.extend(map(soft_repr, current_range)) current_range = [c] if len(current_range) >= 2: out.append(f"{soft_repr(current_range[0])}-{soft_repr(current_range[-1])}") else: out.extend(map(soft_repr, current_range)) return ','.join(out) State = NewType("State", int) TransitionKey = NewType("TransitionKey", int) class Alphabet(Mapping[Any, TransitionKey]): @property def by_transition(self): return self._by_transition def __str__(self): out = [] width = 0 for tk, symbols in sorted(self._by_transition.items()): out.append((nice_char_group(symbols), str(tk))) if len(out[-1][0]) > width: width = len(out[-1][0]) return '\n'.join(f"{a:{width}} | {b}" for a, b in out) def __repr__(self): return f"{type(self).__name__}({self._symbol_mapping!r})" def __len__(self) -> int: return len(self._symbol_mapping) def __iter__(self): return iter(self._symbol_mapping) def __init__(self, symbol_mapping: Dict[Union[str, _AnythingElseCls], TransitionKey]): self._symbol_mapping = symbol_mapping by_transition = defaultdict(list) for s, t in self._symbol_mapping.items(): by_transition[t].append(s) self._by_transition = dict(by_transition) def __getitem__(self, item): if item not in self._symbol_mapping: if anything_else in self._symbol_mapping: return self._symbol_mapping[anything_else] else: return None else: return self._symbol_mapping[item] def __contains__(self, item): return item in self._symbol_mapping def union(*alphabets: 'Alphabet') -> 'Tuple[Alphabet, Tuple[Dict[TransitionKey, TransitionKey], ...]]': all_symbols = frozenset().union(*(a._symbol_mapping.keys() for a in alphabets)) symbol_to_keys = {symbol: tuple(a[symbol] for a in alphabets) for symbol in all_symbols} keys_to_symbols = defaultdict(list) for symbol, keys in symbol_to_keys.items(): keys_to_symbols[keys].append(symbol) keys_to_key = {k: i for i, k in enumerate(keys_to_symbols)} result = Alphabet({symbol: keys_to_key[keys] for keys, symbols in keys_to_symbols.items() for symbol in symbols}) new_to_old_mappings = [{} for _ in alphabets] for keys, new_key in keys_to_key.items(): for old_key, new_to_old in zip(keys, new_to_old_mappings): new_to_old[new_key] = old_key return result, tuple(new_to_old_mappings) @classmethod def from_groups(cls, *groups): return Alphabet({s: TransitionKey(i) for i, group in enumerate(groups) for s in group}) def intersect(self, other: 'Alphabet') -> 'Tuple[Alphabet, Tuple[Dict[TransitionKey, TransitionKey], ...]]': all_symbols = frozenset(self._symbol_mapping).intersection(other._symbol_mapping) symbol_to_keys = {symbol: tuple(a[symbol] for a in (self, other)) for symbol in all_symbols} keys_to_symbols = defaultdict(list) for symbol, keys in symbol_to_keys.items(): keys_to_symbols[keys].append(symbol) keys_to_key = {k: i for i, k in enumerate(keys_to_symbols)} result = Alphabet({symbol: keys_to_key[keys] for keys, symbols in keys_to_symbols.items() for symbol in symbols}) old_to_new_mappings = [defaultdict(list) for _ in (self, other)] new_to_old_mappings = [{} for _ in (self, other)] for keys, new_key in keys_to_key.items(): for old_key, old_to_new, new_to_old in zip(keys, old_to_new_mappings, new_to_old_mappings): old_to_new[old_key].append(new_key) new_to_old[new_key] = old_key return result, tuple(new_to_old_mappings) def copy(self): return Alphabet(self._symbol_mapping.copy()) class OblivionError(Exception): """ This exception is thrown while `crawl()`ing an FSM if we transition to the oblivion state. For example while crawling two FSMs in parallel we may transition to the oblivion state of both FSMs at once. This warrants an out-of-bound signal which will reduce the complexity of the new FSM's map. """ pass class FSM: """ A Finite State Machine or FSM has an alphabet and a set of states. At any given moment, the FSM is in one state. When passed a symbol from the alphabet, the FSM jumps to another state (or possibly the same state). A map (Python dictionary) indicates where to jump. One state is nominated as a starting state. Zero or more states are nominated as final states. If, after consuming a string of symbols, the FSM is in a final state, then it is said to "accept" the string. This class also has some pretty powerful methods which allow FSMs to be concatenated, alternated between, multiplied, looped (Kleene star closure), intersected, and simplified. The majority of these methods are available using operator overloads. """ alphabet: Alphabet initial: State states: Set[State] finals: Set[State] map: Dict[State, Dict[TransitionKey, State]] def __setattr__(self, name, value): """Immutability prevents some potential problems.""" raise Exception("This object is immutable.") def __init__(self, alphabet: Alphabet, states, initial, finals, map, *, __no_validation__=False): """ `alphabet` is an iterable of symbols the FSM can be fed. `states` is the set of states for the FSM `initial` is the initial state `finals` is the set of accepting states `map` may be sparse (i.e. it may omit transitions). In the case of omitted transitions, a non-final "oblivion" state is simulated. """ if not __no_validation__: # Validation. Thanks to immutability, this only needs to be carried out once. if not isinstance(alphabet, Alphabet): raise TypeError("Expected an Alphabet instance") if not initial in states: raise Exception("Initial state " + repr(initial) + " must be one of " + repr(states)) if not finals.issubset(states): raise Exception("Final states " + repr(finals) + " must be a subset of " + repr(states)) for state in map.keys(): for symbol in map[state]: if not map[state][symbol] in states: raise Exception( "Transition for state " + repr(state) + " and symbol " + repr(symbol) + " leads to " + repr( map[state][symbol]) + ", which is not a state") # Initialise the hard way due to immutability. self.__dict__["alphabet"] = alphabet self.__dict__["states"] = frozenset(states) self.__dict__["initial"] = initial self.__dict__["finals"] = frozenset(finals) self.__dict__["map"] = map def accepts(self, input: str): """ Test whether the present FSM accepts the supplied string (iterable of symbols). Equivalently, consider `self` as a possibly-infinite set of strings and test whether `string` is a member of it. This is actually mainly used for unit testing purposes. If `fsm.anything_else` is in your alphabet, then any symbol not in your alphabet will be converted to `fsm.anything_else`. """ state = self.initial for symbol in input: if anything_else in self.alphabet and not symbol in self.alphabet: symbol = anything_else transition = self.alphabet[symbol] # Missing transition = transition to dead state if not (state in self.map and transition in self.map[state]): return False state = self.map[state][transition] return state in self.finals def __contains__(self, string): """ This lets you use the syntax `"a" in fsm1` to see whether the string "a" is in the set of strings accepted by `fsm1`. """ return self.accepts(string) def reduce(self): """ A result by Brzozowski (1963) shows that a minimal finite state machine equivalent to the original can be obtained by reversing the original twice. """ return self.reversed().reversed() def __repr__(self): string = "fsm(" string += "alphabet = " + repr(self.alphabet) string += ", states = " + repr(self.states) string += ", initial = " + repr(self.initial) string += ", finals = " + repr(self.finals) string += ", map = " + repr(self.map) string += ")" return string def __str__(self): rows = [] # top row row = ["", "name", "final?"] # TODO maybe rework this to show transition groups instead of individual symbols row.extend(soft_repr(symbol) for symbol in sorted(self.alphabet)) rows.append(row) # other rows for state in self.states: row = [] if state == self.initial: row.append("*") else: row.append("") row.append(str(state)) if state in self.finals: row.append("True") else: row.append("False") for symbol, transition in sorted(self.alphabet.items()): if state in self.map and transition in self.map[state]: row.append(str(self.map[state][transition])) else: row.append("") rows.append(row) # column widths colwidths = [] for x in range(len(rows[0])): colwidths.append(max(len(str(rows[y][x])) for y in range(len(rows))) + 1) # apply padding for y in range(len(rows)): for x in range(len(rows[y])): rows[y][x] = rows[y][x].ljust(colwidths[x]) # horizontal line rows.insert(1, ["-" * colwidth for colwidth in colwidths]) return "".join("".join(row) + "\n" for row in rows) def concatenate(*fsms): """ Concatenate arbitrarily many finite state machines together. """ if len(fsms) == 0: return epsilon(Alphabet({})) alphabet, new_to_old = Alphabet.union(*[fsm.alphabet for fsm in fsms]) last_index, last = len(fsms) - 1, fsms[-1] def connect_all(i, substate): """ Take a state in the numbered FSM and return a set containing it, plus (if it's final) the first state from the next FSM, plus (if that's final) the first state from the next but one FSM, plus... """ result = {(i, substate)} while i < last_index and substate in fsms[i].finals: i += 1 substate = fsms[i].initial result.add((i, substate)) return result # Use a superset containing states from all FSMs at once. # We start at the start of the first FSM. If this state is final in the # first FSM, then we are also at the start of the second FSM. And so on. initial = set() if len(fsms) > 0: initial.update(connect_all(0, fsms[0].initial)) initial = frozenset(initial) def final(state): """If you're in a final state of the final FSM, it's final""" for (i, substate) in state: if i == last_index and substate in last.finals: return True return False def follow(current, new_transition): """ Follow the collection of states through all FSMs at once, jumping to the next FSM if we reach the end of the current one TODO: improve all follow() implementations to allow for dead metastates? """ next = set() for (i, substate) in current: fsm = fsms[i] if substate in fsm.map and new_to_old[i][new_transition] in fsm.map[substate]: next.update(connect_all(i, fsm.map[substate][new_to_old[i][new_transition]])) if not next: raise OblivionError return frozenset(next) return crawl(alphabet, initial, final, follow) def __add__(self, other): """ Concatenate two finite state machines together. For example, if self accepts "0*" and other accepts "1+(0|1)", will return a finite state machine accepting "0*1+(0|1)". Accomplished by effectively following non-deterministically. Call using "fsm3 = fsm1 + fsm2" """ return self.concatenate(other) def star(self): """ If the present FSM accepts X, returns an FSM accepting X* (i.e. 0 or more Xes). This is NOT as simple as naively connecting the final states back to the initial state: see (b*ab)* for example. """ alphabet = self.alphabet initial = {self.initial} def follow(state, transition): next = set() for substate in state: if substate in self.map and transition in self.map[substate]: next.add(self.map[substate][transition]) # If one of our substates is final, then we can also consider # transitions from the initial state of the original FSM. if substate in self.finals \ and self.initial in self.map \ and transition in self.map[self.initial]: next.add(self.map[self.initial][transition]) if not next: raise OblivionError return frozenset(next) def final(state): return any(substate in self.finals for substate in state) base = crawl(alphabet, initial, final, follow) base.__dict__['finals'] = base.finals | {base.initial} return base def times(self, multiplier): """ Given an FSM and a multiplier, return the multiplied FSM. """ if multiplier < 0: raise Exception("Can't multiply an FSM by " + repr(multiplier)) alphabet = self.alphabet # metastate is a set of iterations+states initial = {(self.initial, 0)} def final(state): """If the initial state is final then multiplying doesn't alter that""" for (substate, iteration) in state: if substate == self.initial \ and (self.initial in self.finals or iteration == multiplier): return True return False def follow(current, transition): next = [] for (substate, iteration) in current: if iteration < multiplier \ and substate in self.map \ and transition in self.map[substate]: next.append((self.map[substate][transition], iteration)) # final of self? merge with initial on next iteration if self.map[substate][transition] in self.finals: next.append((self.initial, iteration + 1)) if len(next) == 0: raise OblivionError return frozenset(next) return crawl(alphabet, initial, final, follow) def __mul__(self, multiplier): """ Given an FSM and a multiplier, return the multiplied FSM. """ return self.times(multiplier) def union(*fsms): """ Treat `fsms` as a collection of arbitrary FSMs and return the union FSM. Can be used as `fsm1.union(fsm2, ...)` or `fsm.union(fsm1, ...)`. `fsms` may be empty. """ return parallel(fsms, any) def __or__(self, other): """ Alternation. Return a finite state machine which accepts any sequence of symbols that is accepted by either self or other. Note that the set of strings recognised by the two FSMs undergoes a set union. Call using "fsm3 = fsm1 | fsm2" """ return self.union(other) def intersection(*fsms): """ Intersection. Take FSMs and AND them together. That is, return an FSM which accepts any sequence of symbols that is accepted by both of the original FSMs. Note that the set of strings recognised by the two FSMs undergoes a set intersection operation. Call using "fsm3 = fsm1 & fsm2" """ return parallel(fsms, all) def __and__(self, other): """ Treat the FSMs as sets of strings and return the intersection of those sets in the form of a new FSM. `fsm1.intersection(fsm2, ...)` or `fsm.intersection(fsm1, ...)` are acceptable. """ return self.intersection(other) def symmetric_difference(*fsms): """ Treat `fsms` as a collection of sets of strings and compute the symmetric difference of them all. The python set method only allows two sets to be operated on at once, but we go the extra mile since it's not too hard. """ return parallel(fsms, lambda accepts: (accepts.count(True) % 2) == 1) def __xor__(self, other): """ Symmetric difference. Returns an FSM which recognises only the strings recognised by `self` or `other` but not both. """ return self.symmetric_difference(other) def everythingbut(self): """ Return a finite state machine which will accept any string NOT accepted by self, and will not accept any string accepted by self. This is more complicated if there are missing transitions, because the missing "dead" state must now be reified. """ alphabet = self.alphabet initial = {0: self.initial} def follow(current, transition): next = {} if 0 in current and current[0] in self.map and transition in self.map[current[0]]: next[0] = self.map[current[0]][transition] return next # state is final unless the original was def final(state): return not (0 in state and state[0] in self.finals) return crawl(alphabet, initial, final, follow) def isdisjoint(self, other: 'FSM') -> bool: alphabet, new_to_old = self.alphabet.intersect(other.alphabet) initial = (self.initial, other.initial) # dedicated function accepts a "superset" and returns the next "superset" # obtained by following this transition in the new FSM def follow(current, transition): ss, os = current if ss in self.map and new_to_old[0][transition] in self.map[ss]: sn = self.map[ss][new_to_old[0][transition]] else: sn = None if os in other.map and new_to_old[1][transition] in other.map[os]: on = other.map[os][new_to_old[1][transition]] else: on = None if not sn or not on: raise OblivionError return sn, on def final(state): if state[0] in self.finals and state[1] in other.finals: # We found a situation where we are in an final state in both fsm raise _Marker try: crawl_hash_no_result(alphabet, initial, final, follow) except _Marker: return False else: return True def reversed(self): """ Return a new FSM such that for every string that self accepts (e.g. "beer", the new FSM accepts the reversed string ("reeb"). """ alphabet = self.alphabet # Start from a composite "state-set" consisting of all final states. # If there are no final states, this set is empty and we'll find that # no other states get generated. initial = frozenset(self.finals) # Speed up follow by pre-computing reverse-transition map reverse_map = {} for state, transition_map in self.map.items(): for transition, next_state in transition_map.items(): if (next_state, transition) not in reverse_map: reverse_map[(next_state, transition)] = set() reverse_map[(next_state, transition)].add(state) # Find every possible way to reach the current state-set # using this symbol. def follow(current, transition): next_states = set() for state in current: next_states.update(reverse_map.get((state, transition), set())) if not next_states: raise OblivionError return frozenset(next_states) # A state-set is final if the initial state is in it. def final(state): return self.initial in state # Man, crawl() is the best! return crawl(alphabet, initial, final, follow) # Do not reduce() the result, since reduce() calls us in turn def __reversed__(self): """ Return a new FSM such that for every string that self accepts (e.g. "beer", the new FSM accepts the reversed string ("reeb"). """ return self.reversed() def islive(self, state): """A state is "live" if a final state can be reached from it.""" seen = {state} reachable = [state] i = 0 while i < len(reachable): current = reachable[i] if current in self.finals: return True if current in self.map: for transition in self.map[current]: next = self.map[current][transition] if next not in seen: reachable.append(next) seen.add(next) i += 1 return False def empty(self): """ An FSM is empty if it recognises no strings. An FSM may be arbitrarily complicated and have arbitrarily many final states while still recognising no strings because those final states may all be inaccessible from the initial state. Equally, an FSM may be non-empty despite having an empty alphabet if the initial state is final. """ return not self.islive(self.initial) def strings(self, max_iterations=None): """ Generate strings (lists of symbols) that this FSM accepts. Since there may be infinitely many of these we use a generator instead of constructing a static list. Strings will be sorted in order of length and then lexically. This procedure uses arbitrary amounts of memory but is very fast. There may be more efficient ways to do this, that I haven't investigated yet. You can use this in list comprehensions. `max_iterations` controls how many attempts will be made to generate strings. For complex FSM it can take minutes to actually find something. If this isn't acceptable, provide a value to `max_iterations`. The approximate time complexity is 0.15 seconds per 10_000 iterations per 10 symbols """ # Many FSMs have "dead states". Once you reach a dead state, you can no # longer reach a final state. Since many strings may end up here, it's # advantageous to constrain our search to live states only. livestates = set(state for state in self.states if self.islive(state)) # We store a list of tuples. Each tuple consists of an input string and the # state that this input string leads to. This means we don't have to run the # state machine from the very beginning every time we want to check a new # string. # We use a deque instead of a list since we append to the end and pop from # the beginning strings = deque() # Initial entry (or possibly not, in which case this is a short one) cstate = self.initial cstring = [] if cstate in livestates: if cstate in self.finals: yield cstring strings.append((cstring, cstate)) # Fixed point calculation i = 0 while strings: (cstring, cstate) = strings.popleft() i += 1 if cstate in self.map: for transition in sorted(self.map[cstate]): nstate = self.map[cstate][transition] if nstate in livestates: for symbol in sorted(self.alphabet.by_transition[transition]): nstring = cstring + [symbol] if nstate in self.finals: yield nstring strings.append((nstring, nstate)) if max_iterations is not None and i > max_iterations: raise ValueError(f"Couldn't find an example within {max_iterations} iterations") def __iter__(self): """ This allows you to do `for string in fsm1` as a list comprehension! """ return self.strings() def equivalent(self, other): """ Two FSMs are considered equivalent if they recognise the same strings. Or, to put it another way, if their symmetric difference recognises no strings. """ return (self ^ other).empty() def __eq__(self, other): """ You can use `fsm1 == fsm2` to determine whether two FSMs recognise the same strings. """ return self.equivalent(other) def different(self, other): """ Two FSMs are considered different if they have a non-empty symmetric difference. """ return not (self ^ other).empty() def __ne__(self, other): """ Use `fsm1 != fsm2` to determine whether two FSMs recognise different strings. """ return self.different(other) def difference(*fsms): """ Difference. Returns an FSM which recognises only the strings recognised by the first FSM in the list, but none of the others. """ return parallel(fsms, lambda accepts: accepts[0] and not any(accepts[1:])) def __sub__(self, other): return self.difference(other) def cardinality(self): """ Consider the FSM as a set of strings and return the cardinality of that set, or raise an OverflowError if there are infinitely many """ num_strings = {} def get_num_strings(state): # Many FSMs have at least one oblivion state if self.islive(state): if state in num_strings: if num_strings[state] is None: # "computing..." # Recursion! There are infinitely many strings recognised raise OverflowError(state) return num_strings[state] num_strings[state] = None # i.e. "computing..." n = 0 if state in self.finals: n += 1 if state in self.map: for transition in self.map[state]: n += get_num_strings(self.map[state][transition]) * len(self.alphabet.by_transition[transition]) num_strings[state] = n else: # Dead state num_strings[state] = 0 return num_strings[state] return get_num_strings(self.initial) def __len__(self): """ Consider the FSM as a set of strings and return the cardinality of that set, or raise an OverflowError if there are infinitely many """ return self.cardinality() def issubset(self, other): """ Treat `self` and `other` as sets of strings and see if `self` is a subset of `other`... `self` recognises no strings which `other` doesn't. """ return (self - other).empty() def __le__(self, other): """ Treat `self` and `other` as sets of strings and see if `self` is a subset of `other`... `self` recognises no strings which `other` doesn't. """ return self.issubset(other) def ispropersubset(self, other): """ Treat `self` and `other` as sets of strings and see if `self` is a proper subset of `other`. """ return self <= other and self != other def __lt__(self, other): """ Treat `self` and `other` as sets of strings and see if `self` is a strict subset of `other`. """ return self.ispropersubset(other) def issuperset(self, other): """ Treat `self` and `other` as sets of strings and see if `self` is a superset of `other`. """ return (other - self).empty() def __ge__(self, other): """ Treat `self` and `other` as sets of strings and see if `self` is a superset of `other`. """ return self.issuperset(other) def ispropersuperset(self, other): """ Treat `self` and `other` as sets of strings and see if `self` is a proper superset of `other`. """ return self >= other and self != other def __gt__(self, other): """ Treat `self` and `other` as sets of strings and see if `self` is a strict superset of `other`. """ return self.ispropersuperset(other) def copy(self): """ For completeness only, since `set.copy()` also exists. FSM objects are immutable, so I can see only very odd reasons to need this. """ return FSM( alphabet=self.alphabet.copy(), states=self.states.copy(), initial=self.initial, finals=self.finals.copy(), map=self.map.copy(), __no_validation__=True, ) def derive(self, input): """ Compute the Brzozowski derivative of this FSM with respect to the input string of symbols. If any of the symbols are not members of the alphabet, that's a KeyError. If you fall into oblivion, then the derivative is an FSM accepting no strings. """ try: # Consume the input string. state = self.initial for symbol in input: if not symbol in self.alphabet: if not anything_else in self.alphabet: raise KeyError(symbol) symbol = anything_else # Missing transition = transition to dead state if not (state in self.map and self.alphabet[symbol] in self.map[state]): raise OblivionError state = self.map[state][self.alphabet[symbol]] # OK so now we have consumed that string, use the new location as the # starting point. return FSM( alphabet=self.alphabet, states=self.states, initial=state, finals=self.finals, map=self.map, __no_validation__=True, ) except OblivionError: # Fell out of the FSM. The derivative of this FSM is the empty FSM. return null(self.alphabet) def null(alphabet): """ An FSM accepting nothing (not even the empty string). This is demonstrates that this is possible, and is also extremely useful in some situations """ return FSM( alphabet=alphabet, states={0}, initial=0, finals=set(), map={ 0: dict([(transition, 0) for transition in alphabet.by_transition]), }, __no_validation__=True, ) def epsilon(alphabet): """ Return an FSM matching an empty string, "", only. This is very useful in many situations """ return FSM( alphabet=alphabet, states={0}, initial=0, finals={0}, map={}, __no_validation__=True, ) def parallel(fsms, test): """ Crawl several FSMs in parallel, mapping the states of a larger meta-FSM. To determine whether a state in the larger FSM is final, pass all of the finality statuses (e.g. [True, False, False] to `test`. """ alphabet, new_to_old = Alphabet.union(*[fsm.alphabet for fsm in fsms]) initial = {i: fsm.initial for (i, fsm) in enumerate(fsms)} # dedicated function accepts a "superset" and returns the next "superset" # obtained by following this transition in the new FSM def follow(current, new_transition, fsm_range=tuple(enumerate(fsms))): next = {} for i, f in fsm_range: old_transition = new_to_old[i][new_transition] if i in current \ and current[i] in f.map \ and old_transition in f.map[current[i]]: next[i] = f.map[current[i]][old_transition] if not next: raise OblivionError return next # Determine the "is final?" condition of each substate, then pass it to the # test to determine finality of the overall FSM. def final(state, fsm_range=tuple(enumerate(fsms))): accepts = [i in state and state[i] in fsm.finals for (i, fsm) in fsm_range] return test(accepts) return crawl(alphabet, initial, final, follow) def crawl_hash_no_result(alphabet, initial, final, follow): unvisited = {initial} visited = set() while unvisited: state = unvisited.pop() visited.add(state) # add to finals final(state) # compute map for this state for transition in alphabet.by_transition: try: new = follow(state, transition) except OblivionError: # Reached an oblivion state. Don't list it. continue else: if new not in visited: unvisited.add(new) def crawl(alphabet, initial, final, follow): """ Given the above conditions and instructions, crawl a new unknown FSM, mapping its states, final states and transitions. Return the new FSM. This is a pretty powerful procedure which could potentially go on forever if you supply an evil version of follow(). """ states = [initial] finals = set() map = {} # iterate over a growing list i = 0 while i < len(states): state = states[i] # add to finals if final(state): finals.add(i) # compute map for this state map[i] = {} for transition in alphabet.by_transition: try: next = follow(state, transition) except OblivionError: # Reached an oblivion state. Don't list it. continue else: try: j = states.index(next) except ValueError: j = len(states) states.append(next) map[i][transition] = j i += 1 return FSM( alphabet=alphabet, states=range(len(states)), initial=0, finals=finals, map=map, __no_validation__=True, ) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1685100613.0 interegular-0.3.3/interegular/patterns.py0000666000000000000000000006245614434114105015535 0ustar00""" Allows the parsing of python-style regexes to FSMs. Main access point is `parse_pattern(str) -> Pattern`. Most other classes are internal and should not be used. """ from abc import abstractmethod, ABC from dataclasses import dataclass from enum import Flag, auto from textwrap import indent from typing import Iterable, FrozenSet, Optional, Tuple, Union from interegular.fsm import FSM, anything_else, epsilon, Alphabet from interegular.utils.simple_parser import SimpleParser, nomatch, NoMatch __all__ = ['parse_pattern', 'Pattern', 'Unsupported', 'InvalidSyntax', 'REFlags'] class Unsupported(Exception): pass class InvalidSyntax(Exception): pass class REFlags(Flag): CASE_INSENSITIVE = I = auto() MULTILINE = M = auto() SINGLE_LINE = S = auto() _flags = { 'i': REFlags.I, 'm': REFlags.M, 's': REFlags.S, } def _get_flags(plus: str) -> REFlags: res = REFlags(0) for c in plus: try: res |= _flags[c] except KeyError: raise Unsupported(f"Flag {c} is not implemented") return res def _combine_flags(base: REFlags, added: REFlags, removed: REFlags): base |= added base &= ~removed # TODO: Check for incorrect combinations (aLu) return base @dataclass(frozen=True) class _BasePattern(ABC): __slots__ = '_alphabet_cache', '_prefix_cache', '_lengths_cache' @abstractmethod def to_fsm(self, alphabet=None, prefix_postfix=None, flags=None) -> FSM: raise NotImplementedError @abstractmethod def _get_alphabet(self, flags: REFlags) -> Alphabet: raise NotImplementedError def get_alphabet(self, flags: REFlags) -> Alphabet: if not hasattr(self, '_alphabet_cache'): super(_BasePattern, self).__setattr__('_alphabet_cache', {}) if flags not in self._alphabet_cache: self._alphabet_cache[flags] = self._get_alphabet(flags) return self._alphabet_cache[flags] @abstractmethod def _get_prefix_postfix(self) -> Tuple[int, Optional[int]]: raise NotImplementedError @property def prefix_postfix(self) -> Tuple[int, Optional[int]]: """Returns the number of dots that have to be pre-/postfixed to support look(aheads|backs)""" if not hasattr(self, '_prefix_cache'): super(_BasePattern, self).__setattr__('_prefix_cache', self._get_prefix_postfix()) return self._prefix_cache @abstractmethod def _get_lengths(self) -> Tuple[int, Optional[int]]: raise NotImplementedError @property def lengths(self) -> Tuple[int, Optional[int]]: """Returns the minimum and maximum length that this pattern can match (maximum can be None bei infinite length)""" if not hasattr(self, '_lengths_cache'): super(_BasePattern, self).__setattr__('_lengths_cache', self._get_lengths()) return self._lengths_cache @abstractmethod def simplify(self) -> '_BasePattern': raise NotImplementedError class _Repeatable(_BasePattern, ABC): pass @dataclass(frozen=True) class _CharGroup(_Repeatable): """Represents the smallest possible pattern that can be matched: A single char. Direct port from the lego module""" chars: FrozenSet[str] negated: bool __slots__ = 'chars', 'negated' def _get_alphabet(self, flags: REFlags) -> Alphabet: if flags & REFlags.CASE_INSENSITIVE: relevant = {*map(str.lower, self.chars), *map(str.upper, self.chars)} else: relevant = self.chars return Alphabet.from_groups(relevant, {anything_else}) def _get_prefix_postfix(self) -> Tuple[int, Optional[int]]: return 0, 0 def _get_lengths(self) -> Tuple[int, Optional[int]]: return 1, 1 def to_fsm(self, alphabet=None, prefix_postfix=None, flags=REFlags(0)) -> FSM: if alphabet is None: alphabet = self.get_alphabet(flags) if prefix_postfix is None: prefix_postfix = self.prefix_postfix if prefix_postfix != (0, 0): raise ValueError("Can not have prefix/postfix on CharGroup-level") insensitive = flags & REFlags.CASE_INSENSITIVE flags &= ~REFlags.CASE_INSENSITIVE flags &= ~REFlags.SINGLE_LINE if flags: raise Unsupported(flags) if insensitive: chars = frozenset({*(c.lower() for c in self.chars), *(c.upper() for c in self.chars)}) else: chars = self.chars # State: 0 is initial, 1 is final # If negated, make a singular FSM accepting any other characters if self.negated: mapping = { 0: {alphabet[symbol]: 1 for symbol in set(alphabet) - chars}, } # If normal, make a singular FSM accepting only these characters else: mapping = { 0: {alphabet[symbol]: 1 for symbol in chars}, } return FSM( alphabet=alphabet, states={0, 1}, initial=0, finals={1}, map=mapping, ) def simplify(self) -> '_CharGroup': return self def _combine_char_groups(*groups: _CharGroup, negate): pos = set().union(*(g.chars for g in groups if not g.negated)) neg = set().union(*(g.chars for g in groups if g.negated)) if neg: return _CharGroup(frozenset(neg - pos), not negate) else: return _CharGroup(frozenset(pos - neg), negate) @dataclass(frozen=True) class __DotCls(_Repeatable): def to_fsm(self, alphabet=None, prefix_postfix=None, flags=REFlags(0)) -> FSM: if alphabet is None: alphabet = self.get_alphabet(flags) if flags is None or not flags & REFlags.SINGLE_LINE: symbols = set(alphabet) - {'\n'} else: symbols = alphabet return FSM( alphabet=alphabet, states={0, 1}, initial=0, finals={1}, map={0: {alphabet[sym]: 1 for sym in symbols}}, ) def _get_alphabet(self, flags: REFlags) -> Alphabet: if flags & REFlags.SINGLE_LINE: return Alphabet.from_groups({anything_else}) else: return Alphabet.from_groups({anything_else}, {'\n'}) def _get_prefix_postfix(self) -> Tuple[int, Optional[int]]: return 0, 0 def _get_lengths(self) -> Tuple[int, Optional[int]]: return 1, 1 def simplify(self) -> '__DotCls': return self @dataclass(frozen=True) class __EmptyCls(_BasePattern): def to_fsm(self, alphabet=None, prefix_postfix=None, flags=REFlags(0)) -> FSM: if alphabet is None: alphabet = self.get_alphabet(flags) return epsilon(alphabet) def _get_alphabet(self, flags: REFlags) -> Alphabet: return Alphabet.from_groups({anything_else}) def _get_prefix_postfix(self) -> Tuple[int, Optional[int]]: return 0, 0 def _get_lengths(self) -> Tuple[int, Optional[int]]: return 0, 0 def simplify(self) -> '__EmptyCls': return self _DOT = __DotCls() _EMPTY = __EmptyCls() _NONE = _CharGroup(frozenset(""), False) _ALL = _CharGroup(frozenset(""), True) _CHAR_GROUPS = { 'w': _CharGroup(frozenset("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_"), False), 'W': _CharGroup(frozenset("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_"), True), 'd': _CharGroup(frozenset("0123456789"), False), 'D': _CharGroup(frozenset("0123456789"), True), 's': _CharGroup(frozenset(" \t\n\r\f\v"), False), 'S': _CharGroup(frozenset(" \t\n\r\f\v"), True), 'a': _CharGroup(frozenset("\a"), False), 'b': _CharGroup(frozenset("\b"), False), 'f': _CharGroup(frozenset("\f"), False), 'n': _CharGroup(frozenset("\n"), False), 'r': _CharGroup(frozenset("\r"), False), 't': _CharGroup(frozenset("\t"), False), 'v': _CharGroup(frozenset("\v"), False), } @dataclass(frozen=True) class _Repeated(_BasePattern): """Represents a repeated pattern. `base` can be matched from `min` to `max` times. `max` may be None to signal infinite""" base: _Repeatable min: int max: Optional[int] def __str__(self): return f"Repeated[{self.min}:{self.max if self.max is not None else ''}]:\n" \ f"{indent(str(self.base), ' ')}" def _get_alphabet(self, flags: REFlags) -> Alphabet: return self.base.get_alphabet(flags) def _get_prefix_postfix(self) -> Tuple[int, Optional[int]]: return self.base.prefix_postfix def _get_lengths(self) -> Tuple[int, Optional[int]]: l, h = self.base.lengths return l * self.min, (h * self.max if None not in (h, self.max) else None) def to_fsm(self, alphabet=None, prefix_postfix=None, flags=REFlags(0)) -> FSM: if alphabet is None: alphabet = self.get_alphabet(flags) if prefix_postfix is None: prefix_postfix = self.prefix_postfix if prefix_postfix != (0, 0): raise ValueError("Can not have prefix/postfix on CharGroup-level") unit = self.base.to_fsm(alphabet, (0, 0), flags=flags) mandatory = unit * self.min if self.max is None: optional = unit.star() else: optional = unit.copy() optional.__dict__['finals'] |= {optional.initial} optional *= (self.max - self.min) return mandatory + optional def simplify(self) -> '_Repeated': return self.__class__(self.base.simplify(), self.min, self.max) _ALL_STAR = _Repeated(_ALL, 0, None) @dataclass(frozen=True) class _NonCapturing: """Represents a lookahead/lookback. Matches `inner` without 'consuming' anything. Can be negated. Only valid inside a `_Concatenation`""" inner: _BasePattern backwards: bool negate: bool __slots__ = 'inner', 'backwards', 'negate' def get_alphabet(self, flags: REFlags) -> Alphabet: return self.inner.get_alphabet(flags) def simplify(self) -> '_NonCapturing': return self.__class__(self.inner.simplify(), self.backwards, self.negate) @dataclass(frozen=True) class _Concatenation(_BasePattern): """Represents multiple Patterns that have to be match in a row. Can contain `_NonCapturing`""" parts: Tuple[Union[_BasePattern, _NonCapturing], ...] __slots__ = 'parts', def __str__(self): return "Concatenation:\n" + "\n".join(indent(str(p), ' ') for p in self.parts) def _get_alphabet(self, flags: REFlags) -> Alphabet: return Alphabet.union(*(p.get_alphabet(flags) for p in self.parts))[0] def _get_prefix_postfix(self) -> Tuple[int, Optional[int]]: pre = 0 # What is the longest a lookback could stick out over the beginning? off = 0 # How many chars have been consumed, e.g what is the minimum length? for p in self.parts: if not isinstance(p, _NonCapturing): off += p.lengths[0] elif p.backwards: a, b = p.inner.lengths if a != b: raise InvalidSyntax(f"lookbacks have to have fixed length {(a, b)}") req = a - off if req > pre: pre = req post = 0 off = 0 for p in reversed(self.parts): if not isinstance(p, _NonCapturing): off += p.lengths[0] elif not p.backwards: a, b = p.inner.lengths if b is None: req = a - off # TODO: is this correct? else: req = b - off if req > post: post = req return pre, post def _get_lengths(self) -> Tuple[int, Optional[int]]: low, high = 0, 0 for p in self.parts: if not isinstance(p, _NonCapturing): pl, ph = p.lengths low += pl high = high + ph if None not in (high, ph) else None return low, high def to_fsm(self, alphabet=None, prefix_postfix=None, flags=REFlags(0)) -> FSM: if alphabet is None: alphabet = self.get_alphabet(flags) if prefix_postfix is None: prefix_postfix = self.prefix_postfix if prefix_postfix[0] < self.prefix_postfix[0] or prefix_postfix[1] < self.prefix_postfix[1]: raise Unsupported("Group can not have lookbacks/lookaheads that go beyond the group bounds.") all_ = _ALL.to_fsm(alphabet) all_star = all_.star() fsm_parts = [] current = [all_.times(prefix_postfix[0])] for part in self.parts: if isinstance(part, _NonCapturing): inner = part.inner.to_fsm(alphabet, (0, 0), flags) if part.backwards: raise Unsupported("lookbacks are not implemented") else: # try: # inner.cardinality() # except OverflowError: # raise NotImplementedError("Can not deal with infinite length lookaheads") fsm_parts.append((None, current)) fsm_parts.append((part, inner)) current = [] else: current.append(part.to_fsm(alphabet, (0, 0), flags)) current.append(all_.times(prefix_postfix[1])) result = FSM.concatenate(*current) for m, f in reversed(fsm_parts): if m is None: result = FSM.concatenate(*f, result) else: assert isinstance(m, _NonCapturing) and not m.backwards if m.negate: result = result.difference(f + all_star) # TODO: This does not feel right... else: result = result.intersection(f + all_star) return result def simplify(self) -> '_Concatenation': return self.__class__(tuple(p.simplify() for p in self.parts)) @dataclass(frozen=True) class Pattern(_Repeatable): options: Tuple[_BasePattern, ...] added_flags: REFlags = REFlags(0) removed_flags: REFlags = REFlags(0) def __str__(self): return "Pattern:\n" + "\n".join(indent(str(o), ' ') for o in self.options) def _get_alphabet(self, flags: REFlags) -> Alphabet: flags = _combine_flags(flags, self.added_flags, self.removed_flags) return Alphabet.union(*(p.get_alphabet(flags) for p in self.options))[0] def _get_lengths(self) -> Tuple[int, Optional[int]]: low, high = None, 0 for o in self.options: ol, oh = o.lengths if low is None or ol < low: low = ol if oh is None or (high is not None and oh > high): high = oh return low, high def _get_prefix_postfix(self) -> Tuple[int, Optional[int]]: pre, post = 0, 0 for o in self.options: opre, opost = o.prefix_postfix if opre > pre: pre = opre if opost is None or (post is not None and opost > post): post = opost return pre, post def to_fsm(self, alphabet=None, prefix_postfix=None, flags=REFlags(0)) -> FSM: flags = _combine_flags(flags, self.added_flags, self.removed_flags) if alphabet is None: alphabet = self.get_alphabet(flags) if prefix_postfix is None: prefix_postfix = self.prefix_postfix return FSM.union(*(o.to_fsm(alphabet, prefix_postfix, flags) for o in self.options)) def with_flags(self, added: REFlags, removed: REFlags = REFlags(0)) -> 'Pattern': return self.__class__(self.options, added, removed) def simplify(self) -> 'Pattern': if len(self.options) == 1: o = self.options[0] if isinstance(o, _Concatenation) and len(o.parts) == 1 and isinstance(o.parts[0], Pattern): p: Pattern = o.parts[0].simplify() f = _combine_flags(_combine_flags(REFlags(0), self.added_flags, self.removed_flags), p.added_flags, p.removed_flags) return p.with_flags(f) return self.__class__(tuple(o.simplify() for o in self.options), self.added_flags, self.removed_flags) class _ParsePattern(SimpleParser[Pattern]): SPECIAL_CHARS_STANDARD: FrozenSet[str] = frozenset({ '+', '?', '*', '.', '$', '^', '\\', '(', ')', '[', '|' }) SPECIAL_CHARS_INNER: FrozenSet[str] = frozenset({ '\\', ']' }) RESERVED_ESCAPES: FrozenSet[str] = frozenset({ 'u', 'U', 'A', 'Z', 'b', 'B' }) def __init__(self, data: str): super(_ParsePattern, self).__init__(data) self.flags = None def parse(self): try: return super(_ParsePattern, self).parse() except NoMatch: raise InvalidSyntax def start(self): self.flags = None p = self.pattern() if self.flags is not None: p = p.with_flags(self.flags) return p def pattern(self): options = [self.conc()] while self.static_b('|'): options.append(self.conc()) return Pattern(tuple(options)) def conc(self): parts = [] while True: try: parts.append(self.obj()) except nomatch: break return _Concatenation(tuple(parts)) def obj(self): if self.static_b("("): return self.group() return self.repetition(self.atom()) def group(self): if self.static_b("?"): return self.extension_group() else: p = self.pattern() self.static(")") return self.repetition(p) def extension_group(self): c = self.any() if c in 'aiLmsux-': self.index -= 1 added_flags = self.multiple('aiLmsux', 0, None) if self.static_b('-'): removed_flags = self.multiple('aiLmsux', 1, None) else: removed_flags = '' if self.static_b(':'): p = self.pattern() p = p.with_flags(_get_flags(added_flags), _get_flags(removed_flags)) self.static(")") return self.repetition(p) elif removed_flags != '': raise nomatch else: self.static(')') self.flags = _get_flags(added_flags) return _EMPTY elif c == ':': p = self.pattern() self.static(")") return self.repetition(p) elif c == 'P': if self.static_b('<'): self.multiple('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_', 1, None) self.static('>') p = self.pattern() self.static(")") return self.repetition(p) elif self.static_b('='): raise Unsupported("Group references are not implemented") elif c == '#': while not self.static_b(')'): self.any() elif c == '=': p = self.pattern() self.static(")") return _NonCapturing(p, False, False) elif c == '!': p = self.pattern() self.static(")") return _NonCapturing(p, False, True) elif c == '<': c = self.any() if c == '=': p = self.pattern() self.static(")") return _NonCapturing(p, True, False) elif c == '!': p = self.pattern() self.static(")") return _NonCapturing(p, True, True) elif c == '(': raise Unsupported("Conditional matching is not implemented") else: raise InvalidSyntax( f"Unknown group-extension: {c!r} (Context: {self.data[self.index - 3:self.index + 5]!r}") def atom(self): if self.static_b("["): return self.repetition(self.chargroup()) elif self.static_b("\\"): return self.repetition(self.escaped()) elif self.static_b("."): return self.repetition(_DOT) elif self.static_b("$"): raise Unsupported("'$'") elif self.static_b("^"): raise Unsupported("'^'") else: c = self.any_but(*self.SPECIAL_CHARS_STANDARD) return self.repetition(_CharGroup(frozenset({c}), False)) def repetition(self, base: _Repeatable): if self.static_b("*"): if self.static_b("?"): pass return _Repeated(base, 0, None) elif self.static_b("+"): if self.static_b("?"): pass return _Repeated(base, 1, None) elif self.static_b("?"): if self.static_b("?"): pass return _Repeated(base, 0, 1) elif self.static_b("{"): try: n = self.number() except nomatch: n = 0 if self.static_b(','): try: m = self.number() except nomatch: m = None else: m = n self.static("}") if self.static_b('?'): pass return _Repeated(base, n, m) else: return base def number(self) -> int: return int(self.multiple("0123456789", 1, None)) def escaped(self, inner=False): if self.static_b("x"): n = self.multiple("0123456789abcdefABCDEF", 2, 2) c = chr(int(n, 16)) return _CharGroup(frozenset({c}), False) if self.static_b("0"): n = self.multiple("01234567", 1, 2) c = chr(int(n, 8)) return _CharGroup(frozenset({c}), False) if self.anyof_b('N', 'p', 'P', 'u', 'U'): raise Unsupported('regex module unicode properties are not supported.') if not inner: try: n = self.multiple("01234567", 3, 3) except nomatch: pass else: c = chr(int(n, 8)) return _CharGroup(frozenset({c}), False) try: self.multiple("0123456789", 1, 2) except nomatch: pass else: raise Unsupported("Group references are not implemented") else: try: n = self.multiple("01234567", 1, 3) except nomatch: pass else: c = chr(int(n, 8)) return _CharGroup(frozenset({c}), False) if not inner: try: c = self.anyof(*self.RESERVED_ESCAPES) except nomatch: pass else: raise Unsupported(f"Escape \\{c} is not implemented") try: c = self.anyof(*_CHAR_GROUPS) except nomatch: pass else: return _CHAR_GROUPS[c] c = self.any_but("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ") if c.isalpha(): raise nomatch return _CharGroup(frozenset(c), False) def chargroup(self): if self.static_b("^"): negate = True else: negate = False groups = [] while True: try: groups.append(self.chargroup_inner()) except nomatch: break self.static("]") if len(groups) == 1: f = tuple(groups)[0] return _CharGroup(f.chars, negate ^ f.negated) elif len(groups) == 0: return _CharGroup(frozenset({}), negate) else: return _combine_char_groups(*groups, negate=negate) def chargroup_inner(self) -> _CharGroup: start = self.index if self.static_b('\\'): base = self.escaped(True) else: base = _CharGroup(frozenset(self.any_but(*self.SPECIAL_CHARS_INNER)), False) if self.static_b('-'): if self.static_b('\\'): end = self.escaped(True) elif self.peek_static(']'): return _combine_char_groups(base, _CharGroup(frozenset('-'), False), negate=False) else: end = _CharGroup(frozenset(self.any_but(*self.SPECIAL_CHARS_INNER)), False) if len(base.chars) != 1 or len(end.chars) != 1: raise InvalidSyntax(f"Invalid Character-range: {self.data[start:self.index]}") low, high = ord(*base.chars), ord(*end.chars) if low > high: raise InvalidSyntax(f"Invalid Character-range: {self.data[start:self.index]}") return _CharGroup(frozenset((chr(i) for i in range(low, high + 1))), False) return base def parse_pattern(pattern: str) -> Pattern: p = _ParsePattern(pattern) out = p.parse() out = out.simplify() return out ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1677795674.0 interegular-0.3.3/interegular/py.typed0000666000000000000000000000000014400220532014766 0ustar00././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1704581742.6942413 interegular-0.3.3/interegular/utils/0000777000000000000000000000000014546355157014470 5ustar00././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1678302297.0 interegular-0.3.3/interegular/utils/__init__.py0000666000000000000000000000070314402156131016557 0ustar00import logging from typing import Iterable logger = logging.getLogger('interegular') logger.addHandler(logging.StreamHandler()) logger.setLevel(logging.CRITICAL) def soft_repr(s: str): """ joins together the repr of each char in the string, while throwing away the repr This for example turns `"'\""` into `'"` instead of `\'"` like the normal repr would do. """ return ''.join(repr(c)[1:-1] for c in str(s)) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1677774165.0 interegular-0.3.3/interegular/utils/simple_parser.py0000666000000000000000000001236114400146525017674 0ustar00""" A small util to simplify the creation of Parsers for simple context-free-grammars. """ from abc import ABC, abstractmethod from collections import defaultdict from functools import wraps from types import FunctionType, MethodType from typing import Generic, TypeVar, Optional, List __all__ = ['nomatch', 'NoMatch', 'SimpleParser'] class nomatch(BaseException): def __init__(self): pass class NoMatch(ValueError): def __init__(self, data: str, index: int, expected: List[str]): self.data = data self.index = index self.expected = expected super(NoMatch, self).__init__(f"Can not match at index {index}. Got {data[index:index + 5]!r}," f" expected any of {expected}.\n" f"Context(data[-10:+10]): {data[index - 10: index + 10]!r}") T = TypeVar('T') def _wrap_reset(m): @wraps(m) def w(self, *args, **kwargs): p = self.index try: return m(self, *args, **kwargs) except nomatch: self.index = p raise return w class SimpleParser(Generic[T], ABC): def __init__(self, data: str): self.data = data self.index = 0 self._expected = defaultdict(list) def __init_subclass__(cls, **kwargs): for n, v in cls.__dict__.items(): if isinstance(v, FunctionType) and not n.startswith('_'): setattr(cls, n, _wrap_reset(v)) def parse(self) -> T: try: result = self.start() except nomatch: raise NoMatch(self.data, max(self._expected), self._expected[max(self._expected)]) from None if self.index < len(self.data): raise NoMatch(self.data, max(self._expected), self._expected[max(self._expected)]) return result @abstractmethod def start(self) -> T: raise NotImplementedError def peek_static(self, expected: str) -> bool: l = len(expected) if self.data[self.index:self.index + l] == expected: return True else: self._expected[self.index].append(expected) return False def static(self, expected: str): length = len(expected) if self.data[self.index:self.index + length] == expected: self.index += length else: self._expected[self.index].append(expected) raise nomatch def static_b(self, expected: str) -> bool: l = len(expected) if self.data[self.index:self.index + l] == expected: self.index += l return True else: self._expected[self.index].append(expected) return False def anyof(self, *strings: str) -> str: for s in strings: if self.static_b(s): return s else: raise nomatch def anyof_b(self, *strings: str) -> bool: for s in strings: if self.static_b(s): return True else: return False def any(self, length: int = 1) -> str: if self.index + length <= len(self.data): res = self.data[self.index:self.index + length] self.index += length return res else: self._expected[self.index].append(f"") raise nomatch def any_but(self, *strings, length: int = 1) -> str: if self.index + length <= len(self.data): res = self.data[self.index:self.index + length] if res not in strings: self.index += length return res else: self._expected[self.index].append(f"") raise nomatch else: self._expected[self.index].append(f"") raise nomatch def multiple(self, chars: str, mi: int, ma: Optional[int]) -> str: result = [] try: for off in range(mi): if self.data[self.index + off] in chars: result.append(self.data[self.index + off]) else: self._expected[self.index + off].extend(chars) raise nomatch except IndexError: raise nomatch self.index += mi if ma is None: try: while True: if self.data[self.index] in chars: result.append(self.data[self.index]) self.index += 1 else: self._expected[self.index].extend(chars) break except IndexError: pass else: try: for _ in range(ma - mi): if self.data[self.index] in chars: result.append(self.data[self.index]) self.index += 1 else: self._expected[self.index].extend(chars) break except IndexError: pass return ''.join(result) ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1704581742.6964138 interegular-0.3.3/interegular.egg-info/0000777000000000000000000000000014546355157015022 5ustar00././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1704581742.0 interegular-0.3.3/interegular.egg-info/PKG-INFO0000666000000000000000000000575114546355156016126 0ustar00Metadata-Version: 2.1 Name: interegular Version: 0.3.3 Summary: a regex intersection checker Home-page: https://github.com/MegaIng/regex_intersections Download-URL: https://github.com/MegaIng/interegular/tarball/master Author: MegaIng Author-email: MegaIng License: MIT Classifier: Operating System :: OS Independent Classifier: License :: OSI Approved :: MIT License Classifier: Programming Language :: Python :: 3 :: Only Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3.12 Requires-Python: >=3.7 Description-Content-Type: text/markdown License-File: LICENSE.txt # Interegular ***regex intersection checker*** A library to check a subset of python regexes for intersections. Based on [grennery](https://github.com/qntm/greenery) by [@qntm](https://github.com/qntm). Adapted for [lark-parser](https://github.com/lark-parser/lark). The primary difference with `grennery` library is that `interegular` is focused on speed and compatibility with python re syntax, whereas grennery has a way to reconstruct a regex from a FSM, which `interegular` lacks. ## Interface | Function | Usage | | -------- | ----- | | `compare_regexes(*regexes: str)` | Takes a series of regexes as strings and returns a Generator of all intersections as `(str, str)`| | `parse_pattern(regex: str)` | Parses a regex as string to a `Pattern` object| | `interegular.compare_patterns(*patterns: Pattern)` | Takes a series of regexes as patterns and returns a Generator of all intersections as `(Pattern, Pattern)`| | `Pattern` | A class representing a parsed regex (intermediate representation)| | `REFlags` | A enum representing the flags a regex can have | | `FSM` | A class representing a fully parsed regex. (Has many useful members) | | `Pattern.with_flags(added: REFlags, removed: REFlags)` | A function to change the flags that are applied to a regex| | `Pattern.to_fsm() -> FSM` | A function to create a `FSM` object from the Pattern | | `Comparator` | A Class to compare a group of Patterns | ## What is supported? Most normal python-regex syntax is support. But because of the backend that is used (final-state-machines), some things can not be implemented. This includes: - Backwards references (`\1`, `(?P=open)`) - Conditional Matching (`(?(1)a|b)`) - Some cases of lookaheads/lookbacks (You gotta try out which work and which don't) - A word of warning: This is currently not correctly handled, and some things might parse, but not work correctly. I am currently working on this. Some things are simply not implemented and will implemented in the future: - Some flags (Progress: `ims` from `aiLmsux`) - Some cases of lookaheads/lookbacks (You gotta try out which work and which don't) ## TODO - Docs - More tests - Checks that the syntax is correctly handled. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1704581742.0 interegular-0.3.3/interegular.egg-info/SOURCES.txt0000666000000000000000000000062714546355156016712 0ustar00LICENSE.txt README.md setup.cfg setup.py interegular/__init__.py interegular/comparator.py interegular/fsm.py interegular/patterns.py interegular/py.typed interegular.egg-info/PKG-INFO interegular.egg-info/SOURCES.txt interegular.egg-info/dependency_links.txt interegular.egg-info/top_level.txt interegular/utils/__init__.py interegular/utils/simple_parser.py tests/test_comparator.py tests/test_patterns.py././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1704581742.0 interegular-0.3.3/interegular.egg-info/dependency_links.txt0000666000000000000000000000000114546355156021067 0ustar00 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1704581742.0 interegular-0.3.3/interegular.egg-info/top_level.txt0000666000000000000000000000001414546355156017546 0ustar00interegular ././@PaxHeader0000000000000000000000000000003300000000000010211 xustar0027 mtime=1704581742.702971 interegular-0.3.3/setup.cfg0000666000000000000000000000022614546355157012630 0ustar00[metadata] description-file = README.md license_file = LICENSE.txt [bdist_wheel] python-tag = py37 [egg_info] tag_build = tag_date = 0 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1704581611.0 interegular-0.3.3/setup.py0000666000000000000000000000234714546354753012530 0ustar00from pathlib import Path from setuptools import setup import re __version__, = re.findall('__version__ = "(.*)"', open('interegular/__init__.py').read()) with open(Path(__file__).with_name('README.md')) as f: long = f.read() setup( name='interegular', version=__version__, packages=['interegular', 'interegular.utils'], package_data={'interegular': ['py.typed']}, install_requires=[], python_requires=">=3.7", author='MegaIng', author_email='MegaIng ', description="a regex intersection checker", long_description=long, long_description_content_type='text/markdown', license="MIT", url='https://github.com/MegaIng/regex_intersections', download_url='https://github.com/MegaIng/interegular/tarball/master', classifiers=[ "Operating System :: OS Independent", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3 :: Only", "Programming Language :: Python :: 3.8", "Programming Language :: Python :: 3.9", "Programming Language :: Python :: 3.10", "Programming Language :: Python :: 3.11", "Programming Language :: Python :: 3.12", ], ) ././@PaxHeader0000000000000000000000000000003300000000000010211 xustar0027 mtime=1704581742.695407 interegular-0.3.3/tests/0000777000000000000000000000000014546355157012151 5ustar00././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1678549320.0 interegular-0.3.3/tests/test_comparator.py0000666000000000000000000000442214403120510015702 0ustar00from time import perf_counter import pytest from interegular import parse_pattern, Comparator REGEX_TO_TEST = { "A": "a+", "B": "[ab]+", "C": "b+", 'OP': '[+*]|[?](?![a-z])', 'RULE_MODIFIERS': '(!|![?]?|[?]!?)(?=[_a-z])', 'EXIT_TAG': '#(?:[ \t]+)?(?i:exit)', 'COMMENT': '#[ \t]*(?!if|ifdef|else|elif|endif|define|set|unset|error|exit)[^\n]+|(;|//)[^\n]*' } @pytest.fixture def comp(request): return Comparator.from_regexes({name: REGEX_TO_TEST[name] for name in request.param}) basic_collisions = [ pytest.param(("A", "B"), (("A", "B"),), id="AB"), pytest.param(("A", "B", "C"), (("A", "B"), ("B", "C")), id="ABC"), pytest.param(("A", "C"), (), id="AC"), pytest.param(("OP", "RULE_MODIFIERS"), (("OP", "RULE_MODIFIERS"),), id="LOOKAHEAD"), ] @pytest.mark.parametrize("comp, expected", basic_collisions, indirect=['comp']) def test_check(comp, expected): expected = set(expected) for collision in comp.check(): assert collision in expected expected.remove(collision) assert not expected @pytest.mark.parametrize("comp, expected", basic_collisions, indirect=['comp']) def test_example(comp, expected): for collision in expected: example = comp.get_example_overlap(*collision) assert comp.get_fsm(collision[0]).accepts(example.full_text), repr(example) assert comp.get_fsm(collision[1]).accepts(example.full_text), repr(example) @pytest.mark.parametrize("comp, expected", [ pytest.param(('EXIT_TAG', 'COMMENT'), (('EXIT_TAG', 'COMMENT'),), id="SLOW_EXAMPLE") ], indirect=['comp']) def test_slow_example(comp, expected): for collision in expected: start = perf_counter() assert not comp.isdisjoint(*collision) try: example = comp.get_example_overlap(*collision, 0.5) assert comp.get_fsm(collision[0]).accepts(example) assert comp.get_fsm(collision[1]).accepts(example) except ValueError: pass end = perf_counter() assert end - start < 1 def test_empty(): comp = Comparator({}) assert comp.marked_pairs == set() assert comp.count_marked_pairs() == 0 for a, b in comp.check(): assert False, "We can't get here" ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1685100422.0 interegular-0.3.3/tests/test_patterns.py0000666000000000000000000001060014434113606015402 0ustar00import unittest from typing import Iterable from interegular import parse_pattern from interegular.patterns import InvalidSyntax, Unsupported class SyntaxTestCase(unittest.TestCase): def parse_unsupported(self, s: str): with self.assertRaises(Unsupported): parse_pattern(s).to_fsm() def parse_invalid_syntax(self, s: str): with self.assertRaises(InvalidSyntax): parse_pattern(s).to_fsm() def parse_valid(self, re: str, targets: Iterable[str] = (), non_targets: Iterable[str] = ()): fsm = parse_pattern(re).to_fsm() for s in targets: self.assertTrue(fsm.accepts(s), f"{re!r} does not match {s!r}") for s in non_targets: self.assertFalse(fsm.accepts(s), f"{re!r} does match {s!r}") def test_basic_syntax(self): self.parse_valid("a", ("a",), ("", "aa", "b")) self.parse_valid("a+", ("a", "aa", "aaaaa"), ("", "b", "ab")) self.parse_valid("a*", ("", "a", "aa", "aaaaa"), ("b", "ab")) self.parse_valid("a{2,10}", ("a" * 2, "a" * 5, "a" * 10), ("a" * 1, "a" * 11, "b", "ab")) self.parse_valid("a{,10}", ("a" * 1, "a" * 2, "a" * 5, "a" * 10), ("a" * 11, "b", "ab")) self.parse_valid("a{2,}", ("a" * 2, "a" * 5, "a" * 10, "a" * 11), ("a" * 1, "b", "ab")) self.parse_unsupported("\\1") self.parse_invalid_syntax("(") self.parse_invalid_syntax(")") self.parse_invalid_syntax("\\g") def test_groups(self): self.parse_valid("(ab)", ("ab",), ("", "a", "b", "abb")) self.parse_valid("(?:ab)", ("ab",), ("", "a", "b", "abb")) self.parse_valid("(?Pab)", ("ab",), ("", "a", "b", "abb")) self.parse_unsupported("(?P=start)") self.parse_invalid_syntax("(?g)") def test_char_group(self): self.parse_valid("[a-h]", (*"abcdef",), ("", "aa", *"ijk")) self.parse_valid("[^a-h]", (*"ijk?0\n",), ("", "aa", *"abcdef")) self.parse_invalid_syntax("[a-A]") self.parse_invalid_syntax("[\\w-A]") self.parse_valid(r"[\w]", (*"abcdef012_",), ("", "..", *".*?",)) self.parse_valid(r"[\W]", (*".*?",), ("", "..", *"abcdef012_")) self.parse_valid(r"[^\w]", (*".*?",), ("", "..", *"abcdef012_")) self.parse_valid(r"[^\W]", (*"abcdef012_",), ("", "..", *".*?",)) self.parse_valid(r"[\wa-c]", (*"abcdef012_",), ("", "..", *".*?",)) self.parse_valid(r"[\Wa-c]", (*"abc.*?",), ("", "..", *"def012_")) self.parse_valid(r"[^\wa-c]", (*".*?",), ("", "..", *"abcdef012_")) self.parse_valid(r"[^\Wa-c]", (*"def012_",), ("", "..", *"abc.*?",)) self.parse_valid(r"[\wa-c?]", (*"abcdef012_?",), ("", "..", *".*",)) self.parse_valid(r"[\Wa-c?]", (*"abc.*?",), ("", "..", *"def012_")) self.parse_valid(r"[^\wa-c?]", (*".*",), ("", "..", *"abcdef012_?")) self.parse_valid(r"[^\Wa-c?]", (*"def012_",), ("", "..", *"abc.*?",)) w = "abc" d = "012" o = ".*?" self.parse_valid(r"[\w\d]", w + d, o) self.parse_valid(r"[\w\D]", w + d + o, "") self.parse_valid(r"[\W\d]", d + o, w) self.parse_valid(r"[\W\D]", o, w + d) self.parse_valid(r"[^\w\d]", o, w + d) self.parse_valid(r"[^\W\d]", w, d + o) self.parse_valid(r"[^\w\D]", "", w + d + o) self.parse_valid(r"[^\W\D]", w + d, o) def test_looks(self): self.parse_valid("(?=ab)...", ("ab?",), ("cd?", "ab")) self.parse_valid("(?!ab)...", ("cd?", "aaa"), ("ab?", "", "ab")) self.parse_unsupported("(?<=ab)") self.parse_unsupported("(?