Skip to content

Index Classes

NgramIndex

N-gram based index for fast fuzzy search.

Constructor

NgramIndex(
    ngram_size: int = 3,
    min_similarity: float = 0.0,
    min_ngram_ratio: float = 0.0,
    normalize: bool = False
)

Parameters:

  • ngram_size: Size of n-grams (1-32)
  • min_similarity: Minimum similarity for results
  • min_ngram_ratio: Minimum n-gram overlap ratio for candidates
  • normalize: Lowercase text for case-insensitive matching

Methods

add

add(text: str) -> int

Add a string to the index. Returns the assigned ID.

add_with_data

add_with_data(text: str, data: int | None = None) -> int

Add a string with optional associated data.

add_all

add_all(texts: Iterable[str]) -> None

Add multiple strings.

search(
    query: str,
    algorithm: str = "jaro_winkler",
    min_similarity: float = 0.0,
    limit: int | None = None
) -> list[SearchResult]

Search for similar strings.

Returns: List of SearchResult(id, text, score, distance, data)

batch_search(
    queries: list[str],
    algorithm: str = "jaro_winkler",
    min_similarity: float = 0.0,
    limit: int | None = None
) -> list[list[SearchResult]]

Search for multiple queries in parallel.

contains

contains(query: str) -> bool

Check if exact match exists in index.

compress / decompress

compress() -> None
decompress() -> None

Compress/decompress posting lists for memory efficiency.

is_compressed

is_compressed() -> bool

Check if index is compressed.

save / load

save(path: str) -> None
load(path: str) -> NgramIndex  # class method

Persist index to disk.


BkTree

BK-tree index for edit distance queries.

Constructor

BkTree(algorithm: str = "levenshtein")

Parameters:

  • algorithm: Distance algorithm ("levenshtein" or "damerau_levenshtein")

Methods

add / add_all

add(text: str) -> None
add_all(texts: Iterable[str]) -> None

Add strings to the tree.

search

search(query: str, max_distance: int) -> list[SearchResult]

Find strings within edit distance threshold.

Note

Consider using search_similarity() instead for consistency with NgramIndex and HybridIndex APIs.

search_similarity

search_similarity(
    query: str,
    min_similarity: float,
    limit: int | None = None
) -> list[SearchResult]

Find strings above similarity threshold. Recommended over search() for API consistency.

The similarity is computed as: 1 - (distance / max(len(query), len(match)))

save / load

save(path: str) -> None
load(path: str) -> BkTree  # class method

HybridIndex

Combined N-gram and similarity index.

Constructor

HybridIndex(
    ngram_size: int = 3,
    min_ngram_ratio: float = 0.0,
    normalize: bool = False
)

Methods

Same as NgramIndex: add, add_all, search, batch_search, contains.


SchemaBuilder

Build multi-field matching schemas.

Methods

add_field

add_field(
    name: str,
    weight: float = 1.0,
    algorithm: str = "jaro_winkler"
) -> SchemaBuilder

Add a field to the schema.

build

build() -> Schema

Build the schema.


SchemaIndex

Index for multi-field record matching.

Constructor

SchemaIndex(schema: Schema, records: list[dict])

Methods

search

search(
    query: dict,
    limit: int | None = None,
    min_similarity: float = 0.0
) -> list[SchemaSearchResult]

Search for matching records.

Returns: List of SchemaSearchResult(id, score, record, field_scores)


Result Types

SearchResult

@dataclass
class SearchResult:
    id: int           # Index ID
    text: str         # Matched text
    score: float      # Similarity score
    distance: int | None  # Edit distance (if applicable)
    data: int | None  # Associated data

MatchResult

@dataclass
class MatchResult:
    text: str    # Matched text
    score: float # Similarity score

SchemaSearchResult

@dataclass
class SchemaSearchResult:
    id: int                    # Record ID
    score: float               # Combined score
    record: dict               # Full record
    field_scores: dict[str, float]  # Per-field scores