Skip to content

Performance

FuzzyRust is designed for high performance at scale.

Benchmarks

Single Pair Comparison

Library Jaro-Winkler Levenshtein
FuzzyRust 1.0x 1.0x
RapidFuzz 0.9x 0.8x
jellyfish 2.5x 3.0x
python-Levenshtein - 2.0x

Lower is better. RapidFuzz is slightly faster for single pairs.

Batch Operations (10K pairs)

Library Time
FuzzyRust 1.0x
RapidFuzz 5-10x
Pure Python 100x+

FuzzyRust excels at batch operations due to Rayon parallelization.

Index Search (1M strings, find top 10)

Method Time
NgramIndex ~1ms
BkTree ~5ms
Linear scan ~500ms

Index structures provide 100-2000x speedup.

Optimization Techniques

FuzzyRust uses several optimization techniques:

1. Rust Core

All algorithms are implemented in Rust with zero-copy string handling.

2. Parallel Processing

Batch operations use Rayon for automatic parallelization:

# Automatically parallelized across all cores
results = index.batch_search(queries, min_similarity=0.8)

3. SIMD Acceleration

Myers bit-parallel algorithm for Levenshtein distance processes 64 characters in parallel.

4. Early Termination

Bounded distance calculations stop early when threshold is exceeded:

# Returns None immediately if distance > 3
fr.levenshtein_bounded("hello", "world", max_distance=3)

5. N-gram Candidate Filtering

Index structures filter candidates before expensive similarity calculations.

6. Memory-Efficient Indices

Posting list compression reduces memory usage by 50-70%:

index.compress()  # Compress after building

Scaling Guidelines

Dataset Size Recommended Approach
< 1K Direct comparison
1K - 10K find_best_matches()
10K - 100K NgramIndex
100K - 1M NgramIndex + compression
> 1M Batch API + SNM deduplication

Memory Usage

Approximate memory per 1M strings (avg 20 chars):

Structure Memory
Raw strings ~20 MB
NgramIndex (uncompressed) ~100 MB
NgramIndex (compressed) ~40 MB
BkTree ~80 MB

Tips for Best Performance

1. Use Appropriate Algorithms

# Fast for names
fr.jaro_winkler(name1, name2)

# Fast for short codes
fr.hamming(code1, code2)

# Use bounded for filtering
fr.levenshtein_bounded(s1, s2, max_distance=2)

2. Pre-filter When Possible

# Filter before fuzzy matching
candidates = [s for s in strings if s[0] == query[0]]
matches = fr.find_best_matches(query, candidates)

3. Use Batch Operations

# Bad: Loop with single calls
for q in queries:
    results.append(fr.find_best_matches(q, targets))

# Good: Single batch call
all_results = index.batch_search(queries, min_similarity=0.8)

4. Choose Right Index

# For similarity threshold queries
index = fr.NgramIndex(ngram_size=3)

# For edit distance queries
tree = fr.BkTree(algorithm="levenshtein")

5. Compress Large Indices

index = fr.NgramIndex(ngram_size=3)
index.add_all(large_dataset)
index.compress()  # Reduce memory by 50-70%

Profiling

Enable debug output for performance analysis:

import logging
logging.getLogger("fuzzyrust").setLevel(logging.DEBUG)

Scale Limits and Complexity

Function Complexity Reference

Function Complexity Max Recommended Records
df_dedupe() O(N^2) 50K
df_match_pairs() O(N^2) 50K
df_dedupe_snm() O(N * W * log N) 10M
df_find_pairs(method="snm") O(N * W * log N) 10M
df_join() O(N * M) N=100K, M=1M
df_match_records() O(N * log M) N=1M, M=10M
SchemaIndex.search() O(log N) 10M
SchemaIndex.batch_search() O(Q * log N) Q=100K, N=10M

Where: - N, M = number of records - W = window_size for SNM - Q = number of queries

Memory Limits

Structure Memory per 1M records
SchemaIndex (4 fields) ~2 GB
NgramIndex ~500 MB
BkTree ~400 MB

Warning: Avoid These Patterns

# BAD: O(N^2) for large datasets - will take hours!
result = frp.df_dedupe(df_1m_rows, columns=["name"])

# GOOD: O(N * W * log N) - completes in minutes
result = frp.df_dedupe_snm(df_1m_rows, columns=["name"], window_size=50)
# BAD: Loop with individual calls
for query in queries:
    results.append(index.search(query))

# GOOD: Single batch call (parallelized)
results = index.batch_search(queries)

See Large-Scale Fuzzy Matching Guide for detailed optimization strategies