Performance¶
FuzzyRust is designed for high performance at scale.
Benchmarks¶
Single Pair Comparison¶
| Library | Jaro-Winkler | Levenshtein |
|---|---|---|
| FuzzyRust | 1.0x | 1.0x |
| RapidFuzz | 0.9x | 0.8x |
| jellyfish | 2.5x | 3.0x |
| python-Levenshtein | - | 2.0x |
Lower is better. RapidFuzz is slightly faster for single pairs.
Batch Operations (10K pairs)¶
| Library | Time |
|---|---|
| FuzzyRust | 1.0x |
| RapidFuzz | 5-10x |
| Pure Python | 100x+ |
FuzzyRust excels at batch operations due to Rayon parallelization.
Index Search (1M strings, find top 10)¶
| Method | Time |
|---|---|
| NgramIndex | ~1ms |
| BkTree | ~5ms |
| Linear scan | ~500ms |
Index structures provide 100-2000x speedup.
Optimization Techniques¶
FuzzyRust uses several optimization techniques:
1. Rust Core¶
All algorithms are implemented in Rust with zero-copy string handling.
2. Parallel Processing¶
Batch operations use Rayon for automatic parallelization:
# Automatically parallelized across all cores
results = index.batch_search(queries, min_similarity=0.8)
3. SIMD Acceleration¶
Myers bit-parallel algorithm for Levenshtein distance processes 64 characters in parallel.
4. Early Termination¶
Bounded distance calculations stop early when threshold is exceeded:
5. N-gram Candidate Filtering¶
Index structures filter candidates before expensive similarity calculations.
6. Memory-Efficient Indices¶
Posting list compression reduces memory usage by 50-70%:
Scaling Guidelines¶
| Dataset Size | Recommended Approach |
|---|---|
| < 1K | Direct comparison |
| 1K - 10K | find_best_matches() |
| 10K - 100K | NgramIndex |
| 100K - 1M | NgramIndex + compression |
| > 1M | Batch API + SNM deduplication |
Memory Usage¶
Approximate memory per 1M strings (avg 20 chars):
| Structure | Memory |
|---|---|
| Raw strings | ~20 MB |
| NgramIndex (uncompressed) | ~100 MB |
| NgramIndex (compressed) | ~40 MB |
| BkTree | ~80 MB |
Tips for Best Performance¶
1. Use Appropriate Algorithms¶
# Fast for names
fr.jaro_winkler(name1, name2)
# Fast for short codes
fr.hamming(code1, code2)
# Use bounded for filtering
fr.levenshtein_bounded(s1, s2, max_distance=2)
2. Pre-filter When Possible¶
# Filter before fuzzy matching
candidates = [s for s in strings if s[0] == query[0]]
matches = fr.find_best_matches(query, candidates)
3. Use Batch Operations¶
# Bad: Loop with single calls
for q in queries:
results.append(fr.find_best_matches(q, targets))
# Good: Single batch call
all_results = index.batch_search(queries, min_similarity=0.8)
4. Choose Right Index¶
# For similarity threshold queries
index = fr.NgramIndex(ngram_size=3)
# For edit distance queries
tree = fr.BkTree(algorithm="levenshtein")
5. Compress Large Indices¶
index = fr.NgramIndex(ngram_size=3)
index.add_all(large_dataset)
index.compress() # Reduce memory by 50-70%
Profiling¶
Enable debug output for performance analysis:
Scale Limits and Complexity¶
Function Complexity Reference¶
| Function | Complexity | Max Recommended Records |
|---|---|---|
df_dedupe() |
O(N^2) | 50K |
df_match_pairs() |
O(N^2) | 50K |
df_dedupe_snm() |
O(N * W * log N) | 10M |
df_find_pairs(method="snm") |
O(N * W * log N) | 10M |
df_join() |
O(N * M) | N=100K, M=1M |
df_match_records() |
O(N * log M) | N=1M, M=10M |
SchemaIndex.search() |
O(log N) | 10M |
SchemaIndex.batch_search() |
O(Q * log N) | Q=100K, N=10M |
Where: - N, M = number of records - W = window_size for SNM - Q = number of queries
Memory Limits¶
| Structure | Memory per 1M records |
|---|---|
| SchemaIndex (4 fields) | ~2 GB |
| NgramIndex | ~500 MB |
| BkTree | ~400 MB |
Warning: Avoid These Patterns¶
# BAD: O(N^2) for large datasets - will take hours!
result = frp.df_dedupe(df_1m_rows, columns=["name"])
# GOOD: O(N * W * log N) - completes in minutes
result = frp.df_dedupe_snm(df_1m_rows, columns=["name"], window_size=50)
# BAD: Loop with individual calls
for query in queries:
results.append(index.search(query))
# GOOD: Single batch call (parallelized)
results = index.batch_search(queries)
See Large-Scale Fuzzy Matching Guide for detailed optimization strategies