Large-Scale Fuzzy Matching Guide¶

This guide covers best practices for fuzzy matching operations on large datasets (100K to 10M+ records).

When to Use What¶

Use this decision tree to choose the right function:

Dataset Size	Recommended Functions	Notes
< 10K rows	`df_dedupe()`, `df_join()`	Simple and fast, O(N^2) acceptable
10K - 100K rows	`df_dedupe()` or `df_dedupe_snm()`	Consider SNM for better scaling
100K - 1M rows	`df_dedupe_snm()` with blocking	Use blocking key to partition data
> 1M rows	`df_dedupe_snm()` with blocking + larger window	Essential optimizations required

Memory Estimation¶

Approximate memory requirements for multi-field matching:

Records	Columns	Avg Field Length	Estimated Memory	Time (approx)
100K	4	30 chars	~500 MB	10-30s
1M	4	30 chars	~2-4 GB	2-5 min
10M	4	30 chars	~20-40 GB	20-60 min

Rule of thumb: Allocate ~100 bytes per record per indexed field.

Complexity Comparison¶

Understanding algorithm complexity is crucial for large-scale operations:

O(N^2) vs O(N log N)¶

df_dedupe(): O(N^2) - compares every pair, avoid for > 100K records
df_dedupe_snm(): O(N * W * log N) where W = window_size

For 1M records: - O(N^2) = 1,000,000,000,000 comparisons (impractical) - O(N * W * log N) with W=50 = ~1,000,000,000 comparisons (manageable)

Window Size Recommendations¶

The window_size parameter in SNM controls accuracy vs speed tradeoff:

Dataset Size	Recommended Window	Accuracy Trade-off
< 100K	10-20	Good
100K - 500K	20-50	Good
500K - 1M	50-100	Moderate
> 1M	100-200	Lower (use blocking)

Larger windows catch more duplicates but take longer. When accuracy drops, use blocking instead of increasing window size.

Blocking Key Strategies¶

What is Blocking?¶

Blocking partitions records before comparison, reducing complexity from O(N^2) to O(B * (N/B)^2) where B is the number of blocks. With good blocking, this approaches O(N).

from fuzzyrust import polars as frp

# Block by category - only compare within same category
result = frp.df_dedupe_snm(
    df,
    columns=["name", "manufacturer"],
    blocking_key="category",  # Use category column as block key
    min_similarity=0.85,
    window_size=30,
)

Recommended Blocking Keys for Industrial Data¶

Data Type	Blocking Key	Example
Part numbers	First 3-4 characters	"PN-001234" -> "PN-0"
Names	First character (normalized)	"Bosch" -> "B"
Manufacturer	Exact match (pre-normalized)	"BOSCH"
Category	Exact match	"Bearings"
Composite	category + first_char(name)	"Bearings-B"

Custom Blocking Keys¶

Create blocking keys with any Python callable:

# Block by first letter of name
result = frp.df_dedupe_snm(
    df,
    columns=["name"],
    blocking_key=lambda df: df["name"].str.slice(0, 1).str.to_uppercase(),
    min_similarity=0.85,
)

# Composite blocking key
result = frp.df_dedupe_snm(
    df,
    columns=["name", "manufacturer"],
    blocking_key=lambda df: (
        df["category"] + "-" + df["manufacturer"].str.slice(0, 2)
    ),
    min_similarity=0.85,
)

Best Practices¶

1. Normalize Data First¶

Always normalize strings before matching:

import polars as pl

df = df.with_columns([
    pl.col("name").str.to_lowercase().str.strip_chars().alias("name_norm"),
    pl.col("manufacturer").str.to_uppercase().alias("mfr_norm"),
])

# Match on normalized columns
result = frp.df_dedupe_snm(df, columns=["name_norm", "mfr_norm"])

2. Pre-filter Categories¶

Match within categories when possible:

# Process each category separately
results = []
for category in df["category"].unique().to_list():
    subset = df.filter(pl.col("category") == category)
    deduped = frp.df_dedupe_snm(
        subset,
        columns=["name"],
        min_similarity=0.85,
    )
    results.append(deduped)

final = pl.concat(results)

3. Use Appropriate Algorithms per Field¶

Different field types benefit from different algorithms:

Field Type	Recommended Algorithm	Reason
Part numbers	`levenshtein`	Exact edit distance for codes
Names/Descriptions	`jaro_winkler`	Handles typos, prefix bonus
Manufacturer	`jaro_winkler` or exact	Often standardized
Category	`jaccard` or exact	Token-based matching

result = frp.df_dedupe_snm(
    df,
    columns=["part_number", "name", "manufacturer"],
    algorithms={
        "part_number": "levenshtein",
        "name": "jaro_winkler",
        "manufacturer": "jaro_winkler",
    },
    weights={
        "part_number": 3.0,  # Most important
        "name": 2.0,
        "manufacturer": 1.0,
    },
    min_similarity=0.85,
)

4. Use Batch Operations¶

Never loop over individual searches:

# BAD: Loop with individual searches
for query in queries:
    index.search(query)  # Creates overhead per query

# GOOD: Single batch call
index.batch_search(queries)  # Parallelized internally

5. Monitor Memory¶

For very large datasets, process in chunks:

# Process 100K rows at a time
chunk_size = 100_000
results = []

for i in range(0, len(df), chunk_size):
    chunk = df.slice(i, chunk_size)
    deduped = frp.df_dedupe_snm(
        chunk,
        columns=["name"],
        min_similarity=0.85,
    )
    results.append(deduped)

Performance Comparison¶

Deduplication Methods¶

Method	Complexity	10K rows	100K rows	1M rows
`df_dedupe()`	O(N^2)	2s	3 min	5+ hours
`df_dedupe_snm()`	O(NWlog N)	0.5s	5s	1 min
`df_dedupe_snm()` + blocking	O(B(N/B)W*log(N/B))	0.3s	2s	20s

Times are approximate and depend on hardware and data characteristics.

Index Search Methods¶

For querying against a reference dataset:

Method	Build Time	Query Time	Memory
`SchemaIndex`	~1s/100K	~1ms/query	~200MB/100K
`NgramIndex`	~2s/100K	~0.5ms/query	~100MB/100K
`HybridIndex`	~3s/100K	~0.3ms/query	~300MB/100K

Common Pitfalls¶

1. Using O(N^2) Methods on Large Data¶

# WRONG: Will take hours for 1M rows
result = frp.df_dedupe(df_1m_rows, columns=["name"])

# RIGHT: Use SNM for O(N log N)
result = frp.df_dedupe_snm(df_1m_rows, columns=["name"], window_size=50)

2. Window Size Too Small¶

A small window may miss duplicates that are far apart after sorting:

# May miss some duplicates
result = frp.df_dedupe_snm(df, columns=["name"], window_size=5)

# Better coverage
result = frp.df_dedupe_snm(df, columns=["name"], window_size=30)

3. No Blocking on Very Large Data¶

For > 1M rows, always use blocking:

# Slow without blocking
result = frp.df_dedupe_snm(df_10m, columns=["name"], window_size=100)

# Much faster with blocking
result = frp.df_dedupe_snm(
    df_10m,
    columns=["name"],
    blocking_key="category",
    window_size=30,
)