Multi-Field Matching¶
FuzzyRust supports matching records across multiple fields with weighted scoring.
Overview¶
When matching records with multiple fields (name, address, phone, etc.), you can:
- Use different algorithms per field
- Assign weights to fields based on importance
- Get a combined similarity score
SchemaBuilder¶
Define a matching schema:
import fuzzyrust as fr
schema = (
fr.SchemaBuilder()
.add_field("name", weight=2.0, algorithm="jaro_winkler")
.add_field("address", weight=1.0, algorithm="levenshtein")
.add_field("phone", weight=0.5, algorithm="jaro")
.build()
)
Field Parameters¶
| Parameter | Description | Default |
|---|---|---|
name |
Field name (must match record keys) | Required |
weight |
Importance weight (higher = more important) | 1.0 |
algorithm |
Similarity algorithm to use | "jaro_winkler" |
Available Algorithms¶
jaro_winkler- Best for namesjaro- Similar to Jaro-Winkler without prefix bonuslevenshtein- Edit distance baseddamerau_levenshtein- Edit distance with transpositionsngram- N-gram based similarity (trigram, n=3)jaccard- Jaccard similarity (n-gram based)cosine- Cosine similarity
SchemaIndex¶
Build and search an index of records:
# Sample records
records = [
{"name": "John Smith", "address": "123 Main St", "phone": "555-1234"},
{"name": "Jane Doe", "address": "456 Oak Ave", "phone": "555-5678"},
{"name": "Bob Wilson", "address": "789 Pine Rd", "phone": "555-9012"},
]
# Build index
index = fr.SchemaIndex(schema, records)
# Search
query = {"name": "Jon Smith", "address": "123 Main"}
results = index.search(query, limit=5, min_similarity=0.7)
for r in results:
print(f"Score: {r.score:.3f}")
print(f" Name: {r.record['name']}")
print(f" Address: {r.record['address']}")
Scoring¶
The final score is a weighted average:
Example¶
With schema: - name: weight=2.0 - address: weight=1.0
And field scores: - name: 0.95 - address: 0.80
Final score = (0.95 × 2.0 + 0.80 × 1.0) / (2.0 + 1.0) = 0.90
Missing Fields¶
Fields missing from the query or record are excluded from scoring:
With Polars¶
Use multi-field matching in fuzzy joins:
import polars as pl
from fuzzyrust import polars as frp
df1 = pl.DataFrame({
"first": ["John", "Jane"],
"last": ["Smith", "Doe"],
"city": ["NYC", "LA"]
})
df2 = pl.DataFrame({
"fname": ["Jon", "Janet"],
"lname": ["Smith", "Doe"],
"location": ["New York", "Los Angeles"]
})
# Multi-column fuzzy join
result = frp.df_join(
df1, df2,
left_on=["first", "last"],
right_on=["fname", "lname"],
min_similarity=0.8,
algorithm="jaro_winkler"
)
Best Practices¶
-
Weight important fields higher - Name fields typically matter more than phone numbers
-
Choose algorithms per field type:
- Names:
jaro_winkler - Addresses:
levenshteinorngram -
Codes/IDs:
jaro(exact prefix matching helps) -
Normalize data first - Uppercase, trim whitespace, standardize formats
-
Start with higher thresholds - Easier to lower than raise
Multi-Field Matching at Scale¶
Memory Considerations for SchemaIndex¶
SchemaIndex stores all records in memory. For large datasets:
| Records | Fields | Avg Field Length | Estimated Memory |
|---|---|---|---|
| 100K | 4 | 30 chars | ~200 MB |
| 1M | 4 | 30 chars | ~2 GB |
| 10M | 4 | 30 chars | ~20 GB |
Algorithm Recommendations for Industrial Data¶
| Field Type | Recommended Algorithm | Why |
|---|---|---|
| Part Numbers | levenshtein | Exact edit distance for codes |
| Names/Descriptions | jaro_winkler | Handles typos, prefix bonus |
| Manufacturer | jaro_winkler or exact | Often standardized |
| Category | exact or jaccard | Token-based matching |
Batch Search Pattern¶
For searching many queries against a large index:
import fuzzyrust as fr
# Define schema
schema = (
fr.SchemaBuilder()
.add_field("name", algorithm="jaro_winkler", weight=2.0)
.add_field("part_number", algorithm="levenshtein", weight=3.0)
.build()
)
# Build index once
index = fr.SchemaIndex(schema)
for record in records:
index.add(record)
# Batch search (single Rust call, parallelized)
queries = [{"name": "...", "part_number": "..."} for _ in range(1000)]
all_results = index.batch_search(
queries,
min_similarity=0.8,
limit=5,
)
Large-Scale Recommendations¶
For datasets larger than 100K records:
- Use
df_dedupe_snm()instead ofdf_dedupe()- O(N log N) vs O(N^2) - Add blocking keys - Partition data before comparison
- Process in chunks - Avoid memory pressure
- Pre-normalize data - Reduce comparison overhead
See Large-Scale Fuzzy Matching Guide for detailed strategies