Amazon India Duplicate Product Detection — ML System Design

Amazon Duplicate Listing Detection

ML System Design NLP Entity Resolution Scalability Expert

The Challenge: Identifying Semantic Duplicates for Telugu Products

You're working for Amazon's India marketplace. Telugu sellers often list the same product with different names like "Andhra Pickles" vs "Guntur Pickle" vs "Traditional Avakaya". How would you design a system to identify and merge these duplicate listings in a large database (e.g., 50 million products, 100K new daily) considering linguistic nuances, seller behavior, and scalability?

Initial Thoughts & Clarifications

Definition of "Duplicate": What constitutes a duplicate? Same underlying product from different sellers? Same seller with minor variations? How do we handle legitimate product variants (e.g., spice level, size) vs. true duplicates?
Data Sources: What product attributes are available? (Title, description, images, brand, seller ID, price, category, customer reviews, item specifics like weight/ingredients).
Linguistic Challenges: How to handle mixed scripts (Telugu + English), transliteration variations, synonyms, regional terms (Guntur vs. general Andhra), and semantic similarity?
Scale: 50M existing, 100K new daily – implies need for efficient candidate generation (blocking) and real-time capabilities for new listings.
Business Goal: Improve customer experience (easier search, less confusion), better catalog quality, fair seller competition, accurate analytics. What are the trade-offs (e.g., precision vs. recall for merging)?
Seller Behavior: Are some duplicates intentional (to gain more visibility) vs. accidental? Does this affect the approach?
Evaluation: How will the system's performance be measured? (Accuracy of merges, impact on search, seller feedback).
Existing Systems: Is there any current rudimentary duplicate detection? What are its limitations?

Framework to Consider (Duplicate Detection / Entity Resolution System):

Problem Definition & Scope:
- Define what constitutes a "duplicate" vs. a "variant" vs. "distinct but similar."
- Specify goals (e.g., merge true duplicates, link variants).
Data Understanding & Preprocessing (General):
- Analyze product attributes (text, images, structured data).
- General text normalization: Language detection, script conversion (e.g., Telugu to Roman), transliteration handling, stemming/lemmatization, stop-word removal, phonetic encoding.
- General image preprocessing and attribute normalization.
Data Preparation for Model Training:
- Detailed labeling strategy: Positive, negative, hard-negative pair generation.
- Feature extraction for model inputs: Specifics for text (tokenization, padding), images (resizing, augmentation), structured data (normalization, encoding).
- Train/validation/test splits: Handling data leakage, temporal splits, class imbalance strategies (sampling, weighting).
- Data augmentation techniques for model robustness.
Candidate Generation / Blocking:
- Goal: Drastically reduce pairwise comparisons from O(N²) to something manageable.
- Methods: Category-based blocking, LSH (MinHash for text, SimHash for images), attribute-based blocking (brand, key ingredients), phonetic blocking.
Feature Engineering for Pairwise Comparison (if not end-to-end):
- Textual Similarity: TF-IDF + Cosine, Word/Sentence Embeddings (e.g., multilingual BERT, FastText) + Cosine, Edit Distance (Levenshtein, Jaro-Winkler).
- Image Similarity: CNN-based embeddings (e.g., ResNet, ViT) + Cosine.
- Structured Attribute Similarity: Match scores for brand, weight, key ingredients, price (normalized).
- Seller Information: Same seller vs. different sellers.
- Behavioral Signals (Advanced): Co-view/co-purchase data, review similarity.
Similarity Scoring & Classification Model:
- Combine features into a similarity score or feed into a classifier (e.g., Logistic Regression, SVM, Random Forest, Gradient Boosting, Siamese Network for end-to-end learning).
- Output: Probability of being a duplicate.
Clustering / Transitive Closure:
- If A=B and B=C, then A=C. Use graph-based clustering (e.g., Connected Components) on high-confidence pairs to find all duplicates in a set.
Evaluation:
- Pairwise metrics: Precision, Recall, F1-score for duplicate classification.
- Cluster metrics: Purity, Rand Index, Normalized Mutual Information.
- Business metrics (via A/B testing): Impact on search relevance, conversion, customer satisfaction, seller complaints.
Scalability & Deployment:
- Batch processing for existing catalog cleanup (e.g., Spark, EMR).
- Real-time/near real-time processing for new listings (e.g., Kafka + Flink/Spark Streaming, API-based checks). Efficient indexing (FAISS for embeddings).
- Model serving infrastructure. Latency vs. accuracy trade-offs.
Rollout Strategy & Risk Management:
- Phased rollout (e.g., by category, region), dark launch/shadow mode.
- Canary releases, A/B testing for deployment.
- Rollback plans and incident response. Seller communication.
Human-in-the-Loop & Continuous Improvement:
- Mechanism for manual review of low-confidence predictions or disputed merges.
- Feedback loop to retrain models with corrected labels.
- Monitor for adversarial seller behavior and adapt.
Ethical Considerations & Bias Mitigation:
- Identify and mitigate linguistic bias, popularity bias, presentation bias.
- Ensure fairness for sellers (especially smaller or regional ones).
- Regular audits for bias and fairness. Transparency in decision-making where possible.

Simulated Conversation

Round 1: Problem Understanding & Linguistic Complexity

Interviewer 1 (I1 - Principal Engineer): Let's jump right in. Telugu sellers on Amazon India are creating duplicate listings for what appear to be the same product – examples like "Andhra Pickles" vs "Guntur Pickle" vs "Traditional Avakaya". At Amazon's scale, we're talking millions of products. How do you approach this duplicate detection problem?

Candidate (C): This is a fascinating and challenging entity resolution problem, especially with the added linguistic and cultural complexity of regional Indian products. My approach would be multi-staged, starting with deep data understanding, moving to candidate generation, detailed data preparation for model training, feature engineering, modeling, and finally, scalable deployment and evaluation.

The core challenge is that these aren't just exact string matches; they are semantic duplicates or near-duplicates often masked by variations in naming, transliteration, regional specificity, and seller descriptions. We need to understand what data attributes we have for each product listing: titles, descriptions, images, seller information, brand, category, price, customer reviews, and any structured attributes like weight or ingredients.

Solid Opening: Candidate frames the problem correctly as entity resolution and acknowledges linguistic/cultural nuances. Mentions data prep early.

Interviewer 2 (I2 - Category Manager): (interrupting) You said "semantic variants" – but do you actually understand the specific context here? "Avakaya" is a specific type of mango pickle. "Guntur" refers to a region in Andhra Pradesh famous for its particularly spicy pickles, so "Guntur Pickle" often implies a Guntur chili-based pickle, which could be mango or something else. "Andhra Pickles" is a very broad term for any pickle from Andhra. These could be genuinely different products with distinct customer expectations. How do you differentiate between true duplicates, legitimate product variants that should perhaps be grouped, and entirely different products?

C: That's an excellent and critical clarification, thank you. You're absolutely right; domain knowledge of Indian cuisine and regional specialties is paramount here. My initial statement needs refinement. The system must distinguish between:

True Duplicates: The exact same underlying product (e.g., same brand, same specific type like "Mango Avakaya", same ingredients, same weight) listed with slightly different titles or by different sellers. Example: "Priya Mango Avakaya Pickle 500g" vs. "Priya Foods Avakai Mango Pickle 0.5kg". These should be merged or a canonical version chosen.
Product Variants: Products that share a base but differ in a key, customer-relevant attribute.
- Examples: "Guntur Spicy Mango Pickle" vs. "Mild Mango Pickle (Andhra Style)". These are both mango pickles but differ in spice level/origin indicated. These might be grouped under a parent product with selectable variants rather than merged into one.
- "Avakaya (with Garlic)" vs. "Avakaya (without Garlic)".
Distinct but Related Products: Products within the same broad category but clearly different. Example: "Mango Avakaya" vs. "Gongura Pickle" vs. "Tomato Pickle". These are not duplicates. "Guntur Pickle" could be a Guntur Mango Pickle (variant of Avakaya) or a Guntur Red Chili Pickle (distinct product).

My approach would need a hierarchical classification:

Broad Category Classification: Is this a "Mango Pickle," "Lemon Pickle," "Mixed Vegetable Pickle"? This uses titles, descriptions, and potentially image analysis or ingredient lists.
Sub-type/Variant Identification: Within "Mango Pickle," identify attributes like "Avakaya," "Guntur style," "spice level," "key ingredients (garlic/no garlic)."
True Duplicate Detection: For items identified as the same sub-type/variant (e.g., "Guntur Spicy Mango Avakaya 500g"), then apply more stringent checks for true duplication across sellers or minor title variations.

This requires building a product ontology or knowledge graph, at least for key categories like pickles, that understands these relationships and attributes. Input from category managers like yourself would be invaluable for defining these rules and attributes.

Strong Domain Adaptation: Candidate quickly incorporates the domain knowledge, refines definitions, and proposes a hierarchical approach incorporating an ontology.

I1 (Principal Engineer): Good recognition of the complexity. And yes, your mention of architecture is key. Amazon India has, let's say, 50 million products in the catalog, with 100,000 new listings added daily. Before diving deep into the model, what's your high-level technical architecture for such a system, keeping in mind both real-time needs for new listings and batch processing for catalog cleanup?

C: For that scale, a multi-stage pipeline is essential, focusing on efficiency for candidate generation and accuracy for pairwise comparison.

High-Level Technical Architecture:

I. Data Ingestion & Preprocessing (Streaming & Batch):

New listings: Streamed via Kafka or Kinesis.
Existing catalog: Batch processed from product database/data lake.
Initial Preprocessing: Basic text cleaning, language detection, image normalization for hashing/initial blocking. Detailed preprocessing for model features happens later.

II. Stage 1: Candidate Generation / Blocking (to reduce N² comparisons):

Real-time (for new listings):
- Generate blocking keys based on normalized title tokens, primary category, brand (if known), phonetic keys.
- Query an inverted index (e.g., Elasticsearch, Solr) or a specialized blocking index (e.g., LSH Forest) for potential matches from the existing catalog.
Batch (for catalog cleanup):
- Techniques like Locality Sensitive Hashing (LSH) with MinHash for textual similarity, or perceptual hashing (pHash/aHash) for images.
- Category-based blocking.
- Attribute-based blocking (brand, manufacturer part number).

(Detailed feature extraction and model-specific data preparation for the candidate pairs will be discussed next)

III. Stage 2: Feature Engineering & Pairwise Classification / Similarity Scoring:

For each candidate pair (product A, product B) from the blocking stage:
- Extract detailed multi-modal features for each product (as we'll detail in data prep).
- A trained classifier (e.g., Siamese Network, XGBoost) takes these features (or embeddings from them) and outputs a probability of being a duplicate/variant.

IV. Stage 3: Clustering & Merging Decision:

Use graph-based clustering (e.g., Connected Components on pairs above a high confidence threshold) to find transitive relationships and form duplicate sets.
Apply business rules for merging or linking variants.

Technology Stack Considerations:

Streaming: Kafka, Flink/Spark Streaming.
Batch Processing: Spark, AWS EMR/Dataproc.
Indexing: Elasticsearch for blocking; FAISS or ScaNN for ANN search on embeddings.
Model Serving: SageMaker, Kubeflow.
Databases: Product catalog (DynamoDB/Cassandra), feature store, graph DB (Neptune).

Scalable Architecture Outline: Candidate proposes a well-reasoned multi-stage pipeline addressing both real-time and batch needs, with appropriate technology considerations, and tees up the data prep discussion.

Round 2: Deep Technical Dive - NLP & Multilingual Challenges

I1 (Principal Engineer): Let's get more technical on the NLP front. As you know, Telugu sellers often mix scripts in their titles and descriptions – for example, "అవకాయ Pickle" (Telugu script for Avakaya + English word Pickle). They also use various English transliterations like "Avakaya", "Avvakaya", or "Avakai". How would your system handle this script mixing and the phonetic/transliteration variations robustly, especially for the blocking and feature engineering stages?

C: This is a significant challenge requiring a sophisticated text preprocessing and normalization pipeline, which is a key part of our data preparation for the model.

Handling Mixed Scripts & Transliteration Variations:

Language & Script Detection:
- For each product title/description, first detect the languages and scripts present (e.g., using libraries like langdetect or a custom classifier). This helps apply appropriate normalization rules.
Script Conversion / Romanization:
- Convert all Telugu script text to a standardized Romanized form (e.g., using ISO 15919 or a consistent internal transliteration scheme). For example, "అవకాయ" would consistently become "avakāya" or "aavakaaya". This allows comparison with English transliterations provided by sellers.
```
# Example: Romanization (conceptual)
# from indic_transliteration import sanscript
# from indic_transliteration.schemes import ITRANS, TELUGU
# romanized_text = sanscript.transliterate(telugu_text, TELUGU, ITRANS) # Or other Roman scheme
```
Normalization of Transliterations:
- Sellers might use "Avakaya", "Aavakai", "Avvakaya". We need to normalize these to a canonical form. This can be done through:
  - Rule-based normalization: Define common phonetic mappings (e.g., 'aa' -> 'a', 'v' -> 'w' sometimes, double consonants to single).
  - Learned normalization: Train a sequence-to-sequence model (like a small transformer) on pairs of variant transliterations and their canonical forms, if such data can be curated.
Phonetic Encoding / Matching:
- Once text is in a common script (e.g., Romanized), apply phonetic algorithms. Standard Soundex is poor for Indian languages.
  - For English parts: Double Metaphone or a more modern phonetic algorithm.
  - For Romanized Telugu: Develop or use existing Indic phonetic algorithms (e.g., based on mapping to a common phonetic representation like IPA, or algorithms designed for Dravidian/Indic languages). These would understand that 'k' and 'kh' are phonetically related, or that 'అ' (a) and 'ఆ' (ā) are short/long versions of the same vowel sound.
  - This generates phonetic keys that can be used for blocking or as features.
```
# Conceptual Telugu Phonetic Hashing
# def get_telugu_phonetic_key(romanized_telugu_word):
#     # Apply rules for Telugu phonology:
#     # e.g., handle vowel length, consonant clusters, aspiration
#     # map similar sounds (like variations of 'cha') to a common key
#     key = romanized_telugu_word.lower()
#     key = key.replace('aa', 'a').replace('ee', 'i').replace('oo', 'u')
#     # ... more complex rules based on Telugu phonetics ...
#     return key
```
Character & N-gram Level Features:
- Use character n-grams (e.g., 3-grams, 4-grams) as features. These are robust to minor misspellings and transliteration variations. TF-IDF can be applied to these character n-grams.
Multilingual Embeddings:
- Use pre-trained multilingual models like mBERT, XLM-Roberta, or IndicBERT. These models are trained on many languages (including Telugu and English) and can often map semantically similar words/phrases in different scripts or transliterations close together in the embedding space. This is powerful for semantic similarity features.

The pipeline would be: Detect Script -> Convert to Standard Romanized Telugu + Keep English as is -> Normalize common English/Telugu transliterations -> Generate Phonetic Keys -> Create Character N-grams -> Generate Multilingual Embeddings. These normalized/featurized texts are then used for blocking and similarity calculations.

Deep NLP Expertise: Candidate details a robust pipeline for handling complex multilingual and multi-script text, including transliteration, phonetic matching, and leveraging advanced embeddings.

I1 (Principal Engineer): (pressing harder) You mentioned phonetic algorithms. Standard Soundex, as you know, was primarily designed for English surnames and is quite crude. "Avakaya" and "Avakai" would likely get different Soundex codes or very different phonetic keys with simple rules. How would you actually build or select a phonetic algorithm that's robust enough for complex Telugu phonetics, including consonant clusters and vowel length distinctions that are semantically important?

C: You're absolutely right. Standard Soundex is inadequate. My apologies for the oversimplification earlier. For robust Telugu phonetic matching, a more sophisticated approach is needed:

Leverage Indic NLP Libraries:
- There are specialized libraries and research in Indic NLP that provide tools for more accurate transliteration and phonetic analysis for languages like Telugu. I'd explore libraries like `iNLTK`, `indic-nlp-library`, or `CLTK (Classical Language Toolkit)` which might have modules for Telugu phonetics or syllable structures.
- These libraries often have built-in knowledge of Brahmic script properties, vowel signs (matras), consonant conjuncts (samyuktaksharas), which are crucial for correct phonetic representation.

Phonetic Transcription to a Standardized System (like IPA):

The most rigorous approach would be to convert both the Telugu script and its Romanized transliterations into the International Phonetic Alphabet (IPA). This provides a universal standard for representing sounds.

Once in IPA, we can compute phonetic similarity using weighted edit distances (e.g., Levenshtein distance where substitution costs are based on phonetic feature similarity – e.g., 'p' to 'b' is a smaller cost than 'p' to 's'). Tools exist for grapheme-to-phoneme conversion for many languages, and for Telugu, this might involve rule-based systems or ML models trained on Telugu pronunciation dictionaries.

# Conceptual: Telugu word to IPA
# def telugu_word_to_ipa(word_in_telugu_script_or_romanized):
#     # This would use a sophisticated G2P (Grapheme-to-Phoneme) engine
#     #  trained or configured for Telugu.
#     # For example, "అవకాయ" -> /aʋakaːja/
#     # "Avakai" (common translit) -> /aʋakaj/ (or similar)
#     ipa_representation = g2p_telugu_engine(word_in_telugu_script_or_romanized)
#     return ipa_representation

# def phonetic_feature_edit_distance(ipa1, ipa2):
#     # Calculates edit distance based on articulatory features of phonemes
#     # (e.g., place of articulation, manner, voicing for consonants; height, backness for vowels)
#     # A 'p' /p/ vs 'b' /b/ is one feature diff (voicing).
#     # A 'ka' /ka/ vs 'ga' /ga/ is one feature diff.
#     # A 'kai' /kaj/ vs 'kaya' /kaːja/ involves vowel length and glide diff.
#     pass

Custom Rule-Based Phonetic Hashing for Telugu:
- Develop a set of rules specifically for Telugu phonology:
  - Normalize vowel lengths (e.g., treat 'a' and 'aa' similarly for initial hashing, but retain difference for fine-grained similarity).
  - Group similar consonants (e.g., voiced/unvoiced pairs like k/g, c/j, t/d, p/b; aspirates with non-aspirates like kh/k).
  - Handle common sandhi (euphonic changes) if they affect spelling.
  - Standardize representation of consonant conjuncts.
Train a Phonetic Similarity Model:
- If we can curate a dataset of (Telugu word variant 1, Telugu word variant 2, phonetic_similarity_score), we could train a character-level Siamese neural network or a similar model to learn a mapping from spelling variations to a shared phonetic embedding space. The similarity score could be derived from human judgments or IPA-based distances.

For "Avakaya" vs. "Avakai": IPA might be /aʋakaːja/ vs. /aʋakaj/. The difference is in the final vowel length (aː vs a) and the presence/absence of the final /a/ glide. A good phonetic distance metric would score these as very similar, much more so than Soundex. My custom rules would also aim to normalize such common suffix variations typical in transliterations.

Advanced Phonetic Handling: Candidate recovers well, detailing sophisticated approaches like IPA conversion, phonetic feature distance, and custom Indic phonetic algorithms, demonstrating specialized NLP knowledge.

I2 (Category Manager): Interesting on the phonetics. But here's a business reality: sellers often deliberately create slight variations in titles and descriptions to avoid duplicate detection and get more visibility – for example, "Homemade Avakaya Pickle by Grandma Lakshmi" versus "Traditional Avakaya Pickle - Grandma Lakshmi's Authentic Recipe". They are gaming the system. How does your approach catch these kinds of adversarial, semantically very similar but textually slightly different duplicates?

C: This adversarial behavior is a common challenge in marketplaces. My system needs to be robust against such gaming. While strong semantic text similarity (from multilingual embeddings) will help, we need to incorporate features less susceptible to minor textual manipulation and also look at behavioral signals.

Catching Adversarial Duplicates:

Stronger Reliance on Non-Textual & Hard-to-Manipulate Features:
- Image Similarity: This is crucial. Sellers might change text, but often use the same or very similar product images. CNN-based image embeddings (e.g., from ResNet, ViT, or specialized product image models like Facebook's GrokNet if an equivalent could be trained) compared using cosine similarity would be a strong signal. We'd need to be robust to minor image edits (cropping, brightness changes).
- Structured Attributes (if available and reliable): If sellers provide EAN/UPC codes, manufacturer part numbers, precise weight (e.g., 485g vs. 500g could be a slight variant, but 500g vs. 500g is a strong match), or detailed, standardized ingredient lists. These are harder to falsify consistently.
- Price (Normalized): If two listings have very similar descriptions and images, and their price per unit (e.g., price per 100g) is nearly identical, it's a strong indicator, even if titles are tweaked.
Semantic Text Similarity (Beyond Keywords):
- Advanced embeddings from models like mBERT or sentence transformers can capture that "Homemade Avakaya Pickle by Grandma Lakshmi" and "Traditional Avakaya Pickle - Grandma Lakshmi's Authentic Recipe" are semantically extremely close, even if keyword TF-IDF might differ more. The presence of key entities like "Avakaya Pickle" and "Grandma Lakshmi" would be captured.
Seller Behavior Analysis:
- Same Seller, Multiple Similar Listings: If the same seller lists multiple products with highly similar images, attributes, and semantic descriptions within a short timeframe, flag for review.
- Rapid Listing Variations: Monitor sellers who frequently de-list and re-list items with minor changes.
Graph-Based Detection:
- Construct a product graph where nodes are products and edges represent high similarity along one or more dimensions (text, image, seller).
- Look for dense clusters of products from the same seller that are all highly similar to each other but just under individual pairwise duplicate thresholds. These "near-duplicate clouds" can indicate deliberate obfuscation.
- Use community detection algorithms on this graph.
Customer Behavior Signals (More complex to integrate but powerful):
- If customers frequently view Product A then immediately view and purchase Product B (and they are very similar), it might suggest B is the canonical or preferred version, or they are perceived as interchangeable.
- High rate of "Report incorrect information" or "Is this the same as X?" from customers on product pages.
Human Review for Adversarial Patterns:
- Use machine learning to flag suspicious patterns of listing behavior (e.g., many similar items from one seller with minor name tweaks). These flagged sellers/listings can be prioritized for human review to identify new adversarial tactics, which can then be used to update the ML models or rules.

The system needs to learn to down-weight easily gamed features (like slight title fluff) and up-weight more stable signals like image content and core semantic meaning when adversarial behavior is suspected.

Addressing Adversarial Behavior: Candidate proposes a multi-pronged defense using robust features (images, structured data), advanced NLP, seller behavior analysis, and graph methods.

Round 3: Data Preparation for Model Training

I1 (Principal Engineer): You've mentioned a Siamese network model, and we've discussed handling linguistic complexities. Before we detail the model architecture itself, let's focus on the lifeblood of any supervised model: the training data. You touched on scalable generation, but could you elaborate on the entire data preparation pipeline for training this model? Specifically, how do you construct the (product A, product B, label) pairs or triplets, what features are extracted for each product before they enter the Siamese towers, and how do you handle common data challenges like imbalance and creating robust train/val/test splits?

C: Absolutely. High-quality, well-prepared training data is paramount for the success of the Siamese network, or any classification model we use for pairwise comparison. Here's how I'd approach it:

Comprehensive Data Preparation Pipeline for Model Training:

1. Labeling Strategy & Pair/Triplet Generation:

We need to generate (Product A, Product B, Label) pairs, or (Anchor, Positive, Negative) triplets if using triplet loss.

Positive Pairs (Label=1 - Duplicates):
- High-Confidence Rules: Generated from exact matches on normalized titles + same brand + very similar image embeddings (e.g., cosine sim > 0.98) + price within +/- 5%. Also, products with identical UPC/EAN codes but different ASINs.
- Manually Merged Data: Leverage any existing data where catalog teams have manually identified and merged duplicates. This is high-quality signal.
- Self-Supervision (Augmentation): For a given product, create an augmented version (e.g., slight text paraphrasing, minor image changes) and consider it a positive pair with the original. This helps learn robustness.
- Seller-Confirmed Duplicates: If sellers flag their own listings as duplicates or confirm system suggestions.
Negative Pairs (Label=0 - Non-Duplicates):
- Easy Negatives: Random pairs of products from vastly different fine-grained categories (e.g., "Avakaya Pickle" vs. "Mobile Phone"). These help the model learn gross differences quickly.
- Hard Negatives: These are crucial for learning fine distinctions.
  - Products within the same fine-grained category but with clearly distinct key attributes identified through NLP (e.g., "Mango Pickle" vs. "Lemon Pickle").
  - Pairs that are "close" in the embedding space from a previous model iteration but are known non-duplicates (e.g., different variants like "Spicy Guntur Mango Pickle" vs. "Mild Andhra Mango Pickle").
  - Products that were candidate pairs from the blocking stage but were confirmed as non-duplicates by human annotators.
Triplet Generation (for Triplet Loss):
- For an "Anchor" product, find a "Positive" (true duplicate) and a "Negative" (non-duplicate, ideally a hard negative). The goal is to train the model such that `distance(Anchor, Positive) + margin < distance(Anchor, Negative)`.
Active Learning & Weak Supervision: As discussed, these will be ongoing sources for new labeled pairs, especially for ambiguous cases.

2. Feature Extraction for Individual Product Inputs (Per Tower):

For each product (Product A or Product B) that goes into one of the Siamese network's towers, we extract a multi-modal feature set:

Textual Features (for Text Encoder):
- Input: Concatenated normalized title and key phrases from the description.
- Cleaning & Normalization: Lowercasing, removal of special characters/HTML. Script conversion (Telugu to Romanized standard) and transliteration normalization (as discussed in NLP round).
- Tokenization: Using a multilingual tokenizer compatible with the chosen transformer model (e.g., SentencePiece, WordPiece for mBERT).
- Padding/Truncation: Pad sequences to a fixed maximum length (e.g., 128 or 256 tokens) and truncate longer sequences. Generate attention masks.
Image Features (for Image Encoder):
- Input: Primary product image.
- Normalization:
  - Resize to a fixed input size required by the CNN (e.g., 224x224 or 299x299).
  - Pixel value scaling (e.g., to [0, 1] range).
  - Normalization using mean and standard deviation statistics from the dataset the pre-trained CNN was trained on (e.g., ImageNet stats).
- Training-Time Augmentation: To make the model robust to variations:
  - Random horizontal flips.
  - Random rotations (small angles).
  - Color jitter (brightness, contrast, saturation).
  - Random minor crops and resizing.
Structured Attribute Features (for Attribute Encoder):
- Numeric Attributes (e.g., price, weight, pack_count):
  - Handling Missing Values: Impute with mean/median/mode for that category, or use a special "missing" flag/embedding.
  - Normalization: Z-score normalization or Min-Max scaling to bring them to a similar range.
- Categorical Attributes (e.g., brand, fine-grained_category_id, seller_tier):
  - Handling Missing Values: Assign a special "unknown" category.
  - Encoding: If the attribute encoder is a simple MLP, use one-hot encoding for low-cardinality attributes. For high-cardinality attributes (like brand_id), learn an embedding vector for each category using an `nn.Embedding` layer.
- Boolean/Binary Attributes (e.g., is_handmade, contains_garlic): Encode as 0/1.

The output of each encoder (text, image, attribute) for a product will be an embedding, which are then fused to create a single multi-modal embedding for that product within its tower.

3. Train/Validation/Test Splits:

Challenge: Data Leakage. Randomly splitting pairs can lead to leakage if different listings of the same underlying true product entity appear in both train and test sets (e.g., ProductA-ProductB in train, ProductA-ProductC in test, where A,B,C are all duplicates of each other). This would inflate validation/test scores.
Strategy: Group-Based Splitting.
- First, identify "true entity clusters" using high-confidence rules or existing merged ASIN groups.
- Ensure that all listings belonging to one true entity cluster are entirely within one split (train, validation, or test). This prevents leakage.
- Split these entity clusters randomly into train/val/test (e.g., 70%/15%/15%).
Temporal Split (for validation/test): Optionally, use older data for training and more recent data for validation/testing to better simulate how the model will perform on new, unseen listings over time.

4. Handling Class Imbalance:

The number of true non-duplicate pairs will vastly outnumber true duplicate pairs.

Sampling Strategies for Pair Generation:
- During training batch creation, we can control the ratio of positive to negative pairs. E.g., for every positive pair, sample K negative pairs (where K might be 1-5, instead of the true much larger ratio).
- Prioritize sampling hard negatives over easy negatives to make training more effective.
Weighted Loss Function: Assign a higher weight to the minority class (duplicates) in the binary cross-entropy loss function. The weight can be inversely proportional to class frequency.
Evaluation Metrics: Focus on metrics robust to imbalance, like Precision, Recall, F1-score (especially for the positive class), and Area Under the Precision-Recall Curve (AUC-PR). Accuracy is misleading here.

This detailed data preparation pipeline ensures that the model receives clean, well-structured, and representative data, which is crucial for learning effective similarity representations.

Comprehensive Data Preparation: Candidate details a thorough pipeline covering labeling, multi-modal feature extraction, robust train/val/test splitting to avoid leakage, and strategies for handling class imbalance. This is a critical and well-addressed section.

Round 4: Machine Learning Model Architecture

I1 (Principal Engineer): That's a very thorough data preparation plan. Now, let's dive into the model you proposed earlier: the Siamese Network. Walk me through its architecture in more detail, how the different encoders and the comparison mechanism work, and the choice of loss function.

C: Certainly. Given our multi-modal input (text, images, structured attributes) for each product, a Siamese Network architecture is well-suited. It's designed to learn an embedding space where similar items are mapped closely together and dissimilar items are mapped further apart.

Siamese Network Architecture:

The core idea is to have two identical "towers" (sub-networks) that process each product in a pair independently using the same set of weights. The outputs of these towers (product embeddings) are then compared.

Input Product Representation (Per Tower):
- As detailed in our data preparation discussion, each product (Product A, Product B) will have its textual data (tokenized sequences), image data (normalized pixel arrays), and structured attributes (normalized/encoded).
Encoders within Each Tower (Shared Weights):
- Text Encoder:
  - Architecture: A pre-trained multilingual transformer model (e.g., mBERT, XLM-Roberta, or a distilled version like DistilmBERT for speed). We'd typically use the output embedding of the `[CLS]` token or an average of token embeddings from the last hidden layer.
  - Fine-tuning: The weights of this transformer can be fine-tuned during the Siamese network training to adapt it to our specific product domain and similarity task.
  - Output: A fixed-size text embedding vector (e.g., 768 dimensions for BERT-base).
- Image Encoder:
  - Architecture: A pre-trained Convolutional Neural Network (CNN) like ResNet50, EfficientNet, or Vision Transformer (ViT), with its final classification layer removed.
  - Fine-tuning: Similar to the text encoder, the CNN weights can be fine-tuned.
  - Output: A fixed-size image embedding vector (e.g., 2048 dimensions for ResNet50 before the FC layer).
- Structured Attribute Encoder:
  - Architecture: This could be a simpler Multi-Layer Perceptron (MLP).
    - Numeric features can be directly fed into the MLP.
    - Categorical features would first pass through `nn.Embedding` layers to get dense vector representations, which are then concatenated with numeric features and fed to the MLP.
  - Output: A fixed-size attribute embedding vector.
Fusion Layer (Per Tower):
- The individual embeddings (text, image, attributes) for a single product need to be combined into a unified multi-modal product embedding.
- Method: Concatenate the text, image, and attribute embeddings. This combined vector is then passed through one or more dense (fully connected) layers with activation functions (e.g., ReLU) to allow the model to learn interactions between modalities and project to a final shared embedding space dimension (e.g., 256 or 512 dimensions).
- This final vector, let's call it `E_product`, is the output of one tower for one product.
Similarity Computation / Interaction Layer (Comparing Tower Outputs):
- Once we have the final embeddings `E_A` for Product A and `E_B` for Product B from the two towers:
- Option 1 (Distance-based, for Contrastive/Triplet Loss):
  - Calculate a distance metric, e.g., Euclidean distance or Cosine distance: `D(E_A, E_B)`. Cosine similarity is often preferred for high-dimensional embeddings.
- Option 2 (Interaction-based, for Binary Classification):
  - Combine `E_A` and `E_B` in a more complex way to capture their interaction. A common approach is to concatenate `E_A`, `E_B`, their element-wise difference `|E_A - E_B|`, and their element-wise product `E_A * E_B`.
  - This combined interaction vector is then passed through a few dense layers terminating in a single sigmoid output neuron. This neuron predicts the probability that Product A and Product B are duplicates (a score between 0 and 1).
Loss Function:
- Contrastive Loss (if using distance-based comparison): Aims to pull positive pairs (duplicates) closer in the embedding space and push negative pairs (non-duplicates) further apart than a defined margin.
```
Loss = y * D(E_A, E_B)^2 + (1-y) * max(0, margin - D(E_A, E_B))^2
```
  (where `y=1` for positive pairs, `y=0` for negative pairs).
- Triplet Loss (if using distance-based comparison and (Anchor, Positive, Negative) triplets): Aims to ensure the anchor is closer to the positive than to the negative by at least a margin.
```
Loss = max(0, D(E_Anchor, E_Positive)^2 - D(E_Anchor, E_Negative)^2 + margin)
```
- Binary Cross-Entropy Loss (if using interaction-based binary classification): This is standard for binary classification tasks. Used with the sigmoid output. We can incorporate class weights here to handle imbalance.

The choice between distance-based loss (Contrastive/Triplet) and interaction-based classification depends on whether the primary goal is to learn a versatile embedding space for various downstream tasks (favoring distance-based) or to optimize purely for the pairwise duplicate classification task (favoring interaction-based, which often performs slightly better for the direct task).


# Conceptual Siamese architecture using PyTorch-like pseudocode (interaction-based)
# class ProductEncoderTower(nn.Module):
#     def __init__(self):
#         super().__init__()
#         self.text_encoder = MultilingualBERTEncoder() # Outputs text_embedding
#         self.image_encoder = ResNetEncoder()         # Outputs image_embedding
#         self.attr_encoder = AttributesEncoder()      # Outputs attr_embedding
#         self.fusion_mlp = nn.Sequential(nn.Linear(text_dim + img_dim + attr_dim, hidden_dim1), 
#                                      nn.ReLU(), 
#                                      nn.Linear(hidden_dim1, final_embedding_dim)) # E.g., final_embedding_dim = 256
#     def forward(self, product_text_input, product_image_input, product_attr_input):
#         txt_emb = self.text_encoder(product_text_input)
#         img_emb = self.image_encoder(product_image_input)
#         atr_emb = self.attr_encoder(product_attr_input)
#         combined_initial_emb = torch.cat((txt_emb, img_emb, atr_emb), dim=1)
#         final_product_embedding = self.fusion_mlp(combined_initial_emb)
#         return final_product_embedding

# class SiameseDuplicateClassifier(nn.Module):
#     def __init__(self):
#         super().__init__()
#         self.product_tower = ProductEncoderTower() # Shared tower
#         # Interaction layers to compare two product embeddings
#         # Input to comparison_mlp is 4 * final_embedding_dim if concatenating E_A, E_B, |E_A-E_B|, E_A*E_B
#         self.comparison_mlp = nn.Sequential(nn.Linear(4 * final_embedding_dim, hidden_comp_dim), 
#                                           nn.ReLU(), 
#                                           nn.Linear(hidden_comp_dim, 1), 
#                                           nn.Sigmoid()) 
#     def forward(self, product1_data, product2_data):
#         # product1_data and product2_data are tuples/dicts of (text_input, image_input, attr_input)
#         emb1 = self.product_tower(*product1_data)
#         emb2 = self.product_tower(*product2_data)
        
#         # Interaction features
#         diff = torch.abs(emb1 - emb2)
#         prod_ew = emb1 * emb2
#         interaction_vec = torch.cat((emb1, emb2, diff, prod_ew), dim=1)
        
#         return self.comparison_mlp(interaction_vec)

This architecture allows for end-to-end training and leverages powerful pre-trained models for different modalities, adapting them to the specific nuances of product similarity.

Sophisticated Model Architecture: Candidate details a suitable Siamese network, covering encoders for each modality, fusion, comparison mechanisms, and appropriate loss functions. Pseudocode clarifies the structure.

Round 5: Scalability & Real-time Implementation

I1 (Principal Engineer): Let's talk scale again. 50 million existing products, 100K new listings daily. Your pairwise comparison approach, even after some initial filtering, can still be computationally intensive if the candidate sets are large. The classic O(N²) problem is a killer. How do you ensure your system is computationally feasible for both the massive batch cleanup and the near real-time checking of new listings? What specific techniques in blocking are critical?

C: You're absolutely right, naive pairwise comparison is out of the question. Efficient blocking is the cornerstone of making this scalable.

Multi-Level Blocking Strategy for Scalability:

The goal is to progressively reduce the number of pairs that need detailed, expensive feature computation and model scoring.

Level 1: Coarse-Grained Attribute Blocking (Initial Broad Cut):
- Category Blocking: This is the most impactful first step. Only compare products within the same fine-grained leaf category (e.g., "Mango Pickles," not just "Pickles" or "Groceries"). This relies on a good product categorization system. If a product is in "Pickles," we compare it to other "Pickles." This can reduce comparisons by orders of magnitude (e.g., 100x-1000x).
- Brand Blocking (where applicable): If brand information is reliable, only compare products of the same brand or where one brand is missing (to catch unbranded versions of branded products). Be cautious, as sellers misspell brands or list them generically.
Level 2: Content-Based Approximate Similarity Blocking (Hashing):
- LSH (Locality Sensitive Hashing) with MinHash for Text:
  - Convert product titles (and optionally key description phrases) into sets of shingles (e.g., character 3-grams or 5-grams after normalization).
  - Apply MinHash to these sets to create compact signatures.
  - Use LSH techniques (e.g., banding) to hash these signatures into buckets. Products falling into the same LSH bucket are considered candidates. This is very efficient for finding textually similar items.
- Perceptual Hashing for Images (e.g., pHash, aHash, dHash):
  - Generate compact hashes for product images. Products with very similar image hashes (low Hamming distance) are strong candidates. This can catch duplicates even if text differs significantly.
- Phonetic Blocking Keys: Use the Indic phonetic algorithms discussed earlier to generate phonetic keys for key product name tokens. Products sharing phonetic keys are candidates.
Level 3: Embedding-Based Approximate Nearest Neighbor (ANN) Search (Semantic Blocking):
- For all products, pre-compute the multi-modal embeddings (text, image, attributes) using our trained `ProductEncoderTower`.
- Store these embeddings in an ANN index (e.g., FAISS, ScaNN, HNSW).
- For a new product (or each product in batch mode), query the ANN index to find the top-K most similar products in the embedding space. These become candidates. This captures semantic similarity even if keywords or exact phrasing differs.

Combining Blocking Strategies:

A product pair becomes a candidate for detailed comparison if it passes through any of these relevant blocking stages, or a combination (e.g., same category AND similar LSH hash). The union of candidate pairs from these methods forms the set for detailed pairwise classification.

Real-time vs. Batch Adaptation:

Real-time (New Listings): The new product is processed. Its blocking keys and a quickly generated embedding are created. Query pre-built indexes (Elasticsearch for keyword/attribute blocks, LSH Forest, FAISS index for embeddings) of the existing catalog. This should return a small set of candidates (e.g., top 50-100) very quickly.
Batch (Catalog Cleanup): Can afford more comprehensive blocking. For example, an all-pairs LSH comparison within categories, or building a graph where edges are created by blocking rule matches, then running clustering. Spark is well-suited for this. All existing product embeddings are already pre-computed and stored.

This multi-level blocking ensures that we dramatically reduce the number of pairs from N² to something closer to O(N * k_avg_candidates), which is computationally feasible.

Scalable Blocking Design: Candidate details a sophisticated multi-level blocking strategy combining attribute, hashing, and embedding-based methods, and considers real-time vs. batch differences.

I1 (Principal Engineer): Good blocking strategy. But what about the cold start problem for categories? New, niche product categories emerge constantly on Amazon, especially in diverse markets like India – for example, "Millets-based Pickles," "Organic Single-Origin Avakaya," or "Sugar-Free Telugu Sweets." Your category-based blocking, which is a powerful first filter, would fail for products in these new or poorly defined categories as they wouldn't have many existing products to be blocked against. How do you handle duplicate detection for products in novel or sparsely populated categories?

C: That's a critical point for a dynamic catalog. If a product falls into a new or very sparse category, relying solely on category-based blocking will indeed miss many potential duplicates that might exist under slightly different (but semantically similar) new category names, or even miscategorized older products.

Handling Novel/Sparse Categories:

Adaptive & Hierarchical Category Blocking:
- Our product categorization system needs to be hierarchical. If a product is in a leaf category with very few items (e.g., < N products, where N is a threshold like 50), the blocking logic should automatically "fall back" to comparing it against products in its parent category or even sibling categories that are semantically close.
- Example: If "Organic Single-Origin Guntur Mango Avakaya" is a new sub-category, it should still be compared against products in "Guntur Mango Avakaya" and "Organic Mango Pickles."
Semantic Category Matching / Expansion:
- Use embeddings for category names/descriptions themselves. If a new product's assigned category "Millets-Avakaya" is novel, find existing categories that are semantically closest (e.g., "Avakaya," "Mixed Grain Pickles," "Health Pickles") and include products from those in the candidate generation pool.
- This requires a system to map new, seller-provided category strings to our internal taxonomy or find closest matches.
Increased Reliance on Content-Based Blocking (LSH, Embeddings) for Novel Categories:
- For products in sparse/new categories, the LSH (on text) and ANN search (on multi-modal embeddings) stages become even more critical. These methods don't rely on pre-defined category blocks and can find similar items across the entire catalog (or a very broad super-category) if configured to do so, albeit with higher computational cost if the search space isn't narrowed.
- We might have a two-pass blocking: first, strict category blocking; if few candidates are found and the category is sparse, then a second, broader search using embeddings or relaxed LSH across related super-categories.
Dynamic Category Creation & Clustering for Orphaned Products:
- Periodically, run clustering algorithms (e.g., on product title/description embeddings) on products that are in very small categories or haven't matched any duplicates for a long time ("orphans").
- These clusters can help identify emerging product types or de facto new categories that sellers are creating. Human review can then formalize these into the taxonomy. This helps the category blocking improve over time.
Zero-Shot or Few-Shot Category Assignment for New Listings:
- For incoming new listings, if the seller provides a category string that doesn't map to our existing taxonomy, use a text classification model (trained on existing product titles/descriptions and their categories) to predict the most likely existing fine-grained category. This helps place it correctly for blocking. If confidence is low, it can be flagged for "broader blocking search."

The system needs to be adaptive, falling back to broader similarity searches when fine-grained category information is insufficient or unreliable, and continuously learning and updating its category understanding.

Addressing Cold-Start Categories: Candidate proposes robust strategies including hierarchical blocking, semantic category matching, and dynamic category creation, showing foresight in system design.

Round 6: System Rollout Strategy & Risk Management

I1 (Principal Engineer): Deploying a system that directly impacts millions of listings and potentially merges products from thousands of sellers is inherently risky. A mistake at scale could be very damaging. How would you plan the rollout of this duplicate detection system to minimize disruption and manage these risks effectively?

C: You're absolutely right. A "big bang" deployment is out of the question. We need a carefully orchestrated, phased rollout strategy with strong monitoring and rollback capabilities.

Phased Rollout Strategy & Risk Management:

Phase 0: Internal Validation & Shadow Mode:
- Shadow Mode Deployment: Deploy the system to run in parallel with existing processes (if any) or on production data, but without taking any automated actions (no actual merges). It only logs its intended decisions (e.g., "would merge A and B," "would flag C as potential duplicate of D").
- Internal Audits: A dedicated team (including category managers and data scientists) thoroughly audits the shadow mode outputs for a diverse set of categories, focusing on:
  - Accuracy of proposed merges (especially for high-impact products).
  - Potential for incorrect merges of variants.
  - Impact on known high-value ASINs.
- Refine Thresholds & Rules: Based on audit findings, refine the model's confidence thresholds for automated actions, business rules for variant handling, and blocking logic. Iterate until internal audit results meet a high-quality bar.
Phase 1: Limited Rollout to Low-Risk Categories/Sellers (Canary Release):
- Select Pilot Categories: Start with a few well-understood, lower-risk product categories where duplicate definitions are clearer and the impact of an error is less catastrophic. Avoid highly nuanced categories like handmade goods initially.
- Small Seller Segment: Potentially, also pilot with a small, cooperative group of sellers who have opted-in or are known for good catalog practices.
- Enable Automated Actions (High-Confidence Only): For this pilot group, enable automated merging only for pairs where the system has extremely high confidence (e.g., >99.5% probability) and multiple signals agree (text, image, attributes, brand). All other potential duplicates are flagged for human review.
- Intensive Monitoring: Closely monitor actual merge accuracy, seller feedback/complaints from the pilot group, impact on search and sales for the affected ASINs, and system performance metrics (latency, errors).
Phase 2: Gradual Expansion (A/B Testing within Categories):
- Category-by-Category Expansion: Based on successful pilot, gradually expand to more categories, prioritizing those with high duplicate rates but manageable complexity.
- A/B Testing of Merging Logic: Within a category, we can run A/B tests comparing customer experience metrics (search CTR, conversion) for users seeing the deduplicated catalog versus the original.
- Iterative Threshold Adjustment: Continuously adjust merge confidence thresholds based on performance in each category.
Phase 3: Full Rollout with Continuous Monitoring & Refinement:
- Once confident in system stability and accuracy across a broad range of categories, move to a wider rollout.
- Even at full rollout, maintain the ability to quickly disable automated merging for specific categories or globally if systemic issues arise.

Risk Management & Mitigation:

Robust Rollback Plan: Have a well-tested mechanism to revert merges if widespread errors are detected. This includes un-merging ASINs and restoring original listing data.
Human Review Queue: Maintain a robust human review system for medium-confidence predictions, disputed merges, and anomalous merges.
Clear Communication with Sellers: Proactively communicate upcoming changes to sellers. Provide clear guidelines on listing products, variants, and disputing merges.
Impact Analysis Dashboard: Develop dashboards to track the impact of merges on key metrics in real-time.
Incident Response Plan: Have a clear plan for responding to bad merges, including identification, notification, and rollback.

The key is to be cautious, data-driven, and iterative, prioritizing the minimization of negative impact on customers and sellers throughout the rollout process.

Methodical Rollout Plan: Candidate details a comprehensive, phased rollout strategy including shadow mode, canary releases, and A/B testing, with strong emphasis on risk mitigation and seller communication.

Round 7: Business Logic & Edge Cases

I2 (Category Manager): Here's a tricky business scenario. A seller has legitimate product variants: "Mild Avakaya", "Medium Spicy Avakaya", and "Extra Hot Avakaya". They are all "Avakaya" from the same brand, same size. Your algorithm, focusing on base product similarity, might flag them as duplicates. But customers specifically search for and expect these spice level variations. How does your system handle such nuanced cases to avoid incorrectly merging distinct, desirable variants?

C: This is where a purely similarity-based duplicate detection system can fail without incorporating understanding of legitimate product variations. We need attribute-aware duplicate and variant detection.

Attribute-Aware Duplicate/Variant Handling:

Structured Attribute Extraction & Taxonomy:
- For relevant categories like "Pickles," we need to define a taxonomy of key differentiating attributes. For Avakaya, this would include "Spice Level" (e.g., Mild, Medium, Hot, Extra Hot), "Key Ingredient Variation" (e.g., With Garlic, Without Garlic, With Fenugreek), "Regional Style" (e.g., Guntur, Godavari, Coastal Andhra).
- Use NLP techniques (NER, rule-based pattern matching, or a trained attribute extraction model) to identify these attribute values from product titles and descriptions. E.g., extract "Mild" from "Mild Avakaya".

Variant vs. Duplicate Classification Logic:

After the initial similarity model identifies a pair of products as highly similar (e.g., "Mild Avakaya 500g" and "Medium Spicy Avakaya 500g" from the same brand), a secondary classification step or rule engine kicks in:


# Conceptual Classification Logic
# def classify_product_relationship(product1_data, product2_data, base_similarity_score):
#     if base_similarity_score < DUPLICATE_THRESHOLD_LOW:
#         return "DIFFERENT_PRODUCTS"

#     attrs1 = extract_key_attributes(product1_data.title, product1_data.description)
#     attrs2 = extract_key_attributes(product2_data.title, product2_data.description)

#     if attrs1['base_pickle'] != attrs2['base_pickle'] or product1_data.brand != product2_data.brand:
#         return "NEEDS_REVIEW_DIFFERENT_BASE" 

#     differing_variant_attributes = []
#     for attr_name in VALID_VARIANT_DIMENSIONS_FOR_CATEGORY['avakaya_pickle']:
#         val1 = attrs1.get(attr_name)
#         val2 = attrs2.get(attr_name)
#         if val1 != val2 and (val1 is not None or val2 is not None): # Ensure actual difference
#             differing_variant_attributes.append(attr_name)
            
#     if len(differing_variant_attributes) > 0 and base_similarity_score > VARIANT_SIMILARITY_THRESHOLD_HIGH:
#         return "PRODUCT_VARIANTS"
#     elif len(differing_variant_attributes) == 0 and base_similarity_score > DUPLICATE_THRESHOLD_HIGH:
#         return "TRUE_DUPLICATES"
#     else:
#         return "UNCERTAIN_NEEDS_REVIEW"

Business Rules Integration:
- Maintain a "variant dimension whitelist" for each product category.
- If two products are highly similar but differ only along these whitelisted variant dimensions, they are classified as "Variants" and should be grouped under a parent ASIN, not merged into one.
- If they are highly similar and have no differences in variant dimensions, they are "True Duplicates."
Learning from Customer Search Behavior:
- Analyze search queries. If customers frequently search for "mild avakaya" or "avakaya no garlic," it validates these as important variant dimensions that need to be preserved.
Seller Interface for Variants:
- Provide sellers with clear tools to list product variations correctly under a single parent product, reducing the creation of separate pseudo-duplicate listings for variants.

This attribute-aware layer ensures that legitimate product diversity desired by customers is maintained, while still catching true duplicates.

Handling Product Variants: Candidate proposes a robust solution involving structured attribute extraction, a variant dimension whitelist, and integrating business rules to differentiate true duplicates from legitimate variants.

I2 (Category Manager): Smart attribute-aware approach. But let's add another layer of complexity. Same seller, same base product (e.g., Priya's Mango Avakaya), but it's listed with different package sizes: 200g, 500g, 1kg. Technically, these are variants. But what if the pricing isn't directly proportional? Sometimes smaller packs have premium pricing per unit due to packaging costs, or conversely, large packs have bulk discounts. How does your system decide if these are just size variants to be grouped, or if significantly different per-unit pricing makes them effectively "different products" from a customer value or seller strategy perspective?

C: That's an excellent point where pure attribute matching isn't enough; financial and perceived value aspects come into play.

Package Size & Pricing Strategy Handling:

For products that are identical except for package size, the decision to treat them as simple size variants or distinct offerings depends heavily on the price per unit consistency.

Extract Normalized Size & Price: Standardize sizes (e.g., to grams) and extract prices.
Calculate Price Per Unit (PPU): E.g., price per 100g.
Analyze PPU Consistency / Relationship:
- Scenario A: Proportional or Expected Non-Linearity: If PPUs show a consistent pattern (bulk discount, small pack premium), these are likely legitimate size variants.
```
# Conceptual PPU analysis
# def check_ppu_consistency(ppu_values_with_sizes):
#     # Sort by size
#     # Check for monotonic decrease/increase in PPU or low coefficient of variation
#     # Thresholds tuned per category
#     pass
```
  Action: Group as size variants.
- Scenario B: Highly Inconsistent or Irrational PPU: If PPU is erratic (e.g., 500g cheaper per unit than 1kg), this could indicate data quality issues, different offerings (e.g., gift pack), or pricing errors. Action: Flag for review. Be cautious about automatically merging.
Integrate with Business Rules & Category Norms: PPU consistency checks can be tuned based on category norms.

So, if products are identical in all but size, and their Price Per Unit follows a reasonable pattern, they are treated as size variants. If PPU is erratic, they are flagged.

Sophisticated Size/Price Handling: Candidate introduces Price Per Unit (PPU) analysis and considers expected non-linearities to differentiate true size variants from potentially distinct offerings.

Round 8: Model Evaluation

I1 (Principal Engineer): How do you robustly evaluate the end-to-end performance of your duplicate detection system? Traditional machine learning metrics like precision and recall for pairwise classification are important, but they might not fully capture the business impact on catalog quality or customer experience. What's your comprehensive evaluation framework?

C: Evaluating an entity resolution system like this requires a multi-layered approach, looking at technical model performance, cluster quality, and, importantly, impact on key business metrics.

Multi-Layered Evaluation Strategy:

1. Pairwise Classification Model Metrics (Offline):

On a labeled test set of product pairs:
- Precision: `TP / (TP + FP)`. Critical to minimize incorrect merges.
- Recall: `TP / (TP + FN)`. Important to minimize missed duplicates.
- F1-Score / F-beta Score: To balance Precision and Recall, possibly weighing Precision higher.
- AUC-PR (Area Under Precision-Recall Curve): More informative for imbalanced datasets.

2. Cluster-Level Quality Metrics (Offline):

After pairwise classification and graph clustering to form sets of duplicates:
- Purity, Normalized Mutual Information (NMI), Adjusted Rand Index (ARI): Compare formed clusters against ground truth clusters.
- Manual audit of a sample of generated duplicate clusters by domain experts.

3. Business Impact Metrics (Online A/B Testing):

This is crucial. A/B test the impact of showing a deduplicated catalog versus the original.

Control Group: Users see catalog/search as is.
Treatment Group: Users see catalog/search where identified duplicates are merged/canonicalized.

Key metrics to compare:

Search Experience: Search CTR, Search-to-Detail Page CTR, Search Abandonment Rate.
Conversion & Sales: Overall Conversion Rate, AOV, GMV.
Customer Satisfaction & Catalog Quality: Customer complaints related to product confusion, return rates, CSAT scores, "Report incorrect product information" flags.
Seller Impact: Sales distribution among sellers, seller complaints about incorrect merges (critical guardrail).

4. Operational Metrics for the System Itself (To monitor, not as primary evaluation):

Latency of real-time checks, throughput of batch processing, computational cost, rate of human review needed.

Offline metrics guide model development, but online A/B testing validates true business impact. The ideal outcome is improved search and conversion metrics, and better CSAT, without unduly harming sellers.

Comprehensive Evaluation: Candidate covers technical ML metrics, cluster quality, and crucially, business impact metrics measured via A/B testing. Operational metrics are mentioned for monitoring.

I1 (Principal Engineer): Good framework for evaluation. But here's a specific challenge. You deploy your system, and the technical metrics (Precision/Recall on held-out annotated data) look great, say 95% precision. However, you start getting a significant number of complaints from legitimate Telugu sellers, primarily in Telugu, saying your system is incorrectly merging their genuinely different regional product variations, which your annotators (who might not all be Telugu food experts) missed. How do you debug this cultural/linguistic gap in your system's performance and improve it?

C: This is a classic "model performs well on test data but fails on real-world nuances" problem. A 95% precision might still mean 5 out of 100 merges are wrong, which at scale is many incorrect merges.

Debugging Cultural/Linguistic Gaps:

Prioritize and Systematically Analyze Seller Complaints:
- Translation & Categorization: Accurately translate and categorize complaints by reason (e.g., "merged different spice levels," "merged different regional preparations").
- Root Cause Analysis: For each valid incorrect merge, analyze why the model failed. Which features contributed? (Use SHAP/LIME). Was it text normalization, image similarity, missing attributes, or ontology gaps?
Augment Training & Evaluation Data with Cultural Expertise:
- Hire/Consult Native Telugu Speakers with Food/Cultural Knowledge: Involve them in reviewing disputed merges, creating new high-quality labeled data (especially hard negatives/positives for nuanced cases), and refining the product ontology for Telugu regional products.
- Update Annotation Guidelines: Make guidelines more specific about Telugu product variations.
Refine Features & Model Based on Error Analysis:
- Feature Engineering: Improve transliteration, NER for distinguishing terms, etc.
- Model Retraining: Retrain with augmented/corrected data.
- Adjust Confidence Thresholds: For nuanced categories, use higher thresholds for automated merging and send more to human review by cultural experts.
Implement a "Seller Dispute" Mechanism with Feedback Loop: Allow sellers to easily flag incorrect merges. These become high-priority inputs for review and retraining.
A/B Test Changes on Seller Segments: Test model updates aimed at fixing cultural gaps on Telugu seller segments first.

The key is a tight feedback loop between seller complaints, cultural expert review, data augmentation, and model retraining, focusing on linguistic and domain-specific error patterns.

Addressing Cultural/Linguistic Errors: Candidate details a strong process involving native speaker review, targeted data augmentation, model retraining, and a seller feedback loop to handle nuanced errors.

Round 9: Ethical Considerations & Bias Mitigation

I2 (Category Manager): We've touched on seller fairness regarding sales impact. Beyond that, what other ethical considerations or potential biases particularly concern you in a system like this, especially given the linguistic diversity and the varied technical capabilities of sellers in India? How would you proactively try to identify and mitigate these?

C: This is a critical area. An automated system making decisions about product visibility and identity can inadvertently perpetuate or even amplify existing biases if not designed and monitored carefully.

Ethical Considerations & Bias Mitigation:

Linguistic Bias:
- Concern: System might perform better for dominant languages/scripts (e.g., English) than for regional scripts or non-standard transliterations, disadvantaging sellers less proficient in English.
- Mitigation: Diverse training data including various scripts/transliterations; robust NLP preprocessing; use of multilingual models (e.g., IndicBERT); regular bias audits for different language/script groups.
Popularity Bias / Rich-Get-Richer:
- Concern: System might favor merging into popular ASINs, reducing smaller sellers' visibility.
- Mitigation: Fair "canonical" selection rules (not just oldest/highest-selling); mindful feature engineering (avoiding features too correlated with popularity); ensure clear visibility of "other sellers" on canonical ASINs.
Bias from Annotators:
- Concern: Annotators might have biases (brand familiarity, seller names).
- Mitigation: Diverse annotation team with regional expertise; clear, objective guidelines; inter-annotator agreement checks; bias training for annotators.
Accessibility & Technical Barriers for Sellers:
- Concern: System might penalize sellers with fewer resources for perfect listings.
- Mitigation: Seller education and easy-to-use listing tools; design system robust to partially incomplete listings.
Transparency & Explainability (where feasible):
- Concern: Sellers confused by unexplained merges.
- Mitigation: Provide high-level reasons for merges if disputed; maintain a clear and fair dispute process.

Proactive identification of potential biases through audits, diverse team involvement, and a commitment to fairness in training data and model design are key. This is an ongoing effort.

Holistic Ethical Framework: Candidate discusses multiple dimensions of bias and proposes concrete mitigation strategies focusing on data diversity, robust NLP, fair processes, and seller support.

Round 10: Advanced Technical Challenge & Trade-offs

I1 (Principal Engineer): Here's your final technical challenge. You need to handle real-time duplicate detection for new product listings. A seller uploads "హస్తనిర్మిత అవకాయ" (handmade avakaya in Telugu script). Your system needs to detect potential duplicates from the existing 50 million products and respond with high-confidence candidates within, say, a 100-millisecond latency budget for the product upload API to give quick feedback to the seller. Walk me through your real-time architecture and the trade-offs you'd make.

C: A 100ms latency budget for duplicate detection against a 50M product catalog is extremely challenging and requires a highly optimized, multi-tiered approach.

Real-time Duplicate Detection Architecture (Sub-100ms):

A. Offline Pre-computation & Indexing (Foundation):

Product Embeddings Index: Pre-compute and store multi-modal embeddings (from our `ProductEncoderTower`) for all 50M existing products in a low-latency ANN index (FAISS, ScaNN).
Blocking Key / Inverted Index: Create an inverted index (Elasticsearch) based on normalized title tokens, brand, category IDs, phonetic keys.

B. Real-time Inference Pipeline for a New Product:

Ultra-Fast Preprocessing (e.g., < 10-20ms): Language/script detection, rapid Romanization, key token extraction, phonetic key generation, category prediction.
Tiered Candidate Generation (e.g., < 20-30ms):
- Tier 1 (Exact/Phonetic Key Lookup): Query inverted index. Very fast, highly relevant.
- Tier 2 (ANN Search): Generate a quick text embedding for the new product (distilled/quantized model). Query ANN index for top-K nearest neighbors.
(Goal: Small set of high-potential candidates, e.g., 10-50).
Lightweight Pairwise Scoring (e.g., < 30-40ms):
- Use a highly optimized, simpler scoring model (e.g., shallow tree ensemble, distilled Siamese comparator, or weighted sum of key similarities like Jaro-Winkler on titles, image hash distance if available quickly).
- Compare new product's quick features/embedding against pre-computed full embeddings/features of candidates.
Thresholding & Response (e.g., < 10ms): Apply high-confidence threshold. Return top 1-3 potential duplicates. If none, respond "No immediate duplicates found; further checks pending."

C. Asynchronous Full Analysis (Post-Upload):

The new product listing is queued for the more comprehensive batch duplicate detection pipeline using the full Siamese network. Seller notified later if more duplicates found.

Latency vs. Accuracy Trade-off:

For sub-100ms, we explicitly trade some accuracy (recall) for speed in the real-time check. The goal is to catch obvious duplicates quickly. The thorough batch process handles higher recall.

Real-time System Design: Candidate details a sophisticated real-time architecture with pre-computation, tiered candidate generation, lightweight scoring, and an asynchronous full analysis, correctly addressing the latency constraints.

I1 (Principal Engineer): That's a very aggressive 100ms budget. Your embedding generation for the new product's text and image alone might take 50ms or more. How do you make the hard trade-off if you simply cannot meet the 100ms for a high-quality check? What's your fallback or tiered response strategy within that API call?

C: You're right, hitting 100ms for a comprehensive check including fresh embedding generation is incredibly tough.

Tiered Response Strategy within the API Call (Aggressive Latency Budget):

Level 0: Identical Match Check (e.g., < 5-10ms): Compute a robust hash of critical, normalized fields. Check against a pre-computed hash index. Catches only purely identical listings.
Level 1: Blocking Key & Rule-Based Heuristics (e.g., add < 20-30ms): If no Level 0 match, query inverted index with tokens. Apply lightweight rule-based scoring on these few candidates (e.g., title edit distance, price proximity). Avoids embedding generation.
Level 2: Cached/Approximate Embedding Similarity (e.g., add < 30-40ms, if Level 0 & 1 inconclusive):
- Title Embedding Lookup (if title common): Check cache (Redis) for existing title embedding. Use this to query ANN index.
- Highly Distilled/Quantized Text Embedding: If novel title, use extremely fast, less accurate distilled text model for on-the-fly embedding. Query ANN.
- Skip Image for Real-Time API: Omit on-the-fly image similarity. Use fast perceptual hash if anything.
- Score top ANN candidates using their pre-computed full embeddings against the new product's quick/cached embedding.
Timeout & Asynchronous Follow-up: Strict timeout (e.g., 90ms). If high-confidence duplicates found, return them. Else, API responds "No immediate high-confidence duplicates found. Detailed check pending." Full analysis triggered asynchronously.

Trade-off Acceptance: For synchronous API, accept lower recall for low latency, catching blatant ones. Overall high recall via asynchronous check. Manages seller expectations.

Graceful Degradation & Tiered Latency: Candidate provides a very practical, tiered approach to meet aggressive latency, explicitly trading off recall in real-time for speed, and deferring full analysis. This is crucial for system design.

I1 (Principal Engineer): One final thought experiment: If you had to make a very hard trade-off for that real-time detection API (sub-100ms) – would you choose to optimize for 99% precision (very few false positives shown to seller immediately) but only 50% recall (missing half the actual duplicates in that quick check), OR 85% precision (more false positives shown, but still decent) with 70% recall (catching more duplicates quickly)? Which way would you lean and why?

C: That's a classic precision/recall trade-off. For the immediate, sub-100ms synchronous API response shown to the seller during upload, I would lean towards prioritizing higher Precision, even if it means lower Recall. So, I'd prefer something closer to the 99% Precision / 50% Recall option, with a crucial caveat.

Reasoning for Prioritizing Precision in Real-Time Seller Feedback:

Seller Trust & User Experience (during upload): Showing incorrect potential duplicates (low precision) erodes trust. False positives are disruptive. High precision means suggestions are more likely valid and actionable.
Actionability of Real-Time Feedback: The purpose is to prevent immediate creation of blatant duplicates. Catching 50% of true duplicates with high confidence is valuable at this stage.
Asynchronous Safety Net for Recall: The crucial caveat is that missed duplicates (due to prioritizing precision) will be caught by the more comprehensive asynchronous batch processing later. We don't permanently lose the chance.
Cost of False Positives in Real-Time: Incorrectly flagging a unique item as a duplicate can halt listing, require disputes, and make sellers feel the platform is unfair.

So, the strategy is:

Real-time (sub-100ms API): Optimize for very high precision (high confidence threshold, strict rules). Accept lower recall.
Asynchronous (Backend): Optimize for higher recall using the full multi-modal model.

This balances immediate seller UX with long-term catalog quality.

Strategic Trade-off Justification: Candidate makes a clear choice prioritizing precision for real-time seller-facing feedback, justifying it with user experience and trust, while relying on asynchronous processes for recall. This is a mature system design perspective.

Interview Conclusion

I1 (Principal Engineer): Outstanding technical depth throughout this discussion. You've demonstrated a strong understanding of multilingual NLP, the complexities of building scalable ML systems for entity resolution, and crucial aspects like data preparation and ethical considerations.

I2 (Category Manager): I'm particularly impressed by how you consistently handled the cultural nuances of Telugu products and the seller ecosystem considerations. You clearly understand that technical excellence must serve both customer and seller needs.

C: Thank you very much. This was a very stimulating and challenging case, and I appreciate the opportunity to discuss it in such detail.

I1 (Principal Engineer): You've shown exactly the kind of strategic and technical leadership we are looking for. We'd definitely like to move you to the next steps.

What to Learn from This Case

Embrace Nuance & Domain Knowledge: Recognize that "duplicate" is not a simple binary; understand variants, regional specifics, and cultural context.
Multi-Stage System Design: For complex problems, break down the solution into manageable stages (preprocessing, blocking, featurization, classification, clustering).
Detailed Data Preparation is Crucial: Thoroughly explain labeling, feature extraction for all modalities (text, image, structured), train/val/test splits (handling leakage), class imbalance, and data augmentation.
Advanced NLP for Linguistic Challenges: Demonstrate proficiency in handling mixed scripts, transliterations, phonetic variations, and semantic similarity.
Scalability is Paramount: Address O(N²) complexity with intelligent blocking. Design for batch and real-time processing.
Robust Training Data Strategy: Detail how to create labeled data at scale (bootstrapping, active learning, weak supervision, hard negative mining).
Consider Adversarial Behavior: Acknowledge and design for users trying to game the system.
Balance Technical Trade-offs: Justify choices like latency vs. accuracy, precision vs. recall under different constraints.
Plan for Rollout & Risk Management: Articulate a phased rollout strategy with risk mitigation.
Multi-faceted Evaluation: Go beyond standard ML metrics. Include cluster quality and business KPIs via A/B testing.
Iterative Improvement & Human-in-the-Loop: Emphasize continuous learning and incorporating human feedback.
Address Ethical Considerations & Bias: Proactively discuss potential biases and outline mitigation strategies.
Think End-to-End & Cross-Functionally: Cover the lifecycle from problem definition to deployment and operational integration.
Handle Edge Cases and Business Constraints: Address legitimate variants, pricing complexities, and seller relations.
Communicate Complex Ideas Clearly: Explain technical concepts and their business implications effectively.
Strategic Thinking: Make reasoned decisions when faced with conflicting objectives.