Amazon Duplicate Listing Detection
The Challenge: Identifying Semantic Duplicates for Telugu Products
You're working for Amazon's India marketplace. Telugu sellers often list the same product with different names like "Andhra Pickles" vs "Guntur Pickle" vs "Traditional Avakaya". How would you design a system to identify and merge these duplicate listings in a large database (e.g., 50 million products, 100K new daily) considering linguistic nuances, seller behavior, and scalability?
Initial Thoughts & Clarifications
- Definition of "Duplicate": What constitutes a duplicate? Same underlying product from different sellers? Same seller with minor variations? How do we handle legitimate product variants (e.g., spice level, size) vs. true duplicates?
- Data Sources: What product attributes are available? (Title, description, images, brand, seller ID, price, category, customer reviews, item specifics like weight/ingredients).
- Linguistic Challenges: How to handle mixed scripts (Telugu + English), transliteration variations, synonyms, regional terms (Guntur vs. general Andhra), and semantic similarity?
- Scale: 50M existing, 100K new daily – implies need for efficient candidate generation (blocking) and real-time capabilities for new listings.
- Business Goal: Improve customer experience (easier search, less confusion), better catalog quality, fair seller competition, accurate analytics. What are the trade-offs (e.g., precision vs. recall for merging)?
- Seller Behavior: Are some duplicates intentional (to gain more visibility) vs. accidental? Does this affect the approach?
- Evaluation: How will the system's performance be measured? (Accuracy of merges, impact on search, seller feedback).
- Existing Systems: Is there any current rudimentary duplicate detection? What are its limitations?
- Problem Definition & Scope:
- Define what constitutes a "duplicate" vs. a "variant" vs. "distinct but similar."
- Specify goals (e.g., merge true duplicates, link variants).
- Data Understanding & Preprocessing (General):
- Analyze product attributes (text, images, structured data).
- General text normalization: Language detection, script conversion (e.g., Telugu to Roman), transliteration handling, stemming/lemmatization, stop-word removal, phonetic encoding.
- General image preprocessing and attribute normalization.
- Data Preparation for Model Training:
- Detailed labeling strategy: Positive, negative, hard-negative pair generation.
- Feature extraction for model inputs: Specifics for text (tokenization, padding), images (resizing, augmentation), structured data (normalization, encoding).
- Train/validation/test splits: Handling data leakage, temporal splits, class imbalance strategies (sampling, weighting).
- Data augmentation techniques for model robustness.
- Candidate Generation / Blocking:
- Goal: Drastically reduce pairwise comparisons from O(N²) to something manageable.
- Methods: Category-based blocking, LSH (MinHash for text, SimHash for images), attribute-based blocking (brand, key ingredients), phonetic blocking.
- Feature Engineering for Pairwise Comparison (if not end-to-end):
- Textual Similarity: TF-IDF + Cosine, Word/Sentence Embeddings (e.g., multilingual BERT, FastText) + Cosine, Edit Distance (Levenshtein, Jaro-Winkler).
- Image Similarity: CNN-based embeddings (e.g., ResNet, ViT) + Cosine.
- Structured Attribute Similarity: Match scores for brand, weight, key ingredients, price (normalized).
- Seller Information: Same seller vs. different sellers.
- Behavioral Signals (Advanced): Co-view/co-purchase data, review similarity.
- Similarity Scoring & Classification Model:
- Combine features into a similarity score or feed into a classifier (e.g., Logistic Regression, SVM, Random Forest, Gradient Boosting, Siamese Network for end-to-end learning).
- Output: Probability of being a duplicate.
- Clustering / Transitive Closure:
- If A=B and B=C, then A=C. Use graph-based clustering (e.g., Connected Components) on high-confidence pairs to find all duplicates in a set.
- Evaluation:
- Pairwise metrics: Precision, Recall, F1-score for duplicate classification.
- Cluster metrics: Purity, Rand Index, Normalized Mutual Information.
- Business metrics (via A/B testing): Impact on search relevance, conversion, customer satisfaction, seller complaints.
- Scalability & Deployment:
- Batch processing for existing catalog cleanup (e.g., Spark, EMR).
- Real-time/near real-time processing for new listings (e.g., Kafka + Flink/Spark Streaming, API-based checks). Efficient indexing (FAISS for embeddings).
- Model serving infrastructure. Latency vs. accuracy trade-offs.
- Rollout Strategy & Risk Management:
- Phased rollout (e.g., by category, region), dark launch/shadow mode.
- Canary releases, A/B testing for deployment.
- Rollback plans and incident response. Seller communication.
- Human-in-the-Loop & Continuous Improvement:
- Mechanism for manual review of low-confidence predictions or disputed merges.
- Feedback loop to retrain models with corrected labels.
- Monitor for adversarial seller behavior and adapt.
- Ethical Considerations & Bias Mitigation:
- Identify and mitigate linguistic bias, popularity bias, presentation bias.
- Ensure fairness for sellers (especially smaller or regional ones).
- Regular audits for bias and fairness. Transparency in decision-making where possible.
Simulated Conversation
Round 1: Problem Understanding & Linguistic Complexity
The core challenge is that these aren't just exact string matches; they are semantic duplicates or near-duplicates often masked by variations in naming, transliteration, regional specificity, and seller descriptions. We need to understand what data attributes we have for each product listing: titles, descriptions, images, seller information, brand, category, price, customer reviews, and any structured attributes like weight or ingredients.
- True Duplicates: The exact same underlying product (e.g., same brand, same specific type like "Mango Avakaya", same ingredients, same weight) listed with slightly different titles or by different sellers. Example: "Priya Mango Avakaya Pickle 500g" vs. "Priya Foods Avakai Mango Pickle 0.5kg". These should be merged or a canonical version chosen.
- Product Variants: Products that share a base but differ in a key, customer-relevant attribute.
- Examples: "Guntur Spicy Mango Pickle" vs. "Mild Mango Pickle (Andhra Style)". These are both mango pickles but differ in spice level/origin indicated. These might be grouped under a parent product with selectable variants rather than merged into one.
- "Avakaya (with Garlic)" vs. "Avakaya (without Garlic)".
- Distinct but Related Products: Products within the same broad category but clearly different. Example: "Mango Avakaya" vs. "Gongura Pickle" vs. "Tomato Pickle". These are not duplicates. "Guntur Pickle" could be a Guntur Mango Pickle (variant of Avakaya) or a Guntur Red Chili Pickle (distinct product).
My approach would need a hierarchical classification:
- Broad Category Classification: Is this a "Mango Pickle," "Lemon Pickle," "Mixed Vegetable Pickle"? This uses titles, descriptions, and potentially image analysis or ingredient lists.
- Sub-type/Variant Identification: Within "Mango Pickle," identify attributes like "Avakaya," "Guntur style," "spice level," "key ingredients (garlic/no garlic)."
- True Duplicate Detection: For items identified as the same sub-type/variant (e.g., "Guntur Spicy Mango Avakaya 500g"), then apply more stringent checks for true duplication across sellers or minor title variations.
This requires building a product ontology or knowledge graph, at least for key categories like pickles, that understands these relationships and attributes. Input from category managers like yourself would be invaluable for defining these rules and attributes.
High-Level Technical Architecture:
I. Data Ingestion & Preprocessing (Streaming & Batch):
- New listings: Streamed via Kafka or Kinesis.
- Existing catalog: Batch processed from product database/data lake.
- Initial Preprocessing: Basic text cleaning, language detection, image normalization for hashing/initial blocking. Detailed preprocessing for model features happens later.
II. Stage 1: Candidate Generation / Blocking (to reduce N² comparisons):
- Real-time (for new listings):
- Generate blocking keys based on normalized title tokens, primary category, brand (if known), phonetic keys.
- Query an inverted index (e.g., Elasticsearch, Solr) or a specialized blocking index (e.g., LSH Forest) for potential matches from the existing catalog.
- Batch (for catalog cleanup):
- Techniques like Locality Sensitive Hashing (LSH) with MinHash for textual similarity, or perceptual hashing (pHash/aHash) for images.
- Category-based blocking.
- Attribute-based blocking (brand, manufacturer part number).
(Detailed feature extraction and model-specific data preparation for the candidate pairs will be discussed next)
III. Stage 2: Feature Engineering & Pairwise Classification / Similarity Scoring:
- For each candidate pair (product A, product B) from the blocking stage:
- Extract detailed multi-modal features for each product (as we'll detail in data prep).
- A trained classifier (e.g., Siamese Network, XGBoost) takes these features (or embeddings from them) and outputs a probability of being a duplicate/variant.
IV. Stage 3: Clustering & Merging Decision:
- Use graph-based clustering (e.g., Connected Components on pairs above a high confidence threshold) to find transitive relationships and form duplicate sets.
- Apply business rules for merging or linking variants.
Technology Stack Considerations:
- Streaming: Kafka, Flink/Spark Streaming.
- Batch Processing: Spark, AWS EMR/Dataproc.
- Indexing: Elasticsearch for blocking; FAISS or ScaNN for ANN search on embeddings.
- Model Serving: SageMaker, Kubeflow.
- Databases: Product catalog (DynamoDB/Cassandra), feature store, graph DB (Neptune).
Round 2: Deep Technical Dive - NLP & Multilingual Challenges
Handling Mixed Scripts & Transliteration Variations:
- Language & Script Detection:
- For each product title/description, first detect the languages and scripts present (e.g., using libraries like
langdetector a custom classifier). This helps apply appropriate normalization rules.
- For each product title/description, first detect the languages and scripts present (e.g., using libraries like
- Script Conversion / Romanization:
- Convert all Telugu script text to a standardized Romanized form (e.g., using ISO 15919 or a consistent internal transliteration scheme). For example, "అవకాయ" would consistently become "avakāya" or "aavakaaya". This allows comparison with English transliterations provided by sellers.
# Example: Romanization (conceptual) # from indic_transliteration import sanscript # from indic_transliteration.schemes import ITRANS, TELUGU # romanized_text = sanscript.transliterate(telugu_text, TELUGU, ITRANS) # Or other Roman scheme
- Convert all Telugu script text to a standardized Romanized form (e.g., using ISO 15919 or a consistent internal transliteration scheme). For example, "అవకాయ" would consistently become "avakāya" or "aavakaaya". This allows comparison with English transliterations provided by sellers.
- Normalization of Transliterations:
- Sellers might use "Avakaya", "Aavakai", "Avvakaya". We need to normalize these to a canonical form. This can be done through:
- Rule-based normalization: Define common phonetic mappings (e.g., 'aa' -> 'a', 'v' -> 'w' sometimes, double consonants to single).
- Learned normalization: Train a sequence-to-sequence model (like a small transformer) on pairs of variant transliterations and their canonical forms, if such data can be curated.
- Sellers might use "Avakaya", "Aavakai", "Avvakaya". We need to normalize these to a canonical form. This can be done through:
- Phonetic Encoding / Matching:
- Once text is in a common script (e.g., Romanized), apply phonetic algorithms. Standard Soundex is poor for Indian languages.
- For English parts: Double Metaphone or a more modern phonetic algorithm.
- For Romanized Telugu: Develop or use existing Indic phonetic algorithms (e.g., based on mapping to a common phonetic representation like IPA, or algorithms designed for Dravidian/Indic languages). These would understand that 'k' and 'kh' are phonetically related, or that 'అ' (a) and 'ఆ' (ā) are short/long versions of the same vowel sound.
- This generates phonetic keys that can be used for blocking or as features.
# Conceptual Telugu Phonetic Hashing # def get_telugu_phonetic_key(romanized_telugu_word): # # Apply rules for Telugu phonology: # # e.g., handle vowel length, consonant clusters, aspiration # # map similar sounds (like variations of 'cha') to a common key # key = romanized_telugu_word.lower() # key = key.replace('aa', 'a').replace('ee', 'i').replace('oo', 'u') # # ... more complex rules based on Telugu phonetics ... # return key
- Once text is in a common script (e.g., Romanized), apply phonetic algorithms. Standard Soundex is poor for Indian languages.
- Character & N-gram Level Features:
- Use character n-grams (e.g., 3-grams, 4-grams) as features. These are robust to minor misspellings and transliteration variations. TF-IDF can be applied to these character n-grams.
- Multilingual Embeddings:
- Use pre-trained multilingual models like mBERT, XLM-Roberta, or IndicBERT. These models are trained on many languages (including Telugu and English) and can often map semantically similar words/phrases in different scripts or transliterations close together in the embedding space. This is powerful for semantic similarity features.
The pipeline would be: Detect Script -> Convert to Standard Romanized Telugu + Keep English as is -> Normalize common English/Telugu transliterations -> Generate Phonetic Keys -> Create Character N-grams -> Generate Multilingual Embeddings. These normalized/featurized texts are then used for blocking and similarity calculations.
- Leverage Indic NLP Libraries:
- There are specialized libraries and research in Indic NLP that provide tools for more accurate transliteration and phonetic analysis for languages like Telugu. I'd explore libraries like `iNLTK`, `indic-nlp-library`, or `CLTK (Classical Language Toolkit)` which might have modules for Telugu phonetics or syllable structures.
- These libraries often have built-in knowledge of Brahmic script properties, vowel signs (matras), consonant conjuncts (samyuktaksharas), which are crucial for correct phonetic representation.
- Phonetic Transcription to a Standardized System (like IPA):
- The most rigorous approach would be to convert both the Telugu script and its Romanized transliterations into the International Phonetic Alphabet (IPA). This provides a universal standard for representing sounds.
- Once in IPA, we can compute phonetic similarity using weighted edit distances (e.g., Levenshtein distance where substitution costs are based on phonetic feature similarity – e.g., 'p' to 'b' is a smaller cost than 'p' to 's'). Tools exist for grapheme-to-phoneme conversion for many languages, and for Telugu, this might involve rule-based systems or ML models trained on Telugu pronunciation dictionaries.
# Conceptual: Telugu word to IPA # def telugu_word_to_ipa(word_in_telugu_script_or_romanized): # # This would use a sophisticated G2P (Grapheme-to-Phoneme) engine # # trained or configured for Telugu. # # For example, "అవకాయ" -> /aʋakaːja/ # # "Avakai" (common translit) -> /aʋakaj/ (or similar) # ipa_representation = g2p_telugu_engine(word_in_telugu_script_or_romanized) # return ipa_representation # def phonetic_feature_edit_distance(ipa1, ipa2): # # Calculates edit distance based on articulatory features of phonemes # # (e.g., place of articulation, manner, voicing for consonants; height, backness for vowels) # # A 'p' /p/ vs 'b' /b/ is one feature diff (voicing). # # A 'ka' /ka/ vs 'ga' /ga/ is one feature diff. # # A 'kai' /kaj/ vs 'kaya' /kaːja/ involves vowel length and glide diff. # pass
- Custom Rule-Based Phonetic Hashing for Telugu:
- Develop a set of rules specifically for Telugu phonology:
- Normalize vowel lengths (e.g., treat 'a' and 'aa' similarly for initial hashing, but retain difference for fine-grained similarity).
- Group similar consonants (e.g., voiced/unvoiced pairs like k/g, c/j, t/d, p/b; aspirates with non-aspirates like kh/k).
- Handle common sandhi (euphonic changes) if they affect spelling.
- Standardize representation of consonant conjuncts.
- Develop a set of rules specifically for Telugu phonology:
- Train a Phonetic Similarity Model:
- If we can curate a dataset of (Telugu word variant 1, Telugu word variant 2, phonetic_similarity_score), we could train a character-level Siamese neural network or a similar model to learn a mapping from spelling variations to a shared phonetic embedding space. The similarity score could be derived from human judgments or IPA-based distances.
For "Avakaya" vs. "Avakai": IPA might be /aʋakaːja/ vs. /aʋakaj/. The difference is in the final vowel length (aː vs a) and the presence/absence of the final /a/ glide. A good phonetic distance metric would score these as very similar, much more so than Soundex. My custom rules would also aim to normalize such common suffix variations typical in transliterations.
Catching Adversarial Duplicates:
- Stronger Reliance on Non-Textual & Hard-to-Manipulate Features:
- Image Similarity: This is crucial. Sellers might change text, but often use the same or very similar product images. CNN-based image embeddings (e.g., from ResNet, ViT, or specialized product image models like Facebook's GrokNet if an equivalent could be trained) compared using cosine similarity would be a strong signal. We'd need to be robust to minor image edits (cropping, brightness changes).
- Structured Attributes (if available and reliable): If sellers provide EAN/UPC codes, manufacturer part numbers, precise weight (e.g., 485g vs. 500g could be a slight variant, but 500g vs. 500g is a strong match), or detailed, standardized ingredient lists. These are harder to falsify consistently.
- Price (Normalized): If two listings have very similar descriptions and images, and their price per unit (e.g., price per 100g) is nearly identical, it's a strong indicator, even if titles are tweaked.
- Semantic Text Similarity (Beyond Keywords):
- Advanced embeddings from models like mBERT or sentence transformers can capture that "Homemade Avakaya Pickle by Grandma Lakshmi" and "Traditional Avakaya Pickle - Grandma Lakshmi's Authentic Recipe" are semantically extremely close, even if keyword TF-IDF might differ more. The presence of key entities like "Avakaya Pickle" and "Grandma Lakshmi" would be captured.
- Seller Behavior Analysis:
- Same Seller, Multiple Similar Listings: If the same seller lists multiple products with highly similar images, attributes, and semantic descriptions within a short timeframe, flag for review.
- Rapid Listing Variations: Monitor sellers who frequently de-list and re-list items with minor changes.
- Graph-Based Detection:
- Construct a product graph where nodes are products and edges represent high similarity along one or more dimensions (text, image, seller).
- Look for dense clusters of products from the same seller that are all highly similar to each other but just under individual pairwise duplicate thresholds. These "near-duplicate clouds" can indicate deliberate obfuscation.
- Use community detection algorithms on this graph.
- Customer Behavior Signals (More complex to integrate but powerful):
- If customers frequently view Product A then immediately view and purchase Product B (and they are very similar), it might suggest B is the canonical or preferred version, or they are perceived as interchangeable.
- High rate of "Report incorrect information" or "Is this the same as X?" from customers on product pages.
- Human Review for Adversarial Patterns:
- Use machine learning to flag suspicious patterns of listing behavior (e.g., many similar items from one seller with minor name tweaks). These flagged sellers/listings can be prioritized for human review to identify new adversarial tactics, which can then be used to update the ML models or rules.
The system needs to learn to down-weight easily gamed features (like slight title fluff) and up-weight more stable signals like image content and core semantic meaning when adversarial behavior is suspected.
Round 3: Data Preparation for Model Training
(product A, product B, label) pairs or triplets, what features are extracted for each product before they enter the Siamese towers, and how do you handle common data challenges like imbalance and creating robust train/val/test splits?Comprehensive Data Preparation Pipeline for Model Training:
1. Labeling Strategy & Pair/Triplet Generation:
We need to generate (Product A, Product B, Label) pairs, or (Anchor, Positive, Negative) triplets if using triplet loss.
- Positive Pairs (Label=1 - Duplicates):
- High-Confidence Rules: Generated from exact matches on normalized titles + same brand + very similar image embeddings (e.g., cosine sim > 0.98) + price within +/- 5%. Also, products with identical UPC/EAN codes but different ASINs.
- Manually Merged Data: Leverage any existing data where catalog teams have manually identified and merged duplicates. This is high-quality signal.
- Self-Supervision (Augmentation): For a given product, create an augmented version (e.g., slight text paraphrasing, minor image changes) and consider it a positive pair with the original. This helps learn robustness.
- Seller-Confirmed Duplicates: If sellers flag their own listings as duplicates or confirm system suggestions.
- Negative Pairs (Label=0 - Non-Duplicates):
- Easy Negatives: Random pairs of products from vastly different fine-grained categories (e.g., "Avakaya Pickle" vs. "Mobile Phone"). These help the model learn gross differences quickly.
- Hard Negatives: These are crucial for learning fine distinctions.
- Products within the same fine-grained category but with clearly distinct key attributes identified through NLP (e.g., "Mango Pickle" vs. "Lemon Pickle").
- Pairs that are "close" in the embedding space from a previous model iteration but are known non-duplicates (e.g., different variants like "Spicy Guntur Mango Pickle" vs. "Mild Andhra Mango Pickle").
- Products that were candidate pairs from the blocking stage but were confirmed as non-duplicates by human annotators.
- Triplet Generation (for Triplet Loss):
- For an "Anchor" product, find a "Positive" (true duplicate) and a "Negative" (non-duplicate, ideally a hard negative). The goal is to train the model such that `distance(Anchor, Positive) + margin < distance(Anchor, Negative)`.
- Active Learning & Weak Supervision: As discussed, these will be ongoing sources for new labeled pairs, especially for ambiguous cases.
2. Feature Extraction for Individual Product Inputs (Per Tower):
For each product (Product A or Product B) that goes into one of the Siamese network's towers, we extract a multi-modal feature set:
- Textual Features (for Text Encoder):
- Input: Concatenated normalized title and key phrases from the description.
- Cleaning & Normalization: Lowercasing, removal of special characters/HTML. Script conversion (Telugu to Romanized standard) and transliteration normalization (as discussed in NLP round).
- Tokenization: Using a multilingual tokenizer compatible with the chosen transformer model (e.g., SentencePiece, WordPiece for mBERT).
- Padding/Truncation: Pad sequences to a fixed maximum length (e.g., 128 or 256 tokens) and truncate longer sequences. Generate attention masks.
- Image Features (for Image Encoder):
- Input: Primary product image.
- Normalization:
- Resize to a fixed input size required by the CNN (e.g., 224x224 or 299x299).
- Pixel value scaling (e.g., to [0, 1] range).
- Normalization using mean and standard deviation statistics from the dataset the pre-trained CNN was trained on (e.g., ImageNet stats).
- Training-Time Augmentation: To make the model robust to variations:
- Random horizontal flips.
- Random rotations (small angles).
- Color jitter (brightness, contrast, saturation).
- Random minor crops and resizing.
- Structured Attribute Features (for Attribute Encoder):
- Numeric Attributes (e.g., price, weight, pack_count):
- Handling Missing Values: Impute with mean/median/mode for that category, or use a special "missing" flag/embedding.
- Normalization: Z-score normalization or Min-Max scaling to bring them to a similar range.
- Categorical Attributes (e.g., brand, fine-grained_category_id, seller_tier):
- Handling Missing Values: Assign a special "unknown" category.
- Encoding: If the attribute encoder is a simple MLP, use one-hot encoding for low-cardinality attributes. For high-cardinality attributes (like brand_id), learn an embedding vector for each category using an `nn.Embedding` layer.
- Boolean/Binary Attributes (e.g., is_handmade, contains_garlic): Encode as 0/1.
- Numeric Attributes (e.g., price, weight, pack_count):
The output of each encoder (text, image, attribute) for a product will be an embedding, which are then fused to create a single multi-modal embedding for that product within its tower.
3. Train/Validation/Test Splits:
- Challenge: Data Leakage. Randomly splitting pairs can lead to leakage if different listings of the same underlying true product entity appear in both train and test sets (e.g., ProductA-ProductB in train, ProductA-ProductC in test, where A,B,C are all duplicates of each other). This would inflate validation/test scores.
- Strategy: Group-Based Splitting.
- First, identify "true entity clusters" using high-confidence rules or existing merged ASIN groups.
- Ensure that all listings belonging to one true entity cluster are entirely within one split (train, validation, or test). This prevents leakage.
- Split these entity clusters randomly into train/val/test (e.g., 70%/15%/15%).
- Temporal Split (for validation/test): Optionally, use older data for training and more recent data for validation/testing to better simulate how the model will perform on new, unseen listings over time.
4. Handling Class Imbalance:
The number of true non-duplicate pairs will vastly outnumber true duplicate pairs.
- Sampling Strategies for Pair Generation:
- During training batch creation, we can control the ratio of positive to negative pairs. E.g., for every positive pair, sample K negative pairs (where K might be 1-5, instead of the true much larger ratio).
- Prioritize sampling hard negatives over easy negatives to make training more effective.
- Weighted Loss Function: Assign a higher weight to the minority class (duplicates) in the binary cross-entropy loss function. The weight can be inversely proportional to class frequency.
- Evaluation Metrics: Focus on metrics robust to imbalance, like Precision, Recall, F1-score (especially for the positive class), and Area Under the Precision-Recall Curve (AUC-PR). Accuracy is misleading here.
This detailed data preparation pipeline ensures that the model receives clean, well-structured, and representative data, which is crucial for learning effective similarity representations.
Round 4: Machine Learning Model Architecture
Siamese Network Architecture:
The core idea is to have two identical "towers" (sub-networks) that process each product in a pair independently using the same set of weights. The outputs of these towers (product embeddings) are then compared.
- Input Product Representation (Per Tower):
- As detailed in our data preparation discussion, each product (Product A, Product B) will have its textual data (tokenized sequences), image data (normalized pixel arrays), and structured attributes (normalized/encoded).
- Encoders within Each Tower (Shared Weights):
- Text Encoder:
- Architecture: A pre-trained multilingual transformer model (e.g., mBERT, XLM-Roberta, or a distilled version like DistilmBERT for speed). We'd typically use the output embedding of the `[CLS]` token or an average of token embeddings from the last hidden layer.
- Fine-tuning: The weights of this transformer can be fine-tuned during the Siamese network training to adapt it to our specific product domain and similarity task.
- Output: A fixed-size text embedding vector (e.g., 768 dimensions for BERT-base).
- Image Encoder:
- Architecture: A pre-trained Convolutional Neural Network (CNN) like ResNet50, EfficientNet, or Vision Transformer (ViT), with its final classification layer removed.
- Fine-tuning: Similar to the text encoder, the CNN weights can be fine-tuned.
- Output: A fixed-size image embedding vector (e.g., 2048 dimensions for ResNet50 before the FC layer).
- Structured Attribute Encoder:
- Architecture: This could be a simpler Multi-Layer Perceptron (MLP).
- Numeric features can be directly fed into the MLP.
- Categorical features would first pass through `nn.Embedding` layers to get dense vector representations, which are then concatenated with numeric features and fed to the MLP.
- Output: A fixed-size attribute embedding vector.
- Architecture: This could be a simpler Multi-Layer Perceptron (MLP).
- Text Encoder:
- Fusion Layer (Per Tower):
- The individual embeddings (text, image, attributes) for a single product need to be combined into a unified multi-modal product embedding.
- Method: Concatenate the text, image, and attribute embeddings. This combined vector is then passed through one or more dense (fully connected) layers with activation functions (e.g., ReLU) to allow the model to learn interactions between modalities and project to a final shared embedding space dimension (e.g., 256 or 512 dimensions).
- This final vector, let's call it `E_product`, is the output of one tower for one product.
- Similarity Computation / Interaction Layer (Comparing Tower Outputs):
- Once we have the final embeddings `E_A` for Product A and `E_B` for Product B from the two towers:
- Option 1 (Distance-based, for Contrastive/Triplet Loss):
- Calculate a distance metric, e.g., Euclidean distance or Cosine distance: `D(E_A, E_B)`. Cosine similarity is often preferred for high-dimensional embeddings.
- Option 2 (Interaction-based, for Binary Classification):
- Combine `E_A` and `E_B` in a more complex way to capture their interaction. A common approach is to concatenate `E_A`, `E_B`, their element-wise difference `|E_A - E_B|`, and their element-wise product `E_A * E_B`.
- This combined interaction vector is then passed through a few dense layers terminating in a single sigmoid output neuron. This neuron predicts the probability that Product A and Product B are duplicates (a score between 0 and 1).
- Loss Function:
- Contrastive Loss (if using distance-based comparison): Aims to pull positive pairs (duplicates) closer in the embedding space and push negative pairs (non-duplicates) further apart than a defined margin.
(where `y=1` for positive pairs, `y=0` for negative pairs).Loss = y * D(E_A, E_B)^2 + (1-y) * max(0, margin - D(E_A, E_B))^2 - Triplet Loss (if using distance-based comparison and (Anchor, Positive, Negative) triplets): Aims to ensure the anchor is closer to the positive than to the negative by at least a margin.
Loss = max(0, D(E_Anchor, E_Positive)^2 - D(E_Anchor, E_Negative)^2 + margin) - Binary Cross-Entropy Loss (if using interaction-based binary classification): This is standard for binary classification tasks. Used with the sigmoid output. We can incorporate class weights here to handle imbalance.
- Contrastive Loss (if using distance-based comparison): Aims to pull positive pairs (duplicates) closer in the embedding space and push negative pairs (non-duplicates) further apart than a defined margin.
The choice between distance-based loss (Contrastive/Triplet) and interaction-based classification depends on whether the primary goal is to learn a versatile embedding space for various downstream tasks (favoring distance-based) or to optimize purely for the pairwise duplicate classification task (favoring interaction-based, which often performs slightly better for the direct task).
# Conceptual Siamese architecture using PyTorch-like pseudocode (interaction-based)
# class ProductEncoderTower(nn.Module):
# def __init__(self):
# super().__init__()
# self.text_encoder = MultilingualBERTEncoder() # Outputs text_embedding
# self.image_encoder = ResNetEncoder() # Outputs image_embedding
# self.attr_encoder = AttributesEncoder() # Outputs attr_embedding
# self.fusion_mlp = nn.Sequential(nn.Linear(text_dim + img_dim + attr_dim, hidden_dim1),
# nn.ReLU(),
# nn.Linear(hidden_dim1, final_embedding_dim)) # E.g., final_embedding_dim = 256
# def forward(self, product_text_input, product_image_input, product_attr_input):
# txt_emb = self.text_encoder(product_text_input)
# img_emb = self.image_encoder(product_image_input)
# atr_emb = self.attr_encoder(product_attr_input)
# combined_initial_emb = torch.cat((txt_emb, img_emb, atr_emb), dim=1)
# final_product_embedding = self.fusion_mlp(combined_initial_emb)
# return final_product_embedding
# class SiameseDuplicateClassifier(nn.Module):
# def __init__(self):
# super().__init__()
# self.product_tower = ProductEncoderTower() # Shared tower
# # Interaction layers to compare two product embeddings
# # Input to comparison_mlp is 4 * final_embedding_dim if concatenating E_A, E_B, |E_A-E_B|, E_A*E_B
# self.comparison_mlp = nn.Sequential(nn.Linear(4 * final_embedding_dim, hidden_comp_dim),
# nn.ReLU(),
# nn.Linear(hidden_comp_dim, 1),
# nn.Sigmoid())
# def forward(self, product1_data, product2_data):
# # product1_data and product2_data are tuples/dicts of (text_input, image_input, attr_input)
# emb1 = self.product_tower(*product1_data)
# emb2 = self.product_tower(*product2_data)
# # Interaction features
# diff = torch.abs(emb1 - emb2)
# prod_ew = emb1 * emb2
# interaction_vec = torch.cat((emb1, emb2, diff, prod_ew), dim=1)
# return self.comparison_mlp(interaction_vec)
This architecture allows for end-to-end training and leverages powerful pre-trained models for different modalities, adapting them to the specific nuances of product similarity.
Round 5: Scalability & Real-time Implementation
Multi-Level Blocking Strategy for Scalability:
The goal is to progressively reduce the number of pairs that need detailed, expensive feature computation and model scoring.
- Level 1: Coarse-Grained Attribute Blocking (Initial Broad Cut):
- Category Blocking: This is the most impactful first step. Only compare products within the same fine-grained leaf category (e.g., "Mango Pickles," not just "Pickles" or "Groceries"). This relies on a good product categorization system. If a product is in "Pickles," we compare it to other "Pickles." This can reduce comparisons by orders of magnitude (e.g., 100x-1000x).
- Brand Blocking (where applicable): If brand information is reliable, only compare products of the same brand or where one brand is missing (to catch unbranded versions of branded products). Be cautious, as sellers misspell brands or list them generically.
- Level 2: Content-Based Approximate Similarity Blocking (Hashing):
- LSH (Locality Sensitive Hashing) with MinHash for Text:
- Convert product titles (and optionally key description phrases) into sets of shingles (e.g., character 3-grams or 5-grams after normalization).
- Apply MinHash to these sets to create compact signatures.
- Use LSH techniques (e.g., banding) to hash these signatures into buckets. Products falling into the same LSH bucket are considered candidates. This is very efficient for finding textually similar items.
- Perceptual Hashing for Images (e.g., pHash, aHash, dHash):
- Generate compact hashes for product images. Products with very similar image hashes (low Hamming distance) are strong candidates. This can catch duplicates even if text differs significantly.
- Phonetic Blocking Keys: Use the Indic phonetic algorithms discussed earlier to generate phonetic keys for key product name tokens. Products sharing phonetic keys are candidates.
- LSH (Locality Sensitive Hashing) with MinHash for Text:
- Level 3: Embedding-Based Approximate Nearest Neighbor (ANN) Search (Semantic Blocking):
- For all products, pre-compute the multi-modal embeddings (text, image, attributes) using our trained `ProductEncoderTower`.
- Store these embeddings in an ANN index (e.g., FAISS, ScaNN, HNSW).
- For a new product (or each product in batch mode), query the ANN index to find the top-K most similar products in the embedding space. These become candidates. This captures semantic similarity even if keywords or exact phrasing differs.
Combining Blocking Strategies:
A product pair becomes a candidate for detailed comparison if it passes through any of these relevant blocking stages, or a combination (e.g., same category AND similar LSH hash). The union of candidate pairs from these methods forms the set for detailed pairwise classification.
Real-time vs. Batch Adaptation:
- Real-time (New Listings): The new product is processed. Its blocking keys and a quickly generated embedding are created. Query pre-built indexes (Elasticsearch for keyword/attribute blocks, LSH Forest, FAISS index for embeddings) of the existing catalog. This should return a small set of candidates (e.g., top 50-100) very quickly.
- Batch (Catalog Cleanup): Can afford more comprehensive blocking. For example, an all-pairs LSH comparison within categories, or building a graph where edges are created by blocking rule matches, then running clustering. Spark is well-suited for this. All existing product embeddings are already pre-computed and stored.
This multi-level blocking ensures that we dramatically reduce the number of pairs from N² to something closer to O(N * k_avg_candidates), which is computationally feasible.
Handling Novel/Sparse Categories:
- Adaptive & Hierarchical Category Blocking:
- Our product categorization system needs to be hierarchical. If a product is in a leaf category with very few items (e.g., < N products, where N is a threshold like 50), the blocking logic should automatically "fall back" to comparing it against products in its parent category or even sibling categories that are semantically close.
- Example: If "Organic Single-Origin Guntur Mango Avakaya" is a new sub-category, it should still be compared against products in "Guntur Mango Avakaya" and "Organic Mango Pickles."
- Semantic Category Matching / Expansion:
- Use embeddings for category names/descriptions themselves. If a new product's assigned category "Millets-Avakaya" is novel, find existing categories that are semantically closest (e.g., "Avakaya," "Mixed Grain Pickles," "Health Pickles") and include products from those in the candidate generation pool.
- This requires a system to map new, seller-provided category strings to our internal taxonomy or find closest matches.
- Increased Reliance on Content-Based Blocking (LSH, Embeddings) for Novel Categories:
- For products in sparse/new categories, the LSH (on text) and ANN search (on multi-modal embeddings) stages become even more critical. These methods don't rely on pre-defined category blocks and can find similar items across the entire catalog (or a very broad super-category) if configured to do so, albeit with higher computational cost if the search space isn't narrowed.
- We might have a two-pass blocking: first, strict category blocking; if few candidates are found and the category is sparse, then a second, broader search using embeddings or relaxed LSH across related super-categories.
- Dynamic Category Creation & Clustering for Orphaned Products:
- Periodically, run clustering algorithms (e.g., on product title/description embeddings) on products that are in very small categories or haven't matched any duplicates for a long time ("orphans").
- These clusters can help identify emerging product types or de facto new categories that sellers are creating. Human review can then formalize these into the taxonomy. This helps the category blocking improve over time.
- Zero-Shot or Few-Shot Category Assignment for New Listings:
- For incoming new listings, if the seller provides a category string that doesn't map to our existing taxonomy, use a text classification model (trained on existing product titles/descriptions and their categories) to predict the most likely existing fine-grained category. This helps place it correctly for blocking. If confidence is low, it can be flagged for "broader blocking search."
The system needs to be adaptive, falling back to broader similarity searches when fine-grained category information is insufficient or unreliable, and continuously learning and updating its category understanding.
Round 6: System Rollout Strategy & Risk Management
Phased Rollout Strategy & Risk Management:
- Phase 0: Internal Validation & Shadow Mode:
- Shadow Mode Deployment: Deploy the system to run in parallel with existing processes (if any) or on production data, but without taking any automated actions (no actual merges). It only logs its intended decisions (e.g., "would merge A and B," "would flag C as potential duplicate of D").
- Internal Audits: A dedicated team (including category managers and data scientists) thoroughly audits the shadow mode outputs for a diverse set of categories, focusing on:
- Accuracy of proposed merges (especially for high-impact products).
- Potential for incorrect merges of variants.
- Impact on known high-value ASINs.
- Refine Thresholds & Rules: Based on audit findings, refine the model's confidence thresholds for automated actions, business rules for variant handling, and blocking logic. Iterate until internal audit results meet a high-quality bar.
- Phase 1: Limited Rollout to Low-Risk Categories/Sellers (Canary Release):
- Select Pilot Categories: Start with a few well-understood, lower-risk product categories where duplicate definitions are clearer and the impact of an error is less catastrophic. Avoid highly nuanced categories like handmade goods initially.
- Small Seller Segment: Potentially, also pilot with a small, cooperative group of sellers who have opted-in or are known for good catalog practices.
- Enable Automated Actions (High-Confidence Only): For this pilot group, enable automated merging only for pairs where the system has extremely high confidence (e.g., >99.5% probability) and multiple signals agree (text, image, attributes, brand). All other potential duplicates are flagged for human review.
- Intensive Monitoring: Closely monitor actual merge accuracy, seller feedback/complaints from the pilot group, impact on search and sales for the affected ASINs, and system performance metrics (latency, errors).
- Phase 2: Gradual Expansion (A/B Testing within Categories):
- Category-by-Category Expansion: Based on successful pilot, gradually expand to more categories, prioritizing those with high duplicate rates but manageable complexity.
- A/B Testing of Merging Logic: Within a category, we can run A/B tests comparing customer experience metrics (search CTR, conversion) for users seeing the deduplicated catalog versus the original.
- Iterative Threshold Adjustment: Continuously adjust merge confidence thresholds based on performance in each category.
- Phase 3: Full Rollout with Continuous Monitoring & Refinement:
- Once confident in system stability and accuracy across a broad range of categories, move to a wider rollout.
- Even at full rollout, maintain the ability to quickly disable automated merging for specific categories or globally if systemic issues arise.
Risk Management & Mitigation:
- Robust Rollback Plan: Have a well-tested mechanism to revert merges if widespread errors are detected. This includes un-merging ASINs and restoring original listing data.
- Human Review Queue: Maintain a robust human review system for medium-confidence predictions, disputed merges, and anomalous merges.
- Clear Communication with Sellers: Proactively communicate upcoming changes to sellers. Provide clear guidelines on listing products, variants, and disputing merges.
- Impact Analysis Dashboard: Develop dashboards to track the impact of merges on key metrics in real-time.
- Incident Response Plan: Have a clear plan for responding to bad merges, including identification, notification, and rollback.
The key is to be cautious, data-driven, and iterative, prioritizing the minimization of negative impact on customers and sellers throughout the rollout process.
Round 7: Business Logic & Edge Cases
Attribute-Aware Duplicate/Variant Handling:
- Structured Attribute Extraction & Taxonomy:
- For relevant categories like "Pickles," we need to define a taxonomy of key differentiating attributes. For Avakaya, this would include "Spice Level" (e.g., Mild, Medium, Hot, Extra Hot), "Key Ingredient Variation" (e.g., With Garlic, Without Garlic, With Fenugreek), "Regional Style" (e.g., Guntur, Godavari, Coastal Andhra).
- Use NLP techniques (NER, rule-based pattern matching, or a trained attribute extraction model) to identify these attribute values from product titles and descriptions. E.g., extract "Mild" from "Mild Avakaya".
- Variant vs. Duplicate Classification Logic:
- After the initial similarity model identifies a pair of products as highly similar (e.g., "Mild Avakaya 500g" and "Medium Spicy Avakaya 500g" from the same brand), a secondary classification step or rule engine kicks in:
# Conceptual Classification Logic # def classify_product_relationship(product1_data, product2_data, base_similarity_score): # if base_similarity_score < DUPLICATE_THRESHOLD_LOW: # return "DIFFERENT_PRODUCTS" # attrs1 = extract_key_attributes(product1_data.title, product1_data.description) # attrs2 = extract_key_attributes(product2_data.title, product2_data.description) # if attrs1['base_pickle'] != attrs2['base_pickle'] or product1_data.brand != product2_data.brand: # return "NEEDS_REVIEW_DIFFERENT_BASE" # differing_variant_attributes = [] # for attr_name in VALID_VARIANT_DIMENSIONS_FOR_CATEGORY['avakaya_pickle']: # val1 = attrs1.get(attr_name) # val2 = attrs2.get(attr_name) # if val1 != val2 and (val1 is not None or val2 is not None): # Ensure actual difference # differing_variant_attributes.append(attr_name) # if len(differing_variant_attributes) > 0 and base_similarity_score > VARIANT_SIMILARITY_THRESHOLD_HIGH: # return "PRODUCT_VARIANTS" # elif len(differing_variant_attributes) == 0 and base_similarity_score > DUPLICATE_THRESHOLD_HIGH: # return "TRUE_DUPLICATES" # else: # return "UNCERTAIN_NEEDS_REVIEW"
- After the initial similarity model identifies a pair of products as highly similar (e.g., "Mild Avakaya 500g" and "Medium Spicy Avakaya 500g" from the same brand), a secondary classification step or rule engine kicks in:
- Business Rules Integration:
- Maintain a "variant dimension whitelist" for each product category.
- If two products are highly similar but differ only along these whitelisted variant dimensions, they are classified as "Variants" and should be grouped under a parent ASIN, not merged into one.
- If they are highly similar and have no differences in variant dimensions, they are "True Duplicates."
- Learning from Customer Search Behavior:
- Analyze search queries. If customers frequently search for "mild avakaya" or "avakaya no garlic," it validates these as important variant dimensions that need to be preserved.
- Seller Interface for Variants:
- Provide sellers with clear tools to list product variations correctly under a single parent product, reducing the creation of separate pseudo-duplicate listings for variants.
This attribute-aware layer ensures that legitimate product diversity desired by customers is maintained, while still catching true duplicates.
Package Size & Pricing Strategy Handling:
For products that are identical except for package size, the decision to treat them as simple size variants or distinct offerings depends heavily on the price per unit consistency.
- Extract Normalized Size & Price: Standardize sizes (e.g., to grams) and extract prices.
- Calculate Price Per Unit (PPU): E.g., price per 100g.
- Analyze PPU Consistency / Relationship:
- Scenario A: Proportional or Expected Non-Linearity: If PPUs show a consistent pattern (bulk discount, small pack premium), these are likely legitimate size variants.
Action: Group as size variants.# Conceptual PPU analysis # def check_ppu_consistency(ppu_values_with_sizes): # # Sort by size # # Check for monotonic decrease/increase in PPU or low coefficient of variation # # Thresholds tuned per category # pass - Scenario B: Highly Inconsistent or Irrational PPU: If PPU is erratic (e.g., 500g cheaper per unit than 1kg), this could indicate data quality issues, different offerings (e.g., gift pack), or pricing errors. Action: Flag for review. Be cautious about automatically merging.
- Scenario A: Proportional or Expected Non-Linearity: If PPUs show a consistent pattern (bulk discount, small pack premium), these are likely legitimate size variants.
- Integrate with Business Rules & Category Norms: PPU consistency checks can be tuned based on category norms.
So, if products are identical in all but size, and their Price Per Unit follows a reasonable pattern, they are treated as size variants. If PPU is erratic, they are flagged.
Round 8: Model Evaluation
Multi-Layered Evaluation Strategy:
1. Pairwise Classification Model Metrics (Offline):
- On a labeled test set of product pairs:
- Precision: `TP / (TP + FP)`. Critical to minimize incorrect merges.
- Recall: `TP / (TP + FN)`. Important to minimize missed duplicates.
- F1-Score / F-beta Score: To balance Precision and Recall, possibly weighing Precision higher.
- AUC-PR (Area Under Precision-Recall Curve): More informative for imbalanced datasets.
2. Cluster-Level Quality Metrics (Offline):
- After pairwise classification and graph clustering to form sets of duplicates:
- Purity, Normalized Mutual Information (NMI), Adjusted Rand Index (ARI): Compare formed clusters against ground truth clusters.
- Manual audit of a sample of generated duplicate clusters by domain experts.
3. Business Impact Metrics (Online A/B Testing):
This is crucial. A/B test the impact of showing a deduplicated catalog versus the original.
- Control Group: Users see catalog/search as is.
- Treatment Group: Users see catalog/search where identified duplicates are merged/canonicalized.
Key metrics to compare:
- Search Experience: Search CTR, Search-to-Detail Page CTR, Search Abandonment Rate.
- Conversion & Sales: Overall Conversion Rate, AOV, GMV.
- Customer Satisfaction & Catalog Quality: Customer complaints related to product confusion, return rates, CSAT scores, "Report incorrect product information" flags.
- Seller Impact: Sales distribution among sellers, seller complaints about incorrect merges (critical guardrail).
4. Operational Metrics for the System Itself (To monitor, not as primary evaluation):
- Latency of real-time checks, throughput of batch processing, computational cost, rate of human review needed.
Offline metrics guide model development, but online A/B testing validates true business impact. The ideal outcome is improved search and conversion metrics, and better CSAT, without unduly harming sellers.
Debugging Cultural/Linguistic Gaps:
- Prioritize and Systematically Analyze Seller Complaints:
- Translation & Categorization: Accurately translate and categorize complaints by reason (e.g., "merged different spice levels," "merged different regional preparations").
- Root Cause Analysis: For each valid incorrect merge, analyze why the model failed. Which features contributed? (Use SHAP/LIME). Was it text normalization, image similarity, missing attributes, or ontology gaps?
- Augment Training & Evaluation Data with Cultural Expertise:
- Hire/Consult Native Telugu Speakers with Food/Cultural Knowledge: Involve them in reviewing disputed merges, creating new high-quality labeled data (especially hard negatives/positives for nuanced cases), and refining the product ontology for Telugu regional products.
- Update Annotation Guidelines: Make guidelines more specific about Telugu product variations.
- Refine Features & Model Based on Error Analysis:
- Feature Engineering: Improve transliteration, NER for distinguishing terms, etc.
- Model Retraining: Retrain with augmented/corrected data.
- Adjust Confidence Thresholds: For nuanced categories, use higher thresholds for automated merging and send more to human review by cultural experts.
- Implement a "Seller Dispute" Mechanism with Feedback Loop: Allow sellers to easily flag incorrect merges. These become high-priority inputs for review and retraining.
- A/B Test Changes on Seller Segments: Test model updates aimed at fixing cultural gaps on Telugu seller segments first.
The key is a tight feedback loop between seller complaints, cultural expert review, data augmentation, and model retraining, focusing on linguistic and domain-specific error patterns.
Round 9: Ethical Considerations & Bias Mitigation
Ethical Considerations & Bias Mitigation:
- Linguistic Bias:
- Concern: System might perform better for dominant languages/scripts (e.g., English) than for regional scripts or non-standard transliterations, disadvantaging sellers less proficient in English.
- Mitigation: Diverse training data including various scripts/transliterations; robust NLP preprocessing; use of multilingual models (e.g., IndicBERT); regular bias audits for different language/script groups.
- Popularity Bias / Rich-Get-Richer:
- Concern: System might favor merging into popular ASINs, reducing smaller sellers' visibility.
- Mitigation: Fair "canonical" selection rules (not just oldest/highest-selling); mindful feature engineering (avoiding features too correlated with popularity); ensure clear visibility of "other sellers" on canonical ASINs.
- Bias from Annotators:
- Concern: Annotators might have biases (brand familiarity, seller names).
- Mitigation: Diverse annotation team with regional expertise; clear, objective guidelines; inter-annotator agreement checks; bias training for annotators.
- Accessibility & Technical Barriers for Sellers:
- Concern: System might penalize sellers with fewer resources for perfect listings.
- Mitigation: Seller education and easy-to-use listing tools; design system robust to partially incomplete listings.
- Transparency & Explainability (where feasible):
- Concern: Sellers confused by unexplained merges.
- Mitigation: Provide high-level reasons for merges if disputed; maintain a clear and fair dispute process.
Proactive identification of potential biases through audits, diverse team involvement, and a commitment to fairness in training data and model design are key. This is an ongoing effort.
Round 10: Advanced Technical Challenge & Trade-offs
Real-time Duplicate Detection Architecture (Sub-100ms):
A. Offline Pre-computation & Indexing (Foundation):
- Product Embeddings Index: Pre-compute and store multi-modal embeddings (from our `ProductEncoderTower`) for all 50M existing products in a low-latency ANN index (FAISS, ScaNN).
- Blocking Key / Inverted Index: Create an inverted index (Elasticsearch) based on normalized title tokens, brand, category IDs, phonetic keys.
B. Real-time Inference Pipeline for a New Product:
- Ultra-Fast Preprocessing (e.g., < 10-20ms): Language/script detection, rapid Romanization, key token extraction, phonetic key generation, category prediction.
- Tiered Candidate Generation (e.g., < 20-30ms):
- Tier 1 (Exact/Phonetic Key Lookup): Query inverted index. Very fast, highly relevant.
- Tier 2 (ANN Search): Generate a quick text embedding for the new product (distilled/quantized model). Query ANN index for top-K nearest neighbors.
- Lightweight Pairwise Scoring (e.g., < 30-40ms):
- Use a highly optimized, simpler scoring model (e.g., shallow tree ensemble, distilled Siamese comparator, or weighted sum of key similarities like Jaro-Winkler on titles, image hash distance if available quickly).
- Compare new product's quick features/embedding against pre-computed full embeddings/features of candidates.
- Thresholding & Response (e.g., < 10ms): Apply high-confidence threshold. Return top 1-3 potential duplicates. If none, respond "No immediate duplicates found; further checks pending."
C. Asynchronous Full Analysis (Post-Upload):
- The new product listing is queued for the more comprehensive batch duplicate detection pipeline using the full Siamese network. Seller notified later if more duplicates found.
Latency vs. Accuracy Trade-off:
For sub-100ms, we explicitly trade some accuracy (recall) for speed in the real-time check. The goal is to catch obvious duplicates quickly. The thorough batch process handles higher recall.
Tiered Response Strategy within the API Call (Aggressive Latency Budget):
- Level 0: Identical Match Check (e.g., < 5-10ms): Compute a robust hash of critical, normalized fields. Check against a pre-computed hash index. Catches only purely identical listings.
- Level 1: Blocking Key & Rule-Based Heuristics (e.g., add < 20-30ms): If no Level 0 match, query inverted index with tokens. Apply lightweight rule-based scoring on these few candidates (e.g., title edit distance, price proximity). Avoids embedding generation.
- Level 2: Cached/Approximate Embedding Similarity (e.g., add < 30-40ms, if Level 0 & 1 inconclusive):
- Title Embedding Lookup (if title common): Check cache (Redis) for existing title embedding. Use this to query ANN index.
- Highly Distilled/Quantized Text Embedding: If novel title, use extremely fast, less accurate distilled text model for on-the-fly embedding. Query ANN.
- Skip Image for Real-Time API: Omit on-the-fly image similarity. Use fast perceptual hash if anything.
- Score top ANN candidates using their pre-computed full embeddings against the new product's quick/cached embedding.
- Timeout & Asynchronous Follow-up: Strict timeout (e.g., 90ms). If high-confidence duplicates found, return them. Else, API responds "No immediate high-confidence duplicates found. Detailed check pending." Full analysis triggered asynchronously.
Trade-off Acceptance: For synchronous API, accept lower recall for low latency, catching blatant ones. Overall high recall via asynchronous check. Manages seller expectations.
Reasoning for Prioritizing Precision in Real-Time Seller Feedback:
- Seller Trust & User Experience (during upload): Showing incorrect potential duplicates (low precision) erodes trust. False positives are disruptive. High precision means suggestions are more likely valid and actionable.
- Actionability of Real-Time Feedback: The purpose is to prevent immediate creation of blatant duplicates. Catching 50% of true duplicates with high confidence is valuable at this stage.
- Asynchronous Safety Net for Recall: The crucial caveat is that missed duplicates (due to prioritizing precision) will be caught by the more comprehensive asynchronous batch processing later. We don't permanently lose the chance.
- Cost of False Positives in Real-Time: Incorrectly flagging a unique item as a duplicate can halt listing, require disputes, and make sellers feel the platform is unfair.
So, the strategy is:
- Real-time (sub-100ms API): Optimize for very high precision (high confidence threshold, strict rules). Accept lower recall.
- Asynchronous (Backend): Optimize for higher recall using the full multi-modal model.
Interview Conclusion
What to Learn from This Case
- Embrace Nuance & Domain Knowledge: Recognize that "duplicate" is not a simple binary; understand variants, regional specifics, and cultural context.
- Multi-Stage System Design: For complex problems, break down the solution into manageable stages (preprocessing, blocking, featurization, classification, clustering).
- Detailed Data Preparation is Crucial: Thoroughly explain labeling, feature extraction for all modalities (text, image, structured), train/val/test splits (handling leakage), class imbalance, and data augmentation.
- Advanced NLP for Linguistic Challenges: Demonstrate proficiency in handling mixed scripts, transliterations, phonetic variations, and semantic similarity.
- Scalability is Paramount: Address O(N²) complexity with intelligent blocking. Design for batch and real-time processing.
- Robust Training Data Strategy: Detail how to create labeled data at scale (bootstrapping, active learning, weak supervision, hard negative mining).
- Consider Adversarial Behavior: Acknowledge and design for users trying to game the system.
- Balance Technical Trade-offs: Justify choices like latency vs. accuracy, precision vs. recall under different constraints.
- Plan for Rollout & Risk Management: Articulate a phased rollout strategy with risk mitigation.
- Multi-faceted Evaluation: Go beyond standard ML metrics. Include cluster quality and business KPIs via A/B testing.
- Iterative Improvement & Human-in-the-Loop: Emphasize continuous learning and incorporating human feedback.
- Address Ethical Considerations & Bias: Proactively discuss potential biases and outline mitigation strategies.
- Think End-to-End & Cross-Functionally: Cover the lifecycle from problem definition to deployment and operational integration.
- Handle Edge Cases and Business Constraints: Address legitimate variants, pricing complexities, and seller relations.
- Communicate Complex Ideas Clearly: Explain technical concepts and their business implications effectively.
- Strategic Thinking: Make reasoned decisions when faced with conflicting objectives.