You are given a dataset (e.g., a Pandas DataFrame) representing customer transactions, with fields like customer_id, transaction_date, and transaction_amount. Write a Python function to identify customers who have made purchases in three consecutive months.
The function should return a list or set of customer_ids for these customers.
Example Input DataFrame (transactions_df):
# Assume transaction_date is already in datetime format or can be converted.data = {
'customer_id': ['C1', 'C1', 'C1', 'C1', 'C2', 'C2', 'C1', 'C3', 'C3', 'C3', 'C3'],
'transaction_date': [
'2023-01-10', '2023-01-20', '2023-02-05', '2023-03-15', # C1: Jan, Feb, Mar'2023-01-01', '2023-03-01', # C2: Jan, Mar (not consecutive)'2023-05-10', # C1: May (breaks sequence)'2023-11-15', '2023-12-20', '2024-01-05', '2024-02-01'# C3: Nov, Dec, Jan (consecutive!) then Feb
],
'transaction_amount': [100, 50, 75, 200, 30, 40, 60, 90, 10, 20, 50]
}
# transactions_df = pd.DataFrame(data)# transactions_df['transaction_date'] = pd.to_datetime(transactions_df['transaction_date'])
Expected Output:{'C1', 'C3'} (or a list ['C1', 'C3'])
Constraints:
Input is a Pandas DataFrame.
transaction_date column can be parsed into datetime objects.
Function Signature (Python):
importpandasaspdfromtypingimportList, SetclassSolution:deffind_consecutive_monthly_purchasers(self, transactions_df:pd.DataFrame) ->Set[str]:# Your code herepass
Related Python Concepts
Pandas DataFramespd.to_datetime()Datetime Properties (.dt.year, .dt.month, .dt.to_period('M'))groupby().apply().unique().sort_values()Set OperationsIteration over GroupsHandling Year BoundariesWindow Functions (Advanced)
Hint
The core of this problem is processing data per customer and then checking for a sequence in their purchase months.
Date Handling: First, ensure your transaction_date column is a proper datetime type. Then, extract a representation of the month (e.g., year-month like "2023-01" or Pandas Period objects like Period('2023-01', 'M')).
Per Customer: You need to analyze each customer's purchase history independently. groupby('customer_id') is the way to go.
Unique Purchase Months: Within each customer's group, you only care if they made a purchase in a given month, not how many. So, get the unique months of purchase for each customer. Make sure these are sorted chronologically.
Checking for Consecutiveness (3 months):
Iterate through the sorted list of unique purchase months for a customer.
For each month m1 in the list, you need to check if m1 + 1 month (let's call it m2) and m1 + 2 months (m3) are also present in that customer's list of unique purchase months.
Period Objects: If you convert year-month to pd.Period objects (e.g., df['month_period'] = df['transaction_date'].dt.to_period('M')), adding integers to them correctly handles month and year rollovers (e.g., Period('2023-12', 'M') + 1 gives Period('2024-01', 'M')). This simplifies checking for consecutive months.
If you use (year, month) tuples: You'll need custom logic to increment months and handle year changes (e.g., (2023, 12) + 1 month -> (2024, 1)).
Collecting Results: Keep a set of customer_ids that satisfy the condition.
Solution: Consecutive Monthly Purchases
This problem requires careful data preparation, grouping, and sequential pattern detection within each group. Using Pandas Period objects can greatly simplify handling month arithmetic across year boundaries.
classSolution:deffind_consecutive_monthly_purchasers(self, transactions_df:pd.DataFrame) ->Set[str]:iftransactions_df.empty:returnset()
df=transactions_df.copy()
# 1. Data Preparation: Ensure transaction_date is datetime and extract month periodifnotpd.api.types.is_datetime64_any_dtype(df['transaction_date']):df['transaction_date'] =pd.to_datetime(df['transaction_date'])
# Create a 'month_period' column (e.g., Period('2023-01', 'M'))# This handles year-month correctly and allows for easy arithmetic with months.df['month_period'] =df['transaction_date'].dt.to_period('M')
consecutive_purchasers=set()
# 2. Group by customer_idforcustomer_id, customer_groupindf.groupby('customer_id'):# 3. Identify Unique Monthly Purchases for the customer, sortedunique_purchase_months=sorted(customer_group['month_period'].unique())
# We need at least 3 unique purchase months to check for 3 consecutiveiflen(unique_purchase_months) <3:continue# 4. Check for Consecutive Months# Iterate up to the third-to-last month to allow checking for two subsequent monthsforiinrange(len(unique_purchase_months) -2):month1=unique_purchase_months[i]
month2_expected=month1+1# pd.Period addition handles month/year rolloversmonth3_expected=month1+2# Check if the actual next two unique months match the expected consecutive monthsif (unique_purchase_months[i+1] ==month2_expectedandunique_purchase_months[i+2] ==month3_expected):consecutive_purchasers.add(customer_id)
break# Found 3 consecutive for this customer, no need to check further for themreturnconsecutive_purchasers# Example Usage:# sol = Solution()# result = sol.find_consecutive_monthly_purchasers(transactions_df_initial.copy())# print(f"\nCustomers with 3 consecutive monthly purchases: {result}")# Expected: {'C1', 'C3'}
Complexity Analysis:
Let N be the total number of transactions and C be the number of unique customers. Let Mmax be the maximum number of unique purchase months for any single customer.
Converting dates to datetime: O(N)
Converting dates to periods: O(N)
groupby('customer_id'): Typically O(N) as it iterates through the data.
Inside the loop for each customer:
.unique() on month periods for a customer: Roughly O(Mc) where Mc is transactions for that customer.
sorted() on unique months: O(Mu log Mu) where Mu is unique months for that customer (Mu <= Mc).
Inner loop for checking consecutive: O(Mu).
The dominant part after the initial O(N) processing is the groupby and the operations within each group. If Mmax is small, the per-customer processing is fast. In the worst case, if one customer has almost all transactions, the sorting of unique months for that customer might be the bottleneck after initial O(N) steps. Overall, it's roughly bounded by O(N + C * Mmax log Mmax) or often simplified to O(N log Mmax) if C is not too large or O(N log N) in a very skewed scenario where M_max is large. If sorting the entire DataFrame by customer and date first (O(N log N)), then iterating, the unique month extraction and check per customer is efficient. Using .dt.to_period('M') is generally efficient.
Space Complexity:O(N) for storing the month_period column if added to the DataFrame. O(C + M_max) for storing unique months per customer in the worst case during iteration, and for the consecutive_purchasers set.
Key Discussion Points for an Interview:
Date Handling: Emphasize the conversion to datetime objects and then to a representation of month (year-month). Using pd.Period('M') is elegant for month arithmetic.
Grouping Strategy:groupby('customer_id') is essential.
Uniqueness and Sorting: For each customer, you need unique, sorted purchase months.
Consecutive Check Logic:
Explain how you iterate through the sorted unique months.
The core logic: if month[i+1] == month[i] + 1_period AND month[i+2] == month[i] + 2_periods.
Mention how pd.Period addition simplifies handling year boundaries automatically. If not using Periods (e.g., (year, month) tuples), you'd need to explain the logic for incrementing months and handling year rollovers (e.g., (2023, 12) -> (2024, 1)).
Efficiency: While a loop within .apply() on groups is shown, for very large datasets, one might explore more vectorized approaches if simple, or acknowledge that the current approach is clear and handles the logic well for typical data sizes. The main cost is often the initial data prep and sorting if not already sorted.
Alternative for Consecutive Check (Advanced): One could calculate the difference between consecutive sorted month periods. If there are two consecutive differences of '1 month', then a 3-month sequence exists.
# Inside the customer group loop, after getting unique_purchase_months:# if len(unique_purchase_months) >= 3:# month_series = pd.Series(unique_purchase_months)# diffs = month_series.diff().dt.n # Get difference in terms of periods (months)# # Check for two consecutive 1s in the differences# for j in range(2, len(diffs)): # Start from the 3rd month (2nd diff)# if diffs.iloc[j] == 1 and diffs.iloc[j-1] == 1:# consecutive_purchasers.add(customer_id)# break
This alternative using .diff() is more Pandas-idiomatic within the group processing.