Pandas: Handling Duplicates & Missing Values
Given a Pandas DataFrame:
- Write a function (or explain the methods) to identify and remove duplicate rows.
- Write a function (or explain the methods) to identify and remove rows with missing values (NaNs):
- in a specific column.
- if any column has a NaN.
- Briefly mention how you might impute (fill) missing values instead of dropping them.
Explanation: Handling Duplicates and Missing Values in Pandas
Handling duplicate rows and missing values are crucial steps in data cleaning and preprocessing before any analysis or model building.
import pandas as pd
import numpy as np
# Sample DataFrame for demonstration
data = {
'col_A': [1, 2, 2, 3, 4, 5, 5, 6],
'col_B': ['x', 'y', 'y', 'z', np.nan, 'a', 'a', 'b'],
'col_C': [100, 200, 200, np.nan, 400, 500, 500, np.nan]
}
df_initial = pd.DataFrame(data)
print("Original DataFrame:\n", df_initial)
1. Handling Duplicate Rows
Pandas provides methods to both identify and remove duplicate rows.
a. Identifying Duplicate Rows: df.duplicated()
The df.duplicated() method returns a boolean Series indicating whether each row is a duplicate of a previous row. By default, it considers all columns for identifying duplicates and marks all occurrences except the first as duplicates.
df = df_initial.copy() # Work on a copy
# Identify which rows are duplicates (marks second, third, etc. occurrences as True)
duplicate_mask = df.duplicated()
print("\nBoolean mask for duplicated rows (first occurrence is False):\n", duplicate_mask)
# To see the actual duplicate rows (excluding their first occurrence)
actual_duplicates = df[df.duplicated()]
print("\nActual duplicate rows (excluding first occurrences):\n", actual_duplicates)
# Keep='last' marks all but the last occurrence as duplicate
duplicates_keep_last = df[df.duplicated(keep='last')]
print("\nDuplicates keeping the last occurrence:\n", duplicates_keep_last)
# Keep=False marks ALL occurrences of duplicates as True
all_duplicate_occurrences = df[df.duplicated(keep=False)]
print("\nAll occurrences of rows that are duplicated:\n", all_duplicate_occurrences)
# Consider duplicates based on a subset of columns
duplicates_subset = df[df.duplicated(subset=['col_A', 'col_B'], keep=False)]
print("\nDuplicates based on 'col_A' and 'col_B':\n", duplicates_subset)
b. Removing Duplicate Rows: df.drop_duplicates()
The df.drop_duplicates() method returns a DataFrame with duplicate rows removed. Key parameters:
subset: List of column labels to consider for identifying duplicates. By default, uses all columns.keep: Determines which duplicate(s) to keep.'first': (Default) Keep the first occurrence.'last': Keep the last occurrence.False: Drop all duplicates.
inplace: IfTrue, modifies the DataFrame in place and returnsNone. Default isFalse(returns a new DataFrame).
df = df_initial.copy()
# Remove duplicates, keeping the first occurrence (default)
df_no_duplicates_first = df.drop_duplicates()
print("\nDataFrame after dropping duplicates (keeping first):\n", df_no_duplicates_first)
# Remove duplicates, keeping the last occurrence
df_no_duplicates_last = df.drop_duplicates(keep='last')
print("\nDataFrame after dropping duplicates (keeping last):\n", df_no_duplicates_last)
# Remove duplicates based on specific columns
df_no_duplicates_subset = df.drop_duplicates(subset=['col_A'], keep='first')
print("\nDataFrame after dropping duplicates based on 'col_A' (keeping first):\n", df_no_duplicates_subset)
# Modify DataFrame in place
# df.drop_duplicates(inplace=True)
2. Handling Missing Values (NaN)
Missing values are typically represented as np.nan (Not a Number) in Pandas.
a. Identifying Missing Values: df.isnull() or df.isna()
These methods return a boolean DataFrame of the same shape as the input, where True indicates a missing value (NaN) and False indicates a non-missing value. .isna() is an alias for .isnull().
df = df_initial.copy()
missing_values_mask = df.isnull()
print("\nBoolean mask for missing values (isnull):\n", missing_values_mask)
# Count missing values per column
missing_counts_per_column = df.isnull().sum() # Summing True (1) and False (0)
print("\nMissing value counts per column:\n", missing_counts_per_column)
# Check if any value in a row is missing
rows_with_any_missing = df[df.isnull().any(axis=1)]
print("\nRows with at least one missing value:\n", rows_with_any_missing)
b. Removing Rows with Missing Values: df.dropna()
The df.dropna() method is used to remove rows (or columns) containing missing values. Key parameters:
axis:0or'index'to drop rows,1or'columns'to drop columns. Default is0.how:'any': (Default) Drop the row/column if any NA values are present.'all': Drop the row/column only if all values are NA.
subset: List of column/row labels to consider when looking for NaNs (only applicable ifaxis=0for columns oraxis=1for rows).thresh: Integer, require that many non-NA values to keep the row/column.inplace: IfTrue, modifies the DataFrame in place. Default isFalse.
df = df_initial.copy()
# Drop rows if ANY column has a NaN
df_dropped_any_nan_row = df.dropna() # Default: axis=0, how='any'
print("\nDataFrame after dropping rows with any NaN:\n", df_dropped_any_nan_row)
# Drop rows if ALL values in that row are NaN
df_dropped_all_nan_row = df.dropna(how='all')
print("\nDataFrame after dropping rows where all values are NaN (no change here):\n", df_dropped_all_nan_row)
# Drop rows with NaN in a specific column ('col_B')
df_dropped_nan_in_col_b = df.dropna(subset=['col_B'])
print("\nDataFrame after dropping rows with NaN in 'col_B':\n", df_dropped_nan_in_col_b)
# Drop columns if they contain any NaN
df_dropped_any_nan_col = df.dropna(axis=1, how='any')
print("\nDataFrame after dropping columns with any NaN:\n", df_dropped_any_nan_col)
3. Imputing (Filling) Missing Values: df.fillna()
Instead of dropping missing values, you often want to fill them with a specific value (e.g., 0, mean, median, mode, or a value from a previous/next row).
df = df_initial.copy()
# Fill all NaNs with 0
df_filled_zero = df.fillna(0)
print("\nDataFrame with NaNs filled with 0:\n", df_filled_zero)
# Fill NaNs in 'col_B' with a placeholder string
df['col_B'] = df['col_B'].fillna('Unknown')
print("\nDataFrame with NaNs in 'col_B' filled with 'Unknown':\n", df)
# Fill NaNs in 'col_C' with the mean of that column
mean_col_c = df['col_C'].mean()
df['col_C'] = df['col_C'].fillna(mean_col_c)
print("\nDataFrame with NaNs in 'col_C' filled with its mean:\n", df)
# Forward fill (fill NaN with the value from the previous row)
# df_ffill = df_initial.copy().fillna(method='ffill')
# Backward fill (fill NaN with the value from the next row)
# df_bfill = df_initial.copy().fillna(method='bfill')
Key Discussion Points for an Interview:
- Parameters of methods: Show you understand parameters like
keepandsubsetfordrop_duplicates(), andaxis,how,subset,threshfordropna(). inplace=True: Mention its use and the fact that it modifies the DataFrame directly and returnsNone. It's often recommended to avoidinplace=Truein production code for clarity and to prevent accidental data modification, preferring reassignment (df = df.dropna()).- Consequences of Dropping vs. Filling: Discuss when it's appropriate to drop data (e.g., if missingness is random and affects a small percentage of data, or if a row/column is mostly empty) versus when to impute (e.g., to preserve data points, using mean/median for numerical, mode for categorical, or more advanced imputation techniques).
- Chaining: Pandas operations can often be chained for concise code, e.g.,
df.isnull().sum().sort_values(ascending=False)to see missing value counts in descending order.