Write a Python function to validate an email address using regular expressions. The function should return True if the email address is valid according to a common pattern, and False otherwise.
A common (simplified) pattern for an email is: username@domain.extension
username: Can contain letters (a-z, A-Z), numbers (0-9), periods (.), underscores (_), percent signs (%), plus signs (+), and hyphens (-).
domain: Can contain letters, numbers, and hyphens (-). It typically consists of one or more parts separated by dots.
extension: The top-level domain (TLD) usually consists of 2 or more letters (e.g., .com, .org, .co.uk).
Examples:
Input:"test.user+label@example.com"Output:TrueInput:"user@sub.domain.co.uk"Output:TrueInput:"invalid_email@"Output:FalseInput:"@domain.com"Output:FalseInput:"user@domain"Output:False(missing .extension)Input:"user@domain.c"Output:False(extension too short)
Constraints:
The input will be a string.
Function Signature (Python):
importreclassSolution:defis_valid_email(self, email_address:str) ->bool:# Your code herepass
Break down the email structure into parts and define a regex for each part. Remember to escape special characters like . with a backslash (\.) if you mean a literal dot.
Username Part:
What characters are allowed? (e.g., a-zA-Z0-9._%+-)
How many of these characters? (At least one: +)
'@' Symbol: This is a literal character.
Domain Name Part:
What characters are allowed? (e.g., a-zA-Z0-9.-)
How many? (At least one: +)
The domain can have subdomains (e.g., sub.domain).
Dot before TLD: A literal dot \..
Top-Level Domain (TLD) / Extension Part:
What characters are allowed? (Usually letters: a-zA-Z)
How many? (Typically 2 or more: {2,})
Anchors: Use ^ to match the beginning of the string and $ to match the end of the string to ensure the entire string matches the pattern (re.fullmatch() does this by default).
Python's re module: Functions like re.fullmatch(pattern, string) will return a match object if the entire string matches the pattern, or None otherwise.
Solution: Email Validation with Regex
The Goal: We want to write a Python function that can look at a string and tell us if it looks like a valid email address (e.g., "name@example.com"). We'll use "regular expressions" (regex), which are special patterns that describe sequences of characters.
Creating a regex that perfectly matches all theoretically valid email addresses according to official standards (RFCs) is incredibly complex. For interviews, a reasonably robust pattern that covers common email formats is usually sufficient.
Approach: Using Python's re module and a Regex Pattern
The process involves:
Defining a regular expression pattern that describes the structure of a common email address.
Using a function from Python's re module (like re.fullmatch()) to test if the input email string conforms to this pattern.
Constructing the Regex Pattern:
A common pattern structure is username@domain.extension.
^ : Matches the beginning of the string.
Username part:[a-zA-Z0-9._%+-]+
[a-zA-Z0-9._%+-]: A character class allowing lowercase letters, uppercase letters, digits, period, underscore, percent, plus, or hyphen.
+: Matches one or more occurrences of the preceding character class.
@ : Matches the literal "@" symbol.
Domain name part:[a-zA-Z0-9.-]+
[a-zA-Z0-9.-]: A character class allowing letters, digits, period, or hyphen. Note that the hyphen is usually placed at the end or beginning of a character class, or escaped, to avoid being interpreted as a range.
+: Matches one or more occurrences. This allows for subdomains (e.g., mail.example).
\. : Matches a literal dot (period). The backslash escapes the dot, as . by itself is a special regex character meaning "any character".
Extension (TLD) part:[a-zA-Z]{2,}
[a-zA-Z]: Allows only letters for the TLD.
{2,}: Matches two or more occurrences of letters (e.g., "com", "org", "co", "info").
$ : Matches the end of the string.
Combining these gives the pattern: r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$". The r"" denotes a raw string, which is good practice for regex patterns to avoid issues with backslashes.
importreclassSolution:defis_valid_email(self, email_address:str) ->bool:# Regular expression for validating an Email# ^ : Start of string# [a-zA-Z0-9._%+-]+ : Username part (one or more of allowed characters)# @ : Literal "@" symbol# [a-zA-Z0-9.-]+ : Domain name part (one or more of allowed characters for domain)# \. : Literal dot (escaped)# [a-zA-Z]{2,} : Top-level domain (TLD), 2 or more letters# $ : End of stringpattern=r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"# re.fullmatch() checks if the entire string matches the pattern.# It returns a match object if there is a match, None otherwise.ifre.fullmatch(pattern, email_address):returnTrueelse:returnFalse# Example Usage:# sol = Solution()# emails_to_test = [# "test.user+label@example.com",# "user@sub.domain.co.uk",# "simple@domain.com",# "invalid_email@",# "@domain.com",# "user@domain",# "user@domain.c",# "user@domain..com" # Invalid due to double dot in domain by this regex# ]# for email in emails_to_test:# print(f"'{email}': {sol.is_valid_email(email)}")
Complexity Analysis:
Time Complexity: For most practical regex patterns and input strings, the time complexity of matching with Python's re module (which uses a backtracking NFA engine) can be considered roughly O(L) on average, where L is the length of the input string. However, poorly constructed regex patterns (especially those with nested quantifiers and backtracking, known as "catastrophic backtracking") can lead to exponential time complexity in worst-case scenarios. The provided pattern is generally well-behaved.
Space Complexity:O(1) for this specific implementation if we don't count the storage for the pattern string itself or the input string. The regex engine might use some space during matching, but it's typically not proportional to the input length for simple patterns.
Key Takeaways for Interviews:
Acknowledge Regex Complexity: Start by mentioning that a truly RFC-compliant email regex is very complex and usually not expected in an interview. State that you'll provide a pattern for common cases.
Break Down the Pattern: Explain each part of your regex (username, @, domain, TLD, anchors). This shows you understand how it works.
Use re.fullmatch() or Anchors: Emphasize the need to match the entire string. re.fullmatch() is ideal for this. If using re.match(), ensure your pattern starts with ^ and ends with $. re.search() finds a match anywhere, so it's usually not what you want for full string validation.
Raw Strings (r""): Use raw strings for regex patterns to avoid issues with backslashes being interpreted as Python escape sequences.
Discuss Limitations: Be ready to discuss what your regex doesn't cover (e.g., quoted usernames, IP addresses as domains, very new TLDs if your TLD length is too restrictive, comments in emails, etc.). This shows a deeper understanding. For example, the pattern [a-zA-Z0-9.-]+ for the domain might allow -- or a domain starting/ending with - or ., which are invalid. A more robust domain part might be ([a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,} but this is getting much more complex.
Alternative (No Regex): Briefly, one could use string methods (split('@'), check parts), but it becomes much more cumbersome and error-prone than a well-crafted regex for this task.