An email id contains 3 parts

  1. Name:
  2. Domain name:
  3. Top Level Domain:

For almost all email ids the following rules will apply:

  • Name can contain only underscore or dot but not at the beginning or end of the name
  • Domain names should be 2 or more characters long and can include or totally comprise of numbers
  • Domain names can be accompanied by subdomains too. Like this:

To match all email such valid email ids from a given file, we can write the Regex as follows:

  • For name: ([a-z]+[a-z0-9]*[\._][a-z][0-9]+)


    [a-z]+ : Name should start with one or more alphabets (+ means 1 or more)

    [a-z0-9]* : Then it can have zero or more (* means zero or more) alphabets and numbers

    [\._]? : Then dot or underscore can appear once

    [a-z0-9]+ : Name can end with one or more alphabets or numbers

    Whole name section is grouped so it is treated as a single unit

  • For domain name including subdomains and the dot following the domain name:



    Subdomains have the following pattern: To match this we write the following regex: ([a-z0-9]+\.)* : Alphabets and numbers once ore more following by a dot and the entire thing is grouped and this group can appear zero or more times.

    The subdomains should lead to domain name and the final dot, so the regex becomes:


    Domain name can be two or more characters long and can comprise of alphabets or numbers. And the final + means the whole group should appear one or more times. Here we are considering that domain names should be 2 or more characters long.

  • Top Level Domain: [a-z]{2,}

    Top level domains should containt only alphabets and should be atleast two characters long.

Lets look at the entire regex now:


Echoing some email ids and testing our regex:

naveed@comquest:~$ echo -e "\\" | grep -E "([a-z]+[a-z0-9]*[_\.]?[a-z0-9]+)@(([a-z0-9]+\.)*[a-z0-9]{2,}\.)+[a-z]{2,}"