An email id contains 3 parts

  1. Name: abc_x10@abc.in
  2. Domain name: abc_x10@abc.in
  3. Top Level Domain: abc_x10@abc.in

For almost all email ids the following rules will apply:

  • Name can contain only underscore or dot but not at the beginning or end of the name
  • Domain names should be 2 or more characters long and can include or totally comprise of numbers
  • Domain names can be accompanied by subdomains too. Like this: abc_x10@ab.cd.abc.in

To match all email such valid email ids from a given file, we can write the Regex as follows:

  • For name: ([a-z]+[a-z0-9]*[\._][a-z][0-9]+)

    Explanation

    [a-z]+ : Name should start with one or more alphabets (+ means 1 or more)

    [a-z0-9]* : Then it can have zero or more (* means zero or more) alphabets and numbers

    [\._]? : Then dot or underscore can appear once

    [a-z0-9]+ : Name can end with one or more alphabets or numbers

    Whole name section is grouped so it is treated as a single unit

  • For domain name including subdomains and the dot following the domain name:

    (([a-z0-9]+\.)*[a-z]{2,}\.)+

    Explanation

    Subdomains have the following pattern: ab.a2.bc.de. To match this we write the following regex: ([a-z0-9]+\.)* : Alphabets and numbers once ore more following by a dot and the entire thing is grouped and this group can appear zero or more times.

    The subdomains should lead to domain name and the final dot, so the regex becomes:

    (([a-z0-9]+\.)*[a-z0-9]{2,}\.)+

    Domain name can be two or more characters long and can comprise of alphabets or numbers. And the final + means the whole group should appear one or more times. Here we are considering that domain names should be 2 or more characters long.

  • Top Level Domain: [a-z]{2,}

    Top level domains should containt only alphabets and should be atleast two characters long.

Lets look at the entire regex now:

([a-z]+[a-z0-9]*[_\.]?[a-z0-9]+)@(([a-z0-9]+\.)*[a-z0-9]{2,}\.)+[a-z]{2,}

Echoing some email ids and testing our regex:

naveed@comquest:~$ echo -e "a_mb1@a.bc.abc.com\na2bc.xyz@a.bb.123.fr\na.123@abc.com.sg" | grep -E "([a-z]+[a-z0-9]*[_\.]?[a-z0-9]+)@(([a-z0-9]+\.)*[a-z0-9]{2,}\.)+[a-z]{2,}"
a_mb1@a.bc.abc.com
a2bc.xyz@a.bb.123.fr
a.123@abc.com.sg

Pattern space and hold space are buffers where sed stores data. As we know sed processes one line at a time, so the current line(s) that are being processed are stored in pattern space.

Let me explain why I wrote line(s) when I already said sed processes one line at a time? It is because there are certain commands in sed like N that append the subsequent lines to the pattern space and hence there can be more than one line in the pattern space. When you do the following:

my-linux:~$ cat myfile.txt | sed -n '2p'

What is happening is:

-n - suppresses natural printing
p - prints the pattern space

The above command prints the pattern space. It can have more than one line too if other commands like N, G and H are used.

If you want to see the raw pattern space use the command l as below:

my-linux:~$ echo -e "linux\nubuntu\nsed" | sed -n '2l'
ubuntu$

Raw pattern space displays the 2nd line as ubuntu$ indicating $ as end of line.

Hold space can be assumed to be empty as long as we specifically add something to it. Now consider the sed command G and h.

G - Append a newline to the contents of the pattern space, and then append the contents of the hold space to that of the pattern space.

h - (hold) Replace the contents of the hold space with the contents of the pattern space.

Lets see how these two work:

my-linux:~$ echo -e "linux\nubuntu\nsed" | sed -n "G;h;l"
linux\n$
ubuntu\nlinux\n$
sed\nubuntu\nlinux\n$

Let’s see what is happening line by line:

When sed takes line no. 1 into pattern space, first the G command will work on it, and it will append a new line (\n) to it and then append contents of hold space to it (but hold space is empty so far). Then the command h is executed; it will replace the contents of hold space with contents of pattern space. So after the execution of first line, pattern space printed in raw form (l) yields linux\n$ and the hold space also has linux\n$ as its contents. Next, when line no. 2 is taken into pattern space, G will append a new line and then append contents of hold space (which is linux\n$). Then the h command replaces contents of hold space with that of pattern space, at this point both pattern space and hold space contain ubuntu\nlinux\n$. Similarly at the end of third line, pattern space contains sed\nubuntu\nlinux$.

In place of l if we use p in the above command this is what we get:

my-linux:~$ echo -e "linux\nubuntu\nsed" | sed -n "G;h;p"
linux

ubuntu
linux

sed
ubuntu
linux

You can notice that all the new line characters (\n) which were printed as-is when l was used are now printed in their real form.

We can slightly modify the above command to reverse the lines of input, like this:

my-linux:~$ echo -e "linux\nubuntu\nsed" | sed -n "1!G;h;$p"
sed
ubuntu
linux

1! means execute G on every line except line number 1

$p means print the last pattern space.

This is the tar command to extract a compressed archive:

tar -xz -C <destination_folder> -f <tar_file>

If we want to download files from a remote server over SSH we can use tar to compress and archive on the fly like this:

local$ tar -xz -C ~/ -f <(ssh naveed@remote_ip "tar -zc -C ~/ <folder_to_download>")

To transfer files from local to remote computer ove SSH:

local$ tar -cz -C ~/ <folder_to_transfer> | ssh naveed@remote_ip "tar -zx -C ~/ ." `

Note: the <(cmd) construct is new to bash and doesn’t work on older systems. It runs a program and sends the output to a pipe, and substitutes that pipe into the command as if it was a file. -C flag is used to change the directory before the files are accessed by tar.

Using sed

sed -i '/^[[:space:]]*$/d' my_file

Using awk

awk -i inplace NF my_file

NOTE: The -i flag in above two commands does “in place” editing of the files, so if you just want to try the command without editing the file skip -i

In my experience awk does the best job in deleting blank lines from a file.