Regular Expression

Regular Expression

regular expression is a smart wart to formulate the rule of a language.

  1. natural languages
  2. formal or regular languages.

The natural languages are human languages. We speak them without caring about the grammar. The grammar is formulated after analyzing the language.

The regular languages are systematically formulated according to rules that we form in the form of regular expressions.

So, you can every regular expression is  basically a rule to formulate a language. For any language we need alphabets. These alphabets are coined together  to form the language. For instance say a(b|c) is a regular expression where a,b,c are alphabets. The regular expression suggests that the language for this regulare expression would be

L = {ab, ac}.

For natural languages like English we have twenty six alphabets AB…Z and have countless rules to coin them together to make countless ways of expressions. The regular languages have hard and fast rules. The natural languages also have some rules but the rules may be violated in some exceptional cases.

Using Regular Expression for Language Processing

We can use regular expressions for processing the text. They can be used to search and substitute by applying the complex conditions.

Simple string look-up

We use forward slashes for simple string look up. For example

/is/ is a regular expression that would find the string “is”. The string “is” may be an individual word like in string “He is going to university” or part of a word like “I am a Pakistani”.

Complete word look-up

To search a string as a standalone word, we  can use the notation \b that defines word boundary. /is\b/ will make sure that the string ends at word’s end. so It will not bring out “Pakistan” or “istambol” but it may bring the words “chemoprophylaxis“, “megasporogenesis“. So we need to apply bound in start of the pattern as well. So the right regular expression pattern to find out a complete word is

/\bis\b/

Selecting one from a range of characters

Lets say you are searching the frequency of word “he” in a text. but you know at the start of the sentence it would be written with capital “H” as “He” and otherwise as “he”.  Keeping in mind both forms we may write following regular expression.

/\b[Hh]e\b/

So, in short, the notation [] to select one character from a range. Say, you are trying to list down all the vehicle numbers with digits 2222. Suppose every vehicle number has three alphabets followed by four digits. so the regular expression for his scenario would be

/[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]2222/

Listing each every character within [] is quite cumbersome. So there is a short way for it. We can write it as /[A-Z][A-Z][A-Z]2222.

 

Not Selecting one from a range of characters

With the selection from range notation [], we can use a caret sign ^ as not character. For instance /[A-Z]123/ means any character from English alphabets followed by 123 and /[^A-Z]123/ means any non-alphabetic character followed by 123.

{} in regular expression

For the following regular expression we have unnecessary repetition.

/[A-Z][A-Z][A-Z]2222.

There is also a workaround for it and that is.

/[A-Z]{3}2222}

Once again the repetition of 2 can also removed as /[A-Z]{3}2{4}/

We can also specify the range of repetition as

/ab{3}/ means abbb
/ab{2,5}/ means the repetition may be 2 times. 3 times, 4 times and 5 times. ie. abb, abbb, abbbb and abbbb are all legal words for this regular expression.
/ab{3,}/ means the repetition is at minimum three times. so three or more b’s are all allowed. so abbb,abbbb,abbbbb,abbbbb,abbbbbbbbb…… are all legal words.

Use of () with {}

The symbol {} is used for repetition of single character. The regular expressin /ab{5}/ means repetition of b five times. What if we want to repeat the pair ab five times as ababababab? In this case, we may use the parenthesis () as /(ab){5}/.

period . in regular expression.

The period . in regular expression means any character. so for RE /c.t/ any character can be filled in between c and t. cat, cbt, c t, c7t, c&t all are legal. Of course ct is illegal. It means null is not allowed.

Question mark ? in regular expression.

The question mark ? gives you an optional case. For instance, while searching for the word “colour” you may come across two versions of it. The American write it as color whereas in British English it is spelled as “colour” To search out both versions from text we need to make the character “u” as optional. The question mark ? comes to rescue in this situation.

/colou?r/

By default the question mark applies to the adjacent character only. We may add () to apply it to multiple characters.

Kleene star * and Kleene plus + operators

The * operator means zero to many and + means at least one.

/ba*/

it means b, ba, baa, baaa, baaaaaa, baaaaaaaaaaaaaa…….. all are invalid.

/ba+/

it means, we need at least one a followed by b.  so ba, baa, baaa, baaa……. valid but “b” is invalid.

caret ^ and dollar sign $ in regular expression

We already saw that caret ^ is used as not operator inside [] symbols. but it has different meaning in start of the sentence.

/^Pakistan/

Here the word Pakistan would be searched only if at  the start of given search.

/Pakistan$/

Here the word Pakistan would be found only if it is the last word in the text. In other words ^ means that matching would start from the beginning and $ means the matching would start from the end of the text.

/^Pakistan$/

It means the matching would start both from beginning and the end of the text. It would bring the result if “Pakistan” is the only word in text.

Backslashed Characters

As we saw, * , +  and . have special meaning in regular expression. What if we want to write them. As in all other programming languages we can use backslash in this situation. So . means any character and \. mean actual dot. Here is table of backslahed character.

baclslashed-characters
baclslashed-characters

Shorts for common set of characters

The most common set of characters can be represented by in short way. For example the [0-9] can be written as \d. Here is the complete set.

Short way for common set of characters
Short way for common set of characters

() and \ for referring back in regular expression.

We can use () and \ together to refer to a memory.

/I am a (man)  of (words) and he is also  a \1  of \2 /

Here \1 refers to man and \2 refers to words. This regular expression would search for the following string in text

“I am a man  of words and he is also  a man of words”

if I write it as

/I am a (man)  of (words) and he is also  a \2  of \1 /

Then the following string would be searched  in the text.

“I am a man of words and words is also a man. ”

Substitution in regular expression

s/colour/color/

This regular expression will substitute the word “colour” with “color” wherever found in text.

s/([0-9]+)/<\1>/

The above regular expression would enclose the numbers (wherever found in the text)  with angle brackets.

s/(.*) is (.*) / \1 are \2 /

Here the word “is” will be replaced with “are” everything else would remain the same. \1 refers to anything before is and \2 refers to anything after is.

Substitution Syntax in Python.

Line 1: we imported library for regular expression.

Line 3: a sample input

Line 4: the re.sub(searched pattern, replaced pattern, tex) is the library function to search and replace a string from the given text. The () brackets are used for putting in memory and \g<1> are used to refer to that memory. \g<1> to first memory and \g<2> to second memory.

Line 6: prints the output that is “They are good boys.”

That’s it . Enjoy!

 

 

 

.