Sunday, 4 August 2013

Tutorial:Regular Expressions in Python with Examples

Regular expression visualization
Visualization of regular expression to find date "1st May 1994"


1. Easy : Let's begin

We will find book name and author name from the line taken from Gutenberg ebook.
The Project Gutenberg eBook, The Art of War, by Sun Tzu
  1. Identify Gutenberg eBook.


    Gutenberg ebook starts with phrase "The Project Gutenberg eBook"
    Python Code:
    import re
    
    line1="The Project Gutenberg eBook, The Art of War, by Sun Tzu"
    
    re.findall('The Project Gutenberg eBook',line1);
    
    >>['The Project Gutenberg eBook']
    
    
    
  2. Find book name


    First one was very easy. Now let's try to find book name.After observing the text we can conclude that,
    • Book name starts after Gutenberg phrase,"The Project Gutenberg eBook", followed by comma and ends with comma.So the pattern we are looking for is #GutenbergPhrase,space#BookName, .
    • Book name may contain sequence of one or more alpha numeric characters and spaces
    We can write regular expression using combination metacharacters " [ ], + , | ,\s ,\w " to find book name.
    re.findall('The Project Gutenberg eBook,[A-Z|a-z|0-9|\s|:|\']+,',line1);
    >>['The Project Gutenberg eBook, The Art of War,']
    
    re.findall('The Project Gutenberg eBook,[A-Z|a-z|0-9| |:|\']+,',line1);
    >>['The Project Gutenberg eBook, The Art of War,']
    
    re.findall('The Project Gutenberg eBook,[\w| |:|\']+,',line1);
    >>['The Project Gutenberg eBook, The Art of War,']
    
    Let's break it down.
    1. "The Project Gutenberg eBook" finds the same string in the text.
    2. ",[A-Z|a-z|0-9|\s|:|\']+," , " ,[A-Z|a-z|0-9| |:|\']+," and ",[\w| |:|\']+," matches with sequence of alphanumeric characters including space enclosed by comma. For eg. ", The Art of War," ;",Zealot: The Life and Times of Jesus of Nazareth ," etc.
    • [ ]: Set of possible character matches
    • + : Matches the preceding pattern element one or more times.
    • | : Separates alternate possibilities.
    • \s : Matches a whitespace character which includes space,\t,\r,\n
    • \w : Matches an alphanumeric character, including "_" 
    • \   : Treat next character as literal character. Here we have used for ' .
    Okay,we find the string which contains book name but not the only book name.The function re.findall() returns string matches the entire pattern.We need part of it [\w| |:|\']+ to get only the book name. we can use metacharacter ( ) for that.
    re.findall('The Project Gutenberg eBook,([\w| |:|\']+),',line1);
    >>[' The Art of War']
    
    
    • ( ):Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string by referencing with sequence number $1,$2,etc .
    Try the same regular expression with following inputs.
    The Project Gutenberg eBook,,
    The Project Gutenberg eBook, by Sun Tzu
    The Project Gutenberg , The Art of War, by Sun Tzu
    
    If you wonder why does it fail to find book name in the 3rd input,you should need to know how regular expression works.Software matches characters or metacharacters of regular expression with characters of the string in sequence.If in between any string's character does not match with regular expression's metacharacters or character ,it declares match not found.
  3. Find author's name


    Now, this one should be easy for you. Author's name is at end of the line and it follows word "by".
    re.findall('by([\w| ]+)$',line1);
    >>[' Sun Tzu']
    
    Note:
    The function re.match() will return no result with same regular expression as it always checks for a match only at the beginning of the string, while re.findall() and re.search() checks for a match anywhere in the string.
    m=re.match('by([\w| ]+)$',line1);
    m.group(0)
    AttributeError: 'NoneType' object has no attribute 'group'
    m=re.match('.*by([\w| ]+)',line1);
    m.group(0)
    >>'The Project Gutenberg eBook, The Art of War, by Sun Tzu'
    m.group(1)
    >>' Sun Tzu'
    
    You must have noticed regular expression contains following new metacharacters.
    • . : Normally matches any character except a newline.
    • * : Matches the preceding pattern element zero or more times. Compare it with + .

2.Intermediate 

Let's try to extract ebook Id, published date and last updated date.Consider following text.Please assume text is in variable "data".
The Project Gutenberg eBook, The Art of War, by Sun Tzu
Release Date: 1st May 1994  [eBook #132]
[Last updated: January 14, 2012]
  1. Find ebook Id


    re.findall('[ebook #\d+]',data)
    >>['e', ' ', 'o', 'e', ' ', 'e', 'b', 'e', ' ', 'e', 'o', 'o', 'k', ' ', 'e', ' ', ' ', 'o', ' ', ' ', 'b', ' ', ' ', 'e', 'e', 'e', ' ', 'e', ' ', '1', ' ', ' ', '1', '9', '9', '4', ' ', ' ', 'e', 'o', 'o', 'k', ' ', '#', '1', '3', '2', ' ', 'e', ' ', ' ', '1', '4', ' ', '2', '0', '1', '2']
    So, What went wrong? Well,it interprets '[ ]' as metacharacter and looks for only one character which matches 'e','b','o','k',' ','#' or any one digit.We need to put "\" before [ and ] to interpret them as characters not a metacharacter.
    re.findall('\[eBook #(\d+)\]',data)
    >>['132']
    
  2. Find Release Date

    We will try to search for "Release Date: 1st May 1994" in the text.
    re.findall('Release Date:\s*\d{1,2}st\s*[A-Z|a-z]{3,9}\s*\d{2,4}',data)
    >>['Release Date: 1st May 1994']
    
    re.findall('Release Date:\s*(\d{1,2}\w{1,2}\s*[A-Z|a-z]{3,9}\s*\d{2,4})',data)
    >>['1st May 1994']
    
    re.findall('Release Date:\s*(\d{1,2}(st|nd|rd|th)\s*[A-Z|a-z]{3,9}\s*\d{2,4})',data)
    >>[('1st May 1994', 'st')]
    
    Hmm.. Did you find it complex? But it is not. Let's break it down.
    1. Release Date:\s* looks for Release Date: in the text.\s* is to account for zero or more spaces.
    2. After "Release Date:" we look for day 1st,2nd,..,15th etc. \d{1,2}st matches digit having minimum length 1 and maximum length 2 followed by "st"."st" is to match 1st.
      To also search for possibility of "nd","rd","th" we can use "|" with group meta character "( )" as used here \d{1,2}(st|nd|rd|th).
      Well, simpler way would be look for any two character with \d{1,2}\w{1,2} .
      {M,N} is used to denotes the minimum M and the maximum N match count.
    3. Now ,we look for month name.Length of month name can be between 3,9.[A-Z|a-z]{3,9} sequence of matches characters of length 3 to 9.
    4. . Here it will match with "May".
    5. At the end \d{2,4} look for year. Here it will match 1994.
  3. FYI :Get datetime object from string in python

    re.sub("(st|nd|rd|th)",",","1st May 1994")
    >>'1, May 1994'
    datetime.strptime('1, May 1994','%d, %B %Y')
    >>datetime.datetime(1994, 5, 1, 0, 0)
    
    #lets try this out 
    re.sub("(st|nd|rd|th)",",","05th August 2013")
    >> '05, Augu, 2013'
    
    #we need correction.
    re.sub("(\d+)(st|nd|rd|th)","\g<1>",'05th  August, 2013')
    >> '05  August, 2013' 
    (st|nd|rd|th) also replaces st in August.So we add (\d+) just to check that st,nd,rd,th are after date. And we replaces the entire pattern by date. To refer date we can use \g<1> which refers to first group of regex.
  4. Find Last updated date

    It should be now easy for you.Let's give a try.
    re.findall("Last updated:\s*\w{3,9}\s*\d{1,2},\s*\d{2,4}",data)
    >>['Last updated: January 14, 2012']
    re.findall("Last updated:\s*(\w{3,9}\s*\d{1,2},\s*\d{2,4})",data)
    >>['January 14, 2012']
References:
1. http://docs.python.org/2/library/re.html
2. http://www.gutenberg.org/ebooks/132
3. http://en.wikipedia.org/wiki/Regular_expression

No comments:

Post a Comment