Visualization of regular expression to find date "1st May 1994" |
1. Easy : Let's begin
We will find book name and author name from the line taken from Gutenberg ebook.The Project Gutenberg eBook, The Art of War, by Sun Tzu
Identify Gutenberg eBook.
Gutenberg ebook starts with phrase "The Project Gutenberg eBook"
Python Code:import re line1="The Project Gutenberg eBook, The Art of War, by Sun Tzu" re.findall('The Project Gutenberg eBook',line1); >>['The Project Gutenberg eBook']
Find book name
First one was very easy. Now let's try to find book name.After observing the text we can conclude that,
- Book name starts after Gutenberg phrase,"The Project Gutenberg eBook", followed by comma and ends with comma.So the pattern we are looking for is #GutenbergPhrase,space#BookName, .
- Book name may contain sequence of one or more alpha numeric characters and spaces
Let's break it down.re.findall('The Project Gutenberg eBook,[A-Z|a-z|0-9|\s|:|\']+,',line1); >>['The Project Gutenberg eBook, The Art of War,'] re.findall('The Project Gutenberg eBook,[A-Z|a-z|0-9| |:|\']+,',line1); >>['The Project Gutenberg eBook, The Art of War,'] re.findall('The Project Gutenberg eBook,[\w| |:|\']+,',line1); >>['The Project Gutenberg eBook, The Art of War,']
- "The Project Gutenberg eBook" finds the same string in the text.
- ",[A-Z|a-z|0-9|\s|:|\']+," , " ,[A-Z|a-z|0-9| |:|\']+," and ",[\w| |:|\']+," matches with sequence of alphanumeric characters including space enclosed by comma. For eg. ", The Art of War," ;",Zealot: The Life and Times of Jesus of Nazareth ," etc.
- [ ]: Set of possible character matches
- + : Matches the preceding pattern element one or more times.
- | : Separates alternate possibilities.
- \s : Matches a whitespace character which includes space,\t,\r,\n
- \w : Matches an alphanumeric character, including "_"
- \ : Treat next character as literal character. Here we have used for ' .
re.findall('The Project Gutenberg eBook,([\w| |:|\']+),',line1); >>[' The Art of War']
- ( ):Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string by referencing with sequence number $1,$2,etc .
The Project Gutenberg eBook,, The Project Gutenberg eBook, by Sun Tzu The Project Gutenberg , The Art of War, by Sun Tzu
If you wonder why does it fail to find book name in the 3rd input,you should need to know how regular expression works.Software matches characters or metacharacters of regular expression with characters of the string in sequence.If in between any string's character does not match with regular expression's metacharacters or character ,it declares match not found.Find author's name
Now, this one should be easy for you. Author's name is at end of the line and it follows word "by".
Note:re.findall('by([\w| ]+)$',line1); >>[' Sun Tzu']
The function re.match() will return no result with same regular expression as it always checks for a match only at the beginning of the string, while re.findall() and re.search() checks for a match anywhere in the string.
You must have noticed regular expression contains following new metacharacters.m=re.match('by([\w| ]+)$',line1); m.group(0) AttributeError: 'NoneType' object has no attribute 'group' m=re.match('.*by([\w| ]+)',line1); m.group(0) >>'The Project Gutenberg eBook, The Art of War, by Sun Tzu' m.group(1) >>' Sun Tzu'
- . : Normally matches any character except a newline.
- * : Matches the preceding pattern element zero or more times. Compare it with + .
2.Intermediate
Let's try to extract ebook Id, published date and last updated date.Consider following text.Please assume text is in variable "data".The Project Gutenberg eBook, The Art of War, by Sun Tzu Release Date: 1st May 1994 [eBook #132] [Last updated: January 14, 2012]
-
Find ebook Id
So, What went wrong? Well,it interprets '[ ]' as metacharacter and looks for only one character which matches 'e','b','o','k',' ','#' or any one digit.We need to put "\" before [ and ] to interpret them as characters not a metacharacter.re.findall('[ebook #\d+]',data) >>['e', ' ', 'o', 'e', ' ', 'e', 'b', 'e', ' ', 'e', 'o', 'o', 'k', ' ', 'e', ' ', ' ', 'o', ' ', ' ', 'b', ' ', ' ', 'e', 'e', 'e', ' ', 'e', ' ', '1', ' ', ' ', '1', '9', '9', '4', ' ', ' ', 'e', 'o', 'o', 'k', ' ', '#', '1', '3', '2', ' ', 'e', ' ', ' ', '1', '4', ' ', '2', '0', '1', '2']
re.findall('\[eBook #(\d+)\]',data) >>['132']
-
Find Release Date
We will try to search for "Release Date: 1st May 1994" in the text.
Hmm.. Did you find it complex? But it is not. Let's break it down.re.findall('Release Date:\s*\d{1,2}st\s*[A-Z|a-z]{3,9}\s*\d{2,4}',data) >>['Release Date: 1st May 1994'] re.findall('Release Date:\s*(\d{1,2}\w{1,2}\s*[A-Z|a-z]{3,9}\s*\d{2,4})',data) >>['1st May 1994'] re.findall('Release Date:\s*(\d{1,2}(st|nd|rd|th)\s*[A-Z|a-z]{3,9}\s*\d{2,4})',data) >>[('1st May 1994', 'st')]
- Release Date:\s* looks for Release Date: in the text.\s* is to account for zero or more spaces.
- After "Release Date:" we look for day 1st,2nd,..,15th etc. \d{1,2}st matches digit having minimum length 1 and maximum length 2 followed by "st"."st" is to match 1st.
To also search for possibility of "nd","rd","th" we can use "|" with group meta character "( )" as used here \d{1,2}(st|nd|rd|th).
Well, simpler way would be look for any two character with \d{1,2}\w{1,2} .
{M,N} is used to denotes the minimum M and the maximum N match count. - Now ,we look for month name.Length of month name can be between 3,9.[A-Z|a-z]{3,9} sequence of matches characters of length 3 to 9. . Here it will match with "May".
- At the end \d{2,4} look for year. Here it will match 1994.
Find Last updated date
It should be now easy for you.Let's give a try.re.findall("Last updated:\s*\w{3,9}\s*\d{1,2},\s*\d{2,4}",data) >>['Last updated: January 14, 2012'] re.findall("Last updated:\s*(\w{3,9}\s*\d{1,2},\s*\d{2,4})",data) >>['January 14, 2012']
FYI :Get datetime object from string in python
re.sub("(st|nd|rd|th)",",","1st May 1994")
>>'1, May 1994'
datetime.strptime('1, May 1994','%d, %B %Y')
>>datetime.datetime(1994, 5, 1, 0, 0)
#lets try this out
re.sub("(st|nd|rd|th)",",","05th August 2013")
>> '05, Augu, 2013'
#we need correction.
re.sub("(\d+)(st|nd|rd|th)","\g<1>",'05th August, 2013')
>> '05 August, 2013'
(st|nd|rd|th) also replaces st in August.So we add (\d+) just to check that st,nd,rd,th are after date. And we replaces the entire pattern by date. To refer date we can use \g<1> which refers to first group of regex.
1. http://docs.python.org/2/library/re.html
2. http://www.gutenberg.org/ebooks/132
3. http://en.wikipedia.org/wiki/Regular_expression