Sunday, 4 August 2013

Tutorial:Regular Expressions in Python with Examples

Regular expression visualization
Visualization of regular expression to find date "1st May 1994"


1. Easy : Let's begin

We will find book name and author name from the line taken from Gutenberg ebook.
The Project Gutenberg eBook, The Art of War, by Sun Tzu
  1. Identify Gutenberg eBook.


    Gutenberg ebook starts with phrase "The Project Gutenberg eBook"
    Python Code:
    import re
    
    line1="The Project Gutenberg eBook, The Art of War, by Sun Tzu"
    
    re.findall('The Project Gutenberg eBook',line1);
    
    >>['The Project Gutenberg eBook']
    
    
    
  2. Find book name


    First one was very easy. Now let's try to find book name.After observing the text we can conclude that,
    • Book name starts after Gutenberg phrase,"The Project Gutenberg eBook", followed by comma and ends with comma.So the pattern we are looking for is #GutenbergPhrase,space#BookName, .
    • Book name may contain sequence of one or more alpha numeric characters and spaces
    We can write regular expression using combination metacharacters " [ ], + , | ,\s ,\w " to find book name.
    re.findall('The Project Gutenberg eBook,[A-Z|a-z|0-9|\s|:|\']+,',line1);
    >>['The Project Gutenberg eBook, The Art of War,']
    
    re.findall('The Project Gutenberg eBook,[A-Z|a-z|0-9| |:|\']+,',line1);
    >>['The Project Gutenberg eBook, The Art of War,']
    
    re.findall('The Project Gutenberg eBook,[\w| |:|\']+,',line1);
    >>['The Project Gutenberg eBook, The Art of War,']
    
    Let's break it down.
    1. "The Project Gutenberg eBook" finds the same string in the text.
    2. ",[A-Z|a-z|0-9|\s|:|\']+," , " ,[A-Z|a-z|0-9| |:|\']+," and ",[\w| |:|\']+," matches with sequence of alphanumeric characters including space enclosed by comma. For eg. ", The Art of War," ;",Zealot: The Life and Times of Jesus of Nazareth ," etc.
    • [ ]: Set of possible character matches
    • + : Matches the preceding pattern element one or more times.
    • | : Separates alternate possibilities.
    • \s : Matches a whitespace character which includes space,\t,\r,\n
    • \w : Matches an alphanumeric character, including "_" 
    • \   : Treat next character as literal character. Here we have used for ' .
    Okay,we find the string which contains book name but not the only book name.The function re.findall() returns string matches the entire pattern.We need part of it [\w| |:|\']+ to get only the book name. we can use metacharacter ( ) for that.
    re.findall('The Project Gutenberg eBook,([\w| |:|\']+),',line1);
    >>[' The Art of War']
    
    
    • ( ):Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string by referencing with sequence number $1,$2,etc .
    Try the same regular expression with following inputs.
    The Project Gutenberg eBook,,
    The Project Gutenberg eBook, by Sun Tzu
    The Project Gutenberg , The Art of War, by Sun Tzu
    
    If you wonder why does it fail to find book name in the 3rd input,you should need to know how regular expression works.Software matches characters or metacharacters of regular expression with characters of the string in sequence.If in between any string's character does not match with regular expression's metacharacters or character ,it declares match not found.
  3. Find author's name


    Now, this one should be easy for you. Author's name is at end of the line and it follows word "by".
    re.findall('by([\w| ]+)$',line1);
    >>[' Sun Tzu']
    
    Note:
    The function re.match() will return no result with same regular expression as it always checks for a match only at the beginning of the string, while re.findall() and re.search() checks for a match anywhere in the string.
    m=re.match('by([\w| ]+)$',line1);
    m.group(0)
    AttributeError: 'NoneType' object has no attribute 'group'
    m=re.match('.*by([\w| ]+)',line1);
    m.group(0)
    >>'The Project Gutenberg eBook, The Art of War, by Sun Tzu'
    m.group(1)
    >>' Sun Tzu'
    
    You must have noticed regular expression contains following new metacharacters.
    • . : Normally matches any character except a newline.
    • * : Matches the preceding pattern element zero or more times. Compare it with + .

2.Intermediate 

Let's try to extract ebook Id, published date and last updated date.Consider following text.Please assume text is in variable "data".
The Project Gutenberg eBook, The Art of War, by Sun Tzu
Release Date: 1st May 1994  [eBook #132]
[Last updated: January 14, 2012]
  1. Find ebook Id


    re.findall('[ebook #\d+]',data)
    >>['e', ' ', 'o', 'e', ' ', 'e', 'b', 'e', ' ', 'e', 'o', 'o', 'k', ' ', 'e', ' ', ' ', 'o', ' ', ' ', 'b', ' ', ' ', 'e', 'e', 'e', ' ', 'e', ' ', '1', ' ', ' ', '1', '9', '9', '4', ' ', ' ', 'e', 'o', 'o', 'k', ' ', '#', '1', '3', '2', ' ', 'e', ' ', ' ', '1', '4', ' ', '2', '0', '1', '2']
    So, What went wrong? Well,it interprets '[ ]' as metacharacter and looks for only one character which matches 'e','b','o','k',' ','#' or any one digit.We need to put "\" before [ and ] to interpret them as characters not a metacharacter.
    re.findall('\[eBook #(\d+)\]',data)
    >>['132']
    
  2. Find Release Date

    We will try to search for "Release Date: 1st May 1994" in the text.
    re.findall('Release Date:\s*\d{1,2}st\s*[A-Z|a-z]{3,9}\s*\d{2,4}',data)
    >>['Release Date: 1st May 1994']
    
    re.findall('Release Date:\s*(\d{1,2}\w{1,2}\s*[A-Z|a-z]{3,9}\s*\d{2,4})',data)
    >>['1st May 1994']
    
    re.findall('Release Date:\s*(\d{1,2}(st|nd|rd|th)\s*[A-Z|a-z]{3,9}\s*\d{2,4})',data)
    >>[('1st May 1994', 'st')]
    
    Hmm.. Did you find it complex? But it is not. Let's break it down.
    1. Release Date:\s* looks for Release Date: in the text.\s* is to account for zero or more spaces.
    2. After "Release Date:" we look for day 1st,2nd,..,15th etc. \d{1,2}st matches digit having minimum length 1 and maximum length 2 followed by "st"."st" is to match 1st.
      To also search for possibility of "nd","rd","th" we can use "|" with group meta character "( )" as used here \d{1,2}(st|nd|rd|th).
      Well, simpler way would be look for any two character with \d{1,2}\w{1,2} .
      {M,N} is used to denotes the minimum M and the maximum N match count.
    3. Now ,we look for month name.Length of month name can be between 3,9.[A-Z|a-z]{3,9} sequence of matches characters of length 3 to 9.
    4. . Here it will match with "May".
    5. At the end \d{2,4} look for year. Here it will match 1994.
  3. FYI :Get datetime object from string in python

    re.sub("(st|nd|rd|th)",",","1st May 1994")
    >>'1, May 1994'
    datetime.strptime('1, May 1994','%d, %B %Y')
    >>datetime.datetime(1994, 5, 1, 0, 0)
    
    #lets try this out 
    re.sub("(st|nd|rd|th)",",","05th August 2013")
    >> '05, Augu, 2013'
    
    #we need correction.
    re.sub("(\d+)(st|nd|rd|th)","\g<1>",'05th  August, 2013')
    >> '05  August, 2013' 
    (st|nd|rd|th) also replaces st in August.So we add (\d+) just to check that st,nd,rd,th are after date. And we replaces the entire pattern by date. To refer date we can use \g<1> which refers to first group of regex.
  4. Find Last updated date

    It should be now easy for you.Let's give a try.
    re.findall("Last updated:\s*\w{3,9}\s*\d{1,2},\s*\d{2,4}",data)
    >>['Last updated: January 14, 2012']
    re.findall("Last updated:\s*(\w{3,9}\s*\d{1,2},\s*\d{2,4})",data)
    >>['January 14, 2012']
References:
1. http://docs.python.org/2/library/re.html
2. http://www.gutenberg.org/ebooks/132
3. http://en.wikipedia.org/wiki/Regular_expression

Wednesday, 10 July 2013

Infographics of Histogram Of Oriented Gradients Descriptor/Features



Histogram of Oriented Gradients (HOG) features descriptor were first introduced by Navneet Dalal and Bill Triggs. Their work  was focused on pedestrian detection.Since then ,HOG is extensively used for object detection in computer vision field for various reasons . First ,It is easy to use with discriminate classifiers such as support vector machine. Second, HOG tries to capture shape of an object from edges(gradients) .Therefor HOG gives good results to identify object from cluttered background without using any segmentation algorithm. 

Research  project "Inverting and Visualizing Features for Object Detection" by MIT was very helpful to visualize how HOG features look like.There is online tool on project site where you can upload image and visualize HOG features.There is also good example of people detection using HOG in OpenCV. However , I remained puzzled , how HOG descriptor can be used with classification algorithm. And for that I required to first understand how to construct HOG descriptor. And  as result of my learning I created infographics to help you understand the main idea.
In the above infographics ,I took HOG descriptor proposed by Dalal and Triggs for pedestrian detection for illustration.But you can construct HOG descriptor for any object once you understand basic idea.

P.S
Online tool to visualize HOG
"Inverting and Visualizing Features for Object Detection" - http://web.mit.edu/vondrick/ihog/






Monday, 10 June 2013

Grub entry for Ubuntu and Windows 8 side by side installation

1.Add following to /etc/grub.d/40_custom
    sudo gedit /etc/grub.d/40_custom
    /etc/grub.d/40_custom
    menuentry "Windows8 UEFI"{

            search --set=root --file /EFI/Microsoft/Boot/bootmgfw.efi

            chainloader /EFI/Microsoft/Boot/bootmgfw.efi

    } 

 
2.Execute following commad
    sudo update-grub
 

Wednesday, 1 May 2013

One script to install and optimize Nginx,mysql & php-fpm for 512Mb VPS


Just one script to install and optimize Nginx,mysql and php5 ,php-fpm for 512Mb VPS/Cloud instance.

Running website on virtual private server ( I am using DigitalOcean VPS, 5$ per month)  or on amazon ec2 instance is cool. But if you are doing it first time ,setting up and optimizing VPS is tricky task to do.You will required to install all servers and configure it yourself. With default settings, MySQL and apache will not work together on same  512MB VPS. Under heavy load MySQL crashes and your site will be down. And, that's why ,I wrote a script which install all required servers and  fine-tune all the servers. I chose NGINX web server over apache because of small footprint and less memory consumption.

Git Repository 

 https://github.com/junedmunshi/SampleCode/blob/master/web/vpsSetup/

Download


OS

Tested on vanilla ubuntu 12.04 .

Install and setup

chmod +x vpsSetup.sh
chmod +x start.sh
chmod +x stop.sh
sudo ./vpsSetup.sh

To start server

sudo ./start.sh

To stop server

sudo ./stop.sh

But before you run the script

Before you blindly run the script, you should look at configuration files for Nginx and modify it according to your need. Everything should work off the shelf except modifying few parameters in ./nginx/default . Read Nginx section below.
Moreover, you can also change configuration for mysql and php if given configuration file does not suit to your need.Read Mysql and Php5 section below.

What will the script do?

It will,
1) Add required debian repository

2) Install all required packages ( nginx mysql-server mysql-client memcached php5 php-apc php-auth php-net-smtp php-net-socket php-pear php5-curl php5-gd php5-mcrypt php5-mysql php5-fpm php5-memcached php5-tidy vsftpd )

3) Take backup of original mysql configuration file. ( /etc/mysql/my.cnf -> /etc/mysql/my.cnf.org)

4) Optimise mysql by patching configuration file from ./mysql/my.cnf to /etc/mysql/my.cnf

5) Apply same step as 3 and 5 for nginx and php5

Files Detail

  • vpsSetup.sh : Installation script
  • mysql/my.cnf : MySQL server configuration file tuned for 512Mb server running mysql and nginx
  • nginx/nginx.conf : Nginx server configuration file tuned for 512Mb server running mysql and nginx
  • nginx/default : Configuration file for your web application.It will patched to /etc/nginx/sites-available .
  • php5/fmp/php.ini : See the php section below
  • php5/fmp/php-fpm.conf : See the php section below
  • php5/fpm/pool.d/www.conf : Tuned for 512Mb server.

If services do not start after applying optimal settings

  1. If Nginx does not start , it may be possible that there is something wrong with either /etc/nginx/ngnix.conf or site-available/default.
    Run sudo service nginx start and look for error.
  2. If MySQL does not start , please check MySQL troubleshoot guide.http://junedmunshiblog.blogspot.in/2013/05/troubleshooting-mysql.html

Nginx

1)nginx/default
It is your web application configuration file. Current settings are configured for php, fast cgi and cakephp .
Things to check before you deploy:
port   
servername : your domain name   
root : path to your website code 
access_log and error_log : very useful for debugging your website code (not nginx) 
location : control behaviour for specific locations such "/" or "*.js,*.css,*.jpg (any assests)" 
Note:
1) Following line is not required if you are not using cakephp. Following line ensure that images,css etc. will load properly without generating "path not found " error from theme/plugin.
try_files $uri $uri/ /../plugins/$1/webroot/$2/$3 /../View/Themed/$2/webroot/$3/$4 ;
2) You may want to turn of access_log and log_not_found for images,xml etc. files once your site functions properly.
2)ngnix.conf

MySQL

my.cnf
1 You may consider further tuning following parameters under '[mysqld]' section if current settings are not best fit for you.
   key_buffer = 16K      (Default is 16M)   
   max_allowed_packet = 1M  (Default is 16M)    
   thread_stack = 64K        (Default is 192K)   
   thread_cache_size  = 4    (Default is 8)   
2 Innodb
  Disable if not needed.Uncomment #skip-innodb
2 Fine tune if you are using it.
   innodb_buffer_pool_size = 16M         (Default is 128M)  
   innodb_additional_mem_pool_size = 2M   

PHP

1) ./php5/fpm/php.ini
The only difference between original and patch file is as below.
;cgi.fix_pathinfo=1 -> cgi.fix_pathinfo=0
2) php5/fpm/php-fpm.conf
3) php5/fpm/pool.d/www.conf
php_admin_value[memory_limit] = You can increase if you have more RAM
php_value[upload_max_filesize] = Change it otherwise leave it if you don't care
php_value[max_execution_time] = Change it otherwise leave it if you don't care
user = www-data 
group = www-data  



Troubleshooting MySQL job failed to start


MySQL start fails

$ sudo service mysql start
start: Job failed to start

Troubleshoot

  1. Try to find out what is wrong.
     
    • Start mysqld manually in verbose mode

      $ sudo mysqld --verbose --user=root

    • Check dmesg

      $ dmesg


  2. Mysqld start manually (Step1) without any error, however,  `sudo service mysql start` still fails. Check following settings are proper  in your /etc/mysql/my.cnf; 

    # The MySQL server
    [mysqld]
    user = mysql
    pid-file = /var/run/mysqld/mysqld.pid
    socket = /var/run/mysqld/mysqld.sock
    port = 3306
    basedir = /usr
    datadir = /var/lib/mysql
    tmpdir = /tmp
    lc-messages-dir = /usr/share/mysql



MySQL crashes under heavy load on Webserver.

 Simulate heavy traffic

 $ ab -n 1000 -c 100 http://yourdomainname/

         ab : Apache HTTP server benchmarking tool
         -n number of requests
         -c concurrent requests

 General cause

  1. Generally mysql crashes because it runs out of memory.

  2. There could be other reasons also. Execute dmesg or look at mysql logs. Path for MySQL logs are specified in mysql.cnf.

Sunday, 7 April 2013

Visualizing Words


Ever wondered how words are connected. Draw graph to visualize it. Try different layout.
Download Code:
https://github.com/junedmunshi/SampleCode/tree/master/AI/visualizeWordNet

Required:
  • Python
  • networkx 
  • nltk 
  • nltk.corpus : wordnet 
  • pygraphviz

Checkout subdirectory from github

It is not possible to checkout specific  sub directory  from git as it does not support sub repositories .However, git supports submodules so that you can append another project repository as sub repository to you project. Here, you are required to create  different repository for each project .

There is another way to checkout subdirectory  without creating different repositories if you are using Github. Github now supports svn checkout for git repository. Therefor, you can checkout sub directories the same way you can do with svn.

For example, I created repository where I put my sample code of experiments I do. Instead of creating repository for each sample code ,I prefer to logically organize them in hierarchical directory structure. Code related to artificial intelligence goes into AI directory , PHP code goes into Web development directory.Now , if you are interested to checkout sudoku solver which is inside AI directory , execute following command.

svn co https://github.com/junedmunshi/SampleCode/trunk/AI/SudokuSolver