Organizational Research By

Surprising Reserch Topic

Question:Regex Refinement


I'm trying to parse addresses out of blocks of text, and have the following expression to do so:

/\d+\s(?:[sewnSEWN]\.?\s)?[\d\w]+\s(?:(?:[\d\w]+\s){0,3})?\w+\.?/

It will currently parse addresses such as:

300 E. Randolph St. Chicago, IL >> Returns 300 E. Randolph St.

5553 Bay Shore Drive >> Returns input

23 Joseph E Lowery Boulevard >> Returns input

513 Martin Luther King Jr Boulevard >> Returns input

This is exactly what I want. I was wondering, as this is the first expression I have ever written, if there was a way to shorten down the expression or refine it a little?

asked Sep 13, 2013 in Java Interview Questions by anonymous
edited Sep 12, 2013
0 votes
18 views



Related Hot Questions

2 Answers

0 votes

I don't know which implementation you are using, so translate this to relevant language when needed.

\w = [a-zA-Z0-9] so [\d\w] is same as [\w]

Note that (?(?:[\w]+\s){0,3})? is same as (?:[\w]+\s){0,3} because the expression inside is matched zero or more times.

You can also add in the \w+\s at the beginning to the above expression, and make it repeat from 1 to 4.

Here is a matching for your example, Knowing not much about your format, here is what I find odd.

/
  \d+                     # 513
  \s
  (?:[sewnSEWN]\.?\s)?    #
  (?:\w+\s){1,4}          # Martin Luther King Jr  
  \w+                     # Boulevard
  \.?
/

    The spaces are restricted to a single space \s is this that strict? Perhaps you want \s+
    If I understand you right, the portion after NSEW. directions are that there has to be atleast two words, and atmost 5 words separated by spaces. is this a correct interpretation?
 

answered Sep 13, 2013 by anonymous
edited Sep 12, 2013
0 votes

I don't know which implementation you are using, so translate this to relevant language when needed.

\w = [a-zA-Z0-9] so [\d\w] is same as [\w]

Note that (?(?:[\w]+\s){0,3})? is same as (?:[\w]+\s){0,3} because the expression inside is matched zero or more times.

You can also add in the \w+\s at the beginning to the above expression, and make it repeat from 1 to 4.

Here is a matching for your example, Knowing not much about your format, here is what I find odd.

/
  \d+                     # 513
  \s
  (?:[sewnSEWN]\.?\s)?    #
  (?:\w+\s){1,4}          # Martin Luther King Jr  
  \w+                     # Boulevard
  \.?
/

    The spaces are restricted to a single space \s is this that strict? Perhaps you want \s+
    If I understand you right, the portion after NSEW. directions are that there has to be atleast two words, and atmost 5 words separated by spaces. is this a correct interpretation?
 

answered Sep 13, 2013 by anonymous
edited Sep 12, 2013

...