Organizational Research By

Surprising Reserch Topic

apache htaccess rules for multiple if/or rules including user agent, cookie, uri and filename

asked Sep 22, 2013 in APACHE by rajesh
0 votes

Related Hot Questions

1 Answer

0 votes
The way rules are processed in .htaccess, there's simply no way to express this with some sort of construct or parsing similar to how you'd do it in a programming language. In the past I had similar questions, and had so much difficulty getting a complete answer that when I finally did I wrote it down for myself so I could find it again in the future. Here's what I wrote to myself:
## After quite a bit of puzzlement and seemingly maddeningly
##  vague documentation, I finally figured out exactly how mod_rewrite's
##  [OR] flag really works: In mod_rewrite there's not really any
##  "precendence"; RewriteCond's are simply processed sequentially.
##  Without any modification, the default is to AND _everything_.
##  Including the [OR] modifier on some RewriteCond's creates a
##  two-level expression with only ANDs at the outer/upper level and
##  only ORs at the inner/lower level. Thus
##  RewriteCond a [OR]
##  RewriteCond b
##  RewriteCond c [OR]
##  RewriteCond d
##  RewriteCond e [OR]
##  RewriteCond f [OR]
##  RewriteCond g
##  is equivalent to the boolean expression
##  ((a OR b) AND (c OR d) AND (e OR f OR g))
## There's _no_ way to have ANDs at the _lower/inner_ level and ORs
##  at the _upper/outer_ level; such constructs can only be implemented with
##  either multiple rulesets (and unavoidable duplication), or the
##  introduction of intermediate environment variables.
## Thus the only advantages of [OR] over a | in an RE are increased
##  clarity/maintainability, and the possibility of checking against
##  unrelated variables. REs with lots of |, on the other hand, are
##  assumed to be much faster.
If I understand your need correctly, the whole thing can be thought of as one giant conditional with the blocks connected not by subsidiary 'if' clauses but rather by AND, like this:
((- HTTP_USER_AGENT includes BotOne
- or HTTP_USER_AGENT includes OtherBot
- or HTTP_COOKIE user_id != 1)
(- REQUEST_URI is "/" main directory
- or REQUEST_FILENAME includes "utm_source"
- or REQUEST_FILENAME includes "utm_medium"
- or REQUEST_FILENAME includes "utm_campaign" and "utm_content")
(- REQUEST_FILENAME doesn't include "/blog/"
- or REQUEST_FILENAME doesn't include "gif"
- or REQUEST_FILENAME doesn't include "jpg"))
- RewriteRule all files to index.html
The biggest complication I see is the the rule about both "utm_campaign" and "utm_content", because as far as I know Regular Expressions (even complex PERL-style ones like those in a .htaccess) don't handle unspecified order at all well. If you know the strings will always in fact be in the same order, you could compose an RE something like "utm_campaign.*utm_content". If the order really is unspecified, to meet your specification exactly you'll need two rule conditionals, one for each possible order, something like this:
RewriteCond "utm_campaign.*utm_content" [OR]
RewriteCond "utm_content.*utm_campaign"
It seems to me some of your REs don't express exactly the same thing your pseudo-rules actually say. For example:
REQUEST_FILENAME includes "utm_source"
should become
RewriteCond ${REQUEST_FILENAME} utm_source
RewriteCond ${REQUEST_FILENAME} ^utm_source 
actually implements
REQUEST_FILENAME **startswith** utm_source
Also, I'd allow for weird browsers that send the root as nothing at all, something like below (also note there's no separate upper and lower case versions of '/', so the [NC] just gives you a slight performance hit for no good reason). And note you need beginning ('^') and ending ('$') of string anchors, else you'll match things like "/xxx/yyy/zzz" too since they contain a slash:
RewriteCond ${REQUEST_URI} ^/?$ [OR]
Finally, only match the part of the string you care about; there's no need to match the rest of the string (and in fact trying to match the rest of the string often causes weird unnecessary errors). In other words, the presence of ".*" in a .htaccess RE usually indicates some kind of unnecessary weirdness that at best takes a bite out of performance and at worst masks some errors. Rather than saying "utm_source.*" just say "utm_source".
At first glance, your logic with the multiple conditions looks right to me (fortunate, because there are so many ways to get complex conditions like these jumbled). So if it doesn't work, I'd suspect other problems with the rules (especially the Regular Expressions) rather than a logic/precedence error. (Also, my guess is the problems have several different causes rather than just one common root cause, so fixing one problem is not likely to fix all the others too.)
answered Sep 22, 2013 by rajesh