Regular expressions

From URL Lock
Jump to navigation Jump to search
This page will be organized in the near future. It's presently a dump of what I've sent out
to people asking questions on how to create regular expressions for IE URL Lock

This page documents the typical regular expression patterns that work best with the URL Lock.

A typical regular expression can be like the following:

^http(s)?://www\.thisdomain\.com(/|$)

That will permit http://www.thisdomain.com, http://www.thisdomain.com/, https://www.thisdomain.com/index.html, and any path on that web server, though it won't permit http://thisdomain.com/ as it requires the "www." at the beginning. If you want to make the "www." optional, then the following should work:

^http(s)?://(www\.)?thisdomain\.com(/|$)

It's mostly a matter of putting the optional parts in () and having a ? after the ) to have it match against zero or one instance of the contents in the (). It's good to begin the URL matching regular expressions with ^ to prevent a URL such as http://malicious.web.site.com/https://www.thisdomain.com/ from working. The (/|$) at the end tells the regular expression matcher that either a / must appear after www.thisdomain.com or www.thisdomain.com must be the end of the URL (the $ matches against the end of the URL). The | is used as an "or" operation. It's important to use the period carefully and, if you want to match against a period in microsoft.com, as an example, escape it with a backslash so that it's microsoft\.com. The period matches against any character in regular expressions, so if you have something like ^http://www.microsoft.com(/%7C$), then it will also let in sites such as http://wwwwmicrosoft.com or http://www-microsoft.com/

The regular expression library that IE URL Lock is using has issues with .* when the period is not enclosed in parentheses. That is, (.)* works, but .* does not.

When specifying the host name, periods should be escaped with a backslash or else a substitute character could be used in place of the period to possibly access another web site. Instead of "http(s)://www.website.com/.*", it should be "http(s)?://www\.website\.com/(.)*". I put a ? after the (s) so that http can work, though if you only want to allow https, then "https://www\.website\.com/(.)*" should also work.

For the slash after the web site, another trick can be used to allow "https://www.website.com" in addition to "https://www.website.com/". To do that, it's possible to use "^http(s)?://www\.website\.com(/|$)", which does several things. The ^ at the beginning ensures that the beginning of the URL is matched, and the (/|$) at the end will match when the URL ends with website.com or when a / exists. When a / exists, it doesn't care what follows, which is similar to the desired effect of (.)* at the end.

If you are trying to access website.com without the www, then you will need to construct the regular expression as ^http(s)?://(www\.)?website\.com(/|$). If there are other servers in website.com besides www that you want to provide access to, then you can either list them with | between them, such as ^http(s)?://((www|anotherserver|ftp|one\.with\.a\.subdomain)\.)?website\.com(/|$), or allow a blanket with ^http(s)?://(([-a-zA-Z0-9])+\.)*website\.com(/|$), which ensures that each subdomain component and the server name contain at least one character and are separated by exactly one period. The latter also allows website.com without the www or subdomain.