Regular Expressions for Pentesters

What is a Regular Expressions ?
Regular Expression is pattern which is used to match character combinations in a strings. Also known as regex or regexp.

Programming Languages that support Regular Expressions ?

  1. Java
  2. Awk
  3. JavaScript
  4. Perl
  5. Python
  6. Php and more

Benefits of using Regular Expressions ?
Regular Expression helps in various ways which includes like validation of user input into the applications, like email, name, phone number and etc. It also helps in finding the word or string in a file or paragraph.

History of Regular Expressions ?
Originated in 1951 by a mathematician Stephen Cole Kleene described regular languages using mathematical notation which he called regular sets. It became popular from 1968 for two purposes:-

  1. Pattern matching in a text editor
  2. Lexical analysis in a compiler

In 1970s various forms of regular expressions were included in different programming languages in Unix are vi, awk, sed, expr and etc. In 1980s more complex form of regexes were included in perl derived from original regex library.

Using Regex for pentesting:-

Regex is a pattern search over a pile of data. you define a pattern for the word/string that you want to search. Every pattern is different from each other as they have their own specific meaning in extracting the data. Like for example you have a 1000 IP address, you want to check for the ip address that is having first octet starting from “100“.

We can write the expression as “^100.*“.

In the above “^” will disallow any element before “100” that means it must start from and reset can be anything after “.

Now the question comes how it will benefit the pentester?

Think of a network audit where he has to check for the “allow any any” in the router configuration.He can write a simple regex for this rather than going through 100’s of firewall rules, which on other hand can save a lot of time.For the pentester it is really important how the user input is validated, the wrong regex can lead to bypass the pattern and execute any malicious code that he wants.

The above regex is used for matching the email for the user validation. We write email in a format “” so we use “[\w._%+-]” which represent only word with special characters that are allowed are “._+%-” for the “name” in email. Next comes the “@[/w.-]” which represents any word character allowed with “.” and “” for the domain. Now we know that “.com” is fixed at the end which is having from 2 to 4 characters, which can be from a to z and A to Z for this we write “.[a-zA-Z]{2,4}“.

As a part of Pentester his work is to find the loophole inside the application, the improper sanitization of RegExp will lead to execution of malicious code. If application is only blocking a characters which are required for the sql injection then, the Pentester can use the URL Encoding technique to bypass the RegExp.

For example let’s consider that the RegExp value is set to only

[0-9]{5}" or "\d{5}" or "[[:digit:]]{5}

which means that only the number from 0-9 will be accepted and the length must be 5 digits only. Now the pentester will use the query “‘O**R 19=19” which gets blocked by the RegExp but when he tries to enter “90210′” or “ale**rt(0x42)57732” or “10118ale**rt(0x42)” then the malicious content bypasses the filter makes the website compromise.

When we exploit a bug for Cross Site Scripting(XSS) we first check the RegExp value which has been block when the pentester enters into input field. For example if the RegExp is set to “(script)/i/g” which blocks “script” keyword. Now however you type the keyword “script” it be be blocked, like “SCRIPT” or “SCRipt” or “scrIPT” because it checks for both lowercase and uppercase. The main part is that how will you execute a xss attack when “script” keyword is blocked. Their are other xss which the pentester can execute like –

<im**g sr**c=x on**error=pro**mpt(1);>/>

which does not have any “script” keyword in it and it can easily bypass the filter.


Vulnerable Example Bypass Example
(^a|a$) %20a%20
http httP
a.* a%0Ab
(a+)+ aaaaaaaaaaaaaaaaaaa!
a{1,5} aaaaaa (6 times)
[A-z] = [a-zA-Z] + [\]^_’ aaa[\]^_’aaa
a’\s+\d a’5
a[^\n]*$ a\n?a\r?
a\s(not [whitespace]|and)\sb a not b
a||b any_string
(a |b)c ac
a[digit]b aab
[SYSTEM|PUBLIC] or (a-z123) SYSTEM or abcdef
(\d{1})=\1 1!=2
a(?#some comment about wildcards:/)(\w*)b affb

About Author:
Sumit Lakra, is an information security enthusiast working as information security consultant @ISECURION and interested in the application and mobile app security.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.