# Regex

Regex is advanced search pattern which can be used to search non-specific and specific data, as well as enhance the quality of your programming code.

Please note that if one can avoid regex instead of something simpler, one must avoid using it. Sometimes using regex where it is not required can break things.

<https://regexr.com> and [https://regex101.com](https://regex101.com/) are very helpful online tools to build regex queries.

To identify any pattern, different pattern formation elements (regex structures) are used which are as follows:

* **Character Classes**: List of characters that can appear in the pattern. Character classes are defined by square brackets around the list
  * [x] \[a-z] - all small letters
  * [x] \[A-Z] - all capital letters
  * [x] \[a-zA-Z] - both small and capital letters
  * [x] \[0-9] - all digits
  * [x] \[afgh] - elements only from this list of a,f,g and h
  * [x] \[^a-d] - ^ means negative or not when in square brackets, i.e. any character which is not small a, b, c and d
  * [x] \[\[:alnum:]] - All alphanumeric characters
* **Meta Characters, Anchors and Escape characters**: They have special meaning within regex and usually start with \\
  * [x] / - start or end of an expression
  * [x] ^ - Start of the line&#x20;
  * [x] $ - End of the line
  * [x] \w - any word small or capital and \W - not any word
  * [x] \s - whitespace and \S - whitespace
  * [x] \d - digits and \D - not digits
  * [x] \b - backspace character or word boundary and \B - non word boundary
  * [x] a|b - match either a or b. | - means match any character before or after |
  * [x] \N or **.** - matches any character other than newline
  * [x] \\\  - matching \ itself
  * [x] \\\* - matching \*&#x20;
  * [x] \\. - matching dot
  * [x] \n - match newline
  * [x] \t - match a tab character
  * [x] \r - match carriage return
  * [x] \0 - to match null character
* **Occurrences:** They usually tell how much to match with the help of wildcards
  * [x] {1,3} - Define a range—the first digit is the minimum value, and the second is the maximum value&#x20;
  * [x] {4} - Number of times the pattern should be matched
  * [x] \+ - Match one (1) or more of the specified preceding pattern, \* - Match zero (0) or more of the specified preceding pattern&#x20;
  * [x] ? – Match (0) zero or one (1) of the specified preceeding pattern.
* Quantifiers: Combining occurrences with the previous two regex structures can give something as quantifiers.

  <figure><img src="https://275986271-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FPaRFhO7J6sRJrjn8Haee%2Fuploads%2FoUGJObuVfKdZaduk4cWS%2Fimage.png?alt=media&#x26;token=ec1826f3-8cc9-4b35-a968-cbeda76f1356" alt=""><figcaption><p>Source: <a href="https://regex101.com/">https://regex101.com/</a></p></figcaption></figure>

Note, capture group can be used with (), so if one wants to capture a pattern which has anything in between two words, let's just say, WORD1 aasdkjaslkj WORD2 asdka WORD3, then to capture from word1 to word2, regex can use the capture group as (.\*?) which means match anything except newline which can be any number of times including zero and capture it only once

Examples:

1. To search gmail id of let's just raghav, but if there are many raghavs' (like <raghav1@gmail.co>, <raghav5@gmail.com>) and there may be additional emails such as that of yahoo, outlook etc , then the regex query can be as follows:

```regex
raghav\d*@\w{2,}\.\w{1,}
```

<figure><img src="https://275986271-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FPaRFhO7J6sRJrjn8Haee%2Fuploads%2FUZs2JDFfzOyougSoZCHL%2Fimage.png?alt=media&#x26;token=6c327030-c665-4b7d-908d-9e0745aea10e" alt=""><figcaption></figcaption></figure>

Here, raghav is matched and then it can have a digit or not i.e.digits can be 0 or more hence \*,  followed by @ of the email domain and then domain can be be any word greater than 2 characters followed by a dot, which is lastly followed by a word of characters 1 or more to tell root domains

2. To search Aadhaar card number which is a set of 12 digits with spaces in between, the regex query can be:

```regex
/\d{4}\s\d{4}\s\d{4}/g
```

3. To search credit card separated by either spaces or dash, regex query can be:

```regex
\d{4}\s?\-?\d{4}\s?\-?\d{4}\s?\-?\d{4}
```

4. To extract IP address from IIS source file, one can use regex101.com to form a query, grep -Po to extract and awk to print only required information as shown below:

```bash
grep -Po '\s(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s[GPHOT]' iis-sample-logs | awk '{print $1}'
```

<figure><img src="https://275986271-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FPaRFhO7J6sRJrjn8Haee%2Fuploads%2FkAe5JrkAR16JBf8cibXv%2Fimage.png?alt=media&#x26;token=25bb0eca-2a90-4e66-87ec-3067fd551ff3" alt=""><figcaption></figcaption></figure>

Note: awk consider field separator as space by default
