Regular Expressions Made Easy

/ Updated on March 27, 2023 / PHP /
  • 3.6 out of 5 Stars

Regular expressions are a very powerful instrument to manipulate and extract strings. However, not all PHP developers know how to use regular expressions. This simple tutorial is intended for everyone who wants to get started with regular expressions in PHP.

PHP regex

PHP has several built-in functions for dealing with regular expressions. We'll examine the PCRE functions (preg_match, preg_replace, …), which use a Perl-compatible regular expression syntax.

Please note, that POSIX regex functions (ereg, eregi, ereg_replace, …) are deprecated in PHP 5.3 and completely removed in PHP 7.
Ok, let's begin.

That is a regular expression? It is the string, defining the "mask" for data to search or replace. Such a string can contain characters with a special meaning: meta characters, anchors, character classes, quantifiers, and group modifiers. Let's examine the most used of them (please note that this list is not a comprehensive one, for the full information consult the documentation).

Meta characters

Meta characters are characters with special meanings. If you need to specify one of these meta characters as plain text, you will need to "escape" it. For this just put the backslash "\" character before it.

  • . — Any character. By default, it doesn't match newline characters (\r\n). If you need to match newline characters too, use the "/s" pattern modifier (see below).
  • [abc] — Range. Will match character if it is in the specified range. The range can be specified by the list of allowed characters, and using the dash "-" symbol. For example, [a-f] defines the range with all letters between "a" and "f". One more example: range [0-9a-z_] includes all alphanumerical characters and underscore.
  • [^abc] — Not in range. Will match character if it is NOT in the specified range.
  • (some expression) — Group. Enclosed in parentheses (). You can use groups for further replacing or for capturing structured information. By default, groups are accessible by their numeric indexes, but it is possible to capture named groups using the syntax (?<name>some expression), see our example below.
  • (php|ruby) — Condition. Variants are divided with the | symbol. This group will match the string "php" or the string "ruby".

Anchors

Anchors are the special zero-length characters in the regular expressions used to mark special positions within the searching text:

  • ^ - Start of string or line.
  • $ - End of string.
  • \b - Word boundary. Matches position before or after any word character (\w, see below).

Predefined character classes

There are several predefined character classes, which can be used in regular expressions, the most used are:

  • \s - White space. Includes a space, a tab, a carriage return, and a line feed.
  • \d - Digit. Includes numbers [0-9]
  • \w - Word. Includes ASCII characters [A-Za-z0-9_]. If the Unicode modifier /u is specified, also includes Unicode letters.

Every character class listed above has an "opposite" class, which will match all characters except the characters from the base class. Just uppercase the letter:

  • \S - Not white space
  • \D - Not digit
  • \W - Not word

Number quantifiers

Number quantifiers are used to specify how many times the previous character or expression should occur. They can be as follows:

  • * - 0 or more
  • + - 1 or more
  • ? - 0 or 1
  • {5} - Exactly 5 times
  • {5,} - 5 or more repetitions
  • {5,10} - from 5 to 10 occurrences

"*" and "+" quantifiers are "greedy" quantifiers. That means that they will match as many characters as possible. To make them "not greedy" you can use the "?" modifier after the quantifier. Let's explain it below.

We have the sample text to parse:

<p>Sample paragraph 1</p><p>Sample paragraph 2</p><p>Sample paragraph 3</p>

Now let's look at the regex: <p>.*</p>

This regex will match the whole sample text! If you need to break your sample into paragraphs and process them separately, you can specify the "?" modifier to make this regex not greedy: <p>.*?</p>

Pattern modifiers

Pattern modifiers are used to specify additional options for regular expressions. The following pattern modifiers are supported:

  • /i - Enables case-insensitive comparison.
  • /s - Single line mode. If specified, this modifier tells the regex engine to treat the newline character (\r\n or \n) as whitespace.
  • /m - Multiline mode. If specified, changes the behavior of ^ and $ metacharacters from "start of string" to "start of line", and from "end of string" to "end of line", accordingly. Has no effect if there are no newline characters (\n) in a subject string.
In older PHP versions, there also was the /e modifier, which was used for inline PHP code evaluation. It is deprecated since PHP 5.5 and removed in PHP 7.

You may also know that other programming languages like C# or Java have an option to "compile" a regular expression, that can be used to increase performance because the same compiled regex can be applied to multiple strings without repeating the compilation process. There is no such a modifier in PHP because it is done automatically because every compiled regex is cached internally after the first usage:

This extension maintains a global per-thread cache of compiled regular expressions (up to 4096). (Source: Introduction to PCRE)

PHP regex functions

Ok, now you know something about regular expressions. Now we'll need to sum it up, and see the real examples. PHP has several functions for dealing with Perl-compatible regular expressions:

All of these functions have pattern parameter, which consists of the following sections divided with delimiter (a forward slash / is the most commonly used one):
/regular_expression/pattern_modifiers.

Sometimes it can be more convenient to use another char like # or @ as a delimiter:
#regular_expression#pattern_modifiers.

You need to escape your delimiter if it is used in the regular expression. So, if your regex contains many slashes, it is better to choose another delimiter.

Ok, now some example patterns:

  • /<title>([^>]*)<\/title>/si - will match the title tag of the webpage
  • /\d{1,2}\/d{1,2}\/d{4}/ - will match the date in format dd/mm/yyyy
  • /\w+@([\w_-]+\.)+[a-z]{2,}/si - will match email address
  • '#^((https?:)?//#si' - will match all external URLs

Working with group references

As you already know, the regex matches can be captured into groups. You can use these groups in further regex operations with group references (backreferences). Group reference is the number of the group preceding with character "$" or "\".

Let's look at the example. This example will change all HTML links in the variable $s to the links that will open in the new window by adding the target="_blank" rel="noopener":

<?php

//initialize the variable with HTML having several sample links
$s = '<a href="http://www.php.net">PHP web site</a> ';
$s .= '<a href="http://www.wmtips.com">Webmaster Tips</a> ';
$s .= '<a href="http://www.google.com">Google</a>';

//add the target="_blank" to the each string
$s = preg_replace('/<a[^>]*?href=[\'"](.*?)[\'"][^>]*?>(.*?)<\/a>/si','<a href="$1" target="_blank" rel="noopener">$2</a>',$s);

//output the result
echo $s;

?>

In this example, we have assumed, that HTML attribute values can be inserted with both ordinary and double quotes, so we've used the [\'"] range (note, that, being inside a string, we escaped the single quote with a backslash).

Parsing the site contents with PHP

Let's write our example script that will grab contents from some webpage, parse it with regular expressions and display the parsed data. Let's take Y Combinator News as an example. We'll extract posted links, their text, and the score rating. After that, we'll order the results by the score.

First, we need to view the HTML page source and find the data blocks we are interested in. The example HTML block looks as follows:

<span class="titleline"><a href="https://github.com/valeriansaliou/sonic">An alternative to Elasticsearch that runs on a few MBs of RAM</a><span class="sitebit comhead"> (<a href="from?site=github.com/valeriansaliou"><span class="sitestr">github.com/valeriansaliou</span></a>)</span></span></td></tr><tr><td colspan="2"></td><td class="subtext"><span class="subline">
<span class="score" id="score_33315237">206 points</span>

So, we have all the necessary data in this block and are ready to implement our grabber PHP script:

<?php

 //Regex example for article "Regular expressions made easy"
 //Copyright (c) www.wmtips.com, 2022
 echo '<style>body {padding: 10px; font: 1.1em/1.6 Roboto, sans-serif;}</style>
 <p>Regex example for the article "<a href="https://www.wmtips.com/php/regular-expressions-made-easy/">Regular expressions made easy</a>".<br>
 Extracting links from news.ycombinator.com and ordering them by the score</p><hr>';

 $url = 'https://news.ycombinator.com';
 //grab contents of the web page into $s variable
 //please note that file_get_contents is the simplest method of Http get and can be banned by site owners
 $s = file_get_contents($url);
 if (!$s)
  die('Cannot get website contents!');

 //extract all links and their scores
 if (preg_match_all('/<span class="titleline"><a href="(?<url>.*?)">(?<title>[^<>]*)<\/a>.*?<span class="score"[^<>]*>(?<score>\d+) points<\/span>/si',$s,$matches,PREG_SET_ORDER))
 {
  //Now we have the following groups:
  //['url'] - the url of each posted link
  //['title'] - its title
  //['score'] - the score

  //sort results by the score in descending order
  usort($matches,function ($a1,$a2)
  {
   $score1 = $a1['score'] ?? 0;
   $score2 = $a2['score'] ?? 0;
   return $score2-$score1;
  });

  //iterate through the results and output them
  foreach ($matches as $m)
   echo "{$m['score']} &nbsp; <a href=\"{$m['url']}\">{$m['title']}</a><br>\n";
 }

?>

You can view this script in action .

Please note, as Y Combinator page format can be changed in the future, this script can stop working.

I hope this simple tutorial was interesting and useful for you. As you can see, regular expressions can be very efficient for parsing and extracting string data.

Rate This Article

How would you rate the quality of this content?
Currently rated: 3.6 out of 5 stars. 15 users have rated this article. Select your rating:
  • 3.6 out of 5 Stars
  • 1
  • 2
  • 3
  • 4
  • 5

About The Author

Webmaster tips and tools. Webmaster tips: HTML, CSS, SEO, AdSense. Webmaster tools: Website information tool, PageRank checker, Keyword Density Analyzer and more.