Regular expressions

From MobileRead
Jump to: navigation, search

Regular expressions is defined as a group of characters that allow search for patterns instead of just the exact text.

Contents

[edit] Overview

Regular expressions are often abbreviated as RegEx, RegExp, or even just RE. In particular these expressions are used in search and replace commands in a text editor or word processor. They are an extension to the regular text search capabilities to allow find items by a recognizable pattern. They are related but not the same thing as wildcards which allow finding groups of items with some related characters.

There are several forms of regular expressions that are similar to each other in that they work from the same principles but the syntax may be different. It can be confusing to try one type of RE when you are used to a different system. PCRE is a RegEx syntax that is based on the Perl programming language. PCRE means Perl compatible RegEx. The IEEE has defined BRE (Basic RegEx) and ERE (extended RegEx) which is a superset of BRE. MS Word also uses its own form of regular expressions which is tied to its use of wildcards in a the search field. Other RegEx systems exist with several based on Unix implementations. Python also has built in support for RE.

[edit] Wildcards

Wildcards (also called wild cards) are special characters used to expand the meaning of a search. * is often used as a wildcard to mean any number of characters. It is used in the command line interface to help select multiple files for example: *.epub would find all files that had a .epub extension. Another wildcard is ? which matches any single character. If you had two words; cat and cut the search argument of c?t would match both. If you did not want the wild card meaning of the characters you would precede them with a \ thus \? would mean a real ? is being searched for. While Wildcards can match lots of items they often match too many things (which is called being greedy) so a way to restrict the matches is needed.

[edit] RE Wildcards

In the context of RE the wildcards may be similar to the ones above or separately defined. For some implementation a . (period) may be a wildcard similar to the * above and is often defined as any character except a New Line. The behavior of a RE wildcard may be modified with other characters to limit or change its scope. A * in RE is often used to match zero or more occurrences of the letter in front of it.

A popular RE wildcard is (.+?) where:

. : any character
+ : any number of times (but at least one)
? : but as few as as possible, as long as the rest of the expression still matches
( ) : and put it in a group, so it can be referred to as \1, \2, etc.
.* : is greedy. It will match as many characters as possible
.*? : will match as few characters as possible
[0-9]+ : if the value is always numeric. 

Let's suppose you have a text file with individual lines and an empty line between paragraphs. The RE would be (.+)\r\n replaced with \1 should fix it to allow proper paragraph wrapping.

[edit] Cheat sheets

[edit] For more information

Personal tools
Namespaces

Variants
Actions
Navigation
MobileRead Networks
Toolbox