Perl Compatible Regular Expressions Manual

PCRE was originally written for the Exim MTA, but is now used by many high-profile open source projects, including Python, Apache, PHP, KDE, Postfix, Analog, and nmap. Other interesting projects using PCRE include Ferite, Onyx, Hypermail, and Askemos.


     PCRE - Perl-compatible regular expressions


     The PCRE library is a set of functions that implement  regu-
     lar  expression  pattern  matching using the same syntax and
     semantics as Perl, with just a few differences. The  current
     implementation  of  PCRE  (release 4.x) corresponds approxi-
     mately with Perl 5.8, including support  for  UTF-8  encoded
     strings.    However,  this  support  has  to  be  explicitly
     enabled; it is not the default.

     PCRE is written in C and released as a C library. However, a
     number  of  people  have  written wrappers and interfaces of
     various kinds. A C++ class is included  in  these  contribu-
     tions,  which  can  be found in the Contrib directory at the
     primary FTP site, which is:

     Details of exactly which Perl  regular  expression  features
     are  and  are  not  supported  by PCRE are given in separate
     documents. See the pcrepattern and pcrecompat pages.

     Some features of PCRE can be included, excluded, or  changed
     when  the library is built. The pcre_config() function makes
     it possible for a client  to  discover  which  features  are
     available.  Documentation  about  building  PCRE for various
     operating systems can be found in the  README  file  in  the
     source distribution.


     The user documentation for PCRE has been  split  up  into  a
     number  of  different sections. In the "man" format, each of
     these is a separate "man page". In the HTML format, each  is
     a  separate  page,  linked from the index page. In the plain
     text format, all the sections are concatenated, for ease  of
     searching. The sections are as follows:

       pcre              this document
       pcreapi           details of PCRE's native API
       pcrebuild         options for building PCRE
       pcrecallout       details of the callout feature
       pcrecompat        discussion of Perl compatibility
       pcregrep          description of the pcregrep command
       pcrepattern       syntax and semantics of supported
                           regular expressions
       pcreperform       discussion of performance issues
       pcreposix         the POSIX-compatible API
       pcresample        discussion of the sample program
       pcretest          the pcretest testing command

     In addition, in the "man" and HTML formats, there is a short
     page  for  each  library function, listing its arguments and


     There are some size limitations in PCRE but it is hoped that
     they will never in practice be relevant.

     The maximum length of a  compiled  pattern  is  65539  (sic)
     bytes  if PCRE is compiled with the default internal linkage
     size of 2. If you want to process regular  expressions  that
     are  truly  enormous,  you can compile PCRE with an internal
     linkage size of 3 or 4 (see the README file  in  the  source
     distribution  and  the pcrebuild documentation for details).
     If these cases the limit is substantially larger.   However,
     the speed of execution will be slower.

     All values in repeating quantifiers must be less than 65536.
     The maximum number of capturing subpatterns is 65535.

     There is no limit to the  number  of  non-capturing  subpat-
     terns,  but  the  maximum  depth  of nesting of all kinds of
     parenthesized subpattern, including  capturing  subpatterns,
     assertions, and other types of subpattern, is 200.

     The maximum length of a subject string is the largest  posi-
     tive number that an integer variable can hold. However, PCRE
     uses recursion to handle subpatterns and indefinite  repeti-
     tion.  This  means  that the available stack space may limit
     the size of a subject string that can be processed  by  cer-
     tain patterns.


     Starting at release 3.3, PCRE has had some support for char-
     acter  strings  encoded in the UTF-8 format. For release 4.0
     this has been greatly extended to cover most common require-

     In order process UTF-8  strings,  you  must  build  PCRE  to
     include  UTF-8  support  in  the code, and, in addition, you
     must call pcre_compile() with  the  PCRE_UTF8  option  flag.
     When  you  do this, both the pattern and any subject strings
     that are matched against it are  treated  as  UTF-8  strings
     instead of just strings of bytes.

     If you compile PCRE with UTF-8 support, but do not use it at
     run  time,  the  library will be a bit bigger, but the addi-
     tional run time overhead is limited to testing the PCRE_UTF8
     flag in several places, so should not be very large.

     The following comments apply when PCRE is running  in  UTF-8

     1. When you set the PCRE_UTF8 flag, the  strings  passed  as
     patterns  and  subjects are checked for validity on entry to
     the relevant  functions.  If  an  invalid  UTF-8  string  is
     passed,  an  error  return is given. In some situations, you
     may already know that your strings are valid, and  therefore
     want  to  skip these checks in order to improve performance.
     If you set the PCRE_NO_UTF8_CHECK flag at compile time or at
     run  time,  PCRE  assumes  that the pattern or subject it is
     given (respectively) contains only  valid  UTF-8  codes.  In
     this  case, it does not diagnose an invalid UTF-8 string. If
     you  pass   an   invalid   UTF-8   string   to   PCRE   when
     PCRE_NO_UTF8_CHECK  is  set, the results are undefined. Your
     program may crash.

     2. In a pattern, the escape sequence \x{...}, where the con-
     tents  of  the  braces is a string of hexadecimal digits, is
     interpreted as a UTF-8 character whose code  number  is  the
     given  hexadecimal  number, for example: \x{1234}. If a non-
     hexadecimal digit appears between the braces,  the  item  is
     not  recognized.  This escape sequence can be used either as
     a literal, or within a character class.

     3. The original hexadecimal escape sequence, \xhh, matches a
     two-byte UTF-8 character if the value is greater than 127.

     4. Repeat quantifiers apply to  complete  UTF-8  characters,
     not to individual bytes, for example: \x{100}{3}.

     5. The dot metacharacter matches one UTF-8 character instead
     of a single byte.

     6. The escape sequence \C can be used to match a single byte
     in UTF-8 mode, but its use can lead to some strange effects.

     7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W
     correctly test characters of any code value, but the charac-
     ters that PCRE recognizes as digits, spaces, or word charac-
     ters  remain  the  same  set as before, all with values less
     than 256.

     8. Case-insensitive  matching  applies  only  to  characters
     whose  values  are  less than 256. PCRE does not support the
     notion of "case" for higher-valued characters.

     9. PCRE does not support the use of Unicode tables and  pro-
     perties or the Perl escapes \p, \P, and \X.


     Philip Hazel <>
     University Computing Service,
     Cambridge CB2 3QG, England.
     Phone: +44 1223 334714

Last updated: 20 August 2003
Copyright (c) 1997-2003 University of Cambridge.

     PCRE - Perl-compatible regular expressions


     This document describes the optional features of  PCRE  that
     can  be  selected when the library is compiled. They are all
     selected, or deselected, by providing options to the config-
     ure  script  which  is run before the make command. The com-
     plete list of options  for  configure  (which  includes  the
     standard  ones  such  as  the  selection of the installation
     directory) can be obtained by running

       ./configure --help

     The following sections describe certain options whose  names
     begin  with  --enable  or  --disable. These settings specify
     changes to the defaults for the configure  command.  Because
     of  the  way  that  configure  works, --enable and --disable
     always come in pairs, so  the  complementary  option  always
     exists  as  well, but as it specifies the default, it is not


     To build PCRE with support for UTF-8 character strings, add


     to the configure command. Of itself, this does not make PCRE
     treat  strings as UTF-8. As well as compiling PCRE with this
     option, you also have have to set the PCRE_UTF8 option  when
     you call the pcre_compile() function.


     By default, PCRE treats character 10 (linefeed) as the  new-
     line  character.  This  is  the  normal newline character on
     Unix-like systems. You can compile PCRE to use character  13
     (carriage return) instead by adding


     to the configure command. For completeness there is  also  a
     --enable-newline-is-lf  option,  which  explicitly specifies
     linefeed as the newline character.


     The PCRE building process uses libtool to build both  shared
     and  static  Unix libraries by default. You can suppress one
     of these by adding one of


     to the configure command, as required.


     When PCRE is called through the  POSIX  interface  (see  the
     pcreposix  documentation),  additional  working  storage  is
     required for holding the pointers  to  capturing  substrings
     because  PCRE requires three integers per substring, whereas
     the POSIX interface provides only  two.  If  the  number  of
     expected  substrings  is  small,  the  wrapper function uses
     space on the stack, because this is faster than  using  mal-
     loc()  for  each call. The default threshold above which the
     stack is no longer used is 10; it can be changed by adding a
     setting such as


     to the configure command.


     Internally, PCRE has a  function  called  match()  which  it
     calls  repeatedly  (possibly  recursively) when performing a
     matching operation. By limiting the  number  of  times  this
     function  may  be  called,  a  limit  can  be  placed on the
     resources used by a single call to  pcre_exec().  The  limit
     can  be  changed  at  run  time, as described in the pcreapi
     documentation. The default is 10 million, but  this  can  be
     changed by adding a setting such as


     to the configure command.


     Within a compiled pattern, offset values are used  to  point
     from  one  part  to  another  (for  example, from an opening
     parenthesis to an  alternation  metacharacter).  By  default
     two-byte  values  are  used  for these offsets, leading to a
     maximum size for a compiled pattern of around 64K.  This  is
     sufficient  to  handle  all  but the most gigantic patterns.
     Nevertheless, some people do want to process  enormous  pat-
     terns,  so  it is possible to compile PCRE to use three-byte
     or four-byte offsets by adding a setting such as


     to the configure command. The value given must be 2,  3,  or
     4.  Using  longer  offsets  slows down the operation of PCRE
     because it has to load additional bytes when handling them.

     If you build PCRE with an increased link size, test  2  (and
     test 5 if you are using UTF-8) will fail. Part of the output
     of these tests is a representation of the compiled  pattern,
     and this changes with the link size.

Last updated: 21 January 2003
Copyright (c) 1997-2003 University of Cambridge.

     PCRE - Perl-compatible regular expressions


     #include <pcre.h>

     pcre *pcre_compile(const char *pattern, int options,
          const char **errptr, int *erroffset,
          const unsigned char *tableptr);

     pcre_extra *pcre_study(const pcre *code, int options,
          const char **errptr);

     int pcre_exec(const pcre *code, const pcre_extra *extra,
          const char *subject, int length, int startoffset,
          int options, int *ovector, int ovecsize);

     int pcre_copy_named_substring(const pcre *code,
          const char *subject, int *ovector,
          int stringcount, const char *stringname,
          char *buffer, int buffersize);

     int pcre_copy_substring(const char *subject, int *ovector,
          int stringcount, int stringnumber, char *buffer,
          int buffersize);

     int pcre_get_named_substring(const pcre *code,
          const char *subject, int *ovector,
          int stringcount, const char *stringname,
          const char **stringptr);

     int pcre_get_stringnumber(const pcre *code,
          const char *name);

     int pcre_get_substring(const char *subject, int *ovector,
          int stringcount, int stringnumber,
          const char **stringptr);

     int pcre_get_substring_list(const char *subject,
          int *ovector, int stringcount, const char ***listptr);

     void pcre_free_substring(const char *stringptr);

     void pcre_free_substring_list(const char **stringptr);

     const unsigned char *pcre_maketables(void);

     int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
          int what, void *where);

     int pcre_info(const pcre *code, int *optptr, *firstcharptr);

     int pcre_config(int what, void *where);

     char *pcre_version(void);

     void *(*pcre_malloc)(size_t);

     void (*pcre_free)(void *);

     int (*pcre_callout)(pcre_callout_block *);


     PCRE has its own native API,  which  is  described  in  this
     document.  There  is  also  a  set of wrapper functions that
     correspond to the POSIX regular expression API.   These  are
     described in the pcreposix documentation.

     The native API function prototypes are defined in the header
     file  pcre.h,  and  on  Unix  systems  the library itself is
     called libpcre.a, so can be accessed by adding -lpcre to the
     command  for  linking  an  application  which  calls it. The
     header file defines the macros PCRE_MAJOR and PCRE_MINOR  to
     contain the major and minor release numbers for the library.
     Applications can use these to include support for  different

     The functions pcre_compile(), pcre_study(), and  pcre_exec()
     are  used  for compiling and matching regular expressions. A
     sample program that demonstrates the simplest way  of  using
     them  is  given in the file pcredemo.c. The pcresample docu-
     mentation describes how to run it.

     There are convenience functions for extracting captured sub-
     strings from a matched subject string. They are:


     pcre_free_substring()  and  pcre_free_substring_list()   are
     also  provided,  to  free  the  memory  used  for  extracted

     The function pcre_maketables() is used (optionally) to build
     a  set of character tables in the current locale for passing
     to pcre_compile().

     The function pcre_fullinfo() is used to find out information
     about a compiled pattern; pcre_info() is an obsolete version
     which returns only some of the available information, but is
     retained   for   backwards   compatibility.    The  function
     pcre_version() returns a pointer to a string containing  the
     version of PCRE and its date of release.

     The global variables  pcre_malloc  and  pcre_free  initially
     contain the entry points of the standard malloc() and free()
     functions respectively. PCRE  calls  the  memory  management
     functions  via  these  variables,  so  a calling program can
     replace them if it  wishes  to  intercept  the  calls.  This
     should be done before calling any PCRE functions.

     The global variable pcre_callout initially contains NULL. It
     can be set by the caller to a "callout" function, which PCRE
     will then call at specified points during a matching  opera-
     tion. Details are given in the pcrecallout documentation.


     The PCRE functions can be used in  multi-threading  applica-
     tions, with the proviso that the memory management functions
     pointed to by pcre_malloc and  pcre_free,  and  the  callout
     function  pointed  to  by  pcre_callout,  are  shared by all

     The compiled form of a regular  expression  is  not  altered
     during  matching, so the same compiled pattern can safely be
     used by several threads at once.


     int pcre_config(int what, void *where);

     The function pcre_config() makes  it  possible  for  a  PCRE
     client  to  discover  which optional features have been com-
     piled into the PCRE library. The pcrebuild documentation has
     more details about these optional features.

     The first argument for pcre_config() is an integer, specify-
     ing  which information is required; the second argument is a
     pointer to a variable into which the information is  placed.
     The following information is available:


     The output is an integer that is set to one if UTF-8 support
     is available; otherwise it is set to zero.


     The output is an integer that is set to  the  value  of  the
     code  that  is  used for the newline character. It is either
     linefeed (10) or carriage return (13), and  should  normally
     be the standard character for your operating system.


     The output is an integer that contains the number  of  bytes
     used  for  internal linkage in compiled regular expressions.
     The value is 2, 3, or 4. Larger values allow larger  regular
     expressions  to be compiled, at the expense of slower match-
     ing. The default value of 2 is sufficient for  all  but  the
     most  massive patterns, since it allows the compiled pattern
     to be up to 64K in size.


     The output is an integer that contains the  threshold  above
     which  the POSIX interface uses malloc() for output vectors.
     Further details are given in the pcreposix documentation.


     The output is an integer that gives the  default  limit  for
     the   number  of  internal  matching  function  calls  in  a
     pcre_exec()  execution.  Further  details  are  given   with
     pcre_exec() below.


     pcre *pcre_compile(const char *pattern, int options,
          const char **errptr, int *erroffset,
          const unsigned char *tableptr);

     The function pcre_compile() is called to compile  a  pattern
     into  an internal form. The pattern is a C string terminated
     by a binary zero, and is passed in the argument  pattern.  A
     pointer  to  a  single  block of memory that is obtained via
     pcre_malloc is returned. This contains the compiled code and
     related  data.  The  pcre  type  is defined for the returned
     block; this is a typedef for a structure whose contents  are
     not  externally  defined. It is up to the caller to free the
     memory when it is no longer required.

     Although the compiled code of a PCRE regex  is  relocatable,
     that is, it does not depend on memory location, the complete
     pcre data block is not fully relocatable,  because  it  con-
     tains  a  copy of the tableptr argument, which is an address
     (see below).
     The options argument contains independent bits  that  affect
     the  compilation.  It  should  be  zero  if  no  options are
     required. Some of the options, in particular, those that are
     compatible  with Perl, can also be set and unset from within
     the pattern (see the detailed description of regular expres-
     sions  in the pcrepattern documentation). For these options,
     the contents of the options argument specifies their initial
     settings  at  the  start  of  compilation and execution. The
     PCRE_ANCHORED option can be set at the time of  matching  as
     well as at compile time.

     If errptr is NULL, pcre_compile() returns NULL  immediately.
     Otherwise, if compilation of a pattern fails, pcre_compile()
     returns NULL, and sets the variable pointed to by errptr  to
     point  to a textual error message. The offset from the start
     of  the  pattern  to  the  character  where  the  error  was
     discovered   is   placed  in  the  variable  pointed  to  by
     erroffset, which must not be NULL. If it  is,  an  immediate
     error is given.

     If the final  argument,  tableptr,  is  NULL,  PCRE  uses  a
     default  set  of character tables which are built when it is
     compiled, using the default C  locale.  Otherwise,  tableptr
     must  be  the result of a call to pcre_maketables(). See the
     section on locale support below.

     This code fragment shows a typical straightforward  call  to

       pcre *re;
       const char *error;
       int erroffset;
       re = pcre_compile(
         "^A.*Z",          /* the pattern */
         0,                /* default options */
         &error,           /* for error message */
         &erroffset,       /* for error offset */
         NULL);            /* use default character tables */

     The following option bits are defined:


     If this bit is set, the pattern is forced to be  "anchored",
     that is, it is constrained to match only at the first match-
     ing point in the string which is being searched  (the  "sub-
     ject string"). This effect can also be achieved by appropri-
     ate constructs in the pattern itself, which is the only  way
     to do it in Perl.


     If this bit is set, letters in the pattern match both  upper
     and  lower  case  letters.  It  is  equivalent  to Perl's /i
     option, and it can be changed within a  pattern  by  a  (?i)
     option setting.


     If this bit is set, a dollar metacharacter  in  the  pattern
     matches  only at the end of the subject string. Without this
     option, a dollar also matches immediately before  the  final
     character  if it is a newline (but not before any other new-
     lines).  The  PCRE_DOLLAR_ENDONLY  option  is   ignored   if
     PCRE_MULTILINE is set. There is no equivalent to this option
     in Perl, and no way to set it within a pattern.


     If this bit is  set,  a  dot  metacharater  in  the  pattern
     matches all characters, including newlines. Without it, new-
     lines are excluded. This option is equivalent to  Perl's  /s
     option,  and  it  can  be changed within a pattern by a (?s)
     option setting. A negative class such as [^a] always matches
     a  newline  character,  independent  of  the setting of this


     If this bit is set, whitespace data characters in  the  pat-
     tern  are  totally  ignored  except when escaped or inside a
     character class. Whitespace does not include the VT  charac-
     ter  (code 11). In addition, characters between an unescaped
     # outside a character class and the next newline  character,
     inclusive, are also ignored. This is equivalent to Perl's /x
     option, and it can be changed within a  pattern  by  a  (?x)
     option setting.

     This option makes it possible  to  include  comments  inside
     complicated patterns.  Note, however, that this applies only
     to data characters. Whitespace characters may  never  appear
     within special character sequences in a pattern, for example
     within the sequence (?( which introduces a conditional  sub-


     This option was invented in  order  to  turn  on  additional
     functionality of PCRE that is incompatible with Perl, but it
     is currently of very little use. When set, any backslash  in
     a  pattern  that is followed by a letter that has no special
     meaning causes an error, thus reserving  these  combinations
     for  future  expansion.  By default, as in Perl, a backslash
     followed by a letter with no special meaning is treated as a
     literal.  There  are at present no other features controlled
     by this option. It can also be set by a (?X) option  setting
     within a pattern.


     By default, PCRE treats the subject string as consisting  of
     a  single "line" of characters (even if it actually contains
     several newlines). The "start  of  line"  metacharacter  (^)
     matches  only  at the start of the string, while the "end of
     line" metacharacter ($) matches  only  at  the  end  of  the
     string,    or   before   a   terminating   newline   (unless
     PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.

     When PCRE_MULTILINE it is set, the "start of line" and  "end
     of  line"  constructs match immediately following or immedi-
     ately before any newline  in  the  subject  string,  respec-
     tively,  as  well  as  at  the  very  start and end. This is
     equivalent to Perl's /m option, and it can be changed within
     a  pattern  by  a  (?m) option setting. If there are no "\n"
     characters in a subject string, or no occurrences of ^ or  $
     in a pattern, setting PCRE_MULTILINE has no effect.


     If this option is set, it disables the use of numbered  cap-
     turing  parentheses  in the pattern. Any opening parenthesis
     that is not followed by ? behaves as if it were followed  by
     ?:  but  named  parentheses  can still be used for capturing
     (and they acquire numbers in the usual  way).  There  is  no
     equivalent of this option in Perl.


     This option inverts the "greediness" of the  quantifiers  so
     that  they  are  not greedy by default, but become greedy if
     followed by "?". It is not compatible with Perl. It can also
     be set by a (?U) option setting within the pattern.


     This option causes PCRE to regard both the pattern  and  the
     subject  as  strings  of UTF-8 characters instead of single-
     byte character strings. However, it  is  available  only  if
     PCRE  has  been  built to include UTF-8 support. If not, the
     use of this option provokes an error. Details  of  how  this
     option  changes  the behaviour of PCRE are given in the sec-
     tion on UTF-8 support in the main pcre page.


     When PCRE_UTF8 is set, the validity  of  the  pattern  as  a
     UTF-8  string  is automatically checked. If an invalid UTF-8
     sequence of bytes is found, pcre_compile() returns an error.
     If you already know that your pattern is valid, and you want
     to skip this check for performance reasons, you can set  the
     PCRE_NO_UTF8_CHECK  option.  When  it  is set, the effect of
     passing an invalid UTF-8 string as a pattern  is  undefined.
     It  may  cause  your program to crash.  Note that there is a
     similar option  for  suppressing  the  checking  of  subject
     strings passed to pcre_exec().


     pcre_extra *pcre_study(const pcre *code, int options,
          const char **errptr);

     When a pattern is going to be  used  several  times,  it  is
     worth  spending  more time analyzing it in order to speed up
     the time taken for matching. The function pcre_study() takes
     a  pointer  to  a compiled pattern as its first argument. If
     studing the pattern  produces  additional  information  that
     will  help speed up matching, pcre_study() returns a pointer
     to a pcre_extra block, in which the study_data field  points
     to the results of the study.

     The  returned  value  from  a  pcre_study()  can  be  passed
     directly  to pcre_exec(). However, the pcre_extra block also
     contains other fields that can be set by the  caller  before
     the  block is passed; these are described below. If studying
     the pattern does not  produce  any  additional  information,
     pcre_study() returns NULL. In that circumstance, if the cal-
     ling program wants to pass  some  of  the  other  fields  to
     pcre_exec(), it must set up its own pcre_extra block.

     The second argument contains option  bits.  At  present,  no
     options  are  defined  for  pcre_study(),  and this argument
     should always be zero.

     The third argument for pcre_study()  is  a  pointer  for  an
     error  message.  If  studying  succeeds  (even if no data is
     returned), the variable it points to is set to NULL.  Other-
     wise it points to a textual error message. You should there-
     fore  test  the  error  pointer  for  NULL   after   calling
     pcre_study(), to be sure that it has run successfully.

     This is a typical call to pcre_study():

       pcre_extra *pe;
       pe = pcre_study(
         re,             /* result of pcre_compile() */
         0,              /* no options exist */
         &error);        /* set to NULL or points to a message */

     At present, studying a  pattern  is  useful  only  for  non-
     anchored  patterns  that do not have a single fixed starting
     character. A  bitmap  of  possible  starting  characters  is


     PCRE handles caseless matching, and determines whether char-
     acters  are  letters, digits, or whatever, by reference to a
     set of tables. When running in UTF-8 mode, this applies only
     to characters with codes less than 256. The library contains
     a default set of tables that is created  in  the  default  C
     locale  when  PCRE  is compiled. This is used when the final
     argument of pcre_compile() is NULL, and  is  sufficient  for
     many applications.

     An alternative set of tables can, however, be supplied. Such
     tables  are built by calling the pcre_maketables() function,
     which has no arguments, in the relevant locale.  The  result
     can  then be passed to pcre_compile() as often as necessary.
     For example, to build and use tables  that  are  appropriate
     for  the French locale (where accented characters with codes
     greater than 128 are treated as letters), the following code
     could be used:

       setlocale(LC_CTYPE, "fr");
       tables = pcre_maketables();
       re = pcre_compile(..., tables);

     The  tables  are  built  in  memory  that  is  obtained  via
     pcre_malloc.  The  pointer that is passed to pcre_compile is
     saved with the compiled pattern, and  the  same  tables  are
     used via this pointer by pcre_study() and pcre_exec(). Thus,
     for any single pattern, compilation, studying  and  matching
     all happen in the same locale, but different patterns can be
     compiled in different locales. It is the caller's  responsi-
     bility  to  ensure  that  the  memory  containing the tables
     remains available for as long as it is needed.


     int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
          int what, void *where);

     The pcre_fullinfo() function  returns  information  about  a
     compiled pattern. It replaces the obsolete pcre_info() func-
     tion, which is nevertheless retained for backwards compabil-
     ity (and is documented below).
     The first argument for pcre_fullinfo() is a pointer  to  the
     compiled  pattern.  The  second  argument  is  the result of
     pcre_study(), or NULL if the pattern was  not  studied.  The
     third  argument  specifies  which  piece  of  information is
     required, and the fourth argument is a pointer to a variable
     to  receive  the data. The yield of the function is zero for
     success, or one of the following negative numbers:

       PCRE_ERROR_NULL       the argument code was NULL
                             the argument where was NULL
       PCRE_ERROR_BADMAGIC   the "magic number" was not found
       PCRE_ERROR_BADOPTION  the value of what was invalid

     Here is a typical call of  pcre_fullinfo(),  to  obtain  the
     length of the compiled pattern:

       int rc;
       unsigned long int length;
       rc = pcre_fullinfo(
         re,               /* result of pcre_compile() */
         pe,               /* result of pcre_study(), or NULL */
         PCRE_INFO_SIZE,   /* what is required */
         &length);         /* where to put the data */

     The possible values for the third argument  are  defined  in
     pcre.h, and are as follows:


     Return the number of the highest back reference in the  pat-
     tern.  The  fourth argument should point to an int variable.
     Zero is returned if there are no back references.


     Return the number of capturing subpatterns in  the  pattern.
     The fourth argument should point to an int variable.


     Return information about  the  first  byte  of  any  matched
     string,  for a non-anchored pattern. (This option used to be
     called PCRE_INFO_FIRSTCHAR; the old name is still recognized
     for backwards compatibility.)

     If there is a fixed first byte, e.g. from a pattern such  as
     (cat|cow|coyote),  it  is returned in the integer pointed to
     by where. Otherwise, if either

     (a) the pattern was compiled with the PCRE_MULTILINE option,
     and every branch starts with "^", or

     (b) every  branch  of  the  pattern  starts  with  ".*"  and
     PCRE_DOTALL is not set (if it were set, the pattern would be

     -1 is returned, indicating that the pattern matches only  at
     the  start  of  a subject string or after any newline within
     the string. Otherwise -2 is returned. For anchored patterns,
     -2 is returned.


     If the pattern was studied, and this resulted  in  the  con-
     struction of a 256-bit table indicating a fixed set of bytes
     for the first byte in any matching string, a pointer to  the
     table  is  returned.  Otherwise NULL is returned. The fourth
     argument should point to an unsigned char * variable.


     Return the value of the rightmost  literal  byte  that  must
     exist  in  any  matched  string, other than at its start, if
     such a byte has been recorded. The  fourth  argument  should
     point  to  an  int variable. If there is no such byte, -1 is
     returned. For anchored patterns,  a  last  literal  byte  is
     recorded  only  if  it follows something of variable length.
     For example, for the pattern /^a\d+z\d+/ the returned  value
     is "z", but for /^a\dz\d/ the returned value is -1.


     PCRE supports the use of named as well as numbered capturing
     parentheses. The names are just an additional way of identi-
     fying the parentheses,  which  still  acquire  a  number.  A
     caller  that  wants  to extract data from a named subpattern
     must convert the name to a number in  order  to  access  the
     correct  pointers  in  the  output  vector  (described  with
     pcre_exec() below). In order to do this, it must  first  use
     these  three  values  to  obtain  the name-to-number mapping
     table for the pattern.

     The  map  consists  of  a  number  of  fixed-size   entries.
     PCRE_INFO_NAMECOUNT   gives   the  number  of  entries,  and
     PCRE_INFO_NAMEENTRYSIZE gives the size of each  entry;  both
     of  these return an int value. The entry size depends on the
     length of the longest name.  PCRE_INFO_NAMETABLE  returns  a
     pointer to the first entry of the table (a pointer to char).
     The first two bytes of each entry are the number of the cap-
     turing parenthesis, most significant byte first. The rest of
     the entry is the corresponding name,  zero  terminated.  The
     names  are  in alphabetical order. For example, consider the
     following pattern (assume PCRE_EXTENDED  is  set,  so  white
     space - including newlines - is ignored):

       (?P<date> (?P<year>(\d\d)?\d\d) -
       (?P<month>\d\d) - (?P<day>\d\d) )

     There are four named subpatterns,  so  the  table  has  four
     entries,  and  each  entry in the table is eight bytes long.
     The table is as follows, with non-printing  bytes  shows  in
     hex, and undefined bytes shown as ??:

       00 01 d  a  t  e  00 ??
       00 05 d  a  y  00 ?? ??
       00 04 m  o  n  t  h  00
       00 02 y  e  a  r  00 ??

     When writing code to extract data  from  named  subpatterns,
     remember  that the length of each entry may be different for
     each compiled pattern.


     Return a copy of the options with which the pattern was com-
     piled.  The fourth argument should point to an unsigned long
     int variable. These option bits are those specified  in  the
     call  to  pcre_compile(),  modified  by any top-level option
     settings within the pattern itself.

     A pattern is automatically anchored by PCRE if  all  of  its
     top-level alternatives begin with one of the following:

       ^     unless PCRE_MULTILINE is set
       \A    always
       \G    always
       .*    if PCRE_DOTALL is set and there are no back
               references to the subpattern in which .* appears

     For such patterns, the  PCRE_ANCHORED  bit  is  set  in  the
     options returned by pcre_fullinfo().


     Return the size of the compiled pattern, that is, the  value
     that  was  passed as the argument to pcre_malloc() when PCRE
     was getting memory in which to place the compiled data.  The
     fourth argument should point to a size_t variable.


     Returns the size  of  the  data  block  pointed  to  by  the
     study_data  field  in a pcre_extra block. That is, it is the
     value that was passed to pcre_malloc() when PCRE was getting
     memory into which to place the data created by pcre_study().
     The fourth argument should point to a size_t variable.


     int pcre_info(const pcre *code, int *optptr, *firstcharptr);

     The pcre_info() function is now obsolete because its  inter-
     face  is  too  restrictive  to return all the available data
     about  a  compiled  pattern.   New   programs   should   use
     pcre_fullinfo()  instead.  The  yield  of pcre_info() is the
     number of capturing subpatterns, or  one  of  the  following
     negative numbers:

       PCRE_ERROR_NULL       the argument code was NULL
       PCRE_ERROR_BADMAGIC   the "magic number" was not found

     If the optptr argument is not NULL, a copy  of  the  options
     with which the pattern was compiled is placed in the integer
     it points to (see PCRE_INFO_OPTIONS above).

     If the pattern is not anchored and the firstcharptr argument
     is  not  NULL, it is used to pass back information about the
     first    character    of    any    matched    string    (see


     int pcre_exec(const pcre *code, const pcre_extra *extra,
          const char *subject, int length, int startoffset,
          int options, int *ovector, int ovecsize);

     The function pcre_exec() is called to match a subject string
     against  a pre-compiled pattern, which is passed in the code
     argument. If the pattern has been studied, the result of the
     study should be passed in the extra argument.

     Here is an example of a simple call to pcre_exec():

       int rc;
       int ovector[30];
       rc = pcre_exec(
         re,             /* result of pcre_compile() */
         NULL,           /* we didn't study the pattern */
         "some string",  /* the subject string */
         11,             /* the length of the subject string */
         0,              /* start at offset 0 in the subject */
         0,              /* default options */
         ovector,        /* vector for substring information */
         30);            /* number of elements in the vector */

     If the extra argument is  not  NULL,  it  must  point  to  a
     pcre_extra  data  block.  The  pcre_study() function returns
     such a block (when it doesn't return NULL), but you can also
     create  one for yourself, and pass additional information in
     it. The fields in the block are as follows:

       unsigned long int flags;
       void *study_data;
       unsigned long int match_limit;
       void *callout_data;

     The flags field is a bitmap  that  specifies  which  of  the
     other fields are set. The flag bits are:


     Other flag bits should be set to zero. The study_data  field
     is   set  in  the  pcre_extra  block  that  is  returned  by
     pcre_study(), together with the appropriate  flag  bit.  You
     should  not  set this yourself, but you can add to the block
     by setting the other fields.

     The match_limit field provides a means  of  preventing  PCRE
     from  using  up a vast amount of resources when running pat-
     terns that are not going to match, but  which  have  a  very
     large  number  of  possibilities  in their search trees. The
     classic example is the  use  of  nested  unlimited  repeats.
     Internally,  PCRE  uses  a  function called match() which it
     calls  repeatedly  (sometimes  recursively).  The  limit  is
     imposed  on the number of times this function is called dur-
     ing a match, which has the effect of limiting the amount  of
     recursion and backtracking that can take place. For patterns
     that are not anchored, the count starts from zero  for  each
     position in the subject string.

     The default limit for the library can be set  when  PCRE  is
     built;  the default default is 10 million, which handles all
     but the most extreme cases. You can reduce  the  default  by
     suppling  pcre_exec()  with  a  pcre_extra  block  in  which
     match_limit   is   set   to    a    smaller    value,    and
     PCRE_EXTRA_MATCH_LIMIT  is  set  in  the flags field. If the
     limit      is      exceeded,       pcre_exec()       returns

     The pcre_callout field is used in conjunction with the "cal-
     lout"  feature,  which is described in the pcrecallout docu-

     The PCRE_ANCHORED option can be passed in the options  argu-
     ment,   whose   unused   bits  must  be  zero.  This  limits
     pcre_exec() to matching at the first matching position. How-
     ever,  if  a  pattern  was  compiled  with PCRE_ANCHORED, or
     turned out to be anchored by virtue of its contents, it can-
     not be made unachored at matching time.

     When PCRE_UTF8 was set at compile time, the validity of  the
     subject  as  a  UTF-8 string is automatically checked. If an
     invalid  UTF-8  sequence  of  bytes  is  found,  pcre_exec()
     returns  the  error  PCRE_ERROR_BADUTF8. If you already know
     that your subject is valid, and you want to skip this  check
     for  performance reasons, you can set the PCRE_NO_UTF8_CHECK
     option when calling pcre_exec(). When this  option  is  set,
     the  effect  of passing an invalid UTF-8 string as a subject
     is undefined. It may cause your program to crash.

     There are also three further options that can be set only at
     matching time:


     The first character of the string is not the beginning of  a
     line,  so  the  circumflex  metacharacter  should  not match
     before it. Setting this without PCRE_MULTILINE  (at  compile
     time) causes circumflex never to match.


     The end of the string is not the end of a line, so the  dol-
     lar  metacharacter should not match it nor (except in multi-
     line mode) a newline immediately  before  it.  Setting  this
     without PCRE_MULTILINE (at compile time) causes dollar never
     to match.


     An empty string is not considered to be  a  valid  match  if
     this  option  is  set. If there are alternatives in the pat-
     tern, they are tried. If  all  the  alternatives  match  the
     empty  string,  the  entire match fails. For example, if the


     is applied to a string not beginning with  "a"  or  "b",  it
     matches  the  empty string at the start of the subject. With
     PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
     further into the string for occurrences of "a" or "b".

     Perl has no direct equivalent of PCRE_NOTEMPTY, but it  does
     make  a  special case of a pattern match of the empty string
     within its split() function, and when using the /g modifier.
     It  is possible to emulate Perl's behaviour after matching a
     null string by first trying the  match  again  at  the  same
     offset  with  PCRE_NOTEMPTY  set,  and then if that fails by
     advancing the starting offset  (see  below)  and  trying  an
     ordinary match again.

     The subject string is passed to pcre_exec() as a pointer  in
     subject,  a length in length, and a starting offset in star-
     toffset. Unlike the pattern string, the subject may  contain
     binary  zero  bytes.  When  the starting offset is zero, the
     search for a match starts at the beginning of  the  subject,
     and this is by far the most common case.

     If the pattern was compiled with the PCRE_UTF8  option,  the
     subject  must  be  a sequence of bytes that is a valid UTF-8
     string.  If  an  invalid  UTF-8  string  is  passed,  PCRE's
     behaviour is not defined.

     A non-zero starting offset  is  useful  when  searching  for
     another  match  in  the  same subject by calling pcre_exec()
     again after a previous success.  Setting startoffset differs
     from  just  passing  over  a  shortened  string  and setting
     PCRE_NOTBOL in the case of a pattern that  begins  with  any
     kind of lookbehind. For example, consider the pattern


     which finds occurrences of "iss" in the middle of words. (\B
     matches only if the current position in the subject is not a
     word boundary.) When applied to the string "Mississipi"  the
     first  call  to  pcre_exec()  finds the first occurrence. If
     pcre_exec() is called again with just the remainder  of  the
     subject,  namely  "issipi", it does not match, because \B is
     always false at the start of the subject, which is deemed to
     be  a  word  boundary. However, if pcre_exec() is passed the
     entire string again, but with startoffset set to 4, it finds
     the  second  occurrence  of "iss" because it is able to look
     behind the starting point to discover that it is preceded by
     a letter.

     If a non-zero starting offset is passed when the pattern  is
     anchored, one attempt to match at the given offset is tried.
     This can only succeed if the pattern does  not  require  the
     match to be at the start of the subject.

     In general, a pattern matches a certain portion of the  sub-
     ject,  and  in addition, further substrings from the subject
     may be picked out by parts of  the  pattern.  Following  the
     usage  in  Jeffrey Friedl's book, this is called "capturing"
     in what follows, and the phrase  "capturing  subpattern"  is
     used for a fragment of a pattern that picks out a substring.
     PCRE supports several other kinds of  parenthesized  subpat-
     tern that do not cause substrings to be captured.
     Captured substrings are returned to the caller via a  vector
     of  integer  offsets whose address is passed in ovector. The
     number of elements in the vector is passed in ovecsize.  The
     first two-thirds of the vector is used to pass back captured
     substrings, each substring using a  pair  of  integers.  The
     remaining  third  of  the  vector  is  used  as workspace by
     pcre_exec() while matching capturing subpatterns, and is not
     available for passing back information. The length passed in
     ovecsize should always be a multiple of three. If it is not,
     it is rounded down.

     When a match has been successful, information about captured
     substrings is returned in pairs of integers, starting at the
     beginning of ovector, and continuing up to two-thirds of its
     length  at  the  most. The first element of a pair is set to
     the offset of the first character in a  substring,  and  the
     second is set to the offset of the first character after the
     end of a substring. The first  pair,  ovector[0]  and  ovec-
     tor[1],  identify  the portion of the subject string matched
     by the entire pattern. The next pair is used for  the  first
     capturing  subpattern,  and  so  on.  The  value returned by
     pcre_exec() is the number of pairs that have  been  set.  If
     there  are no capturing subpatterns, the return value from a
     successful match is 1, indicating that just the  first  pair
     of offsets has been set.

     Some convenience functions are provided for  extracting  the
     captured substrings as separate strings. These are described
     in the following section.

     It is possible for an capturing  subpattern  number  n+1  to
     match  some  part  of  the subject when subpattern n has not
     been used at all.  For  example,  if  the  string  "abc"  is
     matched  against the pattern (a|(z))(bc) subpatterns 1 and 3
     are matched, but 2 is not. When this  happens,  both  offset
     values corresponding to the unused subpattern are set to -1.

     If a capturing subpattern is matched repeatedly, it  is  the
     last  portion  of  the  string  that  it  matched  that gets

     If the vector is too small to hold  all  the  captured  sub-
     strings,  it is used as far as possible (up to two-thirds of
     its length), and the function returns a value  of  zero.  In
     particular,  if  the  substring offsets are not of interest,
     pcre_exec() may be called with ovector passed  as  NULL  and
     ovecsize  as  zero.  However,  if  the pattern contains back
     references and the ovector isn't big enough to remember  the
     related  substrings,  PCRE  has to get additional memory for
     use during matching. Thus it is usually advisable to  supply
     an ovector.

     Note that pcre_info() can be used to find out how many  cap-
     turing  subpatterns  there  are  in  a compiled pattern. The
     smallest size for ovector that will  allow  for  n  captured
     substrings,  in  addition  to  the  offsets of the substring
     matched by the whole pattern, is (n+1)*3.

     If pcre_exec() fails, it returns a negative number. The fol-
     lowing are defined in the header file:

       PCRE_ERROR_NOMATCH        (-1)

     The subject string did not match the pattern.

       PCRE_ERROR_NULL           (-2)

     Either code or subject was passed as NULL,  or  ovector  was
     NULL and ovecsize was not zero.

       PCRE_ERROR_BADOPTION      (-3)

     An unrecognized bit was set in the options argument.

       PCRE_ERROR_BADMAGIC       (-4)

     PCRE stores a 4-byte "magic number" at the start of the com-
     piled  code,  to  catch  the  case  when it is passed a junk
     pointer. This is the error it gives when  the  magic  number
     isn't present.


     While running the pattern match, an unknown item was encoun-
     tered in the compiled pattern. This error could be caused by
     a bug in PCRE or by overwriting of the compiled pattern.

       PCRE_ERROR_NOMEMORY       (-6)

     If a pattern contains back references, but the ovector  that
     is  passed  to pcre_exec() is not big enough to remember the
     referenced substrings, PCRE gets a block of  memory  at  the
     start  of  matching to use for this purpose. If the call via
     pcre_malloc() fails, this error  is  given.  The  memory  is
     freed at the end of matching.


     This   error   is   used   by   the   pcre_copy_substring(),
     pcre_get_substring(),  and  pcre_get_substring_list()  func-
     tions (see below). It is never returned by pcre_exec().


     The recursion and backtracking limit, as  specified  by  the
     match_limit  field  in a pcre_extra structure (or defaulted)
     was reached. See the description above.

       PCRE_ERROR_CALLOUT        (-9)

     This error is never generated by pcre_exec() itself.  It  is
     provided  for  use by callout functions that want to yield a
     distinctive error code. See  the  pcrecallout  documentation
     for details.

       PCRE_ERROR_BADUTF8       (-10)

     A string that contains an invalid UTF-8  byte  sequence  was
     passed as a subject.


     int pcre_copy_substring(const char *subject, int *ovector,
          int stringcount, int stringnumber, char *buffer,
          int buffersize);

     int pcre_get_substring(const char *subject, int *ovector,
          int stringcount, int stringnumber,
          const char **stringptr);

     int pcre_get_substring_list(const char *subject,
          int *ovector, int stringcount, const char ***listptr);

     Captured substrings can be accessed directly  by  using  the
     offsets returned by pcre_exec() in ovector. For convenience,
     the functions  pcre_copy_substring(),  pcre_get_substring(),
     and  pcre_get_substring_list()  are  provided for extracting
     captured  substrings  as  new,   separate,   zero-terminated
     strings.  These functions identify substrings by number. The
     next section describes functions for extracting  named  sub-
     strings.   A  substring  that  contains  a  binary  zero  is
     correctly extracted and has a further zero added on the end,
     but the result is not, of course, a C string.

     The first three arguments are the  same  for  all  three  of
     these  functions:   subject  is the subject string which has
     just been successfully matched, ovector is a pointer to  the
     vector  of  integer  offsets that was passed to pcre_exec(),
     and stringcount is the number of substrings that  were  cap-
     tured by the match, including the substring that matched the
     entire regular expression. This is  the  value  returned  by
     pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()
     returned zero, indicating that it ran out of space in  ovec-
     tor,  the  value passed as stringcount should be the size of
     the vector divided by three.
     The functions pcre_copy_substring() and pcre_get_substring()
     extract a single substring, whose number is given as string-
     number. A value of zero extracts the substring that  matched
     the entire pattern, while higher values extract the captured
     substrings. For pcre_copy_substring(), the string is  placed
     in  buffer,  whose  length is given by buffersize, while for
     pcre_get_substring() a new block of memory is  obtained  via
     pcre_malloc,  and its address is returned via stringptr. The
     yield of the function is  the  length  of  the  string,  not
     including the terminating zero, or one of

       PCRE_ERROR_NOMEMORY       (-6)

     The buffer was too small for pcre_copy_substring(),  or  the
     attempt to get memory failed for pcre_get_substring().


     There is no substring whose number is stringnumber.

     The pcre_get_substring_list() function extracts  all  avail-
     able  substrings  and builds a list of pointers to them. All
     this is done in a single block of memory which  is  obtained
     via pcre_malloc. The address of the memory block is returned
     via listptr, which is also the start of the list  of  string
     pointers.  The  end of the list is marked by a NULL pointer.
     The yield of the function is zero if all went well, or

       PCRE_ERROR_NOMEMORY       (-6)

     if the attempt to get the memory block failed.

     When any of these functions encounter a  substring  that  is
     unset, which can happen when capturing subpattern number n+1
     matches some part of the subject, but subpattern n  has  not
     been  used  at all, they return an empty string. This can be
     distinguished  from  a  genuine  zero-length  substring   by
     inspecting the appropriate offset in ovector, which is nega-
     tive for unset substrings.

     The  two  convenience  functions  pcre_free_substring()  and
     pcre_free_substring_list()  can  be  used to free the memory
     returned by  a  previous  call  of  pcre_get_substring()  or
     pcre_get_substring_list(),  respectively.  They  do  nothing
     more than call the function pointed to by  pcre_free,  which
     of  course  could  be called directly from a C program. How-
     ever, PCRE is used in some situations where it is linked via
     a  special  interface  to another programming language which
     cannot use pcre_free directly; it is for  these  cases  that
     the functions are provided.


     int pcre_copy_named_substring(const pcre *code,
          const char *subject, int *ovector,
          int stringcount, const char *stringname,
          char *buffer, int buffersize);

     int pcre_get_stringnumber(const pcre *code,
          const char *name);

     int pcre_get_named_substring(const pcre *code,
          const char *subject, int *ovector,
          int stringcount, const char *stringname,
          const char **stringptr);

     To extract a substring by name, you first have to find asso-
     ciated    number.    This    can    be   done   by   calling
     pcre_get_stringnumber(). The first argument is the  compiled
     pattern,  and  the second is the name. For example, for this


     the number of the subpattern called "xxx" is  1.  Given  the
     number,  you can then extract the substring directly, or use
     one of the functions described in the previous section.  For
     convenience,  there are also two functions that do the whole

     Most of the  arguments  of  pcre_copy_named_substring()  and
     pcre_get_named_substring()  are  the  same  as those for the
     functions that  extract  by  number,  and  so  are  not  re-
     described here. There are just two differences.

     First, instead of a substring number, a  substring  name  is
     given.  Second,  there  is  an  extra argument, given at the
     start, which is a pointer to the compiled pattern.  This  is
     needed  in order to gain access to the name-to-number trans-
     lation table.

     These functions  call  pcre_get_stringnumber(),  and  if  it
     succeeds,    they   then   call   pcre_copy_substring()   or
     pcre_get_substring(), as appropriate.

Last updated: 20 August 2003
Copyright (c) 1997-2003 University of Cambridge.

     PCRE - Perl-compatible regular expressions


     int (*pcre_callout)(pcre_callout_block *);

     PCRE provides a feature called "callout", which is  a  means
     of  temporarily passing control to the caller of PCRE in the
     middle of pattern matching. The caller of PCRE  provides  an
     external  function  by putting its entry point in the global
     variable pcre_callout. By default,  this  variable  contains
     NULL, which disables all calling out.

     Within a regular expression, (?C) indicates  the  points  at
     which  the external function is to be called. Different cal-
     lout points can be identified by putting a number less  than
     256  after  the  letter  C.  The default value is zero.  For
     example, this pattern has two callout points:


     During matching, when PCRE  reaches  a  callout  point  (and
     pcre_callout  is  set), the external function is called. Its
     only argument is a pointer to  a  pcre_callout  block.  This
     contains the following variables:

       int          version;
       int          callout_number;
       int         *offset_vector;
       const char  *subject;
       int          subject_length;
       int          start_match;
       int          current_position;
       int          capture_top;
       int          capture_last;
       void        *callout_data;

     The version field  is  an  integer  containing  the  version
     number of the block format. The current version is zero. The
     version number may change in future if additional fields are
     added,  but  the  intention  is  never  to remove any of the
     existing fields.

     The callout_number field contains the number of the callout,
     as compiled into the pattern (that is, the number after ?C).

     The offset_vector field  is  a  pointer  to  the  vector  of
     offsets  that  was  passed by the caller to pcre_exec(). The
     contents can be inspected in  order  to  extract  substrings
     that  have  been  matched  so  far,  in  the same way as for
     extracting substrings after a match has completed.
     The subject and subject_length  fields  contain  copies  the
     values that were passed to pcre_exec().

     The start_match field contains the offset within the subject
     at  which  the current match attempt started. If the pattern
     is not anchored, the callout function may be called  several
     times for different starting points.

     The current_position field contains the  offset  within  the
     subject of the current match pointer.

     The capture_top field contains one more than the  number  of
     the  highest  numbered captured substring so far. If no sub-
     strings have been captured, the value of capture_top is one.

     The capture_last field  contains  the  number  of  the  most
     recently captured substring.

     The callout_data field contains a value that  is  passed  to
     pcre_exec()  by  the  caller  specifically so that it can be
     passed back in callouts. It is passed  in  the  pcre_callout
     field  of the pcre_extra data structure. If no such data was
     passed, the value of callout_data in a pcre_callout block is
     NULL.  There is a description of the pcre_extra structure in
     the pcreapi documentation.


     The callout function returns an integer.  If  the  value  is
     zero,  matching  proceeds as normal. If the value is greater
     than zero, matching fails at the current  point,  but  back-
     tracking  to test other possibilities goes ahead, just as if
     a lookahead assertion had failed. If the value is less  than
     zero,  the  match  is abandoned, and pcre_exec() returns the

     Negative values should normally be chosen from  the  set  of
     PCRE_ERROR_xxx  values.  In  particular,  PCRE_ERROR_NOMATCH
     forces a standard "no  match"  failure.   The  error  number
     PCRE_ERROR_CALLOUT is reserved for use by callout functions;
     it will never be used by PCRE itself.

Last updated: 21 January 2003
Copyright (c) 1997-2003 University of Cambridge.

     PCRE - Perl-compatible regular expressions


     This document describes the differences  in  the  ways  that
     PCRE  and  Perl  handle regular expressions. The differences
     described here are with respect to Perl 5.8.

     1. PCRE does  not  allow  repeat  quantifiers  on  lookahead
     assertions. Perl permits them, but they do not mean what you
     might think. For example, (?!a){3} does not assert that  the
     next  three characters are not "a". It just asserts that the
     next character is not "a" three times.

     2. Capturing subpatterns that occur inside  negative  looka-
     head  assertions  are  counted,  but  their  entries  in the
     offsets vector are never set. Perl sets its numerical  vari-
     ables  from  any  such  patterns that are matched before the
     assertion fails to match something (thereby succeeding), but
     only  if  the negative lookahead assertion contains just one

     3. Though binary zero characters are supported in  the  sub-
     ject  string,  they  are  not  allowed  in  a pattern string
     because it is passed as a normal  C  string,  terminated  by
     zero. The escape sequence "\0" can be used in the pattern to
     represent a binary zero.

     4. The following Perl escape sequences  are  not  supported:
     \l,  \u,  \L,  \U,  \P, \p, and \X. In fact these are imple-
     mented by Perl's general string-handling and are not part of
     its pattern matching engine. If any of these are encountered
     by PCRE, an error is generated.

     5. PCRE does support the \Q...\E  escape  for  quoting  sub-
     strings. Characters in between are treated as literals. This
     is slightly different from Perl in that $  and  @  are  also
     handled  as  literals inside the quotes. In Perl, they cause
     variable interpolation (but of course  PCRE  does  not  have
     variables). Note the following examples:

         Pattern            PCRE matches      Perl matches

         \Qabc$xyz\E        abc$xyz           abc followed by the
                                                contents of $xyz
         \Qabc\$xyz\E       abc\$xyz          abc\$xyz
         \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz

     In PCRE, the \Q...\E mechanism is not  recognized  inside  a
     character class.

     8. Fairly obviously, PCRE does not support the (?{code}) and
     (?p{code})  constructions. However, there is some experimen-
     tal support for recursive patterns using the non-Perl  items
     (?R),  (?number)  and  (?P>name).  Also,  the PCRE "callout"
     feature allows an external function to be called during pat-
     tern matching.

     9. There are some differences that are  concerned  with  the
     settings  of  captured  strings  when  part  of a pattern is
     repeated. For example, matching "aba"  against  the  pattern
     /^(a(b)?)+$/  in Perl leaves $2 unset, but in PCRE it is set
     to "b".

     10. PCRE  provides  some  extensions  to  the  Perl  regular
     expression facilities:

     (a) Although lookbehind assertions must match  fixed  length
     strings,  each  alternative branch of a lookbehind assertion
     can match a different length of string. Perl  requires  them
     all to have the same length.

     (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is  not
     set,  the  $  meta-character matches only at the very end of
     the string.

     (c) If PCRE_EXTRA is set, a backslash followed by  a  letter
     with no special meaning is faulted.

     (d) If PCRE_UNGREEDY is set, the greediness of  the  repeti-
     tion  quantifiers  is inverted, that is, by default they are
     not greedy, but if followed by a question mark they are.

     (e) PCRE_ANCHORED can be used to force a pattern to be tried
     only at the first matching position in the subject string.

     PCRE_NO_AUTO_CAPTURE  options  for  pcre_exec() have no Perl

     (g) The (?R), (?number), and (?P>name) constructs allows for
     recursive  pattern  matching  (Perl  can  do  this using the
     (?p{code}) construct, which PCRE cannot support.)

     (h) PCRE supports  named  capturing  substrings,  using  the
     Python syntax.

     (i) PCRE supports the  possessive  quantifier  "++"  syntax,
     taken from Sun's Java package.

     (j) The (R) condition, for  testing  recursion,  is  a  PCRE

     (k) The callout facility is PCRE-specific.

Last updated: 03 February 2003
Copyright (c) 1997-2003 University of Cambridge.

     PCRE - Perl-compatible regular expressions


     The syntax and semantics of  the  regular  expressions  sup-
     ported  by PCRE are described below. Regular expressions are
     also described in the Perl documentation and in a number  of
     other  books,  some  of which have copious examples. Jeffrey
     Friedl's  "Mastering  Regular  Expressions",  published   by
     O'Reilly,  covers them in great detail. The description here
     is intended as reference documentation.

     The basic operation of PCRE is on strings of bytes. However,
     there  is  also  support for UTF-8 character strings. To use
     this support you must build PCRE to include  UTF-8  support,
     and  then call pcre_compile() with the PCRE_UTF8 option. How
     this affects the pattern matching is  mentioned  in  several
     places  below.  There is also a summary of UTF-8 features in
     the section on UTF-8 support in the main pcre page.

     A regular expression is a pattern that is matched against  a
     subject string from left to right. Most characters stand for
     themselves in a pattern, and match the corresponding charac-
     ters in the subject. As a trivial example, the pattern

       The quick brown fox

     matches a portion of a subject string that is  identical  to
     itself.  The  power  of  regular  expressions comes from the
     ability to include alternatives and repetitions in the  pat-
     tern.  These  are encoded in the pattern by the use of meta-
     characters, which do not stand for  themselves  but  instead
     are interpreted in some special way.

     There are two different sets of meta-characters: those  that
     are  recognized anywhere in the pattern except within square
     brackets, and those that are recognized in square  brackets.
     Outside square brackets, the meta-characters are as follows:

       \      general escape character with several uses
       ^      assert start of string (or line, in multiline mode)
       $      assert end of string (or line, in multiline mode)
       .      match any character except newline (by default)
       [      start character class definition
       |      start of alternative branch
       (      start subpattern
       )      end subpattern
       ?      extends the meaning of (
              also 0 or 1 quantifier
              also quantifier minimizer
       *      0 or more quantifier
       +      1 or more quantifier
              also "possessive quantifier"
       {      start min/max quantifier

     Part of a pattern that is in square  brackets  is  called  a
     "character  class".  In  a  character  class  the only meta-
     characters are:

       \      general escape character
       ^      negate the class, but only if the first character
       -      indicates character range
       [      POSIX character class (only if followed by POSIX
       ]      terminates the character class

     The following sections describe  the  use  of  each  of  the


     The backslash character has several uses. Firstly, if it  is
     followed  by  a  non-alphameric character, it takes away any
     special  meaning  that  character  may  have.  This  use  of
     backslash  as  an  escape  character applies both inside and
     outside character classes.

     For example, if you want to match a * character,  you  write
     \*  in the pattern.  This escaping action applies whether or
     not the following character would otherwise  be  interpreted
     as  a meta-character, so it is always safe to precede a non-
     alphameric with backslash to  specify  that  it  stands  for
     itself. In particular, if you want to match a backslash, you
     write \\.

     If a pattern is compiled with the PCRE_EXTENDED option, whi-
     tespace in the pattern (other than in a character class) and
     characters between a # outside a  character  class  and  the
     next  newline  character  are ignored. An escaping backslash
     can be used to include a whitespace or # character  as  part
     of the pattern.

     If you want to remove the special meaning from a sequence of
     characters, you can do so by putting them between \Q and \E.
     This is different from Perl in that $ and @ are  handled  as
     literals  in  \Q...\E  sequences in PCRE, whereas in Perl, $
     and @ cause variable interpolation. Note the following exam-

       Pattern            PCRE matches   Perl matches

       \Qabc$xyz\E        abc$xyz        abc followed by the

                                           contents of $xyz
       \Qabc\$xyz\E       abc\$xyz       abc\$xyz
       \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz

     The \Q...\E sequence is recognized both inside  and  outside
     character classes.

     A second use of backslash provides a way  of  encoding  non-
     printing  characters  in patterns in a visible manner. There
     is no restriction on the appearance of non-printing  charac-
     ters,  apart from the binary zero that terminates a pattern,
     but when a pattern is being prepared by text editing, it  is
     usually  easier to use one of the following escape sequences
     than the binary character it represents:

       \a        alarm, that is, the BEL character (hex 07)
       \cx       "control-x", where x is any character
       \e        escape (hex 1B)
       \f        formfeed (hex 0C)
       \n        newline (hex 0A)
       \r        carriage return (hex 0D)
       \t        tab (hex 09)
       \ddd      character with octal code ddd, or backreference
       \xhh      character with hex code hh
       \x{hhh..} character with hex code hhh... (UTF-8 mode only)

     The precise effect of \cx is as follows: if  x  is  a  lower
     case  letter,  it  is converted to upper case. Then bit 6 of
     the character (hex 40) is inverted.  Thus  \cz  becomes  hex
     1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.

     After \x, from zero  to  two  hexadecimal  digits  are  read
     (letters  can be in upper or lower case). In UTF-8 mode, any
     number of hexadecimal digits may appear between \x{  and  },
     but  the value of the character code must be less than 2**31
     (that is, the maximum hexadecimal  value  is  7FFFFFFF).  If
     characters  other than hexadecimal digits appear between \x{
     and }, or if there is no terminating }, this form of  escape
     is  not  recognized.  Instead, the initial \x will be inter-
     preted as a basic  hexadecimal  escape,  with  no  following
     digits, giving a byte whose value is zero.

     Characters whose value is less than 256 can  be  defined  by
     either  of  the  two  syntaxes  for \x when PCRE is in UTF-8
     mode. There is no difference in the way  they  are  handled.
     For example, \xdc is exactly the same as \x{dc}.

     After \0 up to two further octal digits are  read.  In  both
     cases,  if  there are fewer than two digits, just those that
     are present are used. Thus the  sequence  \0\x\07  specifies
     two binary zeros followed by a BEL character (code value 7).
     Make sure you supply two digits after the  initial  zero  if
     the character that follows is itself an octal digit.

     The handling of a backslash followed by a digit other than 0
     is  complicated.   Outside  a character class, PCRE reads it
     and any following digits as a decimal number. If the  number
     is  less  than  10, or if there have been at least that many
     previous capturing left parentheses in the  expression,  the
     entire  sequence is taken as a back reference. A description
     of how this works is given later, following  the  discussion
     of parenthesized subpatterns.

     Inside a character  class,  or  if  the  decimal  number  is
     greater  than  9 and there have not been that many capturing
     subpatterns, PCRE re-reads up to three octal digits  follow-
     ing  the  backslash,  and  generates  a single byte from the
     least significant 8 bits of the value. Any subsequent digits
     stand for themselves.  For example:

       \040   is another way of writing a space
       \40    is the same, provided there are fewer than 40
                 previous capturing subpatterns
       \7     is always a back reference
       \11    might be a back reference, or another way of
                 writing a tab
       \011   is always a tab
       \0113  is a tab followed by the character "3"
       \113   might be a back reference, otherwise the
                 character with octal code 113
       \377   might be a back reference, otherwise
                 the byte consisting entirely of 1 bits
       \81    is either a back reference, or a binary zero
                 followed by the two characters "8" and "1"

     Note that octal values of 100 or greater must not be  intro-
     duced  by  a  leading zero, because no more than three octal
     digits are ever read.

     All the sequences that define a single byte value or a  sin-
     gle  UTF-8 character (in UTF-8 mode) can be used both inside
     and outside character classes. In addition, inside a charac-
     ter  class,  the sequence \b is interpreted as the backspace
     character (hex 08). Outside a character class it has a  dif-
     ferent meaning (see below).

     The third use of backslash is for specifying generic charac-
     ter types:

       \d     any decimal digit
       \D     any character that is not a decimal digit
       \s     any whitespace character
       \S     any character that is not a whitespace character
       \w     any "word" character
       W     any "non-word" character

     Each pair of escape sequences partitions the complete set of
     characters  into  two  disjoint  sets.  Any  given character
     matches one, and only one, of each pair.

     In UTF-8 mode, characters with values greater than 255 never
     match \d, \s, or \w, and always match \D, \S, and \W.

     For compatibility with Perl, \s does not match the VT  char-
     acter (code 11).  This makes it different from the the POSIX
     "space" class. The \s characters are HT  (9),  LF  (10),  FF
     (12), CR (13), and space (32).

     A "word" character is any letter or digit or the  underscore
     character,  that  is,  any  character which can be part of a
     Perl "word". The definition of letters and  digits  is  con-
     trolled  by PCRE's character tables, and may vary if locale-
     specific matching is taking place (see "Locale  support"  in
     the pcreapi page). For example, in the "fr" (French) locale,
     some character codes greater than 128 are used for  accented
     letters, and these are matched by \w.

     These character type sequences can appear  both  inside  and
     outside  character classes. They each match one character of
     the appropriate type. If the current matching  point  is  at
     the end of the subject string, all of them fail, since there
     is no character to match.

     The fourth use of backslash is  for  certain  simple  asser-
     tions. An assertion specifies a condition that has to be met
     at a particular point in  a  match,  without  consuming  any
     characters  from  the subject string. The use of subpatterns
     for more complicated  assertions  is  described  below.  The
     backslashed assertions are

       \b     matches at a word boundary
       \B     matches when not at a word boundary
       \A     matches at start of subject
       \Z     matches at end of subject or before newline at end
       \z     matches at end of subject
       \G     matches at first matching position in subject

     These assertions may not appear in  character  classes  (but
     note  that  \b has a different meaning, namely the backspace
     character, inside a character class).

     A word boundary is a position in the  subject  string  where
     the current character and the previous character do not both
     match \w or \W (i.e. one matches \w and  the  other  matches
     \W),  or the start or end of the string if the first or last
     character matches \w, respectively.
     The \A, \Z, and \z assertions differ  from  the  traditional
     circumflex  and  dollar  (described below) in that they only
     ever match at the very start and end of the subject  string,
     whatever options are set. Thus, they are independent of mul-
     tiline mode.

     They are not affected  by  the  PCRE_NOTBOL  or  PCRE_NOTEOL
     options.  If the startoffset argument of pcre_exec() is non-
     zero, indicating that matching is to start at a point  other
     than  the  beginning of the subject, \A can never match. The
     difference between \Z and \z is that  \Z  matches  before  a
     newline  that is the last character of the string as well as
     at the end of the string, whereas \z  matches  only  at  the

     The \G assertion is true  only  when  the  current  matching
     position is at the start point of the match, as specified by
     the startoffset argument of pcre_exec(). It differs from  \A
     when  the  value  of  startoffset  is  non-zero.  By calling
     pcre_exec() multiple times with appropriate  arguments,  you
     can mimic Perl's /g option, and it is in this kind of imple-
     mentation where \G can be useful.

     Note, however, that PCRE's  interpretation  of  \G,  as  the
     start of the current match, is subtly different from Perl's,
     which defines it as the end of the previous match. In  Perl,
     these  can  be  different when the previously matched string
     was empty. Because PCRE does just one match at  a  time,  it
     cannot reproduce this behaviour.

     If all the alternatives of a  pattern  begin  with  \G,  the
     expression  is  anchored to the starting match position, and
     the "anchored" flag is set in the compiled  regular  expres-


     Outside a character class, in the default matching mode, the
     circumflex  character  is an assertion which is true only if
     the current matching point is at the start  of  the  subject
     string.  If  the startoffset argument of pcre_exec() is non-
     zero, circumflex  can  never  match  if  the  PCRE_MULTILINE
     option is unset. Inside a character class, circumflex has an
     entirely different meaning (see below).

     Circumflex need not be the first character of the pattern if
     a  number of alternatives are involved, but it should be the
     first thing in each alternative in which it appears  if  the
     pattern is ever to match that branch. If all possible alter-
     natives start with a circumflex, that is, if the pattern  is
     constrained to match only at the start of the subject, it is
     said to be an "anchored" pattern. (There are also other con-
     structs that can cause a pattern to be anchored.)

     A dollar character is an assertion which is true only if the
     current  matching point is at the end of the subject string,
     or immediately before a newline character that is  the  last
     character in the string (by default). Dollar need not be the
     last character of the pattern if a  number  of  alternatives
     are  involved,  but it should be the last item in any branch
     in which it appears.  Dollar has no  special  meaning  in  a
     character class.

     The meaning of dollar can be changed so that it matches only
     at   the   very   end   of   the   string,  by  setting  the
     PCRE_DOLLAR_ENDONLY option at compile time.  This  does  not
     affect the \Z assertion.

     The meanings of the circumflex  and  dollar  characters  are
     changed  if  the  PCRE_MULTILINE option is set. When this is
     the case,  they  match  immediately  after  and  immediately
     before an internal newline character, respectively, in addi-
     tion to matching at the start and end of the subject string.
     For  example, the pattern /^abc$/ matches the subject string
     "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-
     quently,  patterns  that  are  anchored  in single line mode
     because all branches start with ^ are not anchored in multi-
     line  mode,  and a match for circumflex is possible when the
     startoffset  argument  of  pcre_exec()  is   non-zero.   The
     PCRE_DOLLAR_ENDONLY  option  is ignored if PCRE_MULTILINE is

     Note that the sequences \A, \Z, and \z can be used to  match
     the  start  and end of the subject in both modes, and if all
     branches of a pattern start with \A it is  always  anchored,
     whether PCRE_MULTILINE is set or not.


     Outside a character class, a dot in the pattern matches  any
     one character in the subject, including a non-printing char-
     acter, but not (by default) newline.  In UTF-8 mode,  a  dot
     matches  any  UTF-8  character, which might be more than one
     byte  long,  except  (by  default)  for  newline.   If   the
     PCRE_DOTALL  option is set, dots match newlines as well. The
     handling of dot is entirely independent of the  handling  of
     circumflex and dollar, the only relationship being that they
     both involve newline characters. Dot has no special  meaning
     in a character class.


     Outside a character class, the escape  sequence  \C  matches
     any  one  byte, both in and out of UTF-8 mode. Unlike a dot,
     it always matches a newline. The feature is provided in Perl
     in  order  to match individual bytes in UTF-8 mode.  Because
     it breaks up UTF-8 characters into  individual  bytes,  what
     remains  in  the string may be a malformed UTF-8 string. For
     this reason it is best avoided.

     PCRE does not allow \C to appear  in  lookbehind  assertions
     (see below), because in UTF-8 mode it makes it impossible to
     calculate the length of the lookbehind.


     An opening square bracket introduces a character class, ter-
     minated  by  a  closing  square  bracket.  A  closing square
     bracket on its own is  not  special.  If  a  closing  square
     bracket  is  required as a member of the class, it should be
     the first data character in the class (after an initial cir-
     cumflex, if present) or escaped with a backslash.

     A character class matches a single character in the subject.
     In  UTF-8 mode, the character may occupy more than one byte.
     A matched character must be in the set of characters defined
     by the class, unless the first character in the class defin-
     ition is a circumflex, in which case the  subject  character
     must not be in the set defined by the class. If a circumflex
     is actually required as a member of the class, ensure it  is
     not the first character, or escape it with a backslash.

     For example, the character class [aeiou] matches  any  lower
     case vowel, while [^aeiou] matches any character that is not
     a lower case vowel. Note that a circumflex is  just  a  con-
     venient  notation for specifying the characters which are in
     the class by enumerating those that are not. It  is  not  an
     assertion:  it  still  consumes a character from the subject
     string, and fails if the current pointer is at  the  end  of
     the string.

     In UTF-8 mode, characters with values greater than  255  can
     be  included  in a class as a literal string of bytes, or by
     using the \x{ escaping mechanism.

     When caseless matching  is  set,  any  letters  in  a  class
     represent  both their upper case and lower case versions, so
     for example, a caseless [aeiou] matches "A" as well as  "a",
     and  a caseless [^aeiou] does not match "A", whereas a case-
     ful version would. PCRE does not support the concept of case
     for characters with values greater than 255.
     The newline character is never treated in any special way in
     character  classes,  whatever the setting of the PCRE_DOTALL
     or PCRE_MULTILINE options is. A  class  such  as  [^a]  will
     always match a newline.

     The minus (hyphen) character can be used to specify a  range
     of  characters  in  a  character  class.  For example, [d-m]
     matches any letter between d and m, inclusive.  If  a  minus
     character  is required in a class, it must be escaped with a
     backslash or appear in a position where it cannot be  inter-
     preted as indicating a range, typically as the first or last
     character in the class.

     It is not possible to have the literal character "]" as  the
     end  character  of  a  range.  A  pattern such as [W-]46] is
     interpreted as a class of two characters ("W" and "-")  fol-
     lowed by a literal string "46]", so it would match "W46]" or
     "-46]". However, if the "]" is escaped with a  backslash  it
     is  interpreted  as  the end of range, so [W-\]46] is inter-
     preted as a single class containing a range followed by  two
     separate characters. The octal or hexadecimal representation
     of "]" can also be used to end a range.

     Ranges  operate  in  the  collating  sequence  of  character
     values.  They  can  also  be  used  for characters specified
     numerically, for example [\000-\037]. In UTF-8 mode,  ranges
     can  include  characters  whose values are greater than 255,
     for example [\x{100}-\x{2ff}].

     If a range that  includes  letters  is  used  when  caseless
     matching  is set, it matches the letters in either case. For
     example, [W-c] is  equivalent  to  [][\^_`wxyzabc],  matched
     caselessly,  and if character tables for the "fr" locale are
     in use, [\xc8-\xcb] matches accented E  characters  in  both

     The character types \d, \D, \s, \S,  \w,  and  \W  may  also
     appear  in  a  character  class, and add the characters that
     they match to the class. For example, [\dABCDEF] matches any
     hexadecimal  digit.  A  circumflex  can conveniently be used
     with the upper case character types to specify a  more  res-
     tricted set of characters than the matching lower case type.
     For example, the class [^\W_] matches any letter  or  digit,
     but not underscore.

     All non-alphameric characters other than \,  -,  ^  (at  the
     start)  and  the  terminating ] are non-special in character
     classes, but it does no harm if they are escaped.


     Perl supports the  POSIX  notation  for  character  classes,
     which  uses names enclosed by [: and :] within the enclosing
     square brackets. PCRE also supports this notation. For exam-


     matches "0", "1", any alphabetic character, or "%". The sup-
     ported class names are

       alnum    letters and digits
       alpha    letters
       ascii    character codes 0 - 127
       blank    space or tab only
       cntrl    control characters
       digit    decimal digits (same as \d)
       graph    printing characters, excluding space
       lower    lower case letters
       print    printing characters, including space
       punct    printing characters, excluding letters and digits
       space    white space (not quite the same as \s)
       upper    upper case letters
       word     "word" characters (same as \w)
       xdigit   hexadecimal digits

     The "space" characters are HT (9),  LF  (10),  VT  (11),  FF
     (12),  CR  (13),  and  space  (32).  Notice  that  this list
     includes the VT character (code 11). This makes "space" dif-
     ferent  to  \s, which does not include VT (for Perl compati-

     The name "word" is a Perl extension, and "blank"  is  a  GNU
     extension from Perl 5.8. Another Perl extension is negation,
     which is indicated by a ^ character  after  the  colon.  For


     matches "1", "2", or any non-digit.  PCRE  (and  Perl)  also
     recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
     "collating element", but these are  not  supported,  and  an
     error is given if they are encountered.

     In UTF-8 mode, characters with values greater  than  255  do
     not match any of the POSIX character classes.


     Vertical bar characters are  used  to  separate  alternative
     patterns. For example, the pattern


     matches either "gilbert" or "sullivan". Any number of alter-
     natives  may  appear,  and an empty alternative is permitted
     (matching the empty string).   The  matching  process  tries
     each  alternative in turn, from left to right, and the first
     one that succeeds is used. If the alternatives are within  a
     subpattern  (defined  below),  "succeeds" means matching the
     rest of the main pattern as well as the alternative  in  the


     The   settings   of   the   PCRE_CASELESS,   PCRE_MULTILINE,
     PCRE_DOTALL,  and  PCRE_EXTENDED options can be changed from
     within the pattern by a  sequence  of  Perl  option  letters
     enclosed between "(?" and ")". The option letters are

       i  for PCRE_CASELESS
       m  for PCRE_MULTILINE
       s  for PCRE_DOTALL
       x  for PCRE_EXTENDED

     For example, (?im) sets caseless, multiline matching. It  is
     also possible to unset these options by preceding the letter
     with a hyphen, and a combined setting and unsetting such  as
     (?im-sx),  which sets PCRE_CASELESS and PCRE_MULTILINE while
     unsetting PCRE_DOTALL and PCRE_EXTENDED, is also  permitted.
     If  a  letter  appears both before and after the hyphen, the
     option is unset.

     When an option change occurs at  top  level  (that  is,  not
     inside  subpattern  parentheses),  the change applies to the
     remainder of the pattern that follows.   If  the  change  is
     placed  right  at  the  start of a pattern, PCRE extracts it
     into the global options (and it will therefore  show  up  in
     data extracted by the pcre_fullinfo() function).

     An option change within a subpattern affects only that  part
     of the current pattern that follows it, so


     matches  abc  and  aBc  and  no  other   strings   (assuming
     PCRE_CASELESS  is  not used).  By this means, options can be
     made to have different settings in different  parts  of  the
     pattern.  Any  changes  made  in one alternative do carry on
     into subsequent branches within  the  same  subpattern.  For


     matches "ab", "aB", "c", and "C", even though when  matching
     "C" the first branch is abandoned before the option setting.
     This is because the effects of  option  settings  happen  at
     compile  time. There would be some very weird behaviour oth-

     The PCRE-specific options PCRE_UNGREEDY and  PCRE_EXTRA  can
     be changed in the same way as the Perl-compatible options by
     using the characters U and X  respectively.  The  (?X)  flag
     setting  is  special in that it must always occur earlier in
     the pattern than any of the additional features it turns on,
     even when it is at top level. It is best put at the start.


     Subpatterns are delimited by parentheses  (round  brackets),
     which can be nested.  Marking part of a pattern as a subpat-
     tern does two things:

     1. It localizes a set of alternatives. For example, the pat-


     matches one of the words "cat",  "cataract",  or  "caterpil-
     lar".  Without  the  parentheses, it would match "cataract",
     "erpillar" or the empty string.

     2. It sets up the subpattern as a capturing  subpattern  (as
     defined  above).   When the whole pattern matches, that por-
     tion of the subject string that matched  the  subpattern  is
     passed  back  to  the  caller  via  the  ovector argument of
     pcre_exec(). Opening parentheses are counted  from  left  to
     right (starting from 1) to obtain the numbers of the captur-
     ing subpatterns.

     For example, if the string "the red king" is matched against
     the pattern

       the ((red|white) (king|queen))

     the captured substrings are "red king", "red",  and  "king",
     and are numbered 1, 2, and 3, respectively.

     The fact that plain parentheses fulfil two functions is  not
     always  helpful.  There are often times when a grouping sub-
     pattern is required without a capturing requirement.  If  an
     opening  parenthesis  is  followed  by a question mark and a
     colon, the subpattern does not do any capturing, and is  not
     counted  when computing the number of any subsequent captur-
     ing subpatterns. For  example,  if  the  string  "the  white
     queen" is matched against the pattern

       the ((?:red|white) (king|queen))

     the captured substrings are "white queen" and  "queen",  and
     are  numbered  1 and 2. The maximum number of capturing sub-
     patterns is 65535, and the maximum depth of nesting  of  all
     subpatterns, both capturing and non-capturing, is 200.

     As a  convenient  shorthand,  if  any  option  settings  are
     required  at  the  start  of a non-capturing subpattern, the
     option letters may appear between the "?" and the ":".  Thus
     the two patterns


     match exactly the same set of strings.  Because  alternative
     branches  are  tried from left to right, and options are not
     reset until the end of the subpattern is reached, an  option
     setting  in  one  branch does affect subsequent branches, so
     the above patterns match "SUNDAY" as well as "Saturday".


     Identifying capturing parentheses by number is  simple,  but
     it  can be very hard to keep track of the numbers in compli-
     cated regular expressions. Furthermore, if an expression  is
     modified,  the  numbers  may change. To help with the diffi-
     culty, PCRE supports the naming  of  subpatterns,  something
     that  Perl does not provide. The Python syntax (?P<name>...)
     is used. Names consist of alphanumeric characters and under-
     scores, and must be unique within a pattern.

     Named capturing parentheses are still allocated  numbers  as
     well  as  names.  The  PCRE  API provides function calls for
     extracting the name-to-number translation table from a  com-
     piled  pattern. For further details see the pcreapi documen-


     Repetition is specified by quantifiers, which can follow any
     of the following items:

       a literal data character
       the . metacharacter
       the \C escape sequence
       escapes such as \d that match single characters
       a character class
       a back reference (see next section)
       a parenthesized subpattern (unless it is an assertion)

     The general repetition quantifier specifies  a  minimum  and
     maximum  number  of  permitted  matches,  by  giving the two
     numbers in curly brackets (braces), separated  by  a  comma.
     The  numbers  must be less than 65536, and the first must be
     less than or equal to the second. For example:


     matches "zz", "zzz", or "zzzz". A closing brace on  its  own
     is not a special character. If the second number is omitted,
     but the comma is present, there is no upper  limit;  if  the
     second number and the comma are both omitted, the quantifier
     specifies an exact number of required matches. Thus


     matches at least 3 successive vowels,  but  may  match  many
     more, while


     matches exactly 8 digits.  An  opening  curly  bracket  that
     appears  in a position where a quantifier is not allowed, or
     one that does not match the syntax of a quantifier, is taken
     as  a literal character. For example, {,6} is not a quantif-
     ier, but a literal string of four characters.

     In UTF-8 mode, quantifiers apply to UTF-8 characters  rather
     than  to  individual  bytes.  Thus,  for example, \x{100}{2}
     matches two UTF-8 characters, each of which  is  represented
     by a two-byte sequence.

     The quantifier {0} is permitted, causing the  expression  to
     behave  as  if the previous item and the quantifier were not

     For convenience (and  historical  compatibility)  the  three
     most common quantifiers have single-character abbreviations:

       *    is equivalent to {0,}
       +    is equivalent to {1,}
       ?    is equivalent to {0,1}

     It is possible to construct infinite loops  by  following  a
     subpattern  that  can  match no characters with a quantifier
     that has no upper limit, for example:


     Earlier versions of Perl and PCRE used to give an  error  at
     compile  time  for such patterns. However, because there are
     cases where this  can  be  useful,  such  patterns  are  now
     accepted,  but  if  any repetition of the subpattern does in
     fact match no characters, the loop is forcibly broken.

     By default, the quantifiers  are  "greedy",  that  is,  they
     match  as much as possible (up to the maximum number of per-
     mitted times), without causing the rest of  the  pattern  to
     fail. The classic example of where this gives problems is in
     trying to match comments in C programs. These appear between
     the  sequences /* and */ and within the sequence, individual
     * and / characters may appear. An attempt to  match  C  com-
     ments by applying the pattern


     to the string

       /* first command */  not comment  /* second comment */

     fails, because it matches the entire  string  owing  to  the
     greediness of the .*  item.

     However, if a quantifier is followed by a question mark,  it
     ceases  to be greedy, and instead matches the minimum number
     of times possible, so the pattern


     does the right thing with the C comments. The meaning of the
     various  quantifiers is not otherwise changed, just the pre-
     ferred number of matches.  Do not confuse this use of  ques-
     tion  mark  with  its  use as a quantifier in its own right.
     Because it has two uses, it can sometimes appear doubled, as


     which matches one digit by preference, but can match two  if
     that is the only way the rest of the pattern matches.

     If the PCRE_UNGREEDY option is set (an option which  is  not
     available  in  Perl),  the  quantifiers  are  not  greedy by
     default, but individual ones can be made greedy by following
     them  with  a  question mark. In other words, it inverts the
     default behaviour.

     When a parenthesized subpattern is quantified with a minimum
     repeat  count  that is greater than 1 or with a limited max-
     imum, more store is required for the  compiled  pattern,  in
     proportion to the size of the minimum or maximum.
     If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL
     option (equivalent to Perl's /s) is set, thus allowing the .
     to match  newlines,  the  pattern  is  implicitly  anchored,
     because whatever follows will be tried against every charac-
     ter position in the subject string, so there is no point  in
     retrying  the overall match at any position after the first.
     PCRE normally treats such a pattern as though it  were  pre-
     ceded by \A.

     In cases where it is known that the subject string  contains
     no  newlines,  it  is  worth setting PCRE_DOTALL in order to
     obtain this optimization, or alternatively using ^ to  indi-
     cate anchoring explicitly.

     However, there is one situation where the optimization  can-
     not  be  used. When .*  is inside capturing parentheses that
     are the subject of a backreference elsewhere in the pattern,
     a match at the start may fail, and a later one succeed. Con-
     sider, for example:


     If the subject is "xyz123abc123"  the  match  point  is  the
     fourth  character.  For  this  reason, such a pattern is not
     implicitly anchored.

     When a capturing subpattern is repeated, the value  captured
     is the substring that matched the final iteration. For exam-
     ple, after


     has matched "tweedledum tweedledee" the value  of  the  cap-
     tured  substring  is  "tweedledee".  However,  if  there are
     nested capturing  subpatterns,  the  corresponding  captured
     values  may  have been set in previous iterations. For exam-
     ple, after


     matches "aba" the value of the second captured substring  is


     With both maximizing and minimizing repetition,  failure  of
     what  follows  normally  causes  the repeated item to be re-
     evaluated to see if a different number of repeats allows the
     rest  of  the  pattern  to  match. Sometimes it is useful to
     prevent this, either to change the nature of the  match,  or
     to  cause  it fail earlier than it otherwise might, when the
     author of the pattern knows there is no  point  in  carrying

     Consider, for example, the pattern \d+foo  when  applied  to
     the subject line


     After matching all 6 digits and then failing to match "foo",
     the normal action of the matcher is to try again with only 5
     digits matching the \d+ item, and then with 4,  and  so  on,
     before  ultimately  failing. "Atomic grouping" (a term taken
     from Jeffrey Friedl's book) provides the means for  specify-
     ing  that once a subpattern has matched, it is not to be re-
     evaluated in this way.

     If we use atomic grouping  for  the  previous  example,  the
     matcher  would give up immediately on failing to match "foo"
     the  first  time.  The  notation  is  a  kind   of   special
     parenthesis, starting with (?> as in this example:


     This kind of parenthesis "locks up" the  part of the pattern
     it  contains once it has matched, and a failure further into
     the pattern is prevented from backtracking  into  it.  Back-
     tracking  past  it to previous items, however, works as nor-

     An alternative description is that a subpattern of this type
     matches  the  string  of  characters that an identical stan-
     dalone pattern would match, if anchored at the current point
     in the subject string.

     Atomic grouping subpatterns are not  capturing  subpatterns.
     Simple  cases such as the above example can be thought of as
     a maximizing repeat that must swallow everything it can. So,
     while both \d+ and \d+? are prepared to adjust the number of
     digits they match in order to make the rest of  the  pattern
     match, (?>\d+) can only match an entire sequence of digits.

     Atomic groups in general can of course  contain  arbitrarily
     complicated  subpatterns,  and  can be nested. However, when
     the subpattern for an atomic group is just a single repeated
     item,  as in the example above, a simpler notation, called a
     "possessive quantifier" can be used.  This  consists  of  an
     additional  +  character  following a quantifier. Using this
     notation, the previous example can be rewritten as


     Possessive quantifiers are always greedy; the setting of the
     PCRE_UNGREEDY option is ignored. They are a convenient nota-
     tion for the simpler forms of atomic group.  However,  there
     is  no  difference in the meaning or processing of a posses-
     sive quantifier and the equivalent atomic group.

     The possessive quantifier syntax is an extension to the Perl
     syntax. It originates in Sun's Java package.

     When a pattern contains an unlimited repeat inside a subpat-
     tern  that  can  itself  be  repeated an unlimited number of
     times, the use of an atomic group is the only way  to  avoid
     some  failing  matches  taking  a very long time indeed. The


     matches an unlimited number of substrings that  either  con-
     sist  of  non-digits,  or digits enclosed in <>, followed by
     either ! or ?. When it matches, it runs quickly. However, if
     it is applied to


     it takes a long  time  before  reporting  failure.  This  is
     because the string can be divided between the two repeats in
     a large number of ways, and all have to be tried. (The exam-
     ple  used  [!?]  rather  than a single character at the end,
     because both PCRE and Perl have an optimization that  allows
     for  fast  failure  when  a  single  character is used. They
     remember the last single character that is  required  for  a
     match,  and  fail early if it is not present in the string.)
     If the pattern is changed to


     sequences of non-digits cannot be broken, and  failure  hap-
     pens quickly.


     Outside a character class, a backslash followed by  a  digit
     greater  than  0  (and  possibly  further  digits) is a back
     reference to a capturing subpattern earlier (that is, to its
     left)  in  the  pattern,  provided there have been that many
     previous capturing left parentheses.

     However, if the decimal number following  the  backslash  is
     less  than  10,  it is always taken as a back reference, and
     causes an error only if there are not  that  many  capturing
     left  parentheses in the entire pattern. In other words, the
     parentheses that are referenced need not be to the  left  of
     the  reference  for  numbers  less  than 10. See the section
     entitled "Backslash" above for further details of  the  han-
     dling of digits following a backslash.

     A back reference matches whatever actually matched the  cap-
     turing subpattern in the current subject string, rather than
     anything matching the subpattern itself (see "Subpatterns as
     subroutines" below for a way of doing that). So the pattern

       (sens|respons)e and \1ibility

     matches "sense and sensibility" and "response and  responsi-
     bility",  but  not  "sense  and  responsibility". If caseful
     matching is in force at the time of the back reference,  the
     case of letters is relevant. For example,


     matches "rah rah" and "RAH RAH", but  not  "RAH  rah",  even
     though  the  original  capturing subpattern is matched case-

     Back references to named subpatterns use the  Python  syntax
     (?P=name). We could rewrite the above example as follows:


     There may be more than one back reference to the  same  sub-
     pattern.  If  a  subpattern  has not actually been used in a
     particular match, any back references to it always fail. For
     example, the pattern


     always fails if it starts to match  "a"  rather  than  "bc".
     Because  there  may  be many capturing parentheses in a pat-
     tern, all digits following the backslash are taken  as  part
     of a potential back reference number. If the pattern contin-
     ues with a digit character, some delimiter must be  used  to
     terminate the back reference. If the PCRE_EXTENDED option is
     set, this can be whitespace.  Otherwise an empty comment can
     be used.

     A back reference that occurs inside the parentheses to which
     it  refers  fails when the subpattern is first used, so, for
     example, (a\1) never matches.  However, such references  can
     be useful inside repeated subpatterns. For example, the pat-


     matches any number of "a"s and also "aba", "ababbaa" etc. At
     each iteration of the subpattern, the back reference matches
     the character string corresponding to  the  previous  itera-
     tion.  In  order  for this to work, the pattern must be such
     that the first iteration does not need  to  match  the  back
     reference.  This  can  be  done using alternation, as in the
     example above, or by a quantifier with a minimum of zero.


     An assertion is  a  test  on  the  characters  following  or
     preceding  the current matching point that does not actually
     consume any characters. The simple assertions coded  as  \b,
     \B,  \A, \G, \Z, \z, ^ and $ are described above.  More com-
     plicated assertions are coded as subpatterns. There are  two
     kinds:  those that look ahead of the current position in the
     subject string, and those that look behind it.

     An assertion subpattern is matched in the normal way, except
     that  it  does not cause the current matching position to be
     changed. Lookahead assertions start with  (?=  for  positive
     assertions and (?! for negative assertions. For example,


     matches a word followed by a semicolon, but does not include
     the semicolon in the match, and


     matches any occurrence of "foo"  that  is  not  followed  by
     "bar". Note that the apparently similar pattern


     does not find an occurrence of "bar"  that  is  preceded  by
     something other than "foo"; it finds any occurrence of "bar"
     whatsoever, because the assertion  (?!foo)  is  always  true
     when  the  next  three  characters  are  "bar". A lookbehind
     assertion is needed to achieve this effect.

     If you want to force a matching failure at some point  in  a
     pattern,  the  most  convenient  way  to  do it is with (?!)
     because an empty string always matches, so an assertion that
     requires there not to be an empty string must always fail.

     Lookbehind assertions start with (?<=  for  positive  asser-
     tions and (?<! for negative assertions. For example,


     does find an occurrence of "bar" that  is  not  preceded  by
     "foo". The contents of a lookbehind assertion are restricted
     such that all the strings  it  matches  must  have  a  fixed
     length.  However, if there are several alternatives, they do
     not all have to have the same fixed length. Thus


     is permitted, but


     causes an error at compile time. Branches  that  match  dif-
     ferent length strings are permitted only at the top level of
     a lookbehind assertion. This is an extension  compared  with
     Perl  (at  least  for  5.8),  which requires all branches to
     match the same length of string. An assertion such as


     is not permitted, because its single  top-level  branch  can
     match two different lengths, but it is acceptable if rewrit-
     ten to use two top-level branches:


     The implementation of lookbehind  assertions  is,  for  each
     alternative,  to  temporarily move the current position back
     by the fixed width and then  try  to  match.  If  there  are
     insufficient  characters  before  the  current position, the
     match is deemed to fail.

     PCRE does not allow the \C escape (which  matches  a  single
     byte  in  UTF-8  mode)  to  appear in lookbehind assertions,
     because it makes it impossible to calculate  the  length  of
     the lookbehind.

     Atomic groups can be used  in  conjunction  with  lookbehind
     assertions  to  specify efficient matching at the end of the
     subject string. Consider a simple pattern such as


     when applied to a long string that does not  match.  Because
     matching  proceeds  from  left  to right, PCRE will look for
     each "a" in the subject and then see if what follows matches
     the rest of the pattern. If the pattern is specified as


     the initial .* matches the entire string at first, but  when
     this  fails  (because  there  is no following "a"), it back-
     tracks to match all but the last character, then all but the
     last  two  characters,  and so on. Once again the search for
     "a" covers the entire string, from right to left, so we  are
     no better off. However, if the pattern is written as


     or, equivalently,


     there can be no backtracking for the .* item; it  can  match
     only  the entire string. The subsequent lookbehind assertion
     does a single test on the last four characters. If it fails,
     the match fails immediately. For long strings, this approach
     makes a significant difference to the processing time.

     Several assertions (of any sort) may  occur  in  succession.
     For example,


     matches "foo" preceded by three digits that are  not  "999".
     Notice  that each of the assertions is applied independently
     at the same point in the subject string. First  there  is  a
     check that the previous three characters are all digits, and
     then there is a check that the same three characters are not
     "999".   This  pattern  does not match "foo" preceded by six
     characters, the first of which are digits and the last three
     of  which  are  not  "999".  For  example,  it doesn't match
     "123abcfoo". A pattern to do that is


     This time the first assertion looks  at  the  preceding  six
     characters,  checking  that  the first three are digits, and
     then the second assertion checks that  the  preceding  three
     characters are not "999".

     Assertions can be nested in any combination. For example,


     matches an occurrence of "baz" that  is  preceded  by  "bar"
     which in turn is not preceded by "foo", while


     is another pattern which matches  "foo"  preceded  by  three
     digits and any three characters that are not "999".

     Assertion subpatterns are not capturing subpatterns, and may
     not  be  repeated,  because  it makes no sense to assert the
     same thing several times. If any kind of assertion  contains
     capturing  subpatterns  within it, these are counted for the
     purposes of numbering the capturing subpatterns in the whole
     pattern.   However,  substring capturing is carried out only
     for positive assertions, because it does not make sense  for
     negative assertions.


     It is possible to cause the matching process to obey a  sub-
     pattern  conditionally  or to choose between two alternative
     subpatterns, depending on the result  of  an  assertion,  or
     whether  a previous capturing subpattern matched or not. The
     two possible forms of conditional subpattern are


     If the condition is satisfied, the yes-pattern is used; oth-
     erwise  the  no-pattern  (if  present) is used. If there are
     more than two alternatives in the subpattern, a compile-time
     error occurs.

     There are three kinds of condition. If the text between  the
     parentheses  consists of a sequence of digits, the condition
     is satisfied if the capturing subpattern of that number  has
     previously  matched.  The  number must be greater than zero.
     Consider  the  following  pattern,   which   contains   non-
     significant white space to make it more readable (assume the
     PCRE_EXTENDED option) and to divide it into three parts  for
     ease of discussion:

       ( \( )?    [^()]+    (?(1) \) )

     The first part matches an optional opening parenthesis,  and
     if  that character is present, sets it as the first captured
     substring. The second part matches one  or  more  characters
     that  are  not  parentheses. The third part is a conditional
     subpattern that tests whether the first set  of  parentheses
     matched  or  not.  If  they did, that is, if subject started
     with an opening parenthesis, the condition is true,  and  so
     the  yes-pattern  is  executed  and a closing parenthesis is
     required. Otherwise, since no-pattern is  not  present,  the
     subpattern  matches  nothing.  In  other words, this pattern
     matches a sequence of non-parentheses,  optionally  enclosed
     in parentheses.

     If the condition is the string (R), it  is  satisfied  if  a
     recursive  call  to the pattern or subpattern has been made.
     At "top level", the condition is  false.   This  is  a  PCRE
     extension.  Recursive  patterns  are  described  in the next

     If the condition is not a sequence of digits or (R), it must
     be  an assertion.  This may be a positive or negative looka-
     head or lookbehind assertion. Consider this  pattern,  again
     containing  non-significant  white  space,  and with the two
     alternatives on the second line:

       \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

     The condition is a positive lookahead assertion that matches
     an optional sequence of non-letters followed by a letter. In
     other words, it tests for  the  presence  of  at  least  one
     letter  in the subject. If a letter is found, the subject is
     matched against  the  first  alternative;  otherwise  it  is
     matched  against the second. This pattern matches strings in
     one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
     letters and dd are digits.


     The sequence (?# marks the start of a comment which  contin-
     ues  up  to the next closing parenthesis. Nested parentheses
     are not permitted. The characters that  make  up  a  comment
     play no part in the pattern matching at all.

     If the PCRE_EXTENDED option is set, an unescaped # character
     outside  a character class introduces a comment that contin-
     ues up to the next newline character in the pattern.


     Consider the problem of matching a  string  in  parentheses,
     allowing  for  unlimited nested parentheses. Without the use
     of recursion, the best that can be done is to use a  pattern
     that  matches  up  to some fixed depth of nesting. It is not
     possible to handle an arbitrary nesting depth. Perl has pro-
     vided  an  experimental facility that allows regular expres-
     sions to recurse (amongst other things).  It  does  this  by
     interpolating  Perl  code in the expression at run time, and
     the code can refer to the expression itself. A Perl  pattern
     to solve the parentheses problem can be created like this:

       $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;

     The (?p{...}) item interpolates Perl code at run  time,  and
     in  this  case refers recursively to the pattern in which it
     appears. Obviously, PCRE cannot support the interpolation of
     Perl  code.  Instead,  it  supports  some special syntax for
     recursion of the entire pattern,  and  also  for  individual
     subpattern recursion.

     The special item that consists of (? followed  by  a  number
     greater  than  zero and a closing parenthesis is a recursive
     call of the subpattern of the given number, provided that it
     occurs inside that subpattern. (If not, it is a "subroutine"
     call, which is described in the next section.)  The  special
     item  (?R) is a recursive call of the entire regular expres-

     For example, this PCRE pattern solves the nested parentheses
     problem  (assume  the  PCRE_EXTENDED  option  is set so that
     white space is ignored):

       \( ( (?>[^()]+) | (?R) )* \)

     First it matches an opening parenthesis. Then it matches any
     number  of substrings which can either be a sequence of non-
     parentheses, or a recursive  match  of  the  pattern  itself
     (that  is  a  correctly  parenthesized  substring).  Finally
     there is a closing parenthesis.

     If this were part of a larger pattern, you would not want to
     recurse the entire pattern, so instead you could use this:

       ( \( ( (?>[^()]+) | (?1) )* \) )

     We have put the pattern into  parentheses,  and  caused  the
     recursion  to refer to them instead of the whole pattern. In
     a larger pattern, keeping track of parenthesis  numbers  can
     be   tricky.   It  may  be  more  convenient  to  use  named
     parentheses instead. For this, PCRE uses (?P>name), which is
     an  extension  to the Python syntax that PCRE uses for named
     parentheses (Perl does not provide  named  parentheses).  We
     could rewrite the above example as follows:

       (?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )

     This particular example pattern  contains  nested  unlimited
     repeats,  and  so  the  use  of atomic grouping for matching
     strings of non-parentheses is important  when  applying  the
     pattern to strings that do not match. For example, when this
     pattern is applied to


     it yields "no match" quickly. However, if atomic grouping is
     not used, the match runs for a very long time indeed because
     there are so many different ways the +  and  *  repeats  can
     carve  up  the  subject,  and  all  have to be tested before
     failure can be reported.
     At the end of a match, the values set for any capturing sub-
     patterns are those from the outermost level of the recursion
     at which the subpattern value is set.  If you want to obtain
     intermediate  values,  a  callout  function can be used (see
     below and the pcrecallout  documentation).  If  the  pattern
     above is matched against


     the value for the capturing parentheses is  "ef",  which  is
     the  last  value  taken  on  at the top level. If additional
     parentheses are added, giving

       \( ( ( (?>[^()]+) | (?R) )* ) \)
          ^                        ^
          ^                        ^

     the string they capture is "ab(cd)ef", the contents  of  the
     top  level  parentheses. If there are more than 15 capturing
     parentheses in a pattern, PCRE has to obtain extra memory to
     store  data  during  a  recursion,  which  it  does by using
     pcre_malloc, freeing it  via  pcre_free  afterwards.  If  no
     memory   can   be   obtained,   the  match  fails  with  the

     Do not confuse the (?R) item with the condition  (R),  which
     tests  for  recursion.  Consider this pattern, which matches
     text in angle brackets, allowing for arbitrary nesting. Only
     digits are allowed in nested brackets (that is, when recurs-
     ing), whereas any characters  are  permitted  at  the  outer

       < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >

     In this pattern, (?(R) is the start of a conditional subpat-
     tern,  with two different alternatives for the recursive and
     non-recursive cases. The (?R) item is the  actual  recursive


     If the syntax for a recursive subpattern  reference  (either
     by  number  or  by  name) is used outside the parentheses to
     which it refers, it operates like a subroutine in a program-
     ming  language. An earlier example pointed out that the pat-

       (sens|respons)e and \1ibility

     matches "sense and sensibility" and "response and  responsi-
     bility",  but not "sense and responsibility". If instead the

       (sens|respons)e and (?1)ibility

     is used, it does match "sense and responsibility" as well as
     the other two strings. Such references must, however, follow
     the subpattern to which they refer.


     Perl has a  feature  whereby  using  the  sequence  (?{...})
     causes  arbitrary  Perl  code  to be obeyed in the middle of
     matching a  regular  expression.  This  makes  it  possible,
     amongst  other  things, to extract different substrings that
     match the same pair of parentheses when there is  a  repeti-

     PCRE provides a similar feature, but  of  course  it  cannot
     obey  arbitrary  Perl code. The feature is called "callout".
     The caller of PCRE provides an external function by  putting
     its  entry  point  in  the global variable pcre_callout.  By
     default, this variable contains  NULL,  which  disables  all
     calling out.

     Within a regular expression, (?C) indicates  the  points  at
     which  the external function is to be called. If you want to
     identify different callout points, you can put a number less
     than 256 after the letter C. The default value is zero.  For
     example, this pattern has two callout points:


     During matching, when PCRE  reaches  a  callout  point  (and
     pcre_callout is set), the external function is called. It is
     provided with the number of the  callout,  and,  optionally,
     one  item  of  data  originally  supplied  by  the caller of
     pcre_exec(). The callout  function  may  cause  matching  to
     backtrack,  or to fail altogether. A complete description of
     the interface to the callout function is given in the  pcre-
     callout documentation.

Last updated: 03 February 2003
Copyright (c) 1997-2003 University of Cambridge.

     PCRE - Perl-compatible regular expressions


     Certain items that may appear in regular expression patterns
     are  more efficient than others. It is more efficient to use
     a character class like [aeiou] than a  set  of  alternatives
     such  as  (a|e|i|o|u). In general, the simplest construction
     that provides the required behaviour  is  usually  the  most
     efficient.  Jeffrey  Friedl's book contains a lot of discus-
     sion about optimizing regular expressions for efficient per-

     When a pattern begins with .*  not  in  parentheses,  or  in
     parentheses that are not the subject of a backreference, and
     the PCRE_DOTALL option is set,  the  pattern  is  implicitly
     anchored  by PCRE, since it can match only at the start of a
     subject string. However, if PCRE_DOTALL  is  not  set,  PCRE
     cannot  make  this optimization, because the . metacharacter
     does not then match a newline, and  if  the  subject  string
     contains  newlines, the pattern may match from the character
     immediately following one of them instead of from  the  very
     start. For example, the pattern


     matches the subject "first\nand second" (where \n stands for
     a newline character), with the match starting at the seventh
     character. In order to do this, PCRE has to retry the  match
     starting after every newline in the subject.

     If you are using such a pattern with subject strings that do
     not  contain  newlines,  the best performance is obtained by
     setting PCRE_DOTALL, or starting the  pattern  with  ^.*  to
     indicate  explicit anchoring. That saves PCRE from having to
     scan along the subject looking for a newline to restart at.

     Beware of patterns that contain nested  indefinite  repeats.
     These  can  take a long time to run when applied to a string
     that does not match. Consider the pattern fragment


     This can match "aaaa" in 33 different ways, and this  number
     increases  very  rapidly  as  the string gets longer. (The *
     repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of
     those  cases other than 0, the + repeats can match different
     numbers of times.) When the remainder of the pattern is such
     that  the entire match is going to fail, PCRE has in princi-
     ple to try every possible variation, and this  can  take  an
     extremely long time.
     An optimization catches some of the more simple  cases  such


     where a literal character follows. Before embarking  on  the
     standard matching procedure, PCRE checks that there is a "b"
     later in the subject string, and if there is not,  it  fails
     the  match  immediately. However, when there is no following
     literal this optimization cannot be used. You  can  see  the
     difference by comparing the behaviour of


     with the pattern above. The former gives  a  failure  almost
     instantly  when  applied  to a whole line of "a" characters,
     whereas the latter takes an appreciable  time  with  strings
     longer than about 20 characters.

Last updated: 03 February 2003
Copyright (c) 1997-2003 University of Cambridge.

     PCRE - Perl-compatible regular expressions.

     #include <pcreposix.h>

     int regcomp(regex_t *preg, const char *pattern,
          int cflags);

     int regexec(regex_t *preg, const char *string,
          size_t nmatch, regmatch_t pmatch[], int eflags);

     size_t regerror(int errcode, const regex_t *preg,
          char *errbuf, size_t errbuf_size);

     void regfree(regex_t *preg);


     This set of functions provides a POSIX-style API to the PCRE
     regular  expression  package.  See the pcreapi documentation
     for a description of the native API,  which  contains  addi-
     tional functionality.

     The functions described here are just wrapper functions that
     ultimately  call  the  PCRE native API. Their prototypes are
     defined in the pcreposix.h header file, and on Unix  systems
     the library itself is called pcreposix.a, so can be accessed
     by adding -lpcreposix to the command for linking an applica-
     tion  which  uses them. Because the POSIX functions call the
     native ones, it is also necessary to add -lpcre.

     I have implemented only those option bits that can  be  rea-
     sonably  mapped  to  PCRE  native  options. In addition, the
     options REG_EXTENDED and  REG_NOSUB  are  defined  with  the
     value zero. They have no effect, but since programs that are
     written to the POSIX interface often use them, this makes it
     easier to slot in PCRE as a replacement library. Other POSIX
     options are not even defined.

     When PCRE is called via these functions, it is only the  API
     that is POSIX-like in style. The syntax and semantics of the
     regular expressions themselves are still those of Perl, sub-
     ject  to  the  setting of various PCRE options, as described
     below. "POSIX-like in style" means that the API approximates
     to  the  POSIX definition; it is not fully POSIX-compatible,
     and in multi-byte encoding domains it is probably even  less

     The header for these functions is supplied as pcreposix.h to
     avoid  any  potential  clash  with other POSIX libraries. It
     can, of course, be renamed or aliased as regex.h,  which  is
     the "correct" name. It provides two structure types, regex_t
     for compiled internal forms, and  regmatch_t  for  returning
     captured  substrings.  It  also defines some constants whose
     names start with "REG_"; these are used for setting  options
     and identifying error codes.


     The function regcomp() is called to compile a  pattern  into
     an  internal form. The pattern is a C string terminated by a
     binary zero, and is passed in the argument pattern. The preg
     argument  is  a pointer to a regex_t structure which is used
     as a base for storing information about the compiled expres-

     The argument cflags is either zero, or contains one or  more
     of the bits defined by the following macros:


     The PCRE_CASELESS option  is  set  when  the  expression  is
     passed for compilation to the native function.


     The PCRE_MULTILINE option is  set  when  the  expression  is
     passed  for  compilation  to  the native function. Note that
     this  does  not  mimic  the  defined  POSIX  behaviour   for
     REG_NEWLINE (see the following section).

     In the absence of these flags, no options are passed to  the
     native  function.  This means the the regex is compiled with
     PCRE default semantics. In particular, the  way  it  handles
     newline  characters  in  the subject string is the Perl way,
     not the POSIX way. Note that setting PCRE_MULTILINE has only
     some  of  the effects specified for REG_NEWLINE. It does not
     affect the way newlines are matched by . (they aren't) or by
     a negative class such as [^a] (they are).

     The yield of regcomp() is zero on success, and non-zero oth-
     erwise.  The preg structure is filled in on success, and one
     member of the structure  is  public:  re_nsub  contains  the
     number  of  capturing subpatterns in the regular expression.
     Various error codes are defined in the header file.


     This area is not simple, because POSIX and  Perl  take  dif-
     ferent  views  of things.  It is not possible to get PCRE to
     obey POSIX semantics, but then PCRE was never intended to be
     a POSIX engine. The following table lists the different pos-
     sibilities for matching newline characters in PCRE:

                               Default   Change with

       . matches newline          no     PCRE_DOTALL
       newline matches [^a]       yes    not changeable
       $ matches \n at end        yes    PCRE_DOLLARENDONLY
       $ matches \n in middle     no     PCRE_MULTILINE
       ^ matches \n in middle     no     PCRE_MULTILINE

     This is the equivalent table for POSIX:

                               Default   Change with

       . matches newline          yes      REG_NEWLINE
       newline matches [^a]       yes      REG_NEWLINE
       $ matches \n at end        no       REG_NEWLINE
       $ matches \n in middle     no       REG_NEWLINE
       ^ matches \n in middle     no       REG_NEWLINE

     PCRE's behaviour is the same as Perl's, except that there is
     no  equivalent  for PCRE_DOLLARENDONLY in Perl. In both PCRE
     and Perl, there is no way  to  stop  newline  from  matching

     The default POSIX newline handling can be obtained  by  set-
     ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way
     to make PCRE behave exactly as for the REG_NEWLINE action.


     The function regexec() is called  to  match  a  pre-compiled
     pattern  preg against a given string, which is terminated by
     a zero byte, subject to the options in eflags. These can be:


     The PCRE_NOTBOL option is set when  calling  the  underlying
     PCRE matching function.


     The PCRE_NOTEOL option is set when  calling  the  underlying
     PCRE matching function.

     The portion of the string that was  matched,  and  also  any
     captured  substrings,  are returned via the pmatch argument,
     which points to  an  array  of  nmatch  structures  of  type
     regmatch_t,  containing  the  members rm_so and rm_eo. These
     contain the offset to the first character of each  substring
     and  the offset to the first character after the end of each
     substring, respectively.  The  0th  element  of  the  vector
     relates  to  the  entire portion of string that was matched;
     subsequent elements relate to the capturing  subpatterns  of
     the  regular  expression.  Unused  entries in the array have
     both structure members set to -1.

     A successful match yields a zero return; various error codes
     are  defined in the header file, of which REG_NOMATCH is the
     "expected" failure code.


     The regerror()  function  maps  a  non-zero  errorcode  from
     either  regcomp()  or  regexec()  to a printable message. If
     preg is not NULL, the error should have arisen from the  use
     of  that structure. A message terminated by a binary zero is
     placed in errbuf. The length of the message,  including  the
     zero,  is  limited to errbuf_size. The yield of the function
     is the size of buffer needed to hold the whole message.


     Compiling a regular expression causes memory to be allocated
     and  associated  with  the preg structure. The function reg-
     free() frees all such memory, after which preg may no longer
     be used as a compiled expression.


     Philip Hazel <>
     University Computing Service,
     Cambridge CB2 3QG, England.

Last updated: 03 February 2003
Copyright (c) 1997-2003 University of Cambridge.

     PCRE - Perl-compatible regular expressions


     A simple, complete demonstration program, to get you started
     with  using  PCRE, is supplied in the file pcredemo.c in the
     PCRE distribution.

     The program compiles the  regular  expression  that  is  its
     first argument, and matches it against the subject string in
     its second argument. No PCRE options are  set,  and  default
     character tables are used. If matching succeeds, the program
     outputs the portion of the subject  that  matched,  together
     with the contents of any captured substrings.

     If the -g option is given on the command line,  the  program
     then  goes on to check for further matches of the same regu-
     lar expression in the same subject string. The  logic  is  a
     little  bit tricky because of the possibility of matching an
     empty string. Comments in the code explain what is going on.

     On a Unix system that has PCRE installed in /usr/local,  you
     can  compile  the demonstration program using a command like

       gcc -o pcredemo pcredemo.c -I/usr/local/include \
           -L/usr/local/lib -lpcre

     Then you can run simple tests like this:

       ./pcredemo 'cat|dog' 'the cat sat on the mat'
       ./pcredemo -g 'cat|dog' 'the dog sat on the cat'

     Note that there is a much more comprehensive  test  program,
     called  pcretest,  which  supports  many more facilities for
     testing  regular  expressions  and  the  PCRE  library.  The
     pcredemo program is provided as a simple coding example.

     On some operating systems (e.g.  Solaris)  you  may  get  an
     error like this when you try to run pcredemo: a.out: fatal: open failed: No  such
     file or directory

     This is caused by the way shared library  support  works  on
     those systems. You need to add


     to the compile command to get round this problem.

Last updated: 28 January 2003
Copyright (c) 1997-2003 University of Cambridge.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.


I,ve been studying these codes for 3 days and it seems to me that some of the routines are incorrect

incorrect portions

Can you point out the portions that are incorrect?