19.0.0 released Nov 08, 2023
Match regex

Purpose: Find, or find and replace patterns in strings using regex (regular expressions).

match-regex <pattern> in <target> \
    [ \
        ( replace-with <replace pattern> \
            result [ define ] <result> \
            [ result-length [ define ] <result length> ] )  \
            [ status [ define ] <status> ] )  \
    | \
        ( status [ define ] <status> \
            [ case-insensitive [ <case-insensitive> ] ] \
            [ single-match [ <single-match> ] ] \
            [ utf8 [ <utf8> ] ] ) \
    ] \
    [ cache ] \
    [ clear-cache <clear cache> )

match-regex searches <target> string for regex <pattern>. If "replace-with" is specified, then instance(s) of <pattern> in <target> are replaced with <replace pattern> string, and the result is stored in <result> string, which must be different from <target>. <result> is allocated memory.

The number of found or found/replaced patterns can be obtained in number <status> variable. In case of replacement(s), the length of <result> can be obtained in <result length> in optional "result-length" clause, which can be created with optional "define". Note that <result length> is the length in bytes.

If "replace-with" is not specified, then the number of matched <pattern>s within <target> is given in <status> number variable, which in this case must be specified.

If "case-insensitive" is used without optional boolean expression <case-insensitive>, or if <case-insensitive> evaluates to true, then searching for pattern is case insensitive. By default, it is case sensitive.

If "single-match" is specified without optional boolean expression <single-match>, or if <single-match> evaluates to true, then only the very first occurrence of <pattern> in <target> is processed. Otherwise, all occurrences are processed.

If "utf8" is used, then the pattern itself and all data strings used for matching are treated as UTF-8 strings.

<result> and <status> variables can be created within the statement with "define" clause.

If the pattern is bad (i.e. <pattern> is not a correct regular expression), Vely will error out with a message.
PCRE2 and Posix Regex
By default, PCRE2 regex syntax (Perl-compatible Regular Expressions v2) is used. To use extended regex syntax (Posix ERE), specify "--posix-regex" when building your application with vv. See more below in Limitations about differences.
Caching and performance
If "cache" clause is used, then regex compilation of <pattern> will be done only once and saved for future use. There is a significant performance benefit when match-regex executes repeatedly with "cache" (such as in case of web applications or in any kind of loop). If <pattern> changes and you need to recompile it once in a while, use "clear-cache" clause. <clear cache> is a "bool" variable; the regex cache is cleared if it is true, and stays if it is false. For example:
bool cl_c;
if (q == 0) cl_c = true; else cl_c = false;
match-regex ps in look_in replace-with "Yes it is \\1!" result res cache clear-cache cl_c

In this case, when "q" is 0, cache will be cleared, and the pattern in variable "ps" will be recompiled. In all other cases, the last computed regex stays the same.

While every pattern is different, when using cache even a relatively small one was shown in tests to speed up the match-regex by about 500%, or 5x faster. Use cache whenever possible as it brings parsing performance close to its theoretical limits.
Subexpressions and back-referencing
Subexpressions are referenced via a backslash followed by a number. Because C treats backslash followed by a number as an octal number, you must use double backslash (\\). For example:
match-regex "(good).*(day)" \
    in "Hello, good day!" \
    replace-with "\\2 \\1" \
    result res

will produce string "Hello, day good!" as a result in "res" variable. Each subexpression is within () parenthesis, so for instance "day" in the above pattern is the 2nd subexpression, and is back-referenced as \\2 in replacement.

There can be a maximum of 23 subexpressions.

Note that backreferences to non-existing subexpressions are ignored - for example \\4 when there are only 3 subexpressions. Vely is "smart" about using two digits and differentiating between \\1 and \\10 for instance - it takes into account the actual number of subexpressions and their validity, and selects a proper subexpression even when two digits are used in a backreference.
Lookaheads and lookbehinds
match-regex supports syntax for lookaheads (i.e. "(?=...)" and "(?!...)") and lookbehinds (i.e. "(?<=...)" and "(?<!...)"). See PCRE2 pattern matching for more details. For instance, the following matches "bar" only if preceded by "foo":
match-regex "\\w*(?<=foo)bar" in "foobar" status st single-match

and the following matches "foo" if followed by "bar":
match-regex "\\w*foo(?=bar)" in "foofoo" status st single-match

Limitations
If you are using older versions of PCRE2 (10.36 or earlier) such as by default in Debian 10 or Ubuntu 18, then instead of PCRE2, an Extended Regex Expressions (ERE) from the  built-in Linux regex library is used, due to older PCRE2 versions having name conflicts with other libraries. In this case, "utf8" clause will have no effect, and lookaheads/lookbehinds functionality will not work; possibly a few others as well, however for the most part the two are compatible. If your system uses an older PCRE2 library, you can upgrade to 10.37 or later to use PCRE2.

In any case, if you need to use Posix ERE instead of PCRE2 (for compatibility, to reduce memory footprint or some other reason), you can use "--posix-regex" option of vv; the same limitations as above apply. Note that PCRE2 (the default) is generally faster than ERE.

You can use predefined symbol "VV_C_POSIXREGEX" to determine in your code if it's compiled with "--posix-regex", for instance:
#ifndef VV_C_POSIXREGEX
    match-regex "^([^\\s]+).*$" in ores replace-with "\\1" result define fr
#else
    match-regex "^([^[:space:]]+).*$" in ores replace-with "\\1" result define fr
#endif
In this case, "\s" is used with PCRE2, and [:space:] with ERE (i.e. posix regex).

Examples
Use match-regex statement to find out if a string matches a pattern, for example:
match-regex "SOME (.*) IS (.*)" in "WOW SOME THING IS GOOD HERE" status define st

In this case, the first parameter ("SOME (.*) IS (.*)") is a pattern and you're matching it with string ("WOW SOME THING IS GOOD HERE"). Since there is a match, status variable (defined on the fly as integer) "st" will be "1" (meaning one match was found) - in general it will contain the number of patterns matched.

Search for patterns and replace them by using replace-with clause, for example:
match-regex "SOME (.*) IS ([^ ]+)" in "WOW SOME THING IS GOOD HERE FOR SURE" replace-with "THINGS ARE \\2 YES!" result define res status define st

In this case, the result from replacement will be in a new string variable "res" specified with the result clause, and it will be
WOW THINGS ARE GOOD YES! HERE FOR SURE

The above demonstrates a typical use of subexpressions in regex (meaning "()" statements) and their referencing with "\\1", "\\2" etc. in the order in which they appear. Consult regex documentation for more information.  Status variable specified with status clause ("st" in this example) will contain the number of patterns matched and replaced. As is the case with all Vely statements, you could use "define" clause to create a result/output variable within the statement, or omit it if it's created elsewhere.

Matching is by default case sensitive. Use "case-insensitive" clause to change it to case insensitive, for instance:
match-regex "SOME (.*) IS (.*)" in "WOW some THING IS GOOD HERE" status define st case-insensitive

In the above case, the pattern would not be found without "case-insensitive" clause because "SOME" and "some" would not match. This clause works the same in matching-only as well as replacing strings.

If you want to match only the first occurrence of a pattern, use "single-match" option:
match-regex "SOME ([^ ]+) IS ([^ ]+)" in "WOW SOME THING IS GOOD HERE AND SOME STUFF IS GOOD TOO" status define st single-match

In this case there would be two matches by default ("SOME THING IS GOOD" and "SOME STUFF IS GOOD") but only the first would be found. This clause works the same for replacing as well - only the first occurrence would be replaced.
See also
Regex
match-regex    
See all
documentation


You are free to copy, redistribute and adapt this web page (even commercially), as long as you give credit and provide a dofollow link back to this page - see full license at CC-BY-4.0. Copyright (c) 2019-2023 Dasoftver LLC. Vely and elephant logo are trademarks of Dasoftver LLC. The software and information on this web site are provided "AS IS" and without any warranties or guarantees of any kind. Icons from table-icons.io copyright PaweĊ‚ Kuna, licensed under MIT license.