Skip to main content

Regex Extract

Parse Cribl Compatible

Synopsis

Extracts named fields from text using regular expressions with named capture groups.

Schema

regex_extract:
- field: <ident>
- regex: <string>
- additional_regex: <string[]>
- description: <text>
- field_name_format: <string>
- if: <script>
- ignore_failure: <boolean>
- ignore_missing: <boolean>
- max_exec: <integer>
- on_failure: <processor[]>
- on_success: <processor[]>
- overwrite_existing: <boolean>
- tag: <string>

Configuration

FieldRequiredDefaultDescription
fieldY-Field containing text to extract from
regexY-Regular expression with named capture groups
additional_regexN-Additional patterns to match after primary regex
descriptionN-Explanatory note
field_name_formatN-Template for formatting extracted field names (${name})
ifN-Condition to run
ignore_failureNfalseContinue on regex match failures
ignore_missingNfalseContinue if source field doesn't exist
max_execN100Maximum number of matches to process
on_failureN-See Handling Failures
on_successN-See Handling Success
overwrite_existingNfalseReplace existing fields instead of converting to array
tagN-Identifier

Details

The processor supports dynamic field naming using _NAME_ and _VALUE_ pattern pairs, field name formatting, and handling of multiple matches.

Golang regular expressions provied named capture groups to extract fields.

warning

Complex regular expressions on large texts may impact performance

Each named group becomes a field in the output. Special _NAME_n and _VALUE_n pairs allow dynamic field naming based on extracted content.

note

The _NAME_n and _VALUE_n pairs must use matching indices, e.g. _NAME_0 with _VALUE_0

Multiple regex patterns, array conversion for duplicate fields, field name templating, and match count limiting are also supported.

Field names are automatically sanitized to remove invalid characters. However, the field_name_format should produce valid field names. Also, when overwrite_existing is set to false, duplicate matches are converted to arrays.

warning

Be careful with the max_exec setting when dealing with high-frequency matches.

Consider using ignore_failure when regex patterns might not match all inputs.

Examples

Basic

Extracting a numeric value with a static field name...

{
"message": "metric1=23 metric2=42"
}
regex_extract:
- field: message
- regex: "metric1=(?<metric1>\\d+)"

creates a new field:

{
"message": "metric1=23 metric2=42",
"metric1": "23"
}

Complex Logs

Extracting multiple fields from structured log...

{
"message": "462559d4a487[471]: 172.23.0.6 - - [26/Feb/2024:20:22:38 +0000] \"GET /path HTTP/1.1\" 200 87533"
}
regex_extract:
- field: message
- regex: "^(?<container_id>[^\\s]+)\\[(?<process_id>\\d+)\\]:\\s+(?<remote_host>[^\\s]+)\\s+(?<remote_user>-)\\s+(?<auth_user>-)\\s+\\[(?<timestamp>[^\\]]+)\\]\\s+\"(?<request_method>\\w+)\\s+(?<requested_url>[^\\s]+)\\s+(?<http_version>[^\"]+)\"\\s+(?<status>\\d+)\\s+(?<bytes>.+)$"

yields HTTP log components:

{
"message": "462559d4a487[471]: 172.23.0.6...",
"container_id": "462559d4a487",
"process_id": "471",
"remote_host": "172.23.0.6",
"request_method": "GET",
"requested_url": "/path",
"http_version": "HTTP/1.1",
"status": "200",
"bytes": "87533"
}

Dynamic Fields

Extracting key-value pairs as dynamic fields...

{
"message": "name=\"John Doe\" age=30 email=\"john@example.com\""
}
regex_extract:
- field: message
- regex: "(?<_NAME_0>[^\\s=]+)=\"?(?<_VALUE_0>(?<=\")[^\"]*|[^\\s\"]+)"

creates new fields based on the extracted names:

{
"message": "name=\"John Doe\" age=30 email=\"john@example.com\"",
"name": "John Doe",
"age": "30",
"email": "john@example.com"
}

Formatting

Formatting extracted field names...

{
"message": "key=value"
}
regex_extract:
- field: message
- regex: "(?<_NAME_0>[^=]+)=(?<_VALUE_0>.+)"
- field_name_format: "${name}_field"

adds suffixes:

{
"message": "key=value",
"key_field": "value"
}

Multi-Match

Extracting multiple matches with array conversion...

{
"message": "value=1 value=2 value=3 value=4 value=5"
}
regex_extract:
- field: message
- regex: "value=(?<value>\\d+)"
- max_exec: 3

creates an array of up to max_exec matches:

{
"message": "value=1 value=2 value=3 value=4 value=5",
"value": ["1", "2", "3"]
}

Structured Data

Using multiple regexes with structured data...

{
"message": "<134>1 2020-12-22T17:06:08Z CORP_INT_NLB CheckPoint 18160 - [action:\"Accept\"; conn_direction:\"Internal\"]"
}
regex_extract:
- field: message
- regex: "\\[(?<__fields>.*?)\\]"
- additional_regex:
- "(?<_NAME_0>[^:]+):\"(?<_VALUE_0>[^\"]+)\""

extracts nested key-value pairs:

{
"message": "<134>1 2020-12-22T17:06:08Z...",
"action": "Accept",
"conn_direction": "Internal"
}