Data Extractors

TRECO provides a powerful, plugin-based extraction system for parsing HTTP responses and extracting data into variables.

Overview

Extractors allow you to capture data from HTTP responses for use in subsequent requests. The extraction system supports multiple formats and uses a plugin architecture for extensibility.

Basic Syntax

extract:
  variable_name:
    type: extractor_type
    pattern: "extraction_pattern"

All extracted variables are stored in the execution context and can be accessed in later states using the format {{ state_name.variable_name }}.

Available Extractors

JSONPath (jpath)

Extract data from JSON responses using JSONPath expressions.

Type names: jpath, jsonpath, json_path

Syntax:

extract:
  token:
    type: jpath
    pattern: "$.access_token"

Common Patterns:

# Root level field
pattern: "$.field_name"

# Nested field
pattern: "$.user.profile.email"

# Array element
pattern: "$.items[0].id"

# All elements in array
pattern: "$.items[*].id"

# Filter by condition
pattern: "$.users[?(@.active==true)].name"

Example:

states:
  login:
    request: |
      POST /api/login HTTP/1.1
      Content-Type: application/json

      {"username": "user", "password": "pass"}

    extract:
      access_token:
        type: jpath
        pattern: "$.access_token"
      refresh_token:
        type: jpath
        pattern: "$.refresh_token"
      user_id:
        type: jpath
        pattern: "$.user.id"

XPath (xpath)

Extract data from XML/HTML responses using XPath expressions.

Type names: xpath, xml_path, html_path

Syntax:

extract:
  csrf_token:
    type: xpath
    pattern: '//input[@name="csrf"]/@value'

Common Patterns:

# Element by ID
pattern: '//*[@id="element-id"]'

# Input value by name
pattern: '//input[@name="field_name"]/@value'

# Link href
pattern: '//a[@class="link"]/@href'

# Text content
pattern: '//div[@class="message"]/text()'

# Meta tag content
pattern: '//meta[@name="csrf-token"]/@content'

Example:

states:
  get_form:
    request: |
      GET /form HTTP/1.1
      Host: {{ config.host }}

    extract:
      csrf_token:
        type: xpath
        pattern: '//input[@name="csrf_token"]/@value'
      form_action:
        type: html_path
        pattern: '//form/@action'

Regex (regex)

Extract data using regular expressions with capture groups.

Type names: regex, re, regexp

Syntax:

extract:
  session_id:
    type: regex
    pattern: "SESSION=([A-Z0-9]+)"

The first capture group () is returned as the extracted value.

Common Patterns:

# Cookie value
pattern: "SESSIONID=([a-zA-Z0-9]+)"

# Bearer token
pattern: 'Bearer ([a-zA-Z0-9._-]+)'

# UUID
pattern: '([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})'

# Number
pattern: 'balance["\s:]+(\d+\.?\d*)'

# Between quotes
pattern: '"token":"([^"]+)"'

Example:

states:
  get_session:
    request: |
      GET /api/session HTTP/1.1
      Host: {{ config.host }}

    extract:
      session_id:
        type: regex
        pattern: 'session_id=([a-f0-9]{32})'
      auth_code:
        type: re
        pattern: 'code=([A-Z0-9]+)'

Boundary (boundary)

Extract data between left and right delimiters. Simpler alternative to regex for common patterns.

Type names: boundary, between, delimited

Syntax:

extract:
  token:
    type: boundary
    pattern: '"token":"|||"'

The pattern uses ||| as a separator between the left and right boundaries.

Special Markers:

^ - Beginning of line (for left boundary)
$ - End of line (for right boundary)

Common Patterns:

# Between delimiters
pattern: '"token":"|||"'

# Until end of line
pattern: 'Authorization: |||$'

# From beginning of line
pattern: '^|||: value'

# HTML attribute value
pattern: 'value="|||"'

# JSON field value
pattern: '"balance":|||,'

Example:

states:
  parse_response:
    request: |
      GET /api/data HTTP/1.1
      Host: {{ config.host }}

    extract:
      api_key:
        type: boundary
        pattern: '"api_key":"|||"'
      auth_header:
        type: between
        pattern: 'X-Auth-Token: |||$'

Header (header)

Extract values from HTTP response headers (case-insensitive).

Type names: header, headers, http_header

Syntax:

extract:
  request_id:
    type: header
    pattern: "X-Request-Id"

Common Headers:

# Custom auth header
pattern: "X-Auth-Token"

# Request ID
pattern: "X-Request-Id"

# Content type
pattern: "Content-Type"

# Location (for redirects)
pattern: "Location"

# Rate limit info
pattern: "X-RateLimit-Remaining"

Example:

states:
  get_auth:
    request: |
      POST /api/auth HTTP/1.1
      Host: {{ config.host }}

    extract:
      auth_token:
        type: header
        pattern: "X-Auth-Token"
      rate_limit:
        type: headers
        pattern: "X-RateLimit-Remaining"

JWT (jwt)

Decode and extract data from JSON Web Tokens (JWT). Perfect for extracting user information, checking token expiration, and validating JWT structure in API security testing.

Type names: jwt

Extract Specific Claims:

extract:
  user_id:
    type: jwt
    source: "{{ access_token }}"
    claim: sub

  user_role:
    type: jwt
    source: "{{ access_token }}"
    claim: role

  email:
    type: jwt
    source: "{{ access_token }}"
    claim: email

Extract JWT Parts:

extract:
  # Get entire payload
  jwt_payload:
    type: jwt
    source: "{{ token }}"
    part: payload

  # Get header (algorithm, type, etc.)
  jwt_header:
    type: jwt
    source: "{{ token }}"
    part: header

  # Get signature
  jwt_signature:
    type: jwt
    source: "{{ token }}"
    part: signature

Validation Checks:

extract:
  # Check if token has expired
  is_expired:
    type: jwt
    source: "{{ token }}"
    check: expired

  # Get algorithm (HS256, RS256, etc.)
  algorithm:
    type: jwt
    source: "{{ token }}"
    check: algorithm

  # Check if structure is valid
  is_valid:
    type: jwt
    source: "{{ token }}"
    check: valid

With Signature Verification:

extract:
  verified_payload:
    type: jwt
    source: "{{ token }}"
    part: payload
    verify: true
    secret: "{{ jwt_secret }}"
    algorithms: ["HS256", "HS512"]

Common JWT Claims:

sub - Subject (usually user ID)
iss - Issuer
aud - Audience
exp - Expiration timestamp
nbf - Not Before timestamp
iat - Issued At timestamp
jti - JWT ID
role, roles - User role(s)
permissions - User permissions
email, username - User identity

Security Testing Example:

states:
  analyze_jwt:
    request: |
      GET /api/protected HTTP/1.1
      Authorization: Bearer {{ token }}

    extract:
      algorithm:
        type: jwt
        source: "{{ token }}"
        check: algorithm

      is_expired:
        type: jwt
        source: "{{ token }}"
        check: expired

      user_role:
        type: jwt
        source: "{{ token }}"
        claim: role

    logger:
      on_state_leave: |
        {% if algorithm == 'none' %}
          🚨 CRITICAL: JWT uses 'none' algorithm!
        {% elif algorithm == 'HS256' %}
          ⚠ WARNING: JWT uses symmetric algorithm
        {% endif %}
        {% if is_expired %}
          🚨 Token is expired but still accepted!
        {% endif %}

Extractor Summary

Type	Aliases	Best For
`jpath`	`jsonpath`, `json_path`	JSON API responses
`xpath`	`xml_path`, `html_path`	HTML forms, XML responses
`regex`	`re`, `regexp`	Complex patterns, mixed content
`boundary`	`between`, `delimited`	Simple text extraction
`header`	`headers`, `http_header`	Response headers
`cookie`	`cookies`, `set_cookie`, `set-cookie`	Session cookies, tokens
`jwt`		JWT token analysis, claims extraction

Using Extracted Variables

Extracted variables are stored in the context and can be accessed in templates:

states:
  login:
    extract:
      token:
        type: jpath
        pattern: "$.token"

  use_token:
    request: |
      GET /api/data HTTP/1.1
      Authorization: Bearer {{ login.token }}

Variable Naming

Use lowercase with underscores: user_id, auth_token
Avoid reserved words: config, thread, context
Be descriptive: access_token not t

Accessing Variables

# From previous state
{{ state_name.variable_name }}

# From current state (in logger)
{{ variable_name }}

# From config
{{ config.host }}

# Thread info (in race states)
{{ thread.id }}
{{ thread.count }}

Creating Custom Extractors

You can create custom extractors by implementing the BaseExtractor interface:

from treco.http.extractor.base import BaseExtractor, register_extractor

@register_extractor('custom', aliases=['my_extractor'])
class CustomExtractor(BaseExtractor):
    """Custom extractor for specific data formats."""

    def extract(self, response, pattern):
        """
        Extract data from response.

        Args:
            response: ResponseProtocol object
            pattern: Extraction pattern string

        Returns:
            Extracted value or None if not found
        """
        # Your extraction logic here
        content = response.text
        # ... process content using pattern ...
        return extracted_value

The @register_extractor decorator automatically registers your extractor with the specified type name and aliases.

Best Practices

Choose the right extractor: Use JSONPath for JSON, XPath for HTML, regex for complex patterns
Be specific with patterns: Avoid overly broad patterns that might match wrong data
Handle missing data: Extractors return None if pattern doesn’t match
Test patterns: Verify patterns work with actual response data
Use aliases: Different teams may prefer different naming conventions

Troubleshooting

Pattern not matching:

Check the response content type
Verify the pattern syntax
Use verbose mode to see actual response
Test pattern with sample data

Wrong data extracted:

Make patterns more specific
Use capture groups correctly in regex
Check for multiple matches (first match is used)

Extractor type not found:

Check spelling and aliases
Ensure you’re using a valid type name
Custom extractors must be imported before use

Data Extractors

Overview

Basic Syntax

Available Extractors

JSONPath (jpath)

XPath (xpath)

Regex (regex)

Boundary (boundary)

Header (header)

Cookie (cookie)

JWT (jwt)

Extractor Summary

Using Extracted Variables

Variable Naming

Accessing Variables

Creating Custom Extractors

Best Practices

Troubleshooting

See Also