Data Extractors
===============

TRECO provides a powerful, plugin-based extraction system for parsing HTTP responses and extracting data into variables.

Overview
--------

Extractors allow you to capture data from HTTP responses for use in subsequent requests. The extraction system supports multiple formats and uses a plugin architecture for extensibility.

Basic Syntax
~~~~~~~~~~~~

.. code-block:: yaml

   extract:
     variable_name:
       type: extractor_type
       pattern: "extraction_pattern"

All extracted variables are stored in the execution context and can be accessed in later states using the format ``{{ state_name.variable_name }}``.

Available Extractors
--------------------

JSONPath (jpath)
~~~~~~~~~~~~~~~~

Extract data from JSON responses using JSONPath expressions.

**Type names:** ``jpath``, ``jsonpath``, ``json_path``

**Syntax:**

.. code-block:: yaml

   extract:
     token:
       type: jpath
       pattern: "$.access_token"

**Common Patterns:**

.. code-block:: yaml

   # Root level field
   pattern: "$.field_name"
   
   # Nested field
   pattern: "$.user.profile.email"
   
   # Array element
   pattern: "$.items[0].id"
   
   # All elements in array
   pattern: "$.items[*].id"
   
   # Filter by condition
   pattern: "$.users[?(@.active==true)].name"

**Example:**

.. code-block:: yaml

   states:
     login:
       request: |
         POST /api/login HTTP/1.1
         Content-Type: application/json
         
         {"username": "user", "password": "pass"}
       
       extract:
         access_token:
           type: jpath
           pattern: "$.access_token"
         refresh_token:
           type: jpath
           pattern: "$.refresh_token"
         user_id:
           type: jpath
           pattern: "$.user.id"

XPath (xpath)
~~~~~~~~~~~~~

Extract data from XML/HTML responses using XPath expressions.

**Type names:** ``xpath``, ``xml_path``, ``html_path``

**Syntax:**

.. code-block:: yaml

   extract:
     csrf_token:
       type: xpath
       pattern: '//input[@name="csrf"]/@value'

**Common Patterns:**

.. code-block:: yaml

   # Element by ID
   pattern: '//*[@id="element-id"]'
   
   # Input value by name
   pattern: '//input[@name="field_name"]/@value'
   
   # Link href
   pattern: '//a[@class="link"]/@href'
   
   # Text content
   pattern: '//div[@class="message"]/text()'
   
   # Meta tag content
   pattern: '//meta[@name="csrf-token"]/@content'

**Example:**

.. code-block:: yaml

   states:
     get_form:
       request: |
         GET /form HTTP/1.1
         Host: {{ config.host }}
       
       extract:
         csrf_token:
           type: xpath
           pattern: '//input[@name="csrf_token"]/@value'
         form_action:
           type: html_path
           pattern: '//form/@action'

Regex (regex)
~~~~~~~~~~~~~

Extract data using regular expressions with capture groups.

**Type names:** ``regex``, ``re``, ``regexp``

**Syntax:**

.. code-block:: yaml

   extract:
     session_id:
       type: regex
       pattern: "SESSION=([A-Z0-9]+)"

The first capture group ``()`` is returned as the extracted value.

**Common Patterns:**

.. code-block:: yaml

   # Cookie value
   pattern: "SESSIONID=([a-zA-Z0-9]+)"
   
   # Bearer token
   pattern: 'Bearer ([a-zA-Z0-9._-]+)'
   
   # UUID
   pattern: '([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})'
   
   # Number
   pattern: 'balance["\s:]+(\d+\.?\d*)'
   
   # Between quotes
   pattern: '"token":"([^"]+)"'

**Example:**

.. code-block:: yaml

   states:
     get_session:
       request: |
         GET /api/session HTTP/1.1
         Host: {{ config.host }}
       
       extract:
         session_id:
           type: regex
           pattern: 'session_id=([a-f0-9]{32})'
         auth_code:
           type: re
           pattern: 'code=([A-Z0-9]+)'

Boundary (boundary)
~~~~~~~~~~~~~~~~~~~

Extract data between left and right delimiters. Simpler alternative to regex for common patterns.

**Type names:** ``boundary``, ``between``, ``delimited``

**Syntax:**

.. code-block:: yaml

   extract:
     token:
       type: boundary
       pattern: '"token":"|||"'

The pattern uses ``|||`` as a separator between the left and right boundaries.

**Special Markers:**

* ``^`` - Beginning of line (for left boundary)
* ``$`` - End of line (for right boundary)

**Common Patterns:**

.. code-block:: yaml

   # Between delimiters
   pattern: '"token":"|||"'
   
   # Until end of line
   pattern: 'Authorization: |||$'
   
   # From beginning of line
   pattern: '^|||: value'
   
   # HTML attribute value
   pattern: 'value="|||"'
   
   # JSON field value
   pattern: '"balance":|||,'

**Example:**

.. code-block:: yaml

   states:
     parse_response:
       request: |
         GET /api/data HTTP/1.1
         Host: {{ config.host }}
       
       extract:
         api_key:
           type: boundary
           pattern: '"api_key":"|||"'
         auth_header:
           type: between
           pattern: 'X-Auth-Token: |||$'

Header (header)
~~~~~~~~~~~~~~~

Extract values from HTTP response headers (case-insensitive).

**Type names:** ``header``, ``headers``, ``http_header``

**Syntax:**

.. code-block:: yaml

   extract:
     request_id:
       type: header
       pattern: "X-Request-Id"

**Common Headers:**

.. code-block:: yaml

   # Custom auth header
   pattern: "X-Auth-Token"
   
   # Request ID
   pattern: "X-Request-Id"
   
   # Content type
   pattern: "Content-Type"
   
   # Location (for redirects)
   pattern: "Location"
   
   # Rate limit info
   pattern: "X-RateLimit-Remaining"

**Example:**

.. code-block:: yaml

   states:
     get_auth:
       request: |
         POST /api/auth HTTP/1.1
         Host: {{ config.host }}
       
       extract:
         auth_token:
           type: header
           pattern: "X-Auth-Token"
         rate_limit:
           type: headers
           pattern: "X-RateLimit-Remaining"

Cookie (cookie)
~~~~~~~~~~~~~~~

Extract cookie values from Set-Cookie response headers.

**Type names:** ``cookie``, ``cookies``, ``set_cookie``, ``set-cookie``

**Syntax:**

.. code-block:: yaml

   extract:
     session:
       type: cookie
       pattern: "session_id"

**Example:**

.. code-block:: yaml

   states:
     login:
       request: |
         POST /login HTTP/1.1
         Host: {{ config.host }}
         Content-Type: application/json
         
         {"username": "user", "password": "pass"}
       
       extract:
         session_id:
           type: cookie
           pattern: "SESSIONID"
         csrf_cookie:
           type: set-cookie
           pattern: "csrf_token"
         tracking_id:
           type: cookies
           pattern: "_tracking"

JWT (jwt)
~~~~~~~~~

Decode and extract data from JSON Web Tokens (JWT). Perfect for extracting user information, checking token expiration, and validating JWT structure in API security testing.

**Type names:** ``jwt``

**Extract Specific Claims:**

.. code-block:: yaml

   extract:
     user_id:
       type: jwt
       source: "{{ access_token }}"
       claim: sub
     
     user_role:
       type: jwt
       source: "{{ access_token }}"
       claim: role
     
     email:
       type: jwt
       source: "{{ access_token }}"
       claim: email

**Extract JWT Parts:**

.. code-block:: yaml

   extract:
     # Get entire payload
     jwt_payload:
       type: jwt
       source: "{{ token }}"
       part: payload
     
     # Get header (algorithm, type, etc.)
     jwt_header:
       type: jwt
       source: "{{ token }}"
       part: header
     
     # Get signature
     jwt_signature:
       type: jwt
       source: "{{ token }}"
       part: signature

**Validation Checks:**

.. code-block:: yaml

   extract:
     # Check if token has expired
     is_expired:
       type: jwt
       source: "{{ token }}"
       check: expired
     
     # Get algorithm (HS256, RS256, etc.)
     algorithm:
       type: jwt
       source: "{{ token }}"
       check: algorithm
     
     # Check if structure is valid
     is_valid:
       type: jwt
       source: "{{ token }}"
       check: valid

**With Signature Verification:**

.. code-block:: yaml

   extract:
     verified_payload:
       type: jwt
       source: "{{ token }}"
       part: payload
       verify: true
       secret: "{{ jwt_secret }}"
       algorithms: ["HS256", "HS512"]

**Common JWT Claims:**

- ``sub`` - Subject (usually user ID)
- ``iss`` - Issuer  
- ``aud`` - Audience
- ``exp`` - Expiration timestamp
- ``nbf`` - Not Before timestamp
- ``iat`` - Issued At timestamp
- ``jti`` - JWT ID
- ``role``, ``roles`` - User role(s)
- ``permissions`` - User permissions
- ``email``, ``username`` - User identity

**Security Testing Example:**

.. code-block:: yaml

   states:
     analyze_jwt:
       request: |
         GET /api/protected HTTP/1.1
         Authorization: Bearer {{ token }}
       
       extract:
         algorithm:
           type: jwt
           source: "{{ token }}"
           check: algorithm
         
         is_expired:
           type: jwt
           source: "{{ token }}"
           check: expired
         
         user_role:
           type: jwt
           source: "{{ token }}"
           claim: role
       
       logger:
         on_state_leave: |
           {% if algorithm == 'none' %}
             🚨 CRITICAL: JWT uses 'none' algorithm!
           {% elif algorithm == 'HS256' %}
             ⚠ WARNING: JWT uses symmetric algorithm
           {% endif %}
           {% if is_expired %}
             🚨 Token is expired but still accepted!
           {% endif %}

Extractor Summary
-----------------

.. list-table::
   :header-rows: 1
   :widths: 15 25 60

   * - Type
     - Aliases
     - Best For
   * - ``jpath``
     - ``jsonpath``, ``json_path``
     - JSON API responses
   * - ``xpath``
     - ``xml_path``, ``html_path``
     - HTML forms, XML responses
   * - ``regex``
     - ``re``, ``regexp``
     - Complex patterns, mixed content
   * - ``boundary``
     - ``between``, ``delimited``
     - Simple text extraction
   * - ``header``
     - ``headers``, ``http_header``
     - Response headers
   * - ``cookie``
     - ``cookies``, ``set_cookie``, ``set-cookie``
     - Session cookies, tokens
   * - ``jwt``
     - 
     - JWT token analysis, claims extraction

Using Extracted Variables
-------------------------

Extracted variables are stored in the context and can be accessed in templates:

.. code-block:: yaml

   states:
     login:
       extract:
         token:
           type: jpath
           pattern: "$.token"
     
     use_token:
       request: |
         GET /api/data HTTP/1.1
         Authorization: Bearer {{ login.token }}

Variable Naming
~~~~~~~~~~~~~~~

* Use lowercase with underscores: ``user_id``, ``auth_token``
* Avoid reserved words: ``config``, ``thread``, ``context``
* Be descriptive: ``access_token`` not ``t``

Accessing Variables
~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml

   # From previous state
   {{ state_name.variable_name }}
   
   # From current state (in logger)
   {{ variable_name }}
   
   # From config
   {{ config.host }}
   
   # Thread info (in race states)
   {{ thread.id }}
   {{ thread.count }}

Creating Custom Extractors
--------------------------

You can create custom extractors by implementing the ``BaseExtractor`` interface:

.. code-block:: python

   from treco.http.extractor.base import BaseExtractor, register_extractor
   
   @register_extractor('custom', aliases=['my_extractor'])
   class CustomExtractor(BaseExtractor):
       """Custom extractor for specific data formats."""
       
       def extract(self, response, pattern):
           """
           Extract data from response.
           
           Args:
               response: ResponseProtocol object
               pattern: Extraction pattern string
           
           Returns:
               Extracted value or None if not found
           """
           # Your extraction logic here
           content = response.text
           # ... process content using pattern ...
           return extracted_value

The ``@register_extractor`` decorator automatically registers your extractor with the specified type name and aliases.

Best Practices
--------------

1. **Choose the right extractor**: Use JSONPath for JSON, XPath for HTML, regex for complex patterns
2. **Be specific with patterns**: Avoid overly broad patterns that might match wrong data
3. **Handle missing data**: Extractors return ``None`` if pattern doesn't match
4. **Test patterns**: Verify patterns work with actual response data
5. **Use aliases**: Different teams may prefer different naming conventions

Troubleshooting
---------------

**Pattern not matching:**

1. Check the response content type
2. Verify the pattern syntax
3. Use verbose mode to see actual response
4. Test pattern with sample data

**Wrong data extracted:**

1. Make patterns more specific
2. Use capture groups correctly in regex
3. Check for multiple matches (first match is used)

**Extractor type not found:**

1. Check spelling and aliases
2. Ensure you're using a valid type name
3. Custom extractors must be imported before use

See Also
--------

* :doc:`configuration` - YAML configuration reference
* :doc:`templates` - Template syntax and filters
* :doc:`examples` - Real-world attack examples