Files
hansken-extraction-plugin-s…/0.6.0/_sources/dev/concepts/hql_lite.md.txt
2022-08-05 09:22:07 +02:00

341 lines
14 KiB
Plaintext

# HQL-Lite
## Overview
HQL-Lite is a query language derived from Hanskens full HQL human. HQL stands for Hansken Query Language and can be
used to search or match traces. Since not all elements of full HQL can be used in the context of an extraction,
extraction plugins use HQL-Lite, a lightweight version of HQL. This document describes the usage of HQL-Lite in the
context of extraction plugins.
## How does Hansken work?
- Let's say we have a Hansken image `hansken_image1` with 10 pdf files, and 5 jpegs.
- And our Hansken contains 2 tools:
- PdfPlugin
- JpegTool
.. note:: All plugins are Hansken tools, but not all Hansken tools are plugins. Some tools are included in Hansken core.
Let's look at a (simplified) pseudocode example of the inner workings of Hansken:
```python
for each trace in new_traces {
for each datastream in trace {
for each tool in hansken_tools {
if tool.can_this_tool_process_the_provided_trace(trace, datastream) {
tool.process_the_trace(trace, datastream)
}
}
}
}
```
So in this example we know the following:
- `new_traces` has
- 10 pdf files
- 5 jpeg files
- `hansken_tools` contains:
- PdfPlugin
- JpegTool
So the question here is, how do we prevent that traces are not processed by incompatible tools?
The answer is the `tool.can_this_tool_process_the_provided_trace()` part of the pseudocode.
### What does `can_this_tool_process_the_provided_trace()` do?
Hansken actually contains many more tools/plugins than these 2, and instead of 15 files/traces, we usually deal with
millions.
.. note:: If each trace has 1 extra second of overhead, 1 million traces would take 11.5 days of extra CPU time
#### Matchers to the rescue
To reduce the unnecessary overhead of processing all traces (even the ones the tool cannot actually process), Hansken
implements the concept of a `matcher` for each tool. This _matcher_ basically checks the _trace_ for _"matching
conditions"_, that would allow the tool to process it.
Sometimes these _matching conditions_ can be as simple as a specific `filename` or `extension`, but are often more
elaborate in the sense that they check multiple factors that require some intimate knowledge of Hansken.
## What is HQL-Lite?
HQL-Lite is a language based on HQL (Hansken Query Language) that allows plugin developers to write _matchers_ for
Hansken Extraction Plugins. It could be said that HQL-Lite contains a subset of HQL features, plus some HQL-Lite unique
features that are only interesting for _matchers_.
.. note::
Please note that even though the HQL-Lite query is part of the plugin, it is compiled and stored in Hansken during
startup to achieve performance.
### Why not just use HQL for plugins?
HQL was designed to search for traces stored in the Elasticsearch database. As such, some of its features are tightly
coupled to the Elasticsearch implementation, making it difficult to re-implement them for plugins.
Also, even though HQL is more complex than the requirements for _matching_ in plugins, a couple of minor features that
are absolutely necessary for _matching_ are not implemented in HQL, as they don't make much sense from a search point of
view. This is because HQL was designed to be used with _finished extractions_ with all the traces stored in the
database, while HQL-Lite was designed for _active extractions_.
### HQL-Lite syntax
| Matcher | Syntax | remarks |
|----------|------------------------|--------------------------------------------------------------------------------------------------------|
| All | `""` | an empty string translates to match for __all__ traces |
| And | `foo:1 AND bar:2` | the case-sensitive `AND` operator behaves like a logical AND of 2 conditions |
| Not | `NOT foo` or `-foo` | the case-sensitive `NOT` or `-` negates the expression that follows |
| Range | `foo>1` or `1<=foo<10` | a numbered-range check with a min or/and max range(s) |
| Or | `foo:1 OR bar:2` | the case-sensitive `OR` operator behaves like a logical OR of 2 conditions |
| Data | `$data.foo:1` | see `$data` section below |
| DataType | `$data.type:raw` | this query matches against the type of the current datastream |
| Types | `type:email` | this query checks if the trace contains a certain trace type as defined in the Hansken trace model |
There are also a couple of general guidelines that apply to all matchers:
- Equals/not equals:
- `:` or `=` : The most basic of left equals right statements. note that `=` is also valid.
- `!=` : The opposite of equals, not equals. Note that `!:` is __NOT__ supported.
- Wildcards:
- `?`: Match against any single character. E.g. `foo:r?w` will match against `raw, row` but not against `rowing`.
- `*`: Match against any chars. E.g. `foo:r*` will match against `r, ra, raw, raaaaaw` but not against `aw`.
- Exact match: By surrounding a value with quotes, we tell the parser that it is a single value. This is especially
helpful for values that might contain separators. E.g. `foo:'hello hql-lite'`.
- CSV: Currently only the `type` query supports multiple values to check against. E.g. `type:email,chatMessage` will only
return `true` if both types exist for this trace.
- `()` grouping: You can group statements by putting brackets around them. E.g. `foo:1 AND (bar:2 OR bla:3)` which
translates to `foo:1` plus one of the statements in the brackets.
- Escaping `\"\\ .\t\r\n:=><!()~/,[]{}`: Some characters are used internally by HQL-Lite, and need to be escaped if they
are used in the value side of the key-value pair. These values can be escaped by adding prepending `\\` to the
character(s). Example: `foo:/bar` should be `foo:\\/bar`, `foo:foo bar` should be `foo:foo\\ bar`, `foo.bar:foo\\.bar`
...etc.
#### $data matchers
In Hansken, a trace can have multiple :ref:`datastreams <datastreams>`. The exact content of said datastreams is
discussed elsewhere, but the basic idea is that a trace can have multiple representations. For example, a trace might
have a `raw` datastream, but after we identify that the raw bytes contain a __text__ file, we might add a separate
datastream `text`.
.. note::
The `process()` method of each plugin is called for each datastream of each trace. This is explained
in [How does Hansken work?](#how-does-hansken-work). Subsequently, you might have the same property for a
different datastream. For example: you might have a `data.raw.size` and a `data.text.size` property. The reason you
might have the same property multiple times, is because it could have a different meaning.
For example:
- data.raw.size: is the size in bytes
- data.text.size: is the number of bytes in the text representation of the raw stream
If we want to check if either of these properties is not empty by using a `$data` matcher, we do:
```text
$data.size>0
```
##### When is it useful to use a $data matcher?
For example, there is a simple plugin called `LetterCountPlugin`, that counts the letters in text based datastreams.
So to match on these text based datastreams, we have 2 choices:
- List all the possibilities
- Which is too tedious, and not very flexible when new types are supported
- Match on a common property
- More compact, but sometimes difficult to find a common property
In this case we might match on mimeType, which we know is `text/plain` or `text/x-log` for 2 of types we want to match:
```text
$data.mimeType=text\\/*
```
This will match the following:
- `data.text.mimeType=text\\/plain`
- `data.text.mimeType=text\\/not\\ plain`
- `data.pdf.mimeType=text\\/encoded`
- `data.foo.mimeType=text\\/bar`
But will __not__ match any of the following:
- `data.text.mimeType=txt`
- `data.text.mimeType=pdf`
- `data.text.mime=text\\/plain`
- `data.foo.bar=text\\/plain`
## How to write a matcher?
The functional requirements for writing a matcher can be summarized in the following:
1. What does my plugin expect as input?
2. How can I describe that input with the information Hansken provides?
### PdfPlugin example
Let's say we just finished writing a `PdfPlugin`. This is a simple plugin that checks if pdf files contain the
word `the`.
So let's go over our checklist:
#### _What does my plugin expect as input?_
PDF files.
#### _How can I describe that input with the information Hansken provides?_
Hansken consumes and produces :ref:`Traces <traces>`. To that effect, we can only match on trace properties that are
available in Hansken.
##### Match on extension
The easiest way would be to only allow traces with the `.pdf` extension. Looking at the :ref:`Hansken trace model` (or a
Hansken extraction), we can see that there's a property `file`
which contains a property `extension`.
So what would that look like in HQL-lite? Something like
```text
file.extension=pdf
```
.. warning:: This of course __only__ works if the file has the correct extension (note that matchers are case-sensitive).
So what do we do, if we also want to match pdf files that are (un)intentionally misnamed?
##### Match on mime-type
Looking at Wikipedia, we see that `pdf` has a couple of mime-types. In return looking at our extraction and the
trace-model, we see both at `data.raw.mimeType`, with a further explanation in the :ref:`Hansken trace model` that
the `raw` portion of the property is the __data type__ of the datastream.
If we don't know which datastream has the `mimeType` property beforehand, we could use the broad-scoped `$data.` matcher
to look at every datastream.
So our matcher becomes:
```text
file.extension=pdf OR
(
$data.mimeType=application\\/pdf OR
$data.mimeType=application\\/x-pdf
)
```
##### Match on data size
Some pdf files can be huge, meaning that parsing them will need a lot of resources. Could we add a data size check to
the matcher? According to the :ref:`Hansken trace model` `data` has a property `size` (similar to `mimeType`) that we
could use for this.
.. note:: This is also a good way to check if a file is empty or not.
Let's say our cutoff limit is 1 MB, meaning our matcher becomes:
```text
0 < $data.size < 1000000 AND
(
file.extension=pdf OR
(
$data.mimeType=application\\/pdf OR
$data.mimeType=application\\/x-pdf
)
)
```
##### Match if 'property is set'
It is not uncommon to have some overlap between tools/plugins. For example:
- PdfPlugin: a plugin that only supports pdf documents
- DocumentPlugin: this plugin supports a lot of document types, including `pdf`.
So how would we prevent our plugin from processing a trace that has already been processed by the `DocumentPlugin`?
The easiest solution would be to check if a certain property has already been set. Meaning, that if both plugins set
the `foo.bar` property, we check if said property has already been set.
So we __only__ process the trace if `foo.bar` is __empty__, meaning our matcher becomes:
```text
foo.bar!=* AND
0 < $data.size < 1000000 AND
(
file.extension=pdf OR
(
$data.mimeType=application\\/pdf OR
$data.mimeType=application\\/x-pdf
)
)
```
##### Match on excluding a certain path
It is also not uncommon to exclude certain paths from your plugin. These paths might contain invalid or encrypted files,
for example.
So let's say we want to exclude all files under in the `/tmp/virus` path. How do we go about it?
Again, we check our extraction/:ref:`Hansken trace model`, and we see that `file.path` looks promising.
So excluding `/tmp/virus` would look something like:
```text
-file.path=/tmp/virus* AND
foo.bar!=* AND
0 < $data.size < 1000000 AND
(
file.extension=pdf OR
(
$data.mimeType=application\\/pdf OR
$data.mimeType=application\\/x-pdf
)
)
```
##### Match on specific datastream type
When we process datastreams, it is usually a specific type. So writing a matcher that is too loose could yield the wrong
kind of datastream for us to process.
To give a concrete example of when this could go wrong, let's say we have an encrypted file. This encrypted file (trace)
would have a datastream of type `raw`, which is the raw bytes, which are encrypted.
After going through a `SimpleDecryptPlugin`, this trace will now have another datastream of type `text`. So the contents
of the datastreams are __entirely different__, for the same trace.
In our case, we want our `PdfPlugin` to __ONLY__ process `raw` datastream. So we could replace the `$data.` matcher(s)
with `data.raw.`.
So our match would look like:
```text
-file.path=/tmp/virus* AND
foo.bar!=* AND
0 < data.raw.size < 1000000 AND
(
file.extension=pdf OR
(
data.raw.mimeType=application\\/pdf OR
data.raw.mimeType=application\\/x-pdf
)
)
```
## How precise should a matcher be?
In practice, only you as the plugin dev can answer this question.
Know that from the point of view of Hansken, we only care that the plugin:
- __Should not crash__: If a matcher does not compile, then your plugin will not be available in Hansken. Tip: be sure
to test your plugin with the :ref:`test framework <test_framework>`.
- __Should not be slow__: Matching is designed to be extremely fast, but of course, if you make it too complex it can
take longer than we want. In the example above, we calculated that 1 second extra for 1 million traces is 11 days of
extra CPU time. Unlike processing, matching is done for __every trace__, in every extraction iteration, so be careful!
- __Should match on the bare minimum__: Don't go too far by matching 50 different criteria before allowing a trace to be
processed. Note that a lot of (if not all) of these criteria depend on properties set by other tools, and you don't
really have any control on how these tools work.