mirror of
https://github.com/NetherlandsForensicInstitute/hansken-extraction-plugin-sdk-documentation.git
synced 2026-02-14 14:09:49 +00:00
341 lines
14 KiB
Plaintext
341 lines
14 KiB
Plaintext
# HQL-Lite
|
|
|
|
## Overview
|
|
|
|
HQL-Lite is a query language derived from Hanskens full HQL human. HQL stands for Hansken Query Language and can be
|
|
used to search or match traces. Since not all elements of full HQL can be used in the context of an extraction,
|
|
extraction plugins use HQL-Lite, a lightweight version of HQL. This document describes the usage of HQL-Lite in the
|
|
context of extraction plugins.
|
|
|
|
## How does Hansken work?
|
|
|
|
- Let's say we have a Hansken image `hansken_image1` with 10 pdf files, and 5 jpegs.
|
|
- And our Hansken contains 2 tools:
|
|
- PdfPlugin
|
|
- JpegTool
|
|
|
|
.. note:: All plugins are Hansken tools, but not all Hansken tools are plugins. Some tools are included in Hansken core.
|
|
|
|
Let's look at a (simplified) pseudocode example of the inner workings of Hansken:
|
|
|
|
```python
|
|
for each trace in new_traces {
|
|
for each datastream in trace {
|
|
for each tool in hansken_tools {
|
|
if tool.can_this_tool_process_the_provided_trace(trace, datastream) {
|
|
tool.process_the_trace(trace, datastream)
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
So in this example we know the following:
|
|
|
|
- `new_traces` has
|
|
- 10 pdf files
|
|
- 5 jpeg files
|
|
- `hansken_tools` contains:
|
|
- PdfPlugin
|
|
- JpegTool
|
|
|
|
So the question here is, how do we prevent that traces are not processed by incompatible tools?
|
|
|
|
The answer is the `tool.can_this_tool_process_the_provided_trace()` part of the pseudocode.
|
|
|
|
### What does `can_this_tool_process_the_provided_trace()` do?
|
|
|
|
Hansken actually contains many more tools/plugins than these 2, and instead of 15 files/traces, we usually deal with
|
|
millions.
|
|
|
|
.. note:: If each trace has 1 extra second of overhead, 1 million traces would take 11.5 days of extra CPU time
|
|
|
|
#### Matchers to the rescue
|
|
|
|
To reduce the unnecessary overhead of processing all traces (even the ones the tool cannot actually process), Hansken
|
|
implements the concept of a `matcher` for each tool. This _matcher_ basically checks the _trace_ for _"matching
|
|
conditions"_, that would allow the tool to process it.
|
|
|
|
Sometimes these _matching conditions_ can be as simple as a specific `filename` or `extension`, but are often more
|
|
elaborate in the sense that they check multiple factors that require some intimate knowledge of Hansken.
|
|
|
|
## What is HQL-Lite?
|
|
|
|
HQL-Lite is a language based on HQL (Hansken Query Language) that allows plugin developers to write _matchers_ for
|
|
Hansken Extraction Plugins. It could be said that HQL-Lite contains a subset of HQL features, plus some HQL-Lite unique
|
|
features that are only interesting for _matchers_.
|
|
|
|
.. note::
|
|
Please note that even though the HQL-Lite query is part of the plugin, it is compiled and stored in Hansken during
|
|
startup to achieve performance.
|
|
|
|
### Why not just use HQL for plugins?
|
|
|
|
HQL was designed to search for traces stored in the Elasticsearch database. As such, some of its features are tightly
|
|
coupled to the Elasticsearch implementation, making it difficult to re-implement them for plugins.
|
|
|
|
Also, even though HQL is more complex than the requirements for _matching_ in plugins, a couple of minor features that
|
|
are absolutely necessary for _matching_ are not implemented in HQL, as they don't make much sense from a search point of
|
|
view. This is because HQL was designed to be used with _finished extractions_ with all the traces stored in the
|
|
database, while HQL-Lite was designed for _active extractions_.
|
|
|
|
### HQL-Lite syntax
|
|
|
|
| Matcher | Syntax | remarks |
|
|
|----------|------------------------|--------------------------------------------------------------------------------------------------------|
|
|
| All | `""` | an empty string translates to match for __all__ traces |
|
|
| And | `foo:1 AND bar:2` | the case-sensitive `AND` operator behaves like a logical AND of 2 conditions |
|
|
| Not | `NOT foo` or `-foo` | the case-sensitive `NOT` or `-` negates the expression that follows |
|
|
| Range | `foo>1` or `1<=foo<10` | a numbered-range check with a min or/and max range(s) |
|
|
| Or | `foo:1 OR bar:2` | the case-sensitive `OR` operator behaves like a logical OR of 2 conditions |
|
|
| Data | `$data.foo:1` | see `$data` section below |
|
|
| DataType | `$data.type:raw` | this query matches against the type of the current datastream |
|
|
| Types | `type:email` | this query checks if the trace contains a certain trace type as defined in the Hansken trace model |
|
|
|
|
There are also a couple of general guidelines that apply to all matchers:
|
|
|
|
- Equals/not equals:
|
|
- `:` or `=` : The most basic of left equals right statements. note that `=` is also valid.
|
|
- `!=` : The opposite of equals, not equals. Note that `!:` is __NOT__ supported.
|
|
- Wildcards:
|
|
- `?`: Match against any single character. E.g. `foo:r?w` will match against `raw, row` but not against `rowing`.
|
|
- `*`: Match against any chars. E.g. `foo:r*` will match against `r, ra, raw, raaaaaw` but not against `aw`.
|
|
- Exact match: By surrounding a value with quotes, we tell the parser that it is a single value. This is especially
|
|
helpful for values that might contain separators. E.g. `foo:'hello hql-lite'`.
|
|
- CSV: Currently only the `type` query supports multiple values to check against. E.g. `type:email,chatMessage` will only
|
|
return `true` if both types exist for this trace.
|
|
- `()` grouping: You can group statements by putting brackets around them. E.g. `foo:1 AND (bar:2 OR bla:3)` which
|
|
translates to `foo:1` plus one of the statements in the brackets.
|
|
- Escaping `\"\\ .\t\r\n:=><!()~/,[]{}`: Some characters are used internally by HQL-Lite, and need to be escaped if they
|
|
are used in the value side of the key-value pair. These values can be escaped by adding prepending `\\` to the
|
|
character(s). Example: `foo:/bar` should be `foo:\\/bar`, `foo:foo bar` should be `foo:foo\\ bar`, `foo.bar:foo\\.bar`
|
|
...etc.
|
|
|
|
#### $data matchers
|
|
|
|
In Hansken, a trace can have multiple :ref:`datastreams <datastreams>`. The exact content of said datastreams is
|
|
discussed elsewhere, but the basic idea is that a trace can have multiple representations. For example, a trace might
|
|
have a `raw` datastream, but after we identify that the raw bytes contain a __text__ file, we might add a separate
|
|
datastream `text`.
|
|
|
|
.. note::
|
|
The `process()` method of each plugin is called for each datastream of each trace. This is explained
|
|
in [How does Hansken work?](#how-does-hansken-work). Subsequently, you might have the same property for a
|
|
different datastream. For example: you might have a `data.raw.size` and a `data.text.size` property. The reason you
|
|
might have the same property multiple times, is because it could have a different meaning.
|
|
|
|
For example:
|
|
|
|
- data.raw.size: is the size in bytes
|
|
- data.text.size: is the number of bytes in the text representation of the raw stream
|
|
|
|
If we want to check if either of these properties is not empty by using a `$data` matcher, we do:
|
|
|
|
```text
|
|
$data.size>0
|
|
```
|
|
|
|
##### When is it useful to use a $data matcher?
|
|
|
|
For example, there is a simple plugin called `LetterCountPlugin`, that counts the letters in text based datastreams.
|
|
|
|
So to match on these text based datastreams, we have 2 choices:
|
|
|
|
- List all the possibilities
|
|
- Which is too tedious, and not very flexible when new types are supported
|
|
- Match on a common property
|
|
- More compact, but sometimes difficult to find a common property
|
|
|
|
In this case we might match on mimeType, which we know is `text/plain` or `text/x-log` for 2 of types we want to match:
|
|
|
|
```text
|
|
$data.mimeType=text\\/*
|
|
```
|
|
|
|
This will match the following:
|
|
|
|
- `data.text.mimeType=text\\/plain`
|
|
- `data.text.mimeType=text\\/not\\ plain`
|
|
- `data.pdf.mimeType=text\\/encoded`
|
|
- `data.foo.mimeType=text\\/bar`
|
|
|
|
But will __not__ match any of the following:
|
|
|
|
- `data.text.mimeType=txt`
|
|
- `data.text.mimeType=pdf`
|
|
- `data.text.mime=text\\/plain`
|
|
- `data.foo.bar=text\\/plain`
|
|
|
|
## How to write a matcher?
|
|
|
|
The functional requirements for writing a matcher can be summarized in the following:
|
|
|
|
1. What does my plugin expect as input?
|
|
2. How can I describe that input with the information Hansken provides?
|
|
|
|
### PdfPlugin example
|
|
|
|
Let's say we just finished writing a `PdfPlugin`. This is a simple plugin that checks if pdf files contain the
|
|
word `the`.
|
|
|
|
So let's go over our checklist:
|
|
|
|
#### _What does my plugin expect as input?_
|
|
|
|
PDF files.
|
|
|
|
#### _How can I describe that input with the information Hansken provides?_
|
|
|
|
Hansken consumes and produces :ref:`Traces <traces>`. To that effect, we can only match on trace properties that are
|
|
available in Hansken.
|
|
|
|
##### Match on extension
|
|
|
|
The easiest way would be to only allow traces with the `.pdf` extension. Looking at the :ref:`Hansken trace model` (or a
|
|
Hansken extraction), we can see that there's a property `file`
|
|
which contains a property `extension`.
|
|
|
|
So what would that look like in HQL-lite? Something like
|
|
|
|
```text
|
|
file.extension=pdf
|
|
```
|
|
|
|
.. warning:: This of course __only__ works if the file has the correct extension (note that matchers are case-sensitive).
|
|
|
|
So what do we do, if we also want to match pdf files that are (un)intentionally misnamed?
|
|
|
|
##### Match on mime-type
|
|
|
|
Looking at Wikipedia, we see that `pdf` has a couple of mime-types. In return looking at our extraction and the
|
|
trace-model, we see both at `data.raw.mimeType`, with a further explanation in the :ref:`Hansken trace model` that
|
|
the `raw` portion of the property is the __data type__ of the datastream.
|
|
|
|
If we don't know which datastream has the `mimeType` property beforehand, we could use the broad-scoped `$data.` matcher
|
|
to look at every datastream.
|
|
|
|
So our matcher becomes:
|
|
|
|
```text
|
|
file.extension=pdf OR
|
|
(
|
|
$data.mimeType=application\\/pdf OR
|
|
$data.mimeType=application\\/x-pdf
|
|
)
|
|
```
|
|
|
|
##### Match on data size
|
|
|
|
Some pdf files can be huge, meaning that parsing them will need a lot of resources. Could we add a data size check to
|
|
the matcher? According to the :ref:`Hansken trace model` `data` has a property `size` (similar to `mimeType`) that we
|
|
could use for this.
|
|
|
|
.. note:: This is also a good way to check if a file is empty or not.
|
|
|
|
Let's say our cutoff limit is 1 MB, meaning our matcher becomes:
|
|
|
|
```text
|
|
0 < $data.size < 1000000 AND
|
|
(
|
|
file.extension=pdf OR
|
|
(
|
|
$data.mimeType=application\\/pdf OR
|
|
$data.mimeType=application\\/x-pdf
|
|
)
|
|
)
|
|
```
|
|
|
|
##### Match if 'property is set'
|
|
|
|
It is not uncommon to have some overlap between tools/plugins. For example:
|
|
|
|
- PdfPlugin: a plugin that only supports pdf documents
|
|
- DocumentPlugin: this plugin supports a lot of document types, including `pdf`.
|
|
|
|
So how would we prevent our plugin from processing a trace that has already been processed by the `DocumentPlugin`?
|
|
|
|
The easiest solution would be to check if a certain property has already been set. Meaning, that if both plugins set
|
|
the `foo.bar` property, we check if said property has already been set.
|
|
|
|
So we __only__ process the trace if `foo.bar` is __empty__, meaning our matcher becomes:
|
|
|
|
```text
|
|
foo.bar!=* AND
|
|
0 < $data.size < 1000000 AND
|
|
(
|
|
file.extension=pdf OR
|
|
(
|
|
$data.mimeType=application\\/pdf OR
|
|
$data.mimeType=application\\/x-pdf
|
|
)
|
|
)
|
|
```
|
|
|
|
##### Match on excluding a certain path
|
|
|
|
It is also not uncommon to exclude certain paths from your plugin. These paths might contain invalid or encrypted files,
|
|
for example.
|
|
|
|
So let's say we want to exclude all files under in the `/tmp/virus` path. How do we go about it?
|
|
|
|
Again, we check our extraction/:ref:`Hansken trace model`, and we see that `file.path` looks promising.
|
|
|
|
So excluding `/tmp/virus` would look something like:
|
|
|
|
```text
|
|
-file.path=/tmp/virus* AND
|
|
foo.bar!=* AND
|
|
0 < $data.size < 1000000 AND
|
|
(
|
|
file.extension=pdf OR
|
|
(
|
|
$data.mimeType=application\\/pdf OR
|
|
$data.mimeType=application\\/x-pdf
|
|
)
|
|
)
|
|
```
|
|
|
|
##### Match on specific datastream type
|
|
|
|
When we process datastreams, it is usually a specific type. So writing a matcher that is too loose could yield the wrong
|
|
kind of datastream for us to process.
|
|
|
|
To give a concrete example of when this could go wrong, let's say we have an encrypted file. This encrypted file (trace)
|
|
would have a datastream of type `raw`, which is the raw bytes, which are encrypted.
|
|
|
|
After going through a `SimpleDecryptPlugin`, this trace will now have another datastream of type `text`. So the contents
|
|
of the datastreams are __entirely different__, for the same trace.
|
|
|
|
In our case, we want our `PdfPlugin` to __ONLY__ process `raw` datastream. So we could replace the `$data.` matcher(s)
|
|
with `data.raw.`.
|
|
|
|
So our match would look like:
|
|
|
|
```text
|
|
-file.path=/tmp/virus* AND
|
|
foo.bar!=* AND
|
|
0 < data.raw.size < 1000000 AND
|
|
(
|
|
file.extension=pdf OR
|
|
(
|
|
data.raw.mimeType=application\\/pdf OR
|
|
data.raw.mimeType=application\\/x-pdf
|
|
)
|
|
)
|
|
```
|
|
|
|
## How precise should a matcher be?
|
|
|
|
In practice, only you as the plugin dev can answer this question.
|
|
|
|
Know that from the point of view of Hansken, we only care that the plugin:
|
|
|
|
- __Should not crash__: If a matcher does not compile, then your plugin will not be available in Hansken. Tip: be sure
|
|
to test your plugin with the :ref:`test framework <test_framework>`.
|
|
- __Should not be slow__: Matching is designed to be extremely fast, but of course, if you make it too complex it can
|
|
take longer than we want. In the example above, we calculated that 1 second extra for 1 million traces is 11 days of
|
|
extra CPU time. Unlike processing, matching is done for __every trace__, in every extraction iteration, so be careful!
|
|
- __Should match on the bare minimum__: Don't go too far by matching 50 different criteria before allowing a trace to be
|
|
processed. Note that a lot of (if not all) of these criteria depend on properties set by other tools, and you don't
|
|
really have any control on how these tools work.
|