mirror of
https://github.com/NetherlandsForensicInstitute/hansken-extraction-plugin-sdk-documentation.git
synced 2026-05-08 11:16:33 +00:00
287 lines
11 KiB
Plaintext
287 lines
11 KiB
Plaintext
# Python code snippets
|
||
|
||
## Adding properties to a trace
|
||
|
||
Use :py:meth:`update <hansken_extraction_plugin.api.extraction_trace.ExtractionTraceBuilder.update>`
|
||
to add trace types and their properties to an
|
||
:py:class:`ExtractionTrace <hansken_extraction_plugin.api.extraction_trace.ExtractionTrace>`.
|
||
Example:
|
||
|
||
```python
|
||
def process(self, trace, data_context):
|
||
# get the name of the file
|
||
file_name = trace.get('file.name')
|
||
# set the chat application property on the trace
|
||
trace.update('chatConversation.application', f'DemoApp {file_name}')
|
||
```
|
||
|
||
All types and properties that can be set are defined in the :ref:`Hansken trace model`.
|
||
|
||
### Date properties
|
||
|
||
When adding a property which holds a value of data-type Date, always define timezone as being UTC. Example:
|
||
|
||
```python
|
||
def process(self, trace, data_context):
|
||
trace.update('file.modifiedOn',
|
||
datetime.fromtimestamp(1630510809, tz=timezone.utc))
|
||
```
|
||
|
||
### Category for extra properties
|
||
|
||
If the information, which must be added as a property, does not match any of the existing properties of Hansken trace
|
||
model, use the category "misc" (miscellaneous). When part of the category "misc", any name can be given to a property.
|
||
The values of miscellaneous properties are expected to be of data-type string. Example:
|
||
|
||
```python
|
||
def process(self, trace, data_context):
|
||
trace.update({
|
||
'file.misc.notes': 'Some additional notes about the file trace.',
|
||
'file.misc.anyName': 'Even more notes.'
|
||
})
|
||
```
|
||
|
||
.. _tracelets python:
|
||
|
||
### Adding tracelets
|
||
|
||
In the following Python example, a "prediction" :ref:`tracelet<tracelets>` is added to a trace. The tracelet consists
|
||
of a list of four properties, namely "class", "confidence", "modelName" and "modelVersion".
|
||
|
||
```python
|
||
trace.add_tracelet(Tracelet('prediction', {'class': 'telephone',
|
||
'confidence': 0.8,
|
||
'modelName': 'yolo',
|
||
'modelVersion': '2.0'}))
|
||
```
|
||
|
||
## Adding child traces to a trace
|
||
|
||
Adding child traces to the trace can be done by creating a builder with
|
||
:py:meth:`child_builder <hansken_extraction_plugin.api.extraction_trace.ExtractionTraceBuilder.child_builder>`.
|
||
Example:
|
||
|
||
```python
|
||
def process(self, trace, data_context):
|
||
child_builder = trace.child_builder('childTrace-1')
|
||
child_builder.update({
|
||
'chatMessage.application': 'DemoApp',
|
||
'chatMessage.from': 'Ann',
|
||
'chatMessage.to': ['Mark'],
|
||
# list, because there can be multiple receivers
|
||
'chatMessage.message': 'Hello, are you there?',
|
||
}).build()
|
||
grandchild_builder = child_builder.child_builder('grandchild')
|
||
grandchild_builder.update(data={'byte': b'some bytes'})
|
||
grandchild_builder.build()
|
||
```
|
||
|
||
This adds a single child trace with name `childTrace-1` with four properties and a grandchild trace with name
|
||
`grandchild` and a byte data stream.
|
||
|
||
.. _datastreams python:
|
||
|
||
## Adding data to a trace
|
||
|
||
Traces can have data attached to them. See :ref:`datastreams` for more information.
|
||
The following two snippets demonstrate how to add data to a trace.
|
||
|
||
It is currently not possible to verify that a specific data stream is already set or not.
|
||
|
||
### Data Transformations
|
||
|
||
The most efficient way to add data to a trace is using data transformations.
|
||
See :doc:`../concepts/data_transformations` for more details.
|
||
|
||
The following example sets a new datastream with dataType `html` on a trace, by setting a ranged data transformation:
|
||
|
||
```python
|
||
trace.add_transformation('html', RangedTransformation(Range(offset, length)))
|
||
```
|
||
|
||
The following example creates a child trace and sets a new datastream with dataType `raw` on it, by setting a ranged
|
||
data transformation with two ranges:
|
||
|
||
```python
|
||
child = trace.child_builder('new trace')
|
||
child.add_transformation('raw', RangedTransformation.builder()
|
||
.add_range(10, 20)
|
||
.add_range(50, 30)
|
||
.build())
|
||
});
|
||
```
|
||
|
||
### Blobs
|
||
|
||
It is not always possible to create a transformation for the data that has to be
|
||
added to a trace. For example, if the data is a result of a computation, and not
|
||
a direct subset of another data stream..
|
||
|
||
The following snippet shows how to create a new data stream of dataType `raw` on a trace from a blob stored in `bytes`:
|
||
|
||
```python
|
||
data = {'raw': b'...'}
|
||
trace.update(data=data);
|
||
```
|
||
|
||
.. _python_snippets_data_streaming:
|
||
|
||
#### Streaming data
|
||
|
||
.. warning:: Streaming data does not work with the Hansken.py runner because Hansken.py does not support it. It does
|
||
work when running your plugin in Hansken and in the test framework.
|
||
|
||
When dealing with large quantities of data, it is possible to keep the memory usage
|
||
of the plugin within manageable limits by streaming the data from the plugin to Hansken in smaller chunks.
|
||
To do this, use the `with trace.open(data_type=..., mode='wb')` syntax. Here are some examples:
|
||
|
||
Stream strings to `raw` (default) datastream:
|
||
|
||
```python
|
||
with trace.open(mode='wb') as writer:
|
||
writer.write(b'a string')
|
||
writer.write(bytes(another_string, 'utf-8'))
|
||
```
|
||
|
||
Stream a BufferedReader object to a `text` datastream:
|
||
|
||
```python
|
||
with trace.open(data_type='text', mode='wb') as output, open('input.text', 'rb') as in_file:
|
||
output.write(in_file)
|
||
```
|
||
|
||
#### Streaming text
|
||
|
||
To write `str` values directly, use mode `w` (or `wt`).
|
||
By default, it is assumed that the written text is 'utf-8' encoded. The default encoding can be overwritten by using the `'encoding='` argument.
|
||
|
||
(In a future Hansken update) Hansken will set the correct data-stream properties for your text stream (`mimeType`, `mimeClass`, and `fileType`).
|
||
|
||
```python
|
||
with trace.open(data_type='raw', mode='w', encoding='utf-8') as text_writer:
|
||
text_writer.write('hello.world') # write strings directly to the writer
|
||
json.dump({'hello': 'world'}, text_writer) # or pass the writer to json.dump
|
||
```
|
||
|
||
It is recommended to pass `utf-8` explictly as encoding.
|
||
|
||
## Specifying system resources
|
||
|
||
It is possible to specify system resources hints in the `PluginInfo`. To run a plugin with at least 0.5 cpu (= 0.5
|
||
vCPU/Core/hyperthread), 1 gb memory and 10 (concurrent) cpu workers (threads), for example, the following configuration can be added to `PluginInfo`:
|
||
|
||
```python
|
||
plugin_info = PluginInfo(...,
|
||
resources=PluginResources(maximum_cpu=0.5, maximum_memory=1000, maximum_workers=10))
|
||
```
|
||
|
||
.. _python_snippets_deferred:
|
||
|
||
## Deferred Plugins
|
||
|
||
Implementing a deferred extraction plugin requires inheriting the
|
||
:py:class:`DeferredExtractionPlugin <hansken_extraction_plugin.api.extraction_plugin.DeferredExtractionPlugin>`
|
||
base class.
|
||
|
||
```python
|
||
class DeferredPlugin(DeferredExtractionPlugin):
|
||
def process(self, trace, context, searcher):
|
||
```
|
||
|
||
This allows accessing a third :py:class:`TraceSearcher <hansken_extraction_plugin.api.trace_searcher.TraceSearcher>`
|
||
parameter in the process function. This can be used to search for traces:
|
||
|
||
```python
|
||
with searcher.search('file.extension:html', 10, scope='image') as searchresult:
|
||
for trace in searchresult:
|
||
log.debug(f'extension {trace.get("file.extension")}')
|
||
```
|
||
|
||
The ``search`` method accepts three arguments;
|
||
1. a HQL query (note: this is the traditional HQL query, and not the matchers HQL-lite variant),
|
||
2. (optional) the maximum number of traces to return (currently hard-limited to a maximum of 50 traces),
|
||
3. (optional) a scope, which can be either `image`, or `project`. When set to `image`, the searcher will only search for traces
|
||
within the same image as the trace that is being processed.
|
||
|
||
The returned :py:class:`SearchResult <hansken_extraction_plugin.api.search_result.SearchResult>`
|
||
should be closed, for example by using `with`. The resulting search result is an iterable, which will be exhausted when
|
||
no more traces are available. The search result allows taking one or more traces by calling :py:
|
||
meth:`take <hansken_extraction_plugin.api.search_result.SearchResult.take>` or
|
||
:py:meth:`takeone <hansken_extraction_plugin.api.search_result.SearchResult.takeone>`.
|
||
|
||
.. note:: The command `trace.open(datastream_type)` will fail on search result traces that do not originate from the
|
||
same image (evidence item) as the trace that is being processed.
|
||
|
||
## Deferred Meta Extraction Plugins
|
||
|
||
Implementing a deferred meta extraction plugin requires inheriting the
|
||
:py:class:`DeferredMetaExtractionPlugin <hansken_extraction_plugin.api.extraction_plugin.DeferredMetaExtractionPlugin>`
|
||
base class. This plugin is not able to call the trace.open() method since the actual trace data is not available to this plugin.
|
||
Also matching on data type will not work for this plugin since this plugin only works for meta traces
|
||
|
||
```python
|
||
class DeferredMetaPlugin(DeferredMetaExtractionPlugin):
|
||
def plugin_info(self):
|
||
...
|
||
|
||
def process(self, trace, searcher):
|
||
...
|
||
```
|
||
|
||
## Bulk Mode
|
||
|
||
The `PluginInfo` contains a parameter `bulk_mode`. This can be used for lightweight plugins which have to process a lot
|
||
of data (either a lot of traces with data or a small number of traces with large data streams). For streaming
|
||
extractions, these plugins will run inside the worker pod, and will therefore be able to process data more efficiently.
|
||
|
||
**WARNING**: The plugin should be lightweight. This means that it should not use a lot of resources like CPU or memory,
|
||
because this will limit the resources of the worker pod, and therefore Hansken will not be able to start enough workers
|
||
to do extractions.
|
||
|
||
Creating a plugin with bulk mode enabled can be done by setting the parameter to `True` in the `PluginInfo` as follows:
|
||
|
||
```python
|
||
plugin_info = PluginInfo(...,
|
||
bulk_mode=True)
|
||
```
|
||
|
||
## Logging
|
||
|
||
We use Logbook to log messages in Python. Logbook is a logging system for Python that replaces the standard library’s
|
||
logging module.
|
||
|
||
To enable logging in your plugin, add the following to the top of your plugin code:
|
||
|
||
```python
|
||
from logbook import Logger
|
||
|
||
log = Logger(__name__)
|
||
```
|
||
|
||
From there on the logging is pretty straight forward:
|
||
|
||
```python
|
||
log.info(f'Logging a variable: {my_variable}')
|
||
```
|
||
|
||
The default log level is `WARNING`. There are two ways to set the logging level. You can use the `-v` (or `-vv` or `-vvv`) option of `serve_plugin.py` to increase the log level. This is typically done in the plugin `Dockerfile`. Another option is to use an environment variable, `LOG_LEVEL`. Available levels are `WARNING`, `NOTICE`, `INFO` and `DEBUG`. The environment variable overrides the option.
|
||
|
||
.. warning:: Be careful with logging sensitive information.
|
||
|
||
.. note:: Contact your Hansken administrator for more information on where to find logs for your Hansken environment.
|
||
|
||
## [EXPERIMENTAL FEATURE] Adding previews to a trace
|
||
|
||
.. warning:: This is an experimental feature, which might change or get removed in future releases.
|
||
|
||
Use :py:meth:`update <hansken_extraction_plugin.api.extraction_trace.ExtractionTraceBuilder.update>`
|
||
to add previews to an
|
||
:py:class:`ExtractionTrace <hansken_extraction_plugin.api.extraction_trace.ExtractionTrace>`.
|
||
Example:
|
||
|
||
```python
|
||
def process(self, trace, data_context):
|
||
# set the preview data for the image/png MIME-type
|
||
trace.update('preview.image/png', b'\x00\xff')
|
||
```
|