mirror of
https://github.com/NetherlandsForensicInstitute/hansken-extraction-plugin-sdk-documentation.git
synced 2026-05-06 10:16:33 +00:00
383 lines
14 KiB
Plaintext
383 lines
14 KiB
Plaintext
# Python API Changelog
|
|
|
|
This document summarizes all important API changes in the Extraction Plugin API. This document only shows changes that
|
|
are important to plugin developers. For a full list of changes per version, please refer to the general
|
|
:ref:`changelog <changelog>`.
|
|
|
|
.. If present, remove `..` before `## |version|` if you create a new entry after a previous release.
|
|
|
|
## |version|
|
|
|
|
* This version introduces a new docker image build utility `label_plugin`.
|
|
This utility will eventually replace `build_plugin`. `build_plugin` is now deprecated.
|
|
|
|
`label_plugin` is a utility to add labels to an extraction plugin image. Labeling a plugin is required for
|
|
Hansken to detect extraction plugins in a plugin image registry.
|
|
|
|
To label a plugin, first build the plugin image with [docker build](https://docs.docker.com/reference/cli/docker/image/build/);
|
|
for example by using one of the following commands:
|
|
|
|
```shell
|
|
docker build . -t my_plugin
|
|
docker build . -t my_plugin --build-arg https_proxy=http://your_proxy:8080
|
|
```
|
|
|
|
Next, run the `label_plugin` utility to label the build plugin container:
|
|
```shell
|
|
label_plugin my_plugin
|
|
```
|
|
|
|
The result of `label_plugin` is a plugin image that can be :ref:`uploaded to Hansken<upload_plugin>`.
|
|
|
|
`label_plugin` is preferred over `build_plugin`, as it does not require a full (virtual) environment
|
|
with all plugin dependencies and resources. This is especially preferred when the plugin uses (big)
|
|
data models or (external) dependencies.
|
|
|
|
For usage read further in [packaging](packaging.md).
|
|
|
|
|
|
## 0.7.0
|
|
|
|
* Escaping the `/` character in matchers is optional.
|
|
This simplifies and aims for better HQL and HQL-Lite compatability.
|
|
See for more information and examples the :ref:`HQL-Lite syntax documentation<hqllite syntax>`.
|
|
|
|
Examples:
|
|
|
|
* Old: `file.path:\/Users\/*\/AppData` -> new: `file.path:/Users/*/AppData`
|
|
* Old: `file.path:\\/Users\\/*\\/AppData` -> new: `file.path:/Users/*/AppData`
|
|
* Old: `registryEntry.key:\/Software\/Dropbox\/ks*\/Client-p` -> new: `registryEntry.key:/Software/Dropbox/ks*/Client-p`
|
|
|
|
* Hansken returns `file.path` properties (outside the scope of matchers) as a `String` property,
|
|
instead of a list of strings.
|
|
Example: `trace.get('file.path')` now returns `'/dev/null'`, this was `['dev', 'null']`.
|
|
|
|
* Improved plugin loading when using `serve_plugin` and `build_plugin`:
|
|
`import` statements now work for modules (python files) that are located the same directory structure of a plugin.
|
|
|
|
* A plugin can now stream data to a trace using `trace.open(mode='wb')`.
|
|
This removes the limit on the size of data that could be written.
|
|
See also :ref:`the python code snippet<python_snippets_data_streaming>`.
|
|
|
|
Example:
|
|
|
|
```python
|
|
with trace.open(mode='wb') as writer:
|
|
writer.write(b'a string')
|
|
writer.write(bytes(another_string, 'utf-8'))
|
|
```
|
|
|
|
_note_: this does not work when using `run_with_hanskenpy`.
|
|
|
|
## 0.6.1
|
|
|
|
* The docker image build script `build_plugin` has been updated to allow for extension of the docker command.
|
|
This can be especially handy for specifying a proxy. You should build your plugin container image with the following
|
|
command:
|
|
|
|
```bash
|
|
build_plugin PLUGIN_FILE DOCKER_FILE_DIRECTORY [DOCKER_IMAGE_NAME] [DOCKER_ARGS]
|
|
```
|
|
|
|
.. warning:: Note that the `DOCKER_IMAGE_NAME` argument no longer requires a `-n` parameter to be specified.
|
|
|
|
For usage read further in [packaging](packaging.md).
|
|
|
|
## 0.6.0
|
|
|
|
.. warning:: This is an API breaking change.
|
|
Upgrading your plugin to this version will require code changes.
|
|
Plugins built with previous versions of the SDK from `0.3.0` will still work with Hansken.
|
|
|
|
.. warning:: It is strongly recommended to upgrade your plugins to this new version because it significantly improves
|
|
the start-up time of Hansken. See the migration steps below.
|
|
|
|
This release contains both build pipeline changes and API changes.
|
|
Please read all changes carefully.
|
|
|
|
### Build pipeline change
|
|
|
|
* Extraction plugin container images are now labeled with PluginInfo. This
|
|
allows Hansken to efficiently load extraction plugins.
|
|
Migration steps from earlier versions:
|
|
|
|
1. Update the SDK version in your `setup.py` / `requirements.txt`
|
|
2. If you come from a version prior to `0.4.0`, or if you use a plugin name
|
|
instead of a plugin id in your `pluginInfo()`, switch to the plugin id style
|
|
(read instructions for version `0.4.0`)
|
|
3. Update your build scripts to build your plugin (Docker) container image.
|
|
Be sure to [have the Extraction Plugins SDK installed](getting_started.md#Installation).
|
|
Then, you should build your plugin container image with the following command:
|
|
|
|
```bash
|
|
build_plugin PLUGIN_FILE DOCKER_FILE_DIRECTORY -n [DOCKER_IMAGE_NAME]
|
|
```
|
|
|
|
For example:
|
|
```bash
|
|
build_plugin plugin/chatplugin.py . -n extraction-plugins/chatplugin
|
|
```
|
|
|
|
This will generate a plugin image:
|
|
|
|
* The extraction plugin is added to your local image registry (`docker images`),
|
|
* Note that DOCKER\_IMAGE\_NAME is optional and will default to `extraction-plugin/PLUGINID`, e.g.
|
|
`extraction-plugin/nfi.nl/extract/chat/whatsapp`,
|
|
* The image is tagged with two tags: `latest`, and your plugin version.
|
|
|
|
|
|
### API changes
|
|
|
|
* The field `plugin` has been removed from `PluginInfo`.
|
|
* The field `pluginId` should now be the first argument of PluginInfo (when using unnamed arguments).
|
|
|
|
Old (unnamed arguments):
|
|
|
|
```python
|
|
def plugin_info(self):
|
|
return PluginInfo(self, '1.0.0', 'description', author,
|
|
MaturityLevel.PROOF_OF_CONCEPT, '*, 'https://hansken.org',
|
|
PluginId(...), 'Apache License 2.0')
|
|
```
|
|
|
|
New (removed `self`, and moved `PluginId(...)` to first argument position):
|
|
|
|
```python
|
|
def plugin_info(self):
|
|
return PluginInfo(PluginId(...), '1.0.0', 'description',
|
|
author, MaturityLevel.PROOF_OF_CONCEPT,
|
|
'*', 'https://hansken.org', 'Apache License 2.0')
|
|
```
|
|
|
|
Old (named arguments):
|
|
|
|
```python
|
|
def plugin_info(self):
|
|
return PluginInfo(plugin=self,
|
|
version='1.0.0',
|
|
...)
|
|
```
|
|
|
|
New (removed `plugin=self`):
|
|
|
|
```python
|
|
def plugin_info(self):
|
|
return PluginInfo(version='1.0.0',
|
|
...)
|
|
```
|
|
|
|
* Plugin `data_context.data_size` is now a variable instead of a method:
|
|
|
|
Old:
|
|
|
|
```python
|
|
def process(self, trace: ExtractionTrace, data_context: DataContext):
|
|
size = data_context.data_size()
|
|
```
|
|
|
|
New:
|
|
|
|
```python
|
|
def process(self, trace: ExtractionTrace, data_context: DataContext):
|
|
size = data_context.data_size
|
|
```
|
|
|
|
* Simplify declaring required runtime resources in a plugin's info.
|
|
|
|
Extraction plugin resources don't use the builder pattern anymore.
|
|
|
|
Old:
|
|
|
|
```python
|
|
return PluginInfo(
|
|
...,
|
|
resources=PluginResources.builder().maximum_cpu(0.5).maximum_memory(1000).build())
|
|
)
|
|
```
|
|
|
|
New:
|
|
|
|
```python
|
|
# no need for a builder, declare resources by direct instantiation
|
|
return PluginInfo(
|
|
...,
|
|
resources=PluginResources(maximum_cpu=2.0, maximum_memory=2048)
|
|
)
|
|
# or, as before, specify just on resource
|
|
return PluginInfo(
|
|
...,
|
|
resources=PluginResources(maximum_memory=4096)
|
|
)
|
|
```
|
|
|
|
## 0.5.1
|
|
|
|
* Simplify tracelet properties by making the tracelet type prefix optional.
|
|
|
|
```python
|
|
# using a Tracelet object
|
|
trace.add_tracelet(Tracelet("prediction", {
|
|
"type": "example",
|
|
"confidence": 0.8
|
|
}))
|
|
# or without a Tracelet object
|
|
trace.add_tracelet("identity", {"name": "John Doe", "status": "online"})
|
|
```
|
|
|
|
* Enabled _manual_ plugin testing, as described on :ref:`advanced use of the test framework in Python<python testing>`.
|
|
|
|
## 0.5.0
|
|
|
|
* Support vector data type in trace properties.
|
|
|
|
```python
|
|
embedding = Vector.from_sequence((width, height))
|
|
tracelet = Tracelet("prediction", {
|
|
"prediction.type": "example-vector",
|
|
"prediction.embedding": embedding
|
|
})
|
|
trace.add_tracelet(tracelet)
|
|
```
|
|
|
|
## 0.4.13
|
|
|
|
* When writing input search traces for tests, it is no longer required to explicitly set an `id` property.
|
|
These are automatically generated when executing tests.
|
|
|
|
## 0.4.7
|
|
|
|
* More `$data` matchers are supported in Hansken.py plugin runner. Before this improvement it was only possible to match
|
|
on `$data.type`. Now it is also possible to match for example on `$data.mimeType` and `$data.mimeClass`. The `$data`
|
|
matcher should still be at the end of the query as before.
|
|
|
|
## 0.4.6
|
|
|
|
* It is now possible to specify maximum system resources in the `PluginInfo`. To run a plugin with 0.5 cpu (= 0.5
|
|
vCPU/Core/hyperthread) and 1 gb memory, for example, the following configuration can be added to `PluginInfo`:
|
|
|
|
```python
|
|
plugin_info = PluginInfo(...,
|
|
resources=PluginResources.builder().maximum_cpu(0.5).maximum_memory(1000).build())
|
|
```
|
|
|
|
## 0.4.0
|
|
|
|
* Extraction Plugins are now identified with a `PluginInfo.PluginId` containing a domain, category and name. The
|
|
method `PluginInfo.name(pluginName)` has been replaced by `PluginInfo.id(new PluginId(domain, category, name)`. More
|
|
details on the plugin naming conventions can be found at the :doc:`../concepts/plugin_naming_convention` section.
|
|
|
|
* `PluginInfo.name()` is now deprecated (but will still work for backwards compatibility).
|
|
|
|
* A new license field `PluginInfo.license` has also been added in this release.
|
|
|
|
* The following example creates a PluginInfo for a plugin with the name `TestPlugin`, licensed under
|
|
the `Apache License 2.0` license:
|
|
|
|
```python
|
|
class TestPlugin(ExtractionPlugin):
|
|
def plugin_info(self) -> PluginInfo:
|
|
return PluginInfo(self,
|
|
version='1.0.0',
|
|
description='A plugin for testing.',
|
|
author=Author('The Externals', 'tester@holmes.nl', 'NFI'),
|
|
maturity=MaturityLevel.PROOF_OF_CONCEPT,
|
|
webpage_url='https://hansken.org',
|
|
matcher='file.extension=txt',
|
|
id=PluginId(domain='nfi.nl', category='test', name='TestPlugin'),
|
|
license='Apache License 2.0'
|
|
)
|
|
```
|
|
|
|
## 0.3.0
|
|
|
|
* Extraction Plugins can now create new datastreams on a Trace through data transformations. Data transformations
|
|
describe how data can be obtained from a source.
|
|
|
|
An example case is an extraction plugin that processes an archive file. The plugin creates a child trace per entry in
|
|
the archive file. Each child trace will have a datastream that is a transformation that marks the start and length of
|
|
the entry in the original archive data. By just describing the data instead of specifying the actual data, a lot of
|
|
space is saved.
|
|
|
|
Although Hansken supports various transformations, the Extraction Plugins SDK for now only supports ranged data
|
|
transformations. Ranged data transformations define data as a list of ranges, each range with an offset and length in
|
|
a bytearray.
|
|
|
|
The following example sets a new datastream with dataType `html` on a trace, by setting a ranged data transformation:
|
|
|
|
```python
|
|
trace.add_transformation('html', RangedTransformation(Range(offset, length)))
|
|
```
|
|
|
|
The following example creates a child trace and sets a new datastream with dataType `raw` on it, by setting a ranged
|
|
data transformation with two ranges:
|
|
|
|
```python
|
|
child = trace.child_builder('new trace')
|
|
child.add_transformation('raw', RangedTransformation.builder()
|
|
.add_range(10, 20)
|
|
.add_range(50, 30)
|
|
.build())
|
|
});
|
|
```
|
|
|
|
More detailed documentation will follow in an upcoming SDK release.
|
|
|
|
## 0.2.0
|
|
|
|
.. warning:: This is an API breaking change.
|
|
Plugins created with an earlier version of the extraction plugin
|
|
SDK are not compatible with Hansken that uses `0.2.0` or later.
|
|
|
|
* Introduced a new extraction plugin type `api.extraction_plugin.DeferredExtractioPlugin`.
|
|
Deferred Extraction plugins can be run at a different extraction stage.
|
|
This type of plugin also allows accessing other traces using the searcher.
|
|
|
|
* The class `api.extraction_context.ExtractionContext` has been renamed to `api.data_context.DataContext`.
|
|
The new name `DataContext` represents the class contents better.
|
|
Plugins have to update matching import statements accordingly.
|
|
Plugins should also update the named argument `context` to `data_context` of the plugin `process()` method.
|
|
This change has no functional changes.
|
|
|
|
Old:
|
|
|
|
```python
|
|
from hansken_extraction_plugin.api.extraction_context import ExtractionContext
|
|
|
|
def process(self, trace, context):
|
|
pass
|
|
```
|
|
|
|
New:
|
|
|
|
```python
|
|
from hansken_extraction_plugin.api.data_context import DataContext
|
|
|
|
def process(self, trace, data_context):
|
|
pass
|
|
```
|
|
|
|
* Moved `api.author.Author` to `api.plugin_info.Author`, and moved `api.maturity_level.MaturityLevel`
|
|
to `api.plugin_info.MaturityLevel`
|
|
This is a more *pythonic* way of grouping of classes into modules. This change has no functional side effects.
|
|
|
|
Plugins have to update matching import statements accordingly.
|
|
|
|
Old:
|
|
|
|
```python
|
|
from hansken_extraction_plugin.api.author import Author
|
|
from hansken_extraction_plugin.api.maturity_level import MaturityLevel
|
|
from hansken_extraction_plugin.api.plugin_info import PluginInfo
|
|
```
|
|
|
|
New:
|
|
|
|
```python
|
|
from hansken_extraction_plugin.api.plugin_info import Author, MaturityLevel, PluginInfo
|
|
```
|
|
|
|
* Removed `DataContext.get_first_bytes()` from the public API.
|
|
|
|
* Removed `api.extraction_trace.validate_update_arguments(..)` from the public API. This method is still invoked
|
|
implicitly when setting trace properties.
|