Files
Roel van Dijk 93b020aef4 Update documentation to 0.9.16 (#10)
Co-authored-by: Roel van Dijk <rdvdijk@users.noreply.github.com>
2026-03-06 09:59:38 +01:00

537 lines
21 KiB
Plaintext

# Python API Changelog
This document summarizes all important API changes in the Extraction Plugin API. This document only shows changes that
are important to plugin developers. For a full list of changes per version, please refer to the general
:ref:`changelog <changelog>`.
.. If present, remove `..` before `## |version|` if you create a new entry after a previous release.
.. ## |version|
## 0.9.13
* This release introduces a new parameter `bulk_mode` to the `PluginInfo`. This can be used for
[lightweight plugins](snippets.md#bulk-mode) which have to process a lot of data (either a lot of traces or a small
number of traces with large data streams). These plugins will run inside the worker pod for streaming extractions,
and will therefore be able to process data more efficiently.
## 0.9.10
* This release introduces new parameters for the `trace_searcher`, `start` and `sort`.
This allows the searcher can get a certain range of traces in a specific order.
_Note:_ Hansken will support these types of plugins from v47.34.0.
* This release allows the user to search for more than 50 traces using the `GrpcTraceSearcher`. By specifying a count
greater than 50, results will be retrieved in batches of 50 (or less) until the desired count is achieved.
Setting the count to `None` (or omitting it) allows the `GrpcTraceSearcher` to retrieve all available traces.
This functionality is implemented in a buffered manner and is defined within `BatchedSearchResult`,
which replaces the now-removed `GrpcTraceResult`. _Note_: the search results are still limited by Elasticsearch
so no more than 100.000 results can be obtained.
## 0.9.9
* This release introduces the deferred meta extraction plugin. This plugin type can *defer* their execution and
processes a trace only with its metadata, without processing its data and accesses traces using the searcher.
This makes it possible to use deferred plugins in combination with traces without data.
_Note:_ Hansken will support these types of plugins from v47.34.0.
## 0.9.5
* The internal (de)serialization of some types has changed. Please update the extraction plugin sdk to match the one used in Hansken.
## 0.9.4
* This release supports providing all types of transformer arguments when using the `execute_transformer` script.
## 0.9.2
* This release introduces a simple flow-control mechanism that fixes `connection RST` errors that can occur when a
plugin produces traces too fast. Plugins built with this version are **not backwards compatible**.
* Python plugins now support transformers, remote methods which can be executed using the Hansken REST API. More information can be found in [the docs](transformers.md).
* The `build_plugin` and `label_plugin` utilities prematurely shut down containers if the building and labeling process takes too long causing the process for slow containers to fail. If your plugin takes a long time to start, you may want to increase the timeout before the script stops trying to connect and aborts the process of building the plugin. This can be done using the new optional `--timeout` argument. The default is set to 30 seconds.
* The optional image name argument of `build_plugin` is changed to a flag. Build scripts can be updated using `--target-name DOCKER_IMAGE_NAME`.
## 0.8.3
* This release addresses important load balancing issues. Please use release `0.8.3` as a drop-in-replacement for releases `0.8.2` and `0.8.1`.
## 0.8.2
* ⚠️ This release is deprecated, please upgrade to `0.8.3`
* The `build_plugin` utility has been updated and the deprecation status has been removed.
As with `label_plugin`, `build_plugin` now no longer requires a full (virtual) environment
with all plugin dependencies and resources. This will greatly reduce build times for plugins with
big dependencies and/or large models.
The first argument of the command (a pointer to your `plugin.py` file) has been removed.
Please do not forget to remove the first argument of `build_plugin` in your `tox.ini` or other build tooling.
For usage read further in [packaging](packaging.md).
* The default read-buffer of `trace.open('rb')` as been changed from 1 Megabyte to 6 Megabyte to reduce overhead while data reading.
* The data stream writer of `trace.open('wb')` is now buffered as well. This means that multiple small writes will be flushed after every 6 Megabytes of data has been written (or when the writer is closed).
* The read-buffer or write-buffer size can be overridden by the user, by passing the `buffer_size=` argument to `trace.open()`:
```python
with trace.open('rb', buffer_size=1024*1024): # set a 1 Megabyte buffer size
pass
with trace.open('wb', buffer_size=1024*1024*12): # set a 12 Megabyte buffer size
pass
with trace.open('wb', buffer_size=1): # a buffer_size of 1 effectively disables the buffer:
pass # each write will be flushed to Hansken directly
```
* It is now possible to write `str` values to `trace.open(..)`. To do so, pass `mode='w'` as additional argument.
By default, it is assumed that the written text is 'utf-8' encoded. The default can be overwritten by using the `'encoding='` argument.
In a future Hansken update, Hansken will set the correct data-stream properties for your text stream (`mimeType`, `mimeClass`, and `fileType`).
Example use cases are:
* write picture-to-text (OCR) data to a trace
* write translations to a trace
* write audio-to-text (audio transcriptions) to a trace
* write the results of a JSON dump, e.g.: `json.dump(['your', 'data'], text_writer)`
Examples in code:
```python
with trace.open(data_type='raw', mode='w', encoding='utf-8') as text_writer:
text_writer.write('hello.world') # write strings directly to it
json.dump({'hello': 'world'}, text_writer) # or pass the writer to json.dump
```
See also :ref:`the python code snippet<python_snippets_data_streaming>`.
## 0.8.1
* ⚠️ This release is deprecated, please upgrade to `0.8.3`
## 0.8.0
* The trace property `imageId` is renamed to `image`. This is to be in line with the Hansken REST API and Python API.
When updating your plugin, please update your calls `trace.get('imageId')` to `trace.get('image')`.
* [#774](https://git.eminjenv.nl/hansken/hbacklog/-/issues/774)
By default, deferred extraction plugin searches are now scoped to the image
of the trace that is currently being processed. Optionally, a project-wide
search can be done by passing an optional scope argument.
```python
def process(trace, data_context, searcher):
# only search for traces inside the same image as the trace that is being processed
searcher.search('*')
searcher.search('*', scope='image') # explicit alternative, using a str
searcher.search('*', scope=SearchScope.image) # explicit alternative, using the SearchScope enum
# only search for traces inside the same image as the trace that is being processed
searcher.search('*', scope='project')
searcher.search('*', scope=SearchScope.project)
```
* Support trace properties of type `list[float]`. This enables you to write
multiple offsets and confidence scores in tracelets of type prediction.
For example:
```python
trace.add_tracelet('prediction', {
'modelName': 'my_cat_detector',
'modelVersion': '0.0.BETA',
'type': 'classification',
'label': 'cat',
# the best score
'offset': 3.0,
'confidence': 0.4,
# all scores
'offsets': [0.0, 3.0, 6.0, 9.0],
'confidences': [0.1, 0.4, 0.03, 0.09],
})
```
## 0.7.3
* This version introduces a new docker image build utility `label_plugin`.
This utility will eventually replace `build_plugin`. `build_plugin` is now deprecated.
`label_plugin` is a utility to add labels to an extraction plugin image. Labeling a plugin is required for
Hansken to detect extraction plugins in a plugin image registry.
To label a plugin, first build the plugin image with [docker build](https://docs.docker.com/reference/cli/docker/image/build/);
for example by using one of the following commands:
```shell
docker build . -t my_plugin
docker build . -t my_plugin --build-arg https_proxy=http://your_proxy:8080
```
Next, run the `label_plugin` utility to label the build plugin container:
```shell
label_plugin my_plugin
```
The result of `label_plugin` is a plugin image that can be :ref:`uploaded to Hansken<upload_plugin>`.
`label_plugin` is preferred over `build_plugin`, as it does not require a full (virtual) environment
with all plugin dependencies and resources. This is especially preferred when the plugin uses (big)
data models or (external) dependencies.
For usage read further in [packaging](packaging.md).
## 0.7.0
* Escaping the `/` character in matchers is optional.
This simplifies and aims for better HQL and HQL-Lite compatability.
See for more information and examples the :ref:`HQL-Lite syntax documentation<hqllite syntax>`.
Examples:
* Old: `file.path:\/Users\/*\/AppData` -> new: `file.path:/Users/*/AppData`
* Old: `file.path:\\/Users\\/*\\/AppData` -> new: `file.path:/Users/*/AppData`
* Old: `registryEntry.key:\/Software\/Dropbox\/ks*\/Client-p` -> new: `registryEntry.key:/Software/Dropbox/ks*/Client-p`
* Hansken returns `file.path` properties (outside the scope of matchers) as a `String` property,
instead of a list of strings.
Example: `trace.get('file.path')` now returns `'/dev/null'`, this was `['dev', 'null']`.
* Improved plugin loading when using `serve_plugin` and `build_plugin`:
`import` statements now work for modules (python files) that are located the same directory structure of a plugin.
* A plugin can now stream data to a trace using `trace.open(mode='wb')`.
This removes the limit on the size of data that could be written.
See also :ref:`the python code snippet<python_snippets_data_streaming>`.
Example:
```python
with trace.open(mode='wb') as writer:
writer.write(b'a string')
writer.write(bytes(another_string, 'utf-8'))
```
_note_: this does not work when using `run_with_hanskenpy`.
## 0.6.1
* The docker image build script `build_plugin` has been updated to allow for extension of the docker command.
This can be especially handy for specifying a proxy. You should build your plugin container image with the following
command:
```bash
build_plugin PLUGIN_FILE DOCKER_FILE_DIRECTORY [DOCKER_IMAGE_NAME] [DOCKER_ARGS]
```
.. warning:: Note that the `DOCKER_IMAGE_NAME` argument no longer requires a `-n` parameter to be specified.
For usage read further in [packaging](packaging.md).
## 0.6.0
.. warning:: This is an API breaking change.
Upgrading your plugin to this version will require code changes.
Plugins built with previous versions of the SDK from `0.3.0` will still work with Hansken.
.. warning:: It is strongly recommended to upgrade your plugins to this new version because it significantly improves
the start-up time of Hansken. See the migration steps below.
This release contains both build pipeline changes and API changes.
Please read all changes carefully.
### Build pipeline change
* Extraction plugin container images are now labeled with PluginInfo. This
allows Hansken to efficiently load extraction plugins.
Migration steps from earlier versions:
1. Update the SDK version in your `setup.py` / `requirements.txt`
2. If you come from a version prior to `0.4.0`, or if you use a plugin name
instead of a plugin id in your `pluginInfo()`, switch to the plugin id style
(read instructions for version `0.4.0`)
3. Update your build scripts to build your plugin (Docker) container image.
Be sure to [have the Extraction Plugins SDK installed](getting_started.md#Installation).
Then, you should build your plugin container image with the following command:
```bash
build_plugin PLUGIN_FILE DOCKER_FILE_DIRECTORY -n [DOCKER_IMAGE_NAME]
```
For example:
```bash
build_plugin plugin/chatplugin.py . -n extraction-plugins/chatplugin
```
This will generate a plugin image:
* The extraction plugin is added to your local image registry (`docker images`),
* Note that DOCKER\_IMAGE\_NAME is optional and will default to `extraction-plugin/PLUGINID`, e.g.
`extraction-plugin/nfi.nl/extract/chat/whatsapp`,
* The image is tagged with two tags: `latest`, and your plugin version.
### API changes
* The field `plugin` has been removed from `PluginInfo`.
* The field `pluginId` should now be the first argument of PluginInfo (when using unnamed arguments).
Old (unnamed arguments):
```python
def plugin_info(self):
return PluginInfo(self, '1.0.0', 'description', author,
MaturityLevel.PROOF_OF_CONCEPT, '*, 'https://hansken.org',
PluginId(...), 'Apache License 2.0')
```
New (removed `self`, and moved `PluginId(...)` to first argument position):
```python
def plugin_info(self):
return PluginInfo(PluginId(...), '1.0.0', 'description',
author, MaturityLevel.PROOF_OF_CONCEPT,
'*', 'https://hansken.org', 'Apache License 2.0')
```
Old (named arguments):
```python
def plugin_info(self):
return PluginInfo(plugin=self,
version='1.0.0',
...)
```
New (removed `plugin=self`):
```python
def plugin_info(self):
return PluginInfo(version='1.0.0',
...)
```
* Plugin `data_context.data_size` is now a variable instead of a method:
Old:
```python
def process(self, trace: ExtractionTrace, data_context: DataContext):
size = data_context.data_size()
```
New:
```python
def process(self, trace: ExtractionTrace, data_context: DataContext):
size = data_context.data_size
```
* Simplify declaring required runtime resources in a plugin's info.
Extraction plugin resources don't use the builder pattern anymore.
Old:
```python
return PluginInfo(
...,
resources=PluginResources.builder().maximum_cpu(0.5).maximum_memory(1000).build())
)
```
New:
```python
# no need for a builder, declare resources by direct instantiation
return PluginInfo(
...,
resources=PluginResources(maximum_cpu=2.0, maximum_memory=2048)
)
# or, as before, specify just on resource
return PluginInfo(
...,
resources=PluginResources(maximum_memory=4096)
)
```
## 0.5.1
* Simplify tracelet properties by making the tracelet type prefix optional.
```python
# using a Tracelet object
trace.add_tracelet(Tracelet("prediction", {
"type": "example",
"confidence": 0.8
}))
# or without a Tracelet object
trace.add_tracelet("identity", {"name": "John Doe", "status": "online"})
```
* Enabled _manual_ plugin testing, as described on :ref:`advanced use of the test framework in Python<python testing>`.
## 0.5.0
* Support vector data type in trace properties.
```python
embedding = Vector.from_sequence((width, height))
tracelet = Tracelet("prediction", {
"prediction.type": "example-vector",
"prediction.embedding": embedding
})
trace.add_tracelet(tracelet)
```
## 0.4.13
* When writing input search traces for tests, it is no longer required to explicitly set an `id` property.
These are automatically generated when executing tests.
## 0.4.7
* More `$data` matchers are supported in Hansken.py plugin runner. Before this improvement it was only possible to match
on `$data.type`. Now it is also possible to match for example on `$data.mimeType` and `$data.mimeClass`. The `$data`
matcher should still be at the end of the query as before.
## 0.4.6
* It is now possible to specify maximum system resources in the `PluginInfo`. To run a plugin with 0.5 cpu (= 0.5
vCPU/Core/hyperthread) and 1 gb memory, for example, the following configuration can be added to `PluginInfo`:
```python
plugin_info = PluginInfo(...,
resources=PluginResources.builder().maximum_cpu(0.5).maximum_memory(1000).build())
```
## 0.4.0
* Extraction Plugins are now identified with a `PluginInfo.PluginId` containing a domain, category and name. The
method `PluginInfo.name(pluginName)` has been replaced by `PluginInfo.id(new PluginId(domain, category, name)`. More
details on the plugin naming conventions can be found at the :doc:`../concepts/plugin_naming_convention` section.
* `PluginInfo.name()` is now deprecated (but will still work for backwards compatibility).
* A new license field `PluginInfo.license` has also been added in this release.
* The following example creates a PluginInfo for a plugin with the name `TestPlugin`, licensed under
the `Apache License 2.0` license:
```python
class TestPlugin(ExtractionPlugin):
def plugin_info(self) -> PluginInfo:
return PluginInfo(self,
version='1.0.0',
description='A plugin for testing.',
author=Author('The Externals', 'tester@holmes.nl', 'NFI'),
maturity=MaturityLevel.PROOF_OF_CONCEPT,
webpage_url='https://hansken.org',
matcher='file.extension=txt',
id=PluginId(domain='nfi.nl', category='test', name='TestPlugin'),
license='Apache License 2.0'
)
```
## 0.3.0
* Extraction Plugins can now create new datastreams on a Trace through data transformations. Data transformations
describe how data can be obtained from a source.
An example case is an extraction plugin that processes an archive file. The plugin creates a child trace per entry in
the archive file. Each child trace will have a datastream that is a transformation that marks the start and length of
the entry in the original archive data. By just describing the data instead of specifying the actual data, a lot of
space is saved.
Although Hansken supports various transformations, the Extraction Plugins SDK for now only supports ranged data
transformations. Ranged data transformations define data as a list of ranges, each range with an offset and length in
a bytearray.
The following example sets a new datastream with dataType `html` on a trace, by setting a ranged data transformation:
```python
trace.add_transformation('html', RangedTransformation(Range(offset, length)))
```
The following example creates a child trace and sets a new datastream with dataType `raw` on it, by setting a ranged
data transformation with two ranges:
```python
child = trace.child_builder('new trace')
child.add_transformation('raw', RangedTransformation.builder()
.add_range(10, 20)
.add_range(50, 30)
.build())
});
```
More detailed documentation will follow in an upcoming SDK release.
## 0.2.0
.. warning:: This is an API breaking change.
Plugins created with an earlier version of the extraction plugin
SDK are not compatible with Hansken that uses `0.2.0` or later.
* Introduced a new extraction plugin type `api.extraction_plugin.DeferredExtractioPlugin`.
Deferred Extraction plugins can be run at a different extraction stage.
This type of plugin also allows accessing other traces using the searcher.
* The class `api.extraction_context.ExtractionContext` has been renamed to `api.data_context.DataContext`.
The new name `DataContext` represents the class contents better.
Plugins have to update matching import statements accordingly.
Plugins should also update the named argument `context` to `data_context` of the plugin `process()` method.
This change has no functional changes.
Old:
```python
from hansken_extraction_plugin.api.extraction_context import ExtractionContext
def process(self, trace, context):
pass
```
New:
```python
from hansken_extraction_plugin.api.data_context import DataContext
def process(self, trace, data_context):
pass
```
* Moved `api.author.Author` to `api.plugin_info.Author`, and moved `api.maturity_level.MaturityLevel`
to `api.plugin_info.MaturityLevel`
This is a more *pythonic* way of grouping of classes into modules. This change has no functional side effects.
Plugins have to update matching import statements accordingly.
Old:
```python
from hansken_extraction_plugin.api.author import Author
from hansken_extraction_plugin.api.maturity_level import MaturityLevel
from hansken_extraction_plugin.api.plugin_info import PluginInfo
```
New:
```python
from hansken_extraction_plugin.api.plugin_info import Author, MaturityLevel, PluginInfo
```
* Removed `DataContext.get_first_bytes()` from the public API.
* Removed `api.extraction_trace.validate_update_arguments(..)` from the public API. This method is still invoked
implicitly when setting trace properties.