Duplication filter

This submodule is used to skip already parsed data.

Each publication parameter of the filter() is cached and if it is called with same parameter again, None is retuned.

Note

Cache is using simple JSON serialization, so some form of cache persistency is granted. For path to the serialized data, look at DUP_FILTER_FILE.

harvester.filters.dup_filter.save_cache(cache)[source]

Save cahce to the disk.

Parameters:cache (set) – Set with cached data.
harvester.filters.dup_filter.load_cache()[source]

Load cache from the disk.

Returns:Deserialized data from disk.
Return type:set
harvester.filters.dup_filter.filter_publication(publication, cache=None)[source]

Deduplication function, which compares publication with samples stored in cache. If the match NOT is found, publication is returned, else None.

Parameters:
  • publication (obj) – Publication instance.
  • cache (obj) – Cache which is used for lookups.
Returns:

Depends whether the object is found in cache or not.

Return type:

obj/None