edeposit_autoparser.py

This script is used to ease creation of new parsers.

Configuration file

The script expects configuration file with patterns, specified as -c parameter. Pattern files uses YAML as serialization format.

Inside the pattern file should be multiple pattern definitions. Here is example of the test pattern file:

html: simple_xml.xml
first:
    data: i wan't this
    required: true
    notfoundmsg: Can't find variable '$name'.
second:
    data: and this
---
html: simple_xml2.xml
first:
    data: something wanted
    required: true
    notfoundmsg: Can't find variable '$name'.
second:
    data: another wanted thing

As you can see, this file contains two examples divided by ---. Each section, of file have to contain html key pointing to either file or URL resource.

After the html key, there may be unlimited number of variables. Each variable have to contain data key, which defines the match, which will be parsed from the file html key is pointing to.

Optionally, you can also specify required and notfoundmsg. If the variable is required, it means that if generated parser will found data without this variable, UserWarning exception is raised and notfoundmsg is used as message. As you can see in example, you can use $name as variable which holds variable name (first for example).

There is also special keyword tagname, which can be used to further specify correct element in case, that there is more than one element matching.

How it works

Autoparser first reads all examples and locates elements, which content matching pattern defined in data key. Spaces at the beginning and end of the pattern and element’s content are ignored.

When the autoparser collects all matching elements, it generates DOM paths to each element.

After that, elimination process begins. In this step, autoparser throws away all paths, that doesn’t work for all corresponding variables in all examples.

When this is done, paths with best priority are selected and generate_parsers() is called.

Result from this call is string printed to the output. This string contains all necessary parsers for each variable and also unittest.

You can then build the parser you need much more easilly, because now you have working pickers from DOM and all you need to do is to clean the data.

Live example:

$ ./edeposit_autoparser.py -c autoparser/autoparser_data/example_data.yaml
#! /usr/bin/env python
# -*- coding: utf-8 -*-
#
# Interpreter version: python 2.7
#
# HTML parser generated by Autoparser
# (https://github.com/edeposit/edeposit.amqp.harvester)
#
import os
import os.path

import httpkie
import dhtmlparser


# Utilities
def _get_source(link):
    """
    Return source of the `link` whether it is filename or url.

    Args:
        link (str): Filename or URL.

    Returns:
        str: Content.

    Raises:
        UserWarning: When the `link` couldn't be resolved.
    """
    if link.startswith("http://") or link.startswith("https://"):
        down = httpkie.Downloader()
        return down.download(link)

    if os.path.exists(link):
        with open(link) as f:
            return f.read()

    raise UserWarning("html: '%s' is neither URL or data!" % link)


def _get_encoding(dom, default="utf-8"):
    """
    Try to look for meta tag in given `dom`.

    Args:
        dom (obj): pyDHTMLParser dom of HTML elements.
        default (default "utr-8"): What to use if encoding is not found in
                                   `dom`.

    Returns:
        str/default: Given encoding or `default` parameter if not found.
    """
    encoding = dom.find("meta", {"http-equiv": "Content-Type"})

    if not encoding:
        return default

    encoding = encoding[0].params.get("content", None)

    if not encoding:
        return default

    return encoding.lower().split("=")[-1]


def handle_encodnig(html):
    """
    Look for encoding in given `html`. Try to convert `html` to utf-8.

    Args:
        html (str): HTML code as string.

    Returns:
        str: HTML code encoded in UTF.
    """
    encoding = _get_encoding(
        dhtmlparser.parseString(
            html.split("</head>")[0]
        )
    )

    if encoding == "utf-8":
        return html

    return html.decode(encoding).encode("utf-8")


def is_equal_tag(element, tag_name, params, content):
    """
    Check is `element` object match rest of the parameters.

    All checks are performed only if proper attribute is set in the HTMLElement.

    Args:
        element (obj): HTMLElement instance.
        tag_name (str): Tag name.
        params (dict): Parameters of the tag.
        content (str): Content of the tag.

    Returns:
        bool: True if everyhing matchs, False otherwise.
    """
    if tag_name and tag_name != element.getTagName():
        return False

    if params and not element.containsParamSubset(params):
        return False

    if content is not None and content.strip() != element.getContent().strip():
        return False

    return True


def has_neigh(tag_name, params=None, content=None, left=True):
    """
    This function generates functions, which matches all tags with neighbours
    defined by parameters.

    Args:
        tag_name (str): Tag has to have neighbour with this tagname.
        params (dict): Tag has to have neighbour with this parameters.
        params (str): Tag has to have neighbour with this content.
        left (bool, default True): Tag has to have neigbour on the left, or
                                   right (set to ``False``).

    Returns:
        bool: True for every matching tag.

    Note:
        This function can be used as parameter for ``.find()`` method in
        HTMLElement.
    """
    def has_neigh_closure(element):
        if not element.parent \
           or not (element.isTag() and not element.isEndTag()):
            return False

        # filter only visible tags/neighbours
        childs = element.parent.childs
        childs = filter(
            lambda x: (x.isTag() and not x.isEndTag()) \
                      or x.getContent().strip() or x is element,
            childs
        )
        if len(childs) <= 1:
            return False

        ioe = childs.index(element)
        if left and ioe > 0:
            return is_equal_tag(childs[ioe - 1], tag_name, params, content)

        if not left and ioe + 1 < len(childs):
            return is_equal_tag(childs[ioe + 1], tag_name, params, content)

        return False

    return has_neigh_closure


# Generated parsers
def get_second(dom):
    el = dom.find(
        'container',
        {'id': 'mycontent'},
        fn=has_neigh(None, None, 'something something', left=False)
    )

    # pick element from list
    el = el[0] if el else None

    return el


def get_first(dom):
    el = dom.wfind('root').childs

    if not el:
        raise UserWarning(
            "Can't find variable 'first'.\n" +
            'Tag name: root\n' +
            'El:' + str(el) + '\n' +
            'Dom:' + str(dom)
        )

    el = el[-1]

    el = el.wfind('xax').childs

    if not el:
        raise UserWarning(
            "Can't find variable 'first'.\n" +
            'Tag name: xax\n' +
            'El:' + str(el) + '\n' +
            'Dom:' + str(dom)
        )

    el = el[-1]

    el = el.wfind('container').childs

    if not el:
        raise UserWarning(
            "Can't find variable 'first'.\n" +
            'Tag name: container\n' +
            'El:' + str(el) + '\n' +
            'Dom:' + str(dom)
        )

    el = el[-1]

    return el


# Unittest
def test_parsers():
    # Test parsers against autoparser/autoparser_data/simple_xml.xml
    html = handle_encodnig(
        _get_source('autoparser/autoparser_data/simple_xml.xml')
    )
    dom = dhtmlparser.parseString(html)
    dhtmlparser.makeDoubleLinked(dom)

    second = get_second(dom)
    assert second.getContent().strip() == 'and this'

    first = get_first(dom)
    assert first.getContent().strip() == "i wan't this"

    # Test parsers against autoparser/autoparser_data/simple_xml2.xml
    html = handle_encodnig(
        _get_source('autoparser/autoparser_data/simple_xml2.xml')
    )
    dom = dhtmlparser.parseString(html)
    dhtmlparser.makeDoubleLinked(dom)

    second = get_second(dom)
    assert second.getContent().strip() == 'another wanted thing'

    first = get_first(dom)
    assert first.getContent().strip() == 'something wanted'


# Run tests of the parser
if __name__ == '__main__':
    test_parsers()

API

harvester.edeposit_autoparser._create_dom(data)[source]

Creates doublelinked DOM from data.

Parameters:data (str/HTMLElement) – Either string or HTML element.
Returns:HTMLElement containing double linked DOM.
Return type:obj
harvester.edeposit_autoparser._locate_element(dom, el_content, transformer=None)[source]

Find element containing el_content in dom. Use transformer function to content of all elements in dom in order to correctly transforming them to match them with el_content.

Parameters:
  • dom (obj) – HTMLElement tree.
  • el_content (str) – Content of element will be picked from dom.
  • transformer (fn, default None) – Transforming function.

Note

transformer parameter can be for example simple lambda:

lambda x: x.strip()
Returns:Matching HTMLElements.
Return type:list
harvester.edeposit_autoparser._match_elements(dom, matches)[source]

Find location of elements matching patterns specified in matches.

Parameters:
  • dom (obj) – HTMLElement DOM tree.
  • matches (dict) – Structure: {"var": {"data": "match", ..}, ..}.
Returns:

Structure: {"var": {"data": HTMLElement_obj, ..}, ..}

Return type:

dict

harvester.edeposit_autoparser._collect_paths(element)[source]

Collect all possible path which leads to element.

Function returns standard path from root element to this, reverse path, which uses negative indexes for path, also some pattern matches, like “this is element, which has neighbour with id 7” and so on.

Parameters:element (obj) – HTMLElement instance.
Returns:List of PathCall and Chained objects.
Return type:list
harvester.edeposit_autoparser._is_working_path(dom, path, element)[source]

Check whether the path is working or not.

Aply proper search function interpreting path to dom and check, if returned object is element. If so, return True, otherwise False.

Parameters:
  • dom (obj) – HTMLElement DOM.
  • path (obj) – PathCall Instance containing informations about path and which function it require to obtain element the path is pointing to.
  • element (obj) – HTMLElement instance used to decide whether path points to correct element or not.
Returns:

True if path correctly points to proper element.

Return type:

bool

harvester.edeposit_autoparser.select_best_paths(examples)[source]

Process examples, select only paths that works for every example. Select best paths with highest priority.

Parameters:examples (dict) – Output from read_config().
Returns:List of PathCall and Chained objects.
Return type:list