utils submodule

This module contains number of functions, which are used in the rest of the scrappers submodule.

harvester.scrappers.utils._get_encoding(dom, default='utf-8')[source]

Try to look for meta tag in given dom.

Parameters:
  • dom (obj) – pyDHTMLParser dom of HTML elements.
  • default (default “utr-8”) – What to use if encoding is not found in dom.
Returns:

Given encoding or default parameter if not found.

Return type:

str/default

harvester.scrappers.utils.handle_encodnig(html)[source]

Look for encoding in given html. Try to convert html to utf-8.

Parameters:html (str) – HTML code as string.
Returns:HTML code encoded in UTF.
Return type:str
harvester.scrappers.utils.get_first_content(el_list, alt=None, strip=True)[source]

Return content of the first element in el_list or alt. Also return alt if the content string of first element is blank.

Parameters:
  • el_list (list) – List of HTMLElement objects.
  • alt (default None) – Value returner when list or content is blank.
  • strip (bool, default True) – Call .strip() to content.
Returns:

String representation of the content of the first element or alt if not found.

Return type:

str or alt

harvester.scrappers.utils.is_absolute_url(url, protocol='http')[source]

Test whether url is absolute url (http://domain.tld/something) or relative (../something).

Parameters:
  • url (str) – Tested string.
  • protocol (str, default “http”) – Protocol which will be seek at the beginning of the url.
Returns:

True if url is absolute, False if not.

Return type:

bool

harvester.scrappers.utils.normalize_url(base_url, rel_url)[source]

Normalize the url - from relative, create absolute URL.

Parameters:
  • base_url (str) – Domain with protocol:// string
  • rel_url (str) – Relative or absolute url.
Returns:

Normalized URL or None if url is blank.

Return type:

str/None

harvester.scrappers.utils.has_param(param)[source]

Generate function, which will check param is in html element.

This function can be used as parameter for .find() method in HTMLElement.

harvester.scrappers.utils.must_contain(tag_name, tag_content, container_tag_name)[source]

Generate function, which checks if given element contains tag_name with string content tag_content and also another tag named container_tag_name.

This function can be used as parameter for .find() method in HTMLElement.

harvester.scrappers.utils.content_matchs(tag_content, content_transformer=None)[source]

Generate function, which checks whether the content of the tag matchs tag_content.

Parameters:
  • tag_content (str) – Content of the tag which will be matched thru whole DOM.
  • content_transformer (fn, default None) – Function used to transform all tags before matching.

This function can be used as parameter for .find() method in HTMLElement.

harvester.scrappers.utils.self_test_idiom(fn)[source]

Perform basic selftest.

Returns:When everything is ok.
Return type:True
Raises:AssertionError – When there is some problem.