utils submodule¶

This module contains number of functions, which are used in the rest of the scrappers submodule.

harvester.scrappers.utils._get_encoding(dom, default='utf-8')[source]¶

Try to look for meta tag in given dom.

Parameters:	dom (obj) – pyDHTMLParser dom of HTML elements. default (default “utr-8”) – What to use if encoding is not found in dom.
Returns:	Given encoding or default parameter if not found.
Return type:	str/default

harvester.scrappers.utils.handle_encodnig(html)[source]¶

Look for encoding in given html. Try to convert html to utf-8.

harvester.scrappers.utils.get_first_content(el_list, alt=None, strip=True)[source]¶

Return content of the first element in el_list or alt. Also return alt if the content string of first element is blank.

Parameters:	el_list (list) – List of HTMLElement objects. alt (default None) – Value returner when list or content is blank. strip (bool, default True) – Call .strip() to content.
Returns:	String representation of the content of the first element or alt if not found.
Return type:	str or alt

harvester.scrappers.utils.is_absolute_url(url, protocol='http')[source]¶

Test whether url is absolute url (http://domain.tld/something) or relative (../something).

Parameters:	url (str) – Tested string. protocol (str, default “http”) – Protocol which will be seek at the beginning of the url.
Returns:	True if url is absolute, False if not.
Return type:	bool

harvester.scrappers.utils.normalize_url(base_url, rel_url)[source]¶

Normalize the url - from relative, create absolute URL.

Parameters:	base_url (str) – Domain with `protocol://` string rel_url (str) – Relative or absolute url.
Returns:	Normalized URL or None if url is blank.
Return type:	str/None

harvester.scrappers.utils.has_param(param)[source]¶

Generate function, which will check param is in html element.

This function can be used as parameter for .find() method in HTMLElement.

harvester.scrappers.utils.must_contain(tag_name, tag_content, container_tag_name)[source]¶

Generate function, which checks if given element contains tag_name with string content tag_content and also another tag named container_tag_name.

This function can be used as parameter for .find() method in HTMLElement.

harvester.scrappers.utils.content_matchs(tag_content, content_transformer=None)[source]¶

Generate function, which checks whether the content of the tag matchs tag_content.

Parameters:	tag_content (str) – Content of the tag which will be matched thru whole DOM. content_transformer (fn, default None) – Function used to transform all tags before matching.

This function can be used as parameter for .find() method in HTMLElement.

harvester.scrappers.utils.self_test_idiom(fn)[source]¶

Perform basic selftest.

Returns:	When everything is ok.
Return type:	True
Raises:	`AssertionError` – When there is some problem.