utils submodule¶
This module contains number of functions, which are used in the rest of the scrappers submodule.
- harvester.scrappers.utils._get_encoding(dom, default='utf-8')[source]¶
Try to look for meta tag in given dom.
Parameters: - dom (obj) – pyDHTMLParser dom of HTML elements.
- default (default “utr-8”) – What to use if encoding is not found in dom.
Returns: Given encoding or default parameter if not found.
Return type: str/default
- harvester.scrappers.utils.handle_encodnig(html)[source]¶
Look for encoding in given html. Try to convert html to utf-8.
Parameters: html (str) – HTML code as string. Returns: HTML code encoded in UTF. Return type: str
- harvester.scrappers.utils.get_first_content(el_list, alt=None, strip=True)[source]¶
Return content of the first element in el_list or alt. Also return alt if the content string of first element is blank.
Parameters: - el_list (list) – List of HTMLElement objects.
- alt (default None) – Value returner when list or content is blank.
- strip (bool, default True) – Call .strip() to content.
Returns: String representation of the content of the first element or alt if not found.
Return type: str or alt
- harvester.scrappers.utils.is_absolute_url(url, protocol='http')[source]¶
Test whether url is absolute url (http://domain.tld/something) or relative (../something).
Parameters: - url (str) – Tested string.
- protocol (str, default “http”) – Protocol which will be seek at the beginning of the url.
Returns: True if url is absolute, False if not.
Return type: bool
- harvester.scrappers.utils.normalize_url(base_url, rel_url)[source]¶
Normalize the url - from relative, create absolute URL.
Parameters: Returns: Normalized URL or None if url is blank.
Return type: str/None
- harvester.scrappers.utils.has_param(param)[source]¶
Generate function, which will check param is in html element.
This function can be used as parameter for .find() method in HTMLElement.
- harvester.scrappers.utils.must_contain(tag_name, tag_content, container_tag_name)[source]¶
Generate function, which checks if given element contains tag_name with string content tag_content and also another tag named container_tag_name.
This function can be used as parameter for .find() method in HTMLElement.
- harvester.scrappers.utils.content_matchs(tag_content, content_transformer=None)[source]¶
Generate function, which checks whether the content of the tag matchs tag_content.
Parameters: - tag_content (str) – Content of the tag which will be matched thru whole DOM.
- content_transformer (fn, default None) – Function used to transform all tags before matching.
This function can be used as parameter for .find() method in HTMLElement.