Creak: A Swift HTML Parsing Library

Creak: A Swift HTML Parsing Library | Original, translated by AI

Home 2016.06

Creak is designed to parse HTML documents efficiently and build a tree structure representing the document’s elements. The parsing process involves several key steps and components that work together to achieve this goal. Here’s a detailed explanation of how Creak parses HTML:

Parsing Process Overview

Initialization: The HTML string is loaded and cleaned.
Tokenization: The HTML string is broken down into tokens representing different parts of the HTML, such as tags and text.
Tree Construction: The tokens are used to construct a tree structure of nodes, representing the HTML document’s elements and text.

Key Components

Dom Class: Manages the overall parsing process and stores the root of the parsed HTML tree.
Content Class: Provides utility functions for tokenizing the HTML string.
HtmlNode and TextNode Classes: Represent the elements and text nodes in the HTML document.
Tag Class: Represents HTML tags and their attributes.

Detailed Parsing Steps

1. Initialization

The Dom class is responsible for initializing the parsing process. The loadStr method takes the raw HTML string, cleans it, and initializes the Content object.

public func loadStr(str: String) -> Dom {
    raw = str
    let html = clean(str)
    content = Content(content: html)
    parse()
    return self
}

2. Tokenization

The Content class provides utility functions to tokenize the HTML string. It includes methods to copy sections of the string, skip characters, and handle tokens such as tags and attributes.

copyUntil: Copies characters from the current position until a specified character is encountered.
skipByToken: Skips characters based on a specified token.

These methods are used to identify and extract different parts of the HTML, such as tags, attributes, and text content.

3. Tree Construction

The parse method in the Dom class iterates through the HTML string, identifying tags and text, and building a tree structure of nodes (HtmlNode and TextNode).

private func parse() {
    root = HtmlNode(tag: "root")
    var activeNode: InnerNode? = root
    while activeNode != nil {
        let str = content.copyUntil("<")
        if (str == "") {
            let info = parseTag()
            if !info.status {
                activeNode = nil
                continue
            }
            
            if info.closing {
                let originalNode = activeNode
                while activeNode?.tag.name != info.tag {
                    activeNode = activeNode?.parent
                    if activeNode == nil {
                        activeNode = originalNode
                        break
                    }
                }
                if activeNode != nil {
                    activeNode = activeNode?.parent
                }
                continue
            }
            
            if info.node == nil {
                continue
            }
            
            let node = info.node!
            activeNode!.addChild(node)
            if !node.tag.selfClosing {
                activeNode = node
            }
        } else if (trim(str) != "") {
            let textNode = TextNode(text: str)
            activeNode?.addChild(textNode)
        }
    }
}

Root Node: The parsing starts with a root node (HtmlNode with tag “root”).
Active Node: The activeNode variable keeps track of the current node being processed.
Text Content: If text content is found, a TextNode is created and added to the current node.
Tag Parsing: If a tag is found, the parseTag method is called to handle it.

Parsing Tags

The parseTag method handles the identification and processing of tags.

private func parseTag() -> ParseInfo {
    var result = ParseInfo()
    if content.char() != ("<" as Character) {
        return result
    }
    
    if content.fastForward(1).char() == "/" {
        var tag = content.fastForward(1).copyByToken(Content.Token.Slash, char: true)
        content.copyUntil(">")
        content.fastForward(1)
        
        tag = tag.lowercaseString
        if selfClosing.contains(tag) {
            result.status = true
            return result
        } else {
            result.status = true
            result.closing = true
            result.tag = tag
            return result
        }
    }
    
    let tag = content.copyByToken(Content.Token.Slash, char: true).lowercaseString
    let node = HtmlNode(tag: tag)
    
    while content.char() != ">" &&
       content.char() != "/" {
        let space = content.skipByToken(Content.Token.Blank, copy: true)
        if space?.characters.count == 0 {
            content.fastForward(1)
            continue
        }
        
        let name = content.copyByToken(Content.Token.Equal, char: true)
        if name == "/" {
            break
        }
        
        if name == "" {
            content.fastForward(1)
            continue
        }
        
        content.skipByToken(Content.Token.Blank)
        if content.char() == "=" {
            content.fastForward(1).skipByToken(Content.Token.Blank)
            var attr = AttrValue()
            let quote: Character? = content.char()
            if quote != nil {
                if quote == "\"" {
                    attr.doubleQuote = true
                } else {
                    attr.doubleQuote = false
                }
                content.fastForward(1)
                var string = content.copyUntil(String(quote!), char: true, escape: true)
                var moreString = ""
                repeat {
                    moreString = content.copyUntilUnless(String(quote!), unless: "=>")
                    string += moreString
                } while moreString != ""
                attr.value = string
                content.fastForward(1)
                node.setAttribute(name, attrValue: attr)
            } else {
                attr.doubleQuote = true
                attr.value = content.copyByToken(Content.Token.Attr, char: true)
                node.setAttribute(name, attrValue: attr)
            }
        } else {
            node.tag.setAttribute(name, attrValue: AttrValue(nil, doubleQuote: true))
            if content.char() != ">" {
                content.rewind(1)
            }
        }
    }
    
    content.skipByToken(Content.Token.Blank)
    if content.char() == "/" {
        node.tag.selfClosing = true
        content.fastForward(1)
    } else if selfClosing.contains(tag) {
        node.tag.selfClosing = true
    }
    
    content.fastForward(1)
    
    result.status = true
    result.node = node
    
    return result
}

Tag Identification: The method identifies whether a tag is an opening or closing tag.
Attributes: It parses the attributes of the tag and adds them to the HtmlNode.
Self-closing Tags: It handles self-closing tags appropriately.

Conclusion

Creak’s parsing process involves initializing the HTML content, tokenizing it, and constructing a tree structure of nodes. The Dom class manages the overall parsing, while the Content class provides utilities for tokenizing the HTML string. The HtmlNode and TextNode classes represent the elements and text in the HTML document, and the Tag class manages the attributes of the tags. This efficient and organized approach makes Creak a powerful tool for parsing HTML in Swift.

Back Donate