• Software
  • Blog
  • About
Menu

Dabbling Badger

Street Address
54455
Phone Number

Your Custom Text Here

Dabbling Badger

  • Software
  • Blog
  • About

Parsing Character Entities from HTML/XML Content In Swift

September 23, 2021 Jonathan Badger
Photo by Valery Sysoev on Unsplash

Photo by Valery Sysoev on Unsplash

The Web today is a wonderfully addictive mashup of culture, commerce, and technologies, both old and new. As an iOS developer, interacting with the Web is usually trivial. Make an endpoint request of a REST API from a web-server, get back data, decode data. Boom. Done. At least, that’s what I thought, until I ran into odd substrings like ‘"’ and ‘&’ in an xml formatted response I was parsing. These curious bits of text are known as character entity references (CERs) and in this case, stand for the quote ( “) and ampersand (&) characters respectively. In this article I will provide a bit of background about why CERs exist for us non-web developers and give a few practical methods for decoding them in Swift.

What are Character Entity References and why do they exist?

As I mentioned at the start of the article CERs are basically character codes within a string that are sandwiched between an ampersand and semicolon. Your web browser recognizes these codes and automatically replaces them with the appropriate character for rendering on your screen. So, HTML with ‘5 &lt; 6’ renders to ‘5 < 6’. The Worldwide Web Consortium (W3) has a spiffy interactive chart you can have a look at if you are interested in seeing more. For those that want to dig deep you can read the wiki entry or have a close look at the official html spec.

Character Entity References exist for a number of reasons:

  1. To allow for inclusion of reserved characters in HTML. Just like any programming language there are characters that are reserved for the language itself…most of us have at least seen at least a little HTML. Each tag begins with < and ends with >. Is it any surprise these characters are reserved?

  2. To allow for characters not included in the encoding format of the document (90% of web page are UTF-8 these days, so I think this is mostly for edge cases).

  3. As a convenience to the document writers (web devs) for characters that aren’t included on a standard keyboard. Writing ‘&copy;’ is a lot faster and more efficient that going fishing for the copy-write symbol in a special characters library.

Dealing with CERs in Swift

Now that we have the background out of the way I will show you two methods you can use to ‘find and replace’ CERs in Swift.

Option 1: NSAttributedString

If we dip into ObjectiveC (not very Swift-like, I know), NSAttributedString already has a lot of functionality built around parsing html. Here is an alternate initializer for String that handles CERs:

extension String {
    init?(htmlEncodedString: String) {
        guard let data = htmlEncodedString.data(using: .utf8) else {
            return nil
        }
        let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
            .documentType: NSAttributedString.DocumentType.html,
            .characterEncoding: String.Encoding.utf8.rawValue
        ]
        guard let attributedString = try? NSAttributedString(data: data, options: options, documentAttributes: nil) else {
            return nil
        }
        self.init(attributedString.string)
    }
}

Here we briefly initialize an attributed string specifying the DocumentType as .html. Character entity references are automatically substituted for the appropriate character on initialization, so all we have to do is return the .string property and we are done! The new initializer can be used like:

let htmlString = "Easy peasy lemon squeezy. &#127819;"
let fixedString = String(htmlEncodedString: htmlString)
print(fixedString)

Easy peasy lemon squeezy. 🍋

Option 2: Regular Expression Matching

For the second technique we will write our own function that use a dictionary of CER -> Character mappings and regular expressions to perform character substitution manually.

Our dictionary will look like this:

let characterEntities : [String: Character] = [

    // XML predefined entities:
    "&quot;"     : "\"",
    "&amp;"      : "&",
    "&apos;"     : "'",
    "&lt;"       : "<",
    "&gt;"       : ">",

    // HTML character entity references:
    "&nbsp;"     : "\u{00A0}",
    "&iexcl;"    : "\u{00A1}", ...]

I’ve left out the full list of CER : character mappings, but you get the idea. As for the rest of the implementation, let’s write a new function as an extension on String so character substation is available whenever we need it. Here is the full code:

extension String {
    func replacingCharacterEntities() -> String {
        func unicodeScalar(for numericCharacterEntity: String) -> Unicode.Scalar? {
            var unicodeString = ""
            for character in numericCharacterEntity {
                if "0123456789".contains(character) {
                    unicodeString.append(character)
                }
            }
            if let scalarInt = Int(unicodeString),
               let unicodeScalar = Unicode.Scalar(scalarInt) {
                return unicodeScalar
            }
            return nil
        }

        var result = ""
        var position = self.startIndex

        let range = NSRange(self.startIndex..<self.endIndex, in: self)
        let pattern = #"(&\S*?;)"#
        let unicodeScalarPattern = #"&#(\d*?);"#

        guard let regex = try? NSRegularExpression(pattern: pattern, options: []) else { return self }
        regex.enumerateMatches(in: self, options: [], range: range) { matches, flags, stop in
            if let matches = matches {
                    if let range = Range(matches.range(at: 0), in:self) {
                        let rangePreceedingMatch = position..<range.lowerBound
                        result.append(contentsOf: self[rangePreceedingMatch])
                        let characterEntity = String(self[range])
                        if let replacement = characterEntities[characterEntity] {
                            result.append(replacement)
                        } else if let _ = characterEntity.range(of: unicodeScalarPattern, options: .regularExpression),
                                  let unicodeScalar = unicodeScalar(for: characterEntity) {
                            result.append(String(unicodeScalar))
                        }
                        position = self.index(range.lowerBound, offsetBy: characterEntity.count )
                    }
            }
        }
        if position != self.endIndex {
            result.append(contentsOf: self[position..<self.endIndex])
        }
        return result
    }
}

So what is this function doing? In essence, we take our original string, look for substrings that match our pattern, iterate over the matches, and build up the result string by using the ranges found in each match to replace any CERs. For those unfamiliar with using NSRegularExpression there is an excellent article written by Matt on NSHipster that offers background, examples, and explanations. And while I’m directing you off of this article I should also recommend regex101.com, an interactive website I use all the time for prototyping regex patterns.

This new function can be called on any string as in:

let htmlString = "Easy peasy lemon squeezy. &#127819;"
print(htmlString.replacingCharacterEntities())
Easy peasy lemon squeezy. 🍋

Conclusion

Thanks for reading. If you found this article interesting and aren’t already a member of Medium, please consider signing up! You will be supporting me (disclosure: I get part of the membership dues) and get access to tons of great content.

References

  • https://nshipster.com/swift-regular-expressions/

  • https://www.w3.org/TR/html4/cover.html#minitoc

  • https://gist.github.com/mwaterfall/25b4a6a06dc3309d9555

  • https://www.swiftbysundell.com/articles/string-literals-in-swift/

In Programming Tags Swift, Web
← Swift As A Cross-platform, General-purpose Programming Language in 2022Tips And Tricks For Making The Most Of TextFields In SwiftUI →

E-mail: dabblingbadger@gmail.com

Copyright © 2023 Dabbling Badger LLC

POWERED BY SQUARESPACE