Generic HTML Sanitizer Bypass Investigation
I stumbled over a weird HTML behavior on Twitter and started to investigate it. Did I just stumble over a generic HTML Sanitizer bypass?
Get my handwritten font shop.liveoverflow.com (advertisement)
Checkout our courses on hextree.io (advertisement)
The Tweet: / 1662701541680136195
Google XSS: • XSS on Google Search -...
HTML Spec: html.spec.whatwg.org/multipag...
Chapters:
00:00 - Intro
01:09 - Sanitizing vs. Encoding
02:32 - Developing HTML Sanitizer Bypass
05:03 - Attacking DOMPurify
07:08 - Attacking Server-side Sanitizer
08:31 - HTML Parse Error Specification
10:08 - Potential Impact
11:55 - hextree.io
=[ ❤️ Support ]=
→ per Video: / liveoverflow
→ per Month: / @liveoverflow
2nd Channel: / liveunderflow
=[ 🐕 Social ]=
→ Twitter: / liveoverflow
→ Streaming: twitch.tvLiveOverflow/
→ TikTok: / liveoverflow_
→ Instagram: / liveoverflow
→ Blog: liveoverflow.com/
→ Subreddit: / liveoverflow
→ Facebook: / liveoverflow
I'm generally someone who likes to implement stuff themselves, instead of using an external dependency, but stuff like this is why i normally don't touch security related things (like HTML sanitization) myself and go for an existing solution instead.
Once I did as a personal challenge and then you can get a list of common bugs that might lead to other problems. You can work one by one but then you get another list and you get another kick. But it is fun to try
I just remember log4j
Same, I prefer to implement myself most things but Date/Timezone management and security related stuff are 2 topics I don't want to touch if possible
@@rosco3 Oh god yes, f*ck Time management of any kind. Especially in JS.
Fun fact, html sanitation and this specific weirdness is what prevented a hackers malware from working on my phone(s). 😅 They shit in their own hand on that one
I'd read before that valid HTML tags can't start with a number so I wasn't surprised that was the root issue, I wasn't aware of how the specification detailed parsing the situation however and now I'm wondering the logic behind _why_ they specify it should be handled like _that_ of all things 🤔 My only guess is to make sure it mangles the output sufficiently as to hopefully make the developer notice something's wrong and fix it, but surely that could be done more gracefully... or maybe it's just grandfathered in from the quirky behaviour of some early parser?
The HTML 5 spec is nearly entirely trying to nail down the least broken interpretation of existing content written against the wacky browsers of the time.
, etc. are a rare way to count items and basically ignoring , ... by parsing them as comments might be a concession to broken markup cleaners that tried to close those non-tags.
About the Chomsky hierarchy: what we call regex is not actually🤓 type-3/regular, it often has operations, such as repeating the match group with \1, that are only present in type-1 grammars. There is a somewhat well-known regex for prime numbers, and it is impossible to construct a corresponding state machine for it.
This video is great. I've been a full stack web developer now for about 15 years and I learned quite a bit about HTML parsing. The onerror attribute isn't something I would think of at all because quite frankly I write JavaScript Event and Error handlers the recommended/modern way.
There is also a new JS method on all HTML elements: setHTML(input, options). It's basically innerHTML, but sanitizes the input. So i think it's just like DOMpurify, but natively in the browser.
as it's really new... it might contain issues....
@@vaisakhkm783 Hey if you find any I bet there is a reward from google...
Cool, but its not supported in FF or Safari.
@@vaisakhkm783IDK, I'd trust browser authors more with sanitizing HTML, since they wrote their own parser. It wouldn't surprise me if it used some of the browsers own html parsing logic
@@spicybaguette7706until people figure out some no click exploit
When I saw your thumbnail, I instantly tried it out with different numbers and I also tried random tags, starting with numbers. I thought okay, variables can't start with numbers, so I guess that also applies to HTML tags. I would have never thought about any security bypasses by my own but then I became curious and watched your video.
The popular Java library for this, Jsoup, also looks to handle this correctly. The basic input turns into <22> though it strips the comment for the closing tag.
Oh that is nice, I love jsoup
The UI for your courses looks really nice!
I was nervous when you tried dompurify, cause we heavily rely on it in some of our projects.
If you wasted your time researching this, then what have I done by watching!? Haha, great vid, interesting results.
8:00 funny you mentioned the syntax highlighting, because it was the FIRST thing my brain said when you first wrote it in your editor at the start of the video! 😂
Good video. You are never defeated if ends up learning something 💪
We should only allow standard ( and whitelisted/predefined custom tags ) and explicitly close them. And just refuse to parse and throw an error if document is invalid. It's just stupid to allow arbitrary syntax and try to parse it.
Markup languages seldom throw errors. It'd also be annoying to have an entire document not render for you just because they used something your renderer doesn't allow But you can enforce well formed HTML as a style guide for your project, which many people already do
@@seanthesheep Or block execution in case of malformed syntax?
Hey LiveOverflow, how about CTF challenges as hextree courses? I think those would nicely build onto your existing youtube video's.
I learned something new, very interesting thanks.
In Python, both the original htmllib.HTMLParser, which was built on top of the SGML parser and no longer exists in Python 3, and the current html.parser.HTMLParser handle this according to the specification.
I instantly remembered that Astro lets you define a variable to alias a component or html tag, so I instantly went and tried Wat=22. Got the same behaviour described in the spec, but Astro leaked the classname into the HTML content, so I guess I got a bug report to make... EDIT: I tried around and there is just enough sanitization to break any XSS I could think of so far. Shoutouts to the Astro team I guess 😹
I am just learning html, and I was so confused when you were calling an html tag. Only for the conclusion to the video be as simple as “it’s not one”. Like wow, what a shocker
I wish hextree was open... I am so excited!
You learn something everyday.
Say thanks to the big brains who didn't want to go with xhtml which was waaaaay more restrictive than html5
Great content for junior 👍🏽
I remember back in the days many people sanitized for javascript links by checking if the url starts with javascript: that makes sense I guess... but then IE7 allowed for a tab (Or was is some other char? cant remember) characters infront of the javascript: part.
Another totally unrelated discovery I found common a decade ago or so is to not escape float arguments in SQL, right around where Gmap API was the next thing everyone wanted, I found so many sites where you could SQL inject the lat lng arguments on the endpoint for map data. This included the largest private buy and sell site in my country at the time, but their parent company scolded me at a job interview so I never told them about it >:) Nowadays everybody thankfully uses data binding instead of concatenation when building SQL in their applications.. (And yes I was able to get the user table by reading the database structure from INFORMATION_SCHEMA table in mysql and absolutely pwn the shit out of them, but I'm a nice guy that does this shit just for bragging rights)
@@blenderpanzi I really wanted to reply but youtube keeps deleting it lol.
@@ET_AYY_LMAO You can't include any URLs in KZhead comments. They get auto-deleted.
@@blenderpanzi Urls cover more than http, there is other protocols and pseudo protocols like mailto that could be 100% legit use cases as well as relative urls. But yes, always whitelist!
@@ET_AYY_LMAO Yes, as I said, you might over-block, but that is not as bad as having an injection. Add mailto: to the list of allowed protocols if you want to allow that. :D
"one day i was scrolling through twitter", twitter does not exist anymore
Elon change to Tweslla
Twitter in 2006: "Its like text-messages, but for companies to do PSAs and engage with their audience like 'Our website is down, sorry!' or '25% discount at XYZ on friday'" Twitter in 2023: "yOU mUsT sIgN uP tO sEe tHiS tWeEt"
@@ET_AYY_LMAO Twitter in 2006: nobody cared Twitter in 2023: nobody cares
@@jaydeep-pok
I can't wait for the day that someone responds to this with something along the lines of, "wow, I can't believe you predicted the future!!!". Twitter will fall and after a couple awkward years (shit, months?) the Internet collective will find a new hate machine where people say funny little things and receive death threats in response. What a time to be alive 😔
@LiveOverflow Do you know of any good resources for help with finding bugs? What method do you use to find a bug that looks like a potential vulnerability?
I can't remember any instance of variables starting with a number being valid. Also, we did do a simple html parser in 3rd year cs and the alpha as a first letter is the first thing we put in 😂
But if your buggy non-standard HTML parser then spits out normalized HTML with any < > & properly encoded and any tag/attribute/attribute value that is not explicitly allow-listed removed, no injections should be possible either, right?
Is it possible to tell the browser how it should handle errors and dubious code? Is it possible to let the browser check the syntax first, and only when it succeeds, it is allowed to render it.
thank you thank you thank you for showing a "failure".
is the invite code to hextree meant to be a challenge to be bruteforced ?
This is so cool!
Hextree signup is disabled. Is this part of no the test? 😅
Push!
good to know thanks
12:14 is this an advertimesment for hextree and club mate at the same time? :D
Yeah its the same for any other programming language. Every variable name cannot start with a number. Html's tags are no different
I think I’ve seen this happen in a bug converting markdown to html before too
You're awesome
I just wrote my own HTML minimizer. When the video started I instantly know what's the issue. The HTML spce is like C++. They are insane. IMHO, the HTML grammar is simply * . There's no way to get regex (or even EBNF) to HTML without edge cases of syntax error.
There's no way to get a regex for HTML, period. It is provably impossible, and applies to any language where the syntax requires you to match opening and closing brackets or any equivalent thing such as tags.
so interesting and fun to watch, thx, kiss lol
LiveOverFlow!!!! !!!!
11:01 I think if you add [\s]*[a-z][a-zA-Z0-9]* to the regex right after the < it should make it more spec compliant
You can have hiphens in tag names. Custom elements need it as in
Will you continue the Minecraft Hacked series
Will you open hextree to external content creator?
Love how this wouldn't even be an issue if you just turned ALL the < into <
You seem to have missed the bit at 1:15.
My first though was - html tags can't start with digits, so it interprets it as text literal. the becoming a comment is a surprise though
How did I just find out that we share first names :D
When is hextree going to open up?
Registration for hextree is not open? I really wants to try out hardware
Server side sanitizers are a thing of the past. Client side sanitization bypasses are more interesting.
Club-Mate bro :)
thanks for sharing "anyway" 🙂
CLUB MATE!
Ad popped one minute in the video, great ...
weirdly, I am more aware of Chomsky hierarchy through language and culture studies, and not through computer science. And my day job title is software engineer.
This is expected behavior
Seems odd to me that the opening and closing tags are treated differently. It would make sense if the closing tag was also treated as text. I suppose what is happening here is that the default behavior when the parser encounters a closing tag that is missing a corresponding and valid opening tag, is to turn it into a comment. But this check happens before the check for whether the tag name is valid. So that check is never made on the closing tag, because it's already been turned into a comment by the previous check. That makes sense, but why do this intentionally and make it part of the spec? Intuitively, it would make much more sense to require that the opening and closing tags of an invalid tag name be treated the same, and have this check happen before the other check. Then the output would have made more sense, you probably wouldn't have questioned the behavior, and you wouldn't have spent an hour on figuring out why.
Last smile 😂
Starting tags with numbers is invalid HTML / XML. Why would you ever need to do that?
(Developing a TCP Network Proxy - Pwn Adventure 3) I have problem
I respect your snipe
What happened to video "DONT USE ALERT(1) FOR XSS"?
Nothing happened?
Heh. I get it now. Haha
Is your minecraft server still up
pedantry corner here: it's not a letter, but an ASCII letter. öäüß are letters.
Club Mate spotted 😅
that regex html verifier seems like something people would use in amp pages because of their shitty js support
neat! :)
You’re best off using bbcode, markdown, or writing your own parser.
Ya, I have found some strange XSS like this, you never know how the backend will process the input. for example '-alert(1)-' and '+alert(1)+' did not work.... but '-alert(1)+' did!?!, can't explain it.
test
😂 trying to defeat yt??
bravo, you win the internet :D
Guys, I actually found something, I know the comment looks ok, but when some of you liked it, my android notification showed "someone liked your comment ''". So, it proves that these problems are everywhere.
@@intron9 prob character limit?
Question from someone who knows very little HTML: Why does the get parsed as a comment?
foo 22
holy shit
Michael Cera,that you?
test
hi
too much sunlight
I saw what you hid. The behavior is different when stored
Chomsky != Komsky
These kinds of things are why I designed my html/bbcode parser in the way I did. It can read whatever it's given into a dom tree, but when outputting that to html/bbcode it only returns what I've specifically allowed it to.
Why does the browser do that with the invalid 22 'tag' instead of just discarding it? Is it related to custom elements and shadow DOM standards? So weird!
because the HTML specification says the browser has to do that ;)
Test 22
till 3:10 i think yeah nothing interesting or new. At 3:20 holup!
sanitize deez
as a programmer that worked as web developer, I knew from the start, anything that start with a number is not a valid html tag
Markup shenanigans: ✔ Found out about debuggex: 😲 Club Mate in the promo: 🤯 Video rank: 🅰 ➕
I am Disliking all videos on multiple accounts until minecraft hacked comes back!!!
why does it turns to a comment? This was not explained
It was explained at 8:53
@@DefineSyntax994 ahh I see, it treats it differently when it`s a closing tag, thanks!
Nice try!
yurrrrrrrrrrr first
I tried this in PHP and the PHP DOMDocument class doesn't actually handle it correctly. It just... kind of eats the entire 22 tag when parsing, bizarrely outputting the text ">22> (with only one quote, by the way!) as the result. PHP never ceases to disappoint me. EDIT: It's worse than I thought, it fails to handle basically every parse error correctly. Invalid CDATA doesn't become a comment. A character reference that's out of unicode range doesn't become U+FFFD. It even turns attributes in closing tags into text. And that's just a few examples. Something like is actually a vulnerability in the most recent PHP.
11:24 I see this breaking if the parsing /segmentation of the document is not done "as the entirety" Example: If your parser has a size limit on how large the page can be, and your tag is theoretically longer than that, it would need to correctly correlate the opening and closing tag segments from two different segments of computed data, a much easier situation to engineer a flaw. So say you create an html tag that is longer than a parser can handle as a single 'bite'. "< 'x'*21474834648 >" - extremely long data I believe the best solution to this problem would be a logical equivalent of 'just don't talk with your mouth full' as any legitimate code would have no business surpassing a theoretical limit on a parsers maximum length.
Re your Hextree chat at the end, don't go down the rabbit hole of trying to write a Python tutorial just because some of your users wn't know Python. There are a bazillion existing Python tutorials. If you go down that route, shouldn't you also write an English tutorial for your users who don't speak English? And cooking tutorials so your users aren't hungry and can concentrate better? Focus on the actual purpose of the site.
HTML sanitizers should be whitelist based and reject obscure tags like
not first
There's a specific 4 char string you can use for your wifi network ssid that will cause it to display "unknown" ssid in alot of apps. Not really that useful for me but someone could find it useful. So I keep it a secret 😂
This is why it's important to actually know the specs to avoid running into brick walls. However, I still say that HTML/CSS/JS should be replaced with a singular language that incorporates most of their collective feature-set, and that it should *not* be a tagged markup language but an actual programming language with built-in support for styling and document structuring, something that could be used in both a relative and absolute manner to replace as many document formats as possible at the same time, such as PDF's and Office documents. Yeah, I know, it'll never happen and if anyone reads this they'll have an invalid criticism because no one wants to do the hard work to replace things that are wrong.
Does it count as a valid criticism that such a solution would probably be as fragile as its most fragile facet (the JavaScript successor), and, despite the people behind it, XHTML failed in the market because it had XML-style parse errors rather than HTML-style parsing recovery? That's a big reason I never use anything client-side templated like React/Angular/Vue/etc. in my own projects. If a transient network fault causes a CSS subresource to fail to load, the page will be ugly, but probably still usable. If an IT department's application firewall hasn't added the CORS header needed to serve up my font to their HTML header whitelist, the page will be ugly but should still function. If a templating or sanitizing mistake results in malformed HTML, there's still an opportunity for things to work, and for another layer like CSP to prevent any sanitizing mistake from being exploitable. I can use uMatrix and uBlock Origin to block ads and potential exploit vectors without the site doing a web equivalent to segfaulting. etc. etc. etc. Hell, Postscript *is* an actual turing-complete programming language in the vein of what you seem to be asking for and, when it was adapted into PDF, they took it further *away* from what you're asking for. Various exploits have occurred because of the need to run turing-complete code to render a PDF document, even with the bolted-on JavaScript support turned off. Various people prefer SVG over Postscript specifically because it's a more HTML-like declarative solution for describing a document. etc. etc. etc. Likewise, TeX is a programming language for document creation... people have been migrating away from writing it directly to writing declarative things like Markdown and reStructuredText and then programmatically translating to TeX code when they need to access its ecosystem of document typesetting extensions. ...not to mention that you're going to need an HTML-like DOM either way, because that's how screen readers and other accessibility tools see the world (GTK applications on Linux actually have an equivalent to a DOM explorer built in for their widget tree and ready to be turned on by an environment variable), and HTML has been drifting toward closer alignment with the accessibility DOM it generates over the last decade.
@@ssokolow Fragility would depend on the design. So if a committee designs it, it'll take a decade and start out okay, but then turn fragile as more crap is bolted on. If a group of OSS nerds develop it, then it'll be fragile from the start and over time converge on being hardy, but become more and more bloated than even a committee could make it and it would still suck. It would require that a single person with a good vision for it to stringently design it and direct its implementations. This is something that will never happen because corporations would never adopt such a system even if it was the best system. So to answer your first question, yes *and* no, with the dependence on which answer is correct based on who gets to design it.
@@anon_y_mousse I disagree. I have yet to be convinced that it's not a technical impossibility to make anything practical which incorporates turing-complete code (i.e. the part which does JavaScript-y things) as gracefully degrading as the HTML+CSS side of what we already have, regardless of who designs it. It's similar to how the power of a static type system or of Rust's "static, compile-time garbage collection" is in *restricting* what the system is capable of, to bring it to a point the computer can better understand your intentions.
@@ssokolow It's not a technical impossibility, it's a human adoption problem. But if you're a Rustacean then you won't understand, so there's no point in discussing it with you.
@@anon_y_mousse Wow. That's a big assumption out of nowhere. I could have just as easily pointed to C and C++ forbidding unconstrained GOTO instead. (It used to be commonplace for high-level languages to give you assembly language's freedom to jump into the middle of a function, bypassing its beginning.) I just wanted a second example to complement what static typing brings.
HTML tag names can't start with numbers
Sorry but can't you just replace all html special characters into html encoding? Lmao