Generic HTML Sanitizer Bypass Investigation

2024 ж. 24 Мам.

140 032 Рет қаралды

I stumbled over a weird HTML behavior on Twitter and started to investigate it. Did I just stumble over a generic HTML Sanitizer bypass?
Get my handwritten font shop.liveoverflow.com (advertisement)
Checkout our courses on hextree.io (advertisement)
The Tweet: / 1662701541680136195
Google XSS: • XSS on Google Search -...
HTML Spec: html.spec.whatwg.org/multipag...
Chapters:
00:00 - Intro
01:09 - Sanitizing vs. Encoding
02:32 - Developing HTML Sanitizer Bypass
05:03 - Attacking DOMPurify
07:08 - Attacking Server-side Sanitizer
08:31 - HTML Parse Error Specification
10:08 - Potential Impact
11:55 - hextree.io
=[ ❤️ Support ]=
→ per Video: / liveoverflow
→ per Month: / @liveoverflow
2nd Channel: / liveunderflow
=[ 🐕 Social ]=
→ Twitter: / liveoverflow
→ Streaming: twitch.tvLiveOverflow/
→ TikTok: / liveoverflow_
→ Instagram: / liveoverflow
→ Blog: liveoverflow.com/
→ Subreddit: / liveoverflow
→ Facebook: / liveoverflow

Пікірлер

I'm generally someone who likes to implement stuff themselves, instead of using an external dependency, but stuff like this is why i normally don't touch security related things (like HTML sanitization) myself and go for an existing solution instead.
@Fasguy10 ай бұрын
- Once I did as a personal challenge and then you can get a list of common bugs that might lead to other problems. You can work one by one but then you get another list and you get another kick. But it is fun to try
  @motbus310 ай бұрын
- I just remember log4j
  @arnevaneycken287810 ай бұрын
- Same, I prefer to implement myself most things but Date/Timezone management and security related stuff are 2 topics I don't want to touch if possible
  @rosco310 ай бұрын
- @@rosco3 Oh god yes, f*ck Time management of any kind. Especially in JS.
  @Fasguy10 ай бұрын
- Fun fact, html sanitation and this specific weirdness is what prevented a hackers malware from working on my phone(s). 😅 They shit in their own hand on that one
  @apIthletIcc10 ай бұрын
I'd read before that valid HTML tags can't start with a number so I wasn't surprised that was the root issue, I wasn't aware of how the specification detailed parsing the situation however and now I'm wondering the logic behind _why_ they specify it should be handled like _that_ of all things 🤔 My only guess is to make sure it mangles the output sufficiently as to hopefully make the developer notice something's wrong and fix it, but surely that could be done more gracefully... or maybe it's just grandfathered in from the quirky behaviour of some early parser?
@PixelOverload10 ай бұрын
- The HTML 5 spec is nearly entirely trying to nail down the least broken interpretation of existing content written against the wacky browsers of the time.
  @SimonBuchanNz10 ай бұрын
- , etc. are a rare way to count items and basically ignoring , ... by parsing them as comments might be a concession to broken markup cleaners that tried to close those non-tags.
  @D0Samp10 ай бұрын
About the Chomsky hierarchy: what we call regex is not actually🤓 type-3/regular, it often has operations, such as repeating the match group with \1, that are only present in type-1 grammars. There is a somewhat well-known regex for prime numbers, and it is impossible to construct a corresponding state machine for it.
@0marble810 ай бұрын
This video is great. I've been a full stack web developer now for about 15 years and I learned quite a bit about HTML parsing. The onerror attribute isn't something I would think of at all because quite frankly I write JavaScript Event and Error handlers the recommended/modern way.
@dave724410 ай бұрын
There is also a new JS method on all HTML elements: setHTML(input, options). It's basically innerHTML, but sanitizes the input. So i think it's just like DOMpurify, but natively in the browser.
@_nikeee10 ай бұрын
- as it's really new... it might contain issues....
  @vaisakhkm78310 ай бұрын
- @@vaisakhkm783 Hey if you find any I bet there is a reward from google...
  @ET_AYY_LMAO10 ай бұрын
- Cool, but its not supported in FF or Safari.
  @ET_AYY_LMAO10 ай бұрын
- @@vaisakhkm783IDK, I'd trust browser authors more with sanitizing HTML, since they wrote their own parser. It wouldn't surprise me if it used some of the browsers own html parsing logic
  @spicybaguette770610 ай бұрын
- @@spicybaguette7706until people figure out some no click exploit
  @whannabi10 ай бұрын
When I saw your thumbnail, I instantly tried it out with different numbers and I also tried random tags, starting with numbers. I thought okay, variables can't start with numbers, so I guess that also applies to HTML tags. I would have never thought about any security bypasses by my own but then I became curious and watched your video.
@SchonKonnie10 ай бұрын
The popular Java library for this, Jsoup, also looks to handle this correctly. The basic input turns into <22> though it strips the comment for the closing tag.
@MLeoDaalder10 ай бұрын
- Oh that is nice, I love jsoup
  @JordanPlayz15810 ай бұрын
The UI for your courses looks really nice!
@IanZamojc10 ай бұрын
I was nervous when you tried dompurify, cause we heavily rely on it in some of our projects.
@wartab10 ай бұрын
If you wasted your time researching this, then what have I done by watching!? Haha, great vid, interesting results.
@jimdiroffii10 ай бұрын
8:00 funny you mentioned the syntax highlighting, because it was the FIRST thing my brain said when you first wrote it in your editor at the start of the video! 😂
@json_bourne381210 ай бұрын
Good video. You are never defeated if ends up learning something 💪
@kiyov0910 ай бұрын
We should only allow standard ( and whitelisted/predefined custom tags ) and explicitly close them. And just refuse to parse and throw an error if document is invalid. It's just stupid to allow arbitrary syntax and try to parse it.
@eero887910 ай бұрын
- Markup languages seldom throw errors. It'd also be annoying to have an entire document not render for you just because they used something your renderer doesn't allow But you can enforce well formed HTML as a style guide for your project, which many people already do
  @seanthesheep10 ай бұрын
- @@seanthesheep Or block execution in case of malformed syntax?
  @namibjDerEchte9 ай бұрын
Hey LiveOverflow, how about CTF challenges as hextree courses? I think those would nicely build onto your existing youtube video's.
@boomknuffelaar10 ай бұрын
I learned something new, very interesting thanks.
@IllIl10 ай бұрын
In Python, both the original htmllib.HTMLParser, which was built on top of the SGML parser and no longer exists in Python 3, and the current html.parser.HTMLParser handle this according to the specification.
@D0Samp10 ай бұрын
I instantly remembered that Astro lets you define a variable to alias a component or html tag, so I instantly went and tried Wat=22. Got the same behaviour described in the spec, but Astro leaked the classname into the HTML content, so I guess I got a bug report to make... EDIT: I tried around and there is just enough sanitization to break any XSS I could think of so far. Shoutouts to the Astro team I guess 😹
@Mitsunee_10 ай бұрын
I am just learning html, and I was so confused when you were calling an html tag. Only for the conclusion to the video be as simple as “it’s not one”. Like wow, what a shocker
@maker082410 ай бұрын
I wish hextree was open... I am so excited!
@xorlop10 ай бұрын
You learn something everyday.
@ThePowerRanger10 ай бұрын
Say thanks to the big brains who didn't want to go with xhtml which was waaaaay more restrictive than html5
@mjerez602910 ай бұрын
Great content for junior 👍🏽
@farismazlan515710 ай бұрын
I remember back in the days many people sanitized for javascript links by checking if the url starts with javascript: that makes sense I guess... but then IE7 allowed for a tab (Or was is some other char? cant remember) characters infront of the javascript: part.
@ET_AYY_LMAO10 ай бұрын
- Another totally unrelated discovery I found common a decade ago or so is to not escape float arguments in SQL, right around where Gmap API was the next thing everyone wanted, I found so many sites where you could SQL inject the lat lng arguments on the endpoint for map data. This included the largest private buy and sell site in my country at the time, but their parent company scolded me at a job interview so I never told them about it >:) Nowadays everybody thankfully uses data binding instead of concatenation when building SQL in their applications.. (And yes I was able to get the user table by reading the database structure from INFORMATION_SCHEMA table in mysql and absolutely pwn the shit out of them, but I'm a nice guy that does this shit just for bragging rights)
  @ET_AYY_LMAO10 ай бұрын
- @@blenderpanzi I really wanted to reply but youtube keeps deleting it lol.
  @ET_AYY_LMAO10 ай бұрын
- @@ET_AYY_LMAO You can't include any URLs in KZhead comments. They get auto-deleted.
  @blenderpanzi10 ай бұрын
- @@blenderpanzi Urls cover more than http, there is other protocols and pseudo protocols like mailto that could be 100% legit use cases as well as relative urls. But yes, always whitelist!
  @ET_AYY_LMAO10 ай бұрын
- @@ET_AYY_LMAO Yes, as I said, you might over-block, but that is not as bad as having an injection. Add mailto: to the list of allowed protocols if you want to allow that. :D
  @blenderpanzi10 ай бұрын
"one day i was scrolling through twitter", twitter does not exist anymore
@chocolateimage10 ай бұрын
- Elon change to Tweslla
  @Rhidayah10 ай бұрын
- Twitter in 2006: "Its like text-messages, but for companies to do PSAs and engage with their audience like 'Our website is down, sorry!' or '25% discount at XYZ on friday'" Twitter in 2023: "yOU mUsT sIgN uP tO sEe tHiS tWeEt"
  @ET_AYY_LMAO10 ай бұрын
- @@ET_AYY_LMAO Twitter in 2006: nobody cared Twitter in 2023: nobody cares
  @jaydeep-p10 ай бұрын
- @@jaydeep-pok
  @sylv25610 ай бұрын
- I can't wait for the day that someone responds to this with something along the lines of, "wow, I can't believe you predicted the future!!!". Twitter will fall and after a couple awkward years (shit, months?) the Internet collective will find a new hate machine where people say funny little things and receive death threats in response. What a time to be alive 😔
  @partlyblue10 ай бұрын
@LiveOverflow Do you know of any good resources for help with finding bugs? What method do you use to find a bug that looks like a potential vulnerability?
@typedeaf8 ай бұрын
I can't remember any instance of variables starting with a number being valid. Also, we did do a simple html parser in 3rd year cs and the alpha as a first letter is the first thing we put in 😂
@shigekax10 ай бұрын
But if your buggy non-standard HTML parser then spits out normalized HTML with any < > & properly encoded and any tag/attribute/attribute value that is not explicitly allow-listed removed, no injections should be possible either, right?
@blenderpanzi10 ай бұрын
Is it possible to tell the browser how it should handle errors and dubious code? Is it possible to let the browser check the syntax first, and only when it succeeds, it is allowed to render it.
@Hofer230410 ай бұрын
thank you thank you thank you for showing a "failure".
@270jonp10 ай бұрын
is the invite code to hextree meant to be a challenge to be bruteforced ?
@serialkiller878310 ай бұрын
This is so cool!
@lancemarchetti867310 ай бұрын
Hextree signup is disabled. Is this part of no the test? 😅
@motbus310 ай бұрын
Push!
@tg794310 ай бұрын
good to know thanks
@untitled802710 ай бұрын
12:14 is this an advertimesment for hextree and club mate at the same time? :D
@Zadagu10 ай бұрын
Yeah its the same for any other programming language. Every variable name cannot start with a number. Html's tags are no different
@paulcasanova190910 ай бұрын
I think I’ve seen this happen in a bug converting markdown to html before too
@Cdaprod10 ай бұрын
You're awesome
@MrGuppiSocks10 ай бұрын
I just wrote my own HTML minimizer. When the video started I instantly know what's the issue. The HTML spce is like C++. They are insane. IMHO, the HTML grammar is simply * . There's no way to get regex (or even EBNF) to HTML without edge cases of syntax error.
@clehaxze10 ай бұрын
- There's no way to get a regex for HTML, period. It is provably impossible, and applies to any language where the syntax requires you to match opening and closing brackets or any equivalent thing such as tags.
  @beeble200310 ай бұрын
so interesting and fun to watch, thx, kiss lol
@amorcomorco7 ай бұрын
LiveOverFlow!!!! !!!!
@williamm20010 ай бұрын
11:01 I think if you add [\s]*[a-z][a-zA-Z0-9]* to the regex right after the < it should make it more spec compliant
@luketurner31410 ай бұрын
- You can have hiphens in tag names. Custom elements need it as in
  @Victor_Marius10 ай бұрын
Will you continue the Minecraft Hacked series
@nodnarb10 ай бұрын
Will you open hextree to external content creator?
@takeshikovacs66710 ай бұрын
Love how this wouldn't even be an issue if you just turned ALL the < into <
@matthewrease237610 ай бұрын
- You seem to have missed the bit at 1:15.
  @nixel132410 ай бұрын
My first though was - html tags can't start with digits, so it interprets it as text literal. the becoming a comment is a surprise though
@Gastell010 ай бұрын
How did I just find out that we share first names :D
@zdazeeeh10 ай бұрын
When is hextree going to open up?
@bdot0210 ай бұрын
Registration for hextree is not open? I really wants to try out hardware
@kn19ht_s3c10 ай бұрын
Server side sanitizers are a thing of the past. Client side sanitization bypasses are more interesting.
@TheNullBox10 ай бұрын
Club-Mate bro :)
@drkwrk522910 ай бұрын
thanks for sharing "anyway" 🙂
@leyasep591910 ай бұрын
CLUB MATE!
@prescientdove10 ай бұрын
Ad popped one minute in the video, great ...
@bynariizminecraftenplusfun41818 ай бұрын
weirdly, I am more aware of Chomsky hierarchy through language and culture studies, and not through computer science. And my day job title is software engineer.
@tsalVlog9 ай бұрын
This is expected behavior
@JOHN-um29 ай бұрын
Seems odd to me that the opening and closing tags are treated differently. It would make sense if the closing tag was also treated as text. I suppose what is happening here is that the default behavior when the parser encounters a closing tag that is missing a corresponding and valid opening tag, is to turn it into a comment. But this check happens before the check for whether the tag name is valid. So that check is never made on the closing tag, because it's already been turned into a comment by the previous check. That makes sense, but why do this intentionally and make it part of the spec? Intuitively, it would make much more sense to require that the opening and closing tags of an invalid tag name be treated the same, and have this check happen before the other check. Then the output would have made more sense, you probably wouldn't have questioned the behavior, and you wouldn't have spent an hour on figuring out why.
@Jdbye9 ай бұрын
Last smile 😂
@kn19ht_s3c10 ай бұрын
Starting tags with numbers is invalid HTML / XML. Why would you ever need to do that?
@nickp8210 ай бұрын
(Developing a TCP Network Proxy - Pwn Adventure 3) I have problem
@mokhtardz988910 ай бұрын
I respect your snipe
@Vampirat310 ай бұрын
What happened to video "DONT USE ALERT(1) FOR XSS"?
@virinom10 ай бұрын
- Nothing happened?
  @LiveOverflow10 ай бұрын
- Heh. I get it now. Haha
  @LiveOverflow10 ай бұрын
Is your minecraft server still up
@rokutv-202310 ай бұрын
pedantry corner here: it's not a letter, but an ASCII letter. öäüß are letters.
@TheDiveO10 ай бұрын
Club Mate spotted 😅
@Epinardscaramel10 ай бұрын
that regex html verifier seems like something people would use in amp pages because of their shitty js support
@felipemartins643310 ай бұрын
neat! :)
@outseeker10 ай бұрын
You’re best off using bbcode, markdown, or writing your own parser.
@frosty143310 ай бұрын
Ya, I have found some strange XSS like this, you never know how the backend will process the input. for example '-alert(1)-' and '+alert(1)+' did not work.... but '-alert(1)+' did!?!, can't explain it.
@gprime311310 ай бұрын
test
@bugbountyhunter-eh8rq10 ай бұрын
@intron910 ай бұрын
- 😂 trying to defeat yt??
  @vaisakhkm78310 ай бұрын
- bravo, you win the internet :D
  @_erayerdin10 ай бұрын
- Guys, I actually found something, I know the comment looks ok, but when some of you liked it, my android notification showed "someone liked your comment ''". So, it proves that these problems are everywhere.
  @intron910 ай бұрын
- @@intron9 prob character limit?
  @crimsonmegumin10 ай бұрын
Question from someone who knows very little HTML: Why does the get parsed as a comment?
@rtg_onefourtwoeightfiveseven10 ай бұрын
foo 22
@zuctivazenci10 ай бұрын
holy shit
@hangingwithvoid36010 ай бұрын
Michael Cera,that you?
@thecamlayton10 ай бұрын
test
@jamesflames698710 ай бұрын
hi
@emireri238710 ай бұрын
too much sunlight
@zoenagy945810 ай бұрын
I saw what you hid. The behavior is different when stored
@jay25inteserve10 ай бұрын
Chomsky != Komsky
@andrewdunbar82810 ай бұрын
These kinds of things are why I designed my html/bbcode parser in the way I did. It can read whatever it's given into a dom tree, but when outputting that to html/bbcode it only returns what I've specifically allowed it to.
@Sollace10 ай бұрын
Why does the browser do that with the invalid 22 'tag' instead of just discarding it? Is it related to custom elements and shadow DOM standards? So weird!
@jonopens10 ай бұрын
- because the HTML specification says the browser has to do that ;)
  @LiveOverflow10 ай бұрын
Test 22
@TwoThreeFour9 ай бұрын
till 3:10 i think yeah nothing interesting or new. At 3:20 holup!
@oxymonster133710 ай бұрын
sanitize deez
@velox__10 ай бұрын
as a programmer that worked as web developer, I knew from the start, anything that start with a number is not a valid html tag
@MenkoDany10 ай бұрын
Markup shenanigans: ✔ Found out about debuggex: 😲 Club Mate in the promo: 🤯 Video rank: 🅰 ➕
@Grstearns10 ай бұрын
I am Disliking all videos on multiple accounts until minecraft hacked comes back!!!
@va1iduser6829 ай бұрын
why does it turns to a comment? This was not explained
@crimsonmegumin10 ай бұрын
- It was explained at 8:53
  @DefineSyntax99410 ай бұрын
- @@DefineSyntax994 ahh I see, it treats it differently when it`s a closing tag, thanks!
  @crimsonmegumin10 ай бұрын
Nice try!
@bugkilla8410 ай бұрын
yurrrrrrrrrrr first
@MuscleTeamOfficial10 ай бұрын
I tried this in PHP and the PHP DOMDocument class doesn't actually handle it correctly. It just... kind of eats the entire 22 tag when parsing, bizarrely outputting the text ">22> (with only one quote, by the way!) as the result. PHP never ceases to disappoint me. EDIT: It's worse than I thought, it fails to handle basically every parse error correctly. Invalid CDATA doesn't become a comment. A character reference that's out of unicode range doesn't become U+FFFD. It even turns attributes in closing tags into text. And that's just a few examples. Something like is actually a vulnerability in the most recent PHP.
@Gamesaucer10 ай бұрын
11:24 I see this breaking if the parsing /segmentation of the document is not done "as the entirety" Example: If your parser has a size limit on how large the page can be, and your tag is theoretically longer than that, it would need to correctly correlate the opening and closing tag segments from two different segments of computed data, a much easier situation to engineer a flaw. So say you create an html tag that is longer than a parser can handle as a single 'bite'. "< 'x'*21474834648 >" - extremely long data I believe the best solution to this problem would be a logical equivalent of 'just don't talk with your mouth full' as any legitimate code would have no business surpassing a theoretical limit on a parsers maximum length.
@notavoicechanger18084 ай бұрын
Re your Hextree chat at the end, don't go down the rabbit hole of trying to write a Python tutorial just because some of your users wn't know Python. There are a bazillion existing Python tutorials. If you go down that route, shouldn't you also write an English tutorial for your users who don't speak English? And cooking tutorials so your users aren't hungry and can concentrate better? Focus on the actual purpose of the site.
@beeble200310 ай бұрын
HTML sanitizers should be whitelist based and reject obscure tags like
@maxdemian631210 ай бұрын
not first
@bryld_10 ай бұрын
There's a specific 4 char string you can use for your wifi network ssid that will cause it to display "unknown" ssid in alot of apps. Not really that useful for me but someone could find it useful. So I keep it a secret 😂
@apIthletIcc10 ай бұрын
This is why it's important to actually know the specs to avoid running into brick walls. However, I still say that HTML/CSS/JS should be replaced with a singular language that incorporates most of their collective feature-set, and that it should *not* be a tagged markup language but an actual programming language with built-in support for styling and document structuring, something that could be used in both a relative and absolute manner to replace as many document formats as possible at the same time, such as PDF's and Office documents. Yeah, I know, it'll never happen and if anyone reads this they'll have an invalid criticism because no one wants to do the hard work to replace things that are wrong.
@anon_y_mousse10 ай бұрын
- Does it count as a valid criticism that such a solution would probably be as fragile as its most fragile facet (the JavaScript successor), and, despite the people behind it, XHTML failed in the market because it had XML-style parse errors rather than HTML-style parsing recovery? That's a big reason I never use anything client-side templated like React/Angular/Vue/etc. in my own projects. If a transient network fault causes a CSS subresource to fail to load, the page will be ugly, but probably still usable. If an IT department's application firewall hasn't added the CORS header needed to serve up my font to their HTML header whitelist, the page will be ugly but should still function. If a templating or sanitizing mistake results in malformed HTML, there's still an opportunity for things to work, and for another layer like CSP to prevent any sanitizing mistake from being exploitable. I can use uMatrix and uBlock Origin to block ads and potential exploit vectors without the site doing a web equivalent to segfaulting. etc. etc. etc. Hell, Postscript *is* an actual turing-complete programming language in the vein of what you seem to be asking for and, when it was adapted into PDF, they took it further *away* from what you're asking for. Various exploits have occurred because of the need to run turing-complete code to render a PDF document, even with the bolted-on JavaScript support turned off. Various people prefer SVG over Postscript specifically because it's a more HTML-like declarative solution for describing a document. etc. etc. etc. Likewise, TeX is a programming language for document creation... people have been migrating away from writing it directly to writing declarative things like Markdown and reStructuredText and then programmatically translating to TeX code when they need to access its ecosystem of document typesetting extensions. ...not to mention that you're going to need an HTML-like DOM either way, because that's how screen readers and other accessibility tools see the world (GTK applications on Linux actually have an equivalent to a DOM explorer built in for their widget tree and ready to be turned on by an environment variable), and HTML has been drifting toward closer alignment with the accessibility DOM it generates over the last decade.
  @ssokolow10 ай бұрын
- @@ssokolow Fragility would depend on the design. So if a committee designs it, it'll take a decade and start out okay, but then turn fragile as more crap is bolted on. If a group of OSS nerds develop it, then it'll be fragile from the start and over time converge on being hardy, but become more and more bloated than even a committee could make it and it would still suck. It would require that a single person with a good vision for it to stringently design it and direct its implementations. This is something that will never happen because corporations would never adopt such a system even if it was the best system. So to answer your first question, yes *and* no, with the dependence on which answer is correct based on who gets to design it.
  @anon_y_mousse10 ай бұрын
- @@anon_y_mousse I disagree. I have yet to be convinced that it's not a technical impossibility to make anything practical which incorporates turing-complete code (i.e. the part which does JavaScript-y things) as gracefully degrading as the HTML+CSS side of what we already have, regardless of who designs it. It's similar to how the power of a static type system or of Rust's "static, compile-time garbage collection" is in *restricting* what the system is capable of, to bring it to a point the computer can better understand your intentions.
  @ssokolow10 ай бұрын
- @@ssokolow It's not a technical impossibility, it's a human adoption problem. But if you're a Rustacean then you won't understand, so there's no point in discussing it with you.
  @anon_y_mousse10 ай бұрын
- @@anon_y_mousse Wow. That's a big assumption out of nowhere. I could have just as easily pointed to C and C++ forbidding unconstrained GOTO instead. (It used to be commonplace for high-level languages to give you assembly language's freedom to jump into the middle of a function, bypassing its beginning.) I just wanted a second example to complement what static typing brings.
  @ssokolow10 ай бұрын
HTML tag names can't start with numbers
@djthdinsessions10 ай бұрын
Sorry but can't you just replace all html special characters into html encoding? Lmao
@IceHax10 ай бұрын