Archive for Utilities Code

Safely cleaning HTML with strip_tags in C#

One of my favorites in the PHP libraries is the strip_tags function. Not only does it neatly remove HTML from an input it also allows you to specify which tags should stay. This is great if you are allowing your visitors to apply some basic HTML tags to their comments. This post explores two issues: using C# to remove unwanted tags, and cleaning up unwanted attributes that might be hidden in the allowed tags.

I wanted to clean some comments posted to a website from unwanted HTML tags. The users are allowed to or and even their posts but anything else must be stripped before it is posted to the site. I found several regular expressions for C# that allow you to strip HTML but these magically wipe all the HTML and leave nothing.

Below is the end result of of some hacking, and of course much love-hate with the regular expression library.

string StripTags(string Input, string[] AllowedTags)

The StripTags method takes an input string, and an array of allowed tags. It returns the input as a string, minus all not wanted tags.

string test1 = StripTags("

George

WBush", new string[]{"i","b"});
string test2 = StripTags("

George W Bush

", new string[]{"p"});
string test3 = StripTags("Martijn Dijksterhuis", new string[]{"a"});

Using the above example code returns the following:

GeorgeWBush

George W Bush

Martijn Dijksterhuis

string StripTagsAndAttributes(string Input, string[] AllowedTags)

The above StripTags function is similar to the original PHP strip_tags function in having the same weakness: It is still possible for a malicious user to insert attributes into each of the tags. Think “style=” and “id=”. We would be somewhat saver if we cleaned these as well. The StripTagsAndAttributes method does just that.

It first runs the input through StripTags, and for the remaining tags is strips out all but a restricted set of attributes.

string test4 = "Martijn Dijksterhuis";
Console.WriteLine(StripTagsAndAttributes(test4, new string[]{"a"}));

 

That “OnClick” attribute looks mighty unsafe. Running the above string throughStripTagsAndAttributes as in the example above returns:

This function probably needs some tuning if you want to allow, or restrict things even further.

A word of caution

Regular expressions are voodoo, very cool, but still voodoo. The above functions work for the tests I have applied to them, but your mileage may vary! If you have a special situation that doesn’t work leave a note below and maybe we can work out the problems.

Credits

The strip_tags function is of course inspired by the PHP version , and a Javascript implementation thereof by Kevin van Sonderveld. The attribute stripping routine is based on the regular expressions by mdw252 in one of the strip_tags manual page comments.

Source code

The complete source code for the StripTags function and StripTagsAndAttributesfunction with my test code can be found below:

using System;
using System.Text.RegularExpressions;

namespace StripHTML
{
class MainClass
{

private static string ReplaceFirst(string haystack, string needle, string replacement)
{
int pos = haystack.IndexOf(needle);
if (pos < 0) return haystack;
return haystack.Substring(0,pos) + replacement + haystack.Substring(pos+needle.Length);
}

private static string ReplaceAll(string haystack, string needle, string replacement)
{
int pos;
// Avoid a possible infinite loop
if (needle == replacement) return haystack;
while((pos = haystack.IndexOf(needle))>0)
haystack = haystack.Substring(0,pos) + replacement + haystack.Substring(pos+needle.Length);
return haystack;
}

public static string StripTags(string Input, string[] AllowedTags)
{
Regex StripHTMLExp = new Regex(@"(<\/?[^>]+>)");
string Output = Input;

foreach(Match Tag in StripHTMLExp.Matches(Input))
{
string HTMLTag = Tag.Value.ToLower();
bool IsAllowed = false;

foreach(string AllowedTag in AllowedTags)
{
int offset = -1;

// Determine if it is an allowed tag
// "" , " if (offset!=0) offset = HTMLTag.IndexOf('<'+AllowedTag+'>');
if (offset!=0) offset = HTMLTag.IndexOf('<'+AllowedTag+' ');
if (offset!=0) offset = HTMLTag.IndexOf("
// If it matched any of the above the tag is allowed
if (offset==0)
{
IsAllowed = true;
break;
}
}

// Remove tags that are not allowed
if (!IsAllowed) Output = ReplaceFirst(Output,Tag.Value,"");
}

return Output;
}

public static string StripTagsAndAttributes(string Input, string[] AllowedTags)
{
/* Remove all unwanted tags first */
string Output = StripTags(Input,AllowedTags);

/* Lambda functions */
MatchEvaluator HrefMatch = m => m.Groups[1].Value + "href..;,;.." + m.Groups[2].Value;
MatchEvaluator ClassMatch = m => m.Groups[1].Value + "class..;,;.." + m.Groups[2].Value;
MatchEvaluator UnsafeMatch = m => m.Groups[1].Value + m.Groups[4].Value;

/* Allow the "href" attribute */
Output = new Regex("()").Replace(Output,HrefMatch);

/* Allow the "class" attribute */
Output = new Regex("()").Replace(Output,ClassMatch);

/* Remove unsafe attributes in any of the remaining tags */
Output = new Regex(@"(<.*) .*=(\'|\""|\w)[\w|.|(|)]*(\'|\""|\w)(.*>)").Replace(Output,UnsafeMatch);

/* Return the allowed tags to their proper form */
Output = ReplaceAll(Output,"..;,;..", "=");

return Output;
}

public static void Main(string[] args)
{
string test1 = StripTags("

George

WBush", new string[]{"i","b"});
string test2 = StripTags("

George W Bush

", new string[]{"p"});
string test3 = StripTags("Martijn Dijksterhuis", new string[]{"a"});

Console.WriteLine(test1);
Console.WriteLine(test2);
Console.WriteLine(test3);

string test4 = "Martijn Dijksterhuis";
Console.WriteLine(StripTagsAndAttributes(test4, new string[]{"a"}));
}
}