Page 1 of 1

Regex search problem

Posted: Mon Jan 04, 2016 9:02 pm
by AnoPem
Hello im having problems searching my string using regex it works fine when source is a static string but as soon as i download it via a webclient it dosent work, any sugguestions?
Code: Select all
Dim tReturn As New ArrayList
        Dim strRegex As String = "<a href=""\/dc\/doujin\/-\/list\/=\/article=keyword\/id=.*\/"" class=""genreTag__txt"">(\s\n.*?)<\/a>"
        Dim myRegex As New Regex(strRegex, RegexOptions.IgnoreCase Or RegexOptions.Multiline)
        For Each myMatch As Match In myRegex.Matches(source)
            If myMatch.Success Then
                TextBox1.Text = TextBox1.Text & vbnewline & myMatch.Groups(1).Value.Trim
            End If
        Next
What im trying to do is get the text between, <a href="/dc/doujin/-/list/=/article=keyword/id=Test1/" class="genreTag__txt"> and its ending tag </a>
the regex search works fine but it does not work when the string is downloaded from webclient

heres part of the website it downloads, i cannot post entire html file or url the site is an adult site ...
Code: Select all
<ul class="genreTagList">
  <li class="genreTagList__item">
	<div class="m-genreTag">
	  <div class="genreTag__item">
		<a href="/dc/doujin/-/list/=/article=keyword/id=Test1/" class="genreTag__txt">
		  Test1                    </a>
	  </div>
	</div>
  </li>
  <li class="genreTagList__item">
	<div class="m-genreTag">
	  <div class="genreTag__item">
		<a href="/dc/doujin/-/list/=/article=keyword/id=Test2/" class="genreTag__txt">
		  Test2                    </a>
	  </div>
	</div>
  </li>
</ul>
And this is my webclient
Code: Select all
        Dim WClient As New Net.WebClient
        WClient.Encoding = System.Text.Encoding.UTF8
        Dim source As String = WClient.DownloadString(url)

Re: Regex search problem

Posted: Tue Jan 05, 2016 4:44 pm
by SumCode
It would help if you posted what your source looks like or the link to what your webclient is downloading and in addition, post what value you're trying to get .

So for example say the string you want to parse is "the man ate 3 burgers" and the value you want is '3', you would post what is bolded.

I would also like to state that using regex isn't a good way to parse html. I would look at using HtmlAgilityPack or even just the built in webbrowser class.

Re: Regex search problem

Posted: Tue Jan 05, 2016 5:11 pm
by AnoPem
SumCode wrote:
It would help if you posted what your source looks like or the link to what your webclient is downloading and in addition, post what value you're trying to get .

So for example say the string you want to parse is "the man ate 3 burgers" and the value you want is '3', you would post what is bolded.

I would also like to state that using regex isn't a good way to parse html. I would look at using HtmlAgilityPack or even just the built in webbrowser class.

Thanks for the reply i have updated my main post with some of the information you asked.

Re: Regex search problem

Posted: Tue Jan 05, 2016 6:01 pm
by SumCode
Using Regex
Well the reason it doesn't work is because you put '\s' where there is no extra whitespace. So you can change it to '\s?' to make it so it still matches when there is no whitespace. You are also able to shorten your regex to just '<a href=""\/dc\/doujin\/-\/list\/=\/article=keyword\/id=.*\/"" class=""genreTag__txt"">\n(.*?)<\/a>'. You can even shorten your pattern further (although this depends on the html you would get from the website) to 'genreTag__txt.+\n(.+)<\/a>'.
My final code with regex:
Code: Select all
Dim r = Regex.Matches(_code, "genreTag__txt.+\n(.+)<\/a>", RegexOptions.IgnoreCase)
MsgBox(r(0).Groups(1).Value)
Using HtmlAgilityPack
Code: Select all
 Dim doc = New HtmlDocument
doc.LoadHtml(_code)
Dim results = doc.DocumentNode.SelectNodes("//a[@class=""genreTag__txt""]")
Dim r1 = results(0).InnerText
MsgBox(r1)

Re: Regex search problem

Posted: Tue Jan 05, 2016 6:14 pm
by AnoPem
SumCode wrote:
Using Regex
Well the reason it doesn't work is because you put '\s' where there is no extra whitespace. So you can change it to '\s?' to make it so it still matches when there is no whitespace. You are also able to shorten your regex to just '<a href=""\/dc\/doujin\/-\/list\/=\/article=keyword\/id=.*\/"" class=""genreTag__txt"">\n(.*?)<\/a>'. You can even shorten your pattern further (although this depends on the html you would get from the website) to 'genreTag__txt.+\n(.+)<\/a>'.
My final code with regex:
Code: Select all
Dim r = Regex.Matches(_code, "genreTag__txt.+\n(.+)<\/a>", RegexOptions.IgnoreCase)
MsgBox(r(0).Groups(1).Value)
Using HtmlAgilityPack
Code: Select all
 Dim doc = New HtmlDocument
doc.LoadHtml(_code)
Dim results = doc.DocumentNode.SelectNodes("//a[@class=""genreTag__txt""]")
Dim r1 = results(0).InnerText
MsgBox(r1)

That seems to work, thank you very much