Regex search problem

If you need help with a project or need to know how to do something specific in VB.NET then please ask your questions in here.
Forum rules
Please LOCK your topics once you have found the solution to your question so we know you no longer require help with your query.
5 posts Page 1 of 1
Contributors
User avatar
AnoPem
VIP - Donator
VIP - Donator
Posts: 441
Joined: Sat Jul 24, 2010 10:55 pm

Regex search problem
AnoPem
Hello im having problems searching my string using regex it works fine when source is a static string but as soon as i download it via a webclient it dosent work, any sugguestions?
Code: Select all
Dim tReturn As New ArrayList
        Dim strRegex As String = "<a href=""\/dc\/doujin\/-\/list\/=\/article=keyword\/id=.*\/"" class=""genreTag__txt"">(\s\n.*?)<\/a>"
        Dim myRegex As New Regex(strRegex, RegexOptions.IgnoreCase Or RegexOptions.Multiline)
        For Each myMatch As Match In myRegex.Matches(source)
            If myMatch.Success Then
                TextBox1.Text = TextBox1.Text & vbnewline & myMatch.Groups(1).Value.Trim
            End If
        Next
What im trying to do is get the text between, <a href="/dc/doujin/-/list/=/article=keyword/id=Test1/" class="genreTag__txt"> and its ending tag </a>
the regex search works fine but it does not work when the string is downloaded from webclient

heres part of the website it downloads, i cannot post entire html file or url the site is an adult site ...
Code: Select all
<ul class="genreTagList">
  <li class="genreTagList__item">
	<div class="m-genreTag">
	  <div class="genreTag__item">
		<a href="/dc/doujin/-/list/=/article=keyword/id=Test1/" class="genreTag__txt">
		  Test1                    </a>
	  </div>
	</div>
  </li>
  <li class="genreTagList__item">
	<div class="m-genreTag">
	  <div class="genreTag__item">
		<a href="/dc/doujin/-/list/=/article=keyword/id=Test2/" class="genreTag__txt">
		  Test2                    </a>
	  </div>
	</div>
  </li>
</ul>
And this is my webclient
Code: Select all
        Dim WClient As New Net.WebClient
        WClient.Encoding = System.Text.Encoding.UTF8
        Dim source As String = WClient.DownloadString(url)
Last edited by AnoPem on Tue Jan 05, 2016 5:11 pm, edited 1 time in total.
https://t.me/pump_upp
User avatar
SumCode
Dedicated Member
Dedicated Member
Posts: 57
Joined: Fri Aug 03, 2012 2:34 am

Re: Regex search problem
SumCode
It would help if you posted what your source looks like or the link to what your webclient is downloading and in addition, post what value you're trying to get .

So for example say the string you want to parse is "the man ate 3 burgers" and the value you want is '3', you would post what is bolded.

I would also like to state that using regex isn't a good way to parse html. I would look at using HtmlAgilityPack or even just the built in webbrowser class.
User avatar
AnoPem
VIP - Donator
VIP - Donator
Posts: 441
Joined: Sat Jul 24, 2010 10:55 pm

Re: Regex search problem
AnoPem
SumCode wrote:
It would help if you posted what your source looks like or the link to what your webclient is downloading and in addition, post what value you're trying to get .

So for example say the string you want to parse is "the man ate 3 burgers" and the value you want is '3', you would post what is bolded.

I would also like to state that using regex isn't a good way to parse html. I would look at using HtmlAgilityPack or even just the built in webbrowser class.

Thanks for the reply i have updated my main post with some of the information you asked.
https://t.me/pump_upp
User avatar
SumCode
Dedicated Member
Dedicated Member
Posts: 57
Joined: Fri Aug 03, 2012 2:34 am

Re: Regex search problem
SumCode
Using Regex
Well the reason it doesn't work is because you put '\s' where there is no extra whitespace. So you can change it to '\s?' to make it so it still matches when there is no whitespace. You are also able to shorten your regex to just '<a href=""\/dc\/doujin\/-\/list\/=\/article=keyword\/id=.*\/"" class=""genreTag__txt"">\n(.*?)<\/a>'. You can even shorten your pattern further (although this depends on the html you would get from the website) to 'genreTag__txt.+\n(.+)<\/a>'.
My final code with regex:
Code: Select all
Dim r = Regex.Matches(_code, "genreTag__txt.+\n(.+)<\/a>", RegexOptions.IgnoreCase)
MsgBox(r(0).Groups(1).Value)
Using HtmlAgilityPack
Code: Select all
 Dim doc = New HtmlDocument
doc.LoadHtml(_code)
Dim results = doc.DocumentNode.SelectNodes("//a[@class=""genreTag__txt""]")
Dim r1 = results(0).InnerText
MsgBox(r1)
User avatar
AnoPem
VIP - Donator
VIP - Donator
Posts: 441
Joined: Sat Jul 24, 2010 10:55 pm

Re: Regex search problem
AnoPem
SumCode wrote:
Using Regex
Well the reason it doesn't work is because you put '\s' where there is no extra whitespace. So you can change it to '\s?' to make it so it still matches when there is no whitespace. You are also able to shorten your regex to just '<a href=""\/dc\/doujin\/-\/list\/=\/article=keyword\/id=.*\/"" class=""genreTag__txt"">\n(.*?)<\/a>'. You can even shorten your pattern further (although this depends on the html you would get from the website) to 'genreTag__txt.+\n(.+)<\/a>'.
My final code with regex:
Code: Select all
Dim r = Regex.Matches(_code, "genreTag__txt.+\n(.+)<\/a>", RegexOptions.IgnoreCase)
MsgBox(r(0).Groups(1).Value)
Using HtmlAgilityPack
Code: Select all
 Dim doc = New HtmlDocument
doc.LoadHtml(_code)
Dim results = doc.DocumentNode.SelectNodes("//a[@class=""genreTag__txt""]")
Dim r1 = results(0).InnerText
MsgBox(r1)

That seems to work, thank you very much
https://t.me/pump_upp
5 posts Page 1 of 1
Return to “Coding Help & Support”