Codenstuff.com: Regex search problem

AnoPem
VIP - Donator

Posts: 441

Joined: Sat Jul 24, 2010 10:55 pm

Regex search problem
AnoPem
Mon Jan 04, 2016 9:02 pm

Hello im having problems searching my string using regex it works fine when source is a static string but as soon as i download it via a webclient it dosent work, any sugguestions?

Code: Select all

Dim tReturn As New ArrayList
        Dim strRegex As String = "<a href=""\/dc\/doujin\/-\/list\/=\/article=keyword\/id=.*\/"" class=""genreTag__txt"">(\s\n.*?)<\/a>"
        Dim myRegex As New Regex(strRegex, RegexOptions.IgnoreCase Or RegexOptions.Multiline)
        For Each myMatch As Match In myRegex.Matches(source)
            If myMatch.Success Then
                TextBox1.Text = TextBox1.Text & vbnewline & myMatch.Groups(1).Value.Trim
            End If
        Next

What im trying to do is get the text between, <a href="/dc/doujin/-/list/=/article=keyword/id=Test1/" class="genreTag__txt"> and its ending tag </a>
the regex search works fine but it does not work when the string is downloaded from webclient

heres part of the website it downloads, i cannot post entire html file or url the site is an adult site ...

Code: Select all

<ul class="genreTagList">
  <li class="genreTagList__item">
	<div class="m-genreTag">
	  <div class="genreTag__item">
		<a href="/dc/doujin/-/list/=/article=keyword/id=Test1/" class="genreTag__txt">
		  Test1                    </a>
	  </div>
	</div>
  </li>
  <li class="genreTagList__item">
	<div class="m-genreTag">
	  <div class="genreTag__item">
		<a href="/dc/doujin/-/list/=/article=keyword/id=Test2/" class="genreTag__txt">
		  Test2                    </a>
	  </div>
	</div>
  </li>
</ul>

And this is my webclient

Code: Select all

        Dim WClient As New Net.WebClient
        WClient.Encoding = System.Text.Encoding.UTF8
        Dim source As String = WClient.DownloadString(url)

Last edited by AnoPem on Tue Jan 05, 2016 5:11 pm, edited 1 time in total.

https://t.me/pump_upp

SumCode
Dedicated Member

Posts: 57

Joined: Fri Aug 03, 2012 2:34 am

Re: Regex search problem
SumCode
Tue Jan 05, 2016 4:44 pm

It would help if you posted what your source looks like or the link to what your webclient is downloading and in addition, post what value you're trying to get .

So for example say the string you want to parse is "the man ate 3 burgers" and the value you want is '3', you would post what is bolded.

I would also like to state that using regex isn't a good way to parse html. I would look at using HtmlAgilityPack or even just the built in webbrowser class.

AnoPem
VIP - Donator

Posts: 441

Joined: Sat Jul 24, 2010 10:55 pm

Re: Regex search problem
AnoPem
Tue Jan 05, 2016 5:11 pm

SumCode wrote:
It would help if you posted what your source looks like or the link to what your webclient is downloading and in addition, post what value you're trying to get .

So for example say the string you want to parse is "the man ate 3 burgers" and the value you want is '3', you would post what is bolded.

I would also like to state that using regex isn't a good way to parse html. I would look at using HtmlAgilityPack or even just the built in webbrowser class.

Thanks for the reply i have updated my main post with some of the information you asked.

https://t.me/pump_upp

SumCode
Dedicated Member

Posts: 57

Joined: Fri Aug 03, 2012 2:34 am

Re: Regex search problem
SumCode
Tue Jan 05, 2016 6:01 pm

Using Regex
Well the reason it doesn't work is because you put '\s' where there is no extra whitespace. So you can change it to '\s?' to make it so it still matches when there is no whitespace. You are also able to shorten your regex to just '<a href=""\/dc\/doujin\/-\/list\/=\/article=keyword\/id=.*\/"" class=""genreTag__txt"">\n(.*?)<\/a>'. You can even shorten your pattern further (although this depends on the html you would get from the website) to 'genreTag__txt.+\n(.+)<\/a>'.
My final code with regex:

Code: Select all

Dim r = Regex.Matches(_code, "genreTag__txt.+\n(.+)<\/a>", RegexOptions.IgnoreCase)
MsgBox(r(0).Groups(1).Value)

Using HtmlAgilityPack

Code: Select all

 Dim doc = New HtmlDocument
doc.LoadHtml(_code)
Dim results = doc.DocumentNode.SelectNodes("//a[@class=""genreTag__txt""]")
Dim r1 = results(0).InnerText
MsgBox(r1)

AnoPem
VIP - Donator

Posts: 441

Joined: Sat Jul 24, 2010 10:55 pm

Re: Regex search problem
AnoPem
Tue Jan 05, 2016 6:14 pm

SumCode wrote:
Using Regex
Well the reason it doesn't work is because you put '\s' where there is no extra whitespace. So you can change it to '\s?' to make it so it still matches when there is no whitespace. You are also able to shorten your regex to just '<a href=""\/dc\/doujin\/-\/list\/=\/article=keyword\/id=.*\/"" class=""genreTag__txt"">\n(.*?)<\/a>'. You can even shorten your pattern further (although this depends on the html you would get from the website) to 'genreTag__txt.+\n(.+)<\/a>'.
My final code with regex:
Code: Select all
Dim r = Regex.Matches(_code, "genreTag__txt.+\n(.+)<\/a>", RegexOptions.IgnoreCase)
MsgBox(r(0).Groups(1).Value)
Using HtmlAgilityPack
Code: Select all
 Dim doc = New HtmlDocument
doc.LoadHtml(_code)
Dim results = doc.DocumentNode.SelectNodes("//a[@class=""genreTag__txt""]")
Dim r1 = results(0).InnerText
MsgBox(r1)

That seems to work, thank you very much

https://t.me/pump_upp

Regex search problem

AnoPem

Information

Posts: 441

Joined: Sat Jul 24, 2010 10:55 pm
View Full Profile

SumCode

Information

Posts: 57

Joined: Fri Aug 03, 2012 2:34 am
View Full Profile

AnoPem

Information

Posts: 441

Joined: Sat Jul 24, 2010 10:55 pm
View Full Profile

SumCode

Information

Posts: 57

Joined: Fri Aug 03, 2012 2:34 am
View Full Profile

AnoPem

Information

Posts: 441

Joined: Sat Jul 24, 2010 10:55 pm
View Full Profile

Breadcrumbs

Regex search problem

Post Options:

Post Options:

Post Options:

Post Options:

Post Options:

Copyright Information