• Top Members
    Reps
    Posts
  • 834 Replies
    2585 Replies
  • 716 Replies
    2026 Replies
  • 331 Replies
    1928 Replies

Regex search problem

Please LOCK your topics once you have found the solution to your question so we know you no longer require help with your query.
User avatar
AnoPem
VIP - Donator
Posts: 441

Regex search problem

Mon Jan 04, 2016 10:02 pm

Hello im having problems searching my string using regex it works fine when source is a static string but as soon as i download it via a webclient it dosent work, any sugguestions?

Code: Select all

Dim tReturn As New ArrayList
        Dim strRegex As String = "<a href=""\/dc\/doujin\/-\/list\/=\/article=keyword\/id=.*\/"" class=""genreTag__txt"">(\s\n.*?)<\/a>"
        Dim myRegex As New Regex(strRegex, RegexOptions.IgnoreCase Or RegexOptions.Multiline)
        For Each myMatch As Match In myRegex.Matches(source)
            If myMatch.Success Then
                TextBox1.Text = TextBox1.Text & vbnewline & myMatch.Groups(1).Value.Trim
            End If
        Next
What im trying to do is get the text between, <a href="/dc/doujin/-/list/=/article=keyword/id=Test1/" class="genreTag__txt"> and its ending tag </a>
the regex search works fine but it does not work when the string is downloaded from webclient

heres part of the website it downloads, i cannot post entire html file or url the site is an adult site ...

Code: Select all

<ul class="genreTagList">
  <li class="genreTagList__item">
	<div class="m-genreTag">
	  <div class="genreTag__item">
		<a href="/dc/doujin/-/list/=/article=keyword/id=Test1/" class="genreTag__txt">
		  Test1                    </a>
	  </div>
	</div>
  </li>
  <li class="genreTagList__item">
	<div class="m-genreTag">
	  <div class="genreTag__item">
		<a href="/dc/doujin/-/list/=/article=keyword/id=Test2/" class="genreTag__txt">
		  Test2                    </a>
	  </div>
	</div>
  </li>
</ul>
And this is my webclient

Code: Select all

        Dim WClient As New Net.WebClient
        WClient.Encoding = System.Text.Encoding.UTF8
        Dim source As String = WClient.DownloadString(url)
Last edited by AnoPem on Tue Jan 05, 2016 6:11 pm, edited 1 time in total.
Image

User avatar
SumCode
Dedicated Member
Posts: 57

Re: Regex search problem

Tue Jan 05, 2016 5:44 pm

It would help if you posted what your source looks like or the link to what your webclient is downloading and in addition, post what value you're trying to get .

So for example say the string you want to parse is "the man ate 3 burgers" and the value you want is '3', you would post what is bolded.

I would also like to state that using regex isn't a good way to parse html. I would look at using HtmlAgilityPack or even just the built in webbrowser class.

User avatar
AnoPem
VIP - Donator
Posts: 441

Re: Regex search problem

Tue Jan 05, 2016 6:11 pm

SumCode wrote:It would help if you posted what your source looks like or the link to what your webclient is downloading and in addition, post what value you're trying to get .

So for example say the string you want to parse is "the man ate 3 burgers" and the value you want is '3', you would post what is bolded.

I would also like to state that using regex isn't a good way to parse html. I would look at using HtmlAgilityPack or even just the built in webbrowser class.

Thanks for the reply i have updated my main post with some of the information you asked.
Image

User avatar
SumCode
Dedicated Member
Posts: 57

Re: Regex search problem

Tue Jan 05, 2016 7:01 pm

Using Regex
Well the reason it doesn't work is because you put '\s' where there is no extra whitespace. So you can change it to '\s?' to make it so it still matches when there is no whitespace. You are also able to shorten your regex to just '<a href=""\/dc\/doujin\/-\/list\/=\/article=keyword\/id=.*\/"" class=""genreTag__txt"">\n(.*?)<\/a>'. You can even shorten your pattern further (although this depends on the html you would get from the website) to 'genreTag__txt.+\n(.+)<\/a>'.
My final code with regex:

Code: Select all

Dim r = Regex.Matches(_code, "genreTag__txt.+\n(.+)<\/a>", RegexOptions.IgnoreCase)
MsgBox(r(0).Groups(1).Value)
Using HtmlAgilityPack

Code: Select all

 Dim doc = New HtmlDocument
doc.LoadHtml(_code)
Dim results = doc.DocumentNode.SelectNodes("//a[@class=""genreTag__txt""]")
Dim r1 = results(0).InnerText
MsgBox(r1)

User avatar
AnoPem
VIP - Donator
Posts: 441

Re: Regex search problem

Tue Jan 05, 2016 7:14 pm

SumCode wrote:Using Regex
Well the reason it doesn't work is because you put '\s' where there is no extra whitespace. So you can change it to '\s?' to make it so it still matches when there is no whitespace. You are also able to shorten your regex to just '<a href=""\/dc\/doujin\/-\/list\/=\/article=keyword\/id=.*\/"" class=""genreTag__txt"">\n(.*?)<\/a>'. You can even shorten your pattern further (although this depends on the html you would get from the website) to 'genreTag__txt.+\n(.+)<\/a>'.
My final code with regex:

Code: Select all

Dim r = Regex.Matches(_code, "genreTag__txt.+\n(.+)<\/a>", RegexOptions.IgnoreCase)
MsgBox(r(0).Groups(1).Value)
Using HtmlAgilityPack

Code: Select all

 Dim doc = New HtmlDocument
doc.LoadHtml(_code)
Dim results = doc.DocumentNode.SelectNodes("//a[@class=""genreTag__txt""]")
Dim r1 = results(0).InnerText
MsgBox(r1)

That seems to work, thank you very much
Image

Post Reply

Return to “Coding Help & Support”