Page 1 of 1

WebScrape question regarding forward slash encoding

Posted: Wed Aug 02, 2006 3:06 am
by sag
I created a .ini file for which webscrape works correctly on normal links, but which messes up on unusually-encoded links.

Example of a working link:
HREF="http://www.sfgate.com/cgi-bin/article.cgi?f=/n/a/2006/08/01/sports/s132535D95.DTL&type=golf"

Example of a problem link:
HREF="http://www.golfweek.com/312958777640113.php"

The error message is http://www.&.com cannot be found

I've tried to post my .ini, but not all the characters are coming through after hitting preview. I'll at least post the URL I'm trying to scrape. Thank you.

golfobserver.com/preview/golfnotebook/golfnotebook_080106.html

Re: WebScrape question regarding forward slash encoding

Posted: Wed Aug 02, 2006 3:27 am
by support
sag wrote:I've tried to post my .ini, but not all the characters are coming through after hitting preview.

If you email your INI file to me, I'll insert it into your post.

Re: WebScrape question regarding forward slash encoding

Posted: Thu Aug 03, 2006 1:11 am
by support
This is the INI file:

Code: Select all

[ChannelParameters]
URL=http://www.golfobserver.com/preview/golfnotebook/golfnotebook_080106.html
Title=Golf Notebook
Description=LPGA section
BaseUrl=
MaxItems=60
Shorthand=
SectionPattern=<a name="Weetabix"></a>(.*?)<!--spacer rule-->
ItemPattern-1=<A CLASS="Link" HREF="(?P<L>.*?)"(.*?)>(?P<T>.*?)<BR><FONT
CLASS="NewsSource"><B><I>(?P<D>.*?)</I></B></FONT></p>
ItemPattern-2=
ItemPattern-3=

I had trouble getting it in as well :-) I eventually did it by checking Disable HTML in this post.

Posted: Fri Aug 04, 2006 7:55 pm
by sag
The web site in question has now done a substantial redesign, and so far all of the links contain normal forward slashes in their URLs. I don't need any further help with golfobserver.com at this time. Thanks.