Post Reply
creeper
Posts: 24
Joined: Wed Sep 21, 2005 1:28 pm
Location: Paris, France

It's working with webscrape setting but the page is blank

Post by creeper » Thu Oct 27, 2005 2:56 pm

Hi,

I'm tryng to scrape a web page. The following code works perfectly with webscrape setting but when I load the ini file, the page is blank in awasu.
Any Idea Why ?

Thanks

Code: Select all

[ChannelParameters]
URL=http://www.smartbrief.com/allAccess/industryNews.jsp
Title=GMA Smartbrief
Description=
BaseUrl=http://www.smartbrief.com
MaxItems=15
Shorthand=
SectionPattern=
ItemPattern-1=<h3 class="abstractHeadlineDetail"><a.*?href="(?P<L>.*?)"
ItemPattern-2=.*?target="_blank">(?P<T>.*?)</a></h3>
ItemPattern-3=.*?<p class="abstractCopyDetail">(?P<D>.*?)</p>

abwilson
Posts: 247
Joined: Sun Feb 09, 2003 12:36 am
Location: San Francisco, CA -- USA

Post by abwilson » Thu Oct 27, 2005 6:44 pm

When I tried the URL in my browser, the page that came up wanted an e-mail address to confirm I am one of their subscribers (I am not). Evidently they depend on a cookie, which WebScrape knows nothing about.

Taka and I exchanged some forum messages about DownloadUrl some time back. Taka, will cookies indeed be handled correctly if I update WebScrape to use DownloadUrl?

But...before I can use DownloadUrl, please confirm the plugin's .ini file will contain DownloadUrlEncoding (or equivalent) as we once discussed; I didn't see anything in the documentation about it.

Thanks

Allan

abwilson
Posts: 247
Joined: Sun Feb 09, 2003 12:36 am
Location: San Francisco, CA -- USA

Post by abwilson » Thu Oct 27, 2005 6:53 pm

I just did a test.

"HttpHeader_Content-Type=..." is returned in [DownloadUrl Response].

Looks like this requirement is met. :)

(Perhaps a note should be added to the docs.)

Thanks

Allan

abwilson
Posts: 247
Joined: Sun Feb 09, 2003 12:36 am
Location: San Francisco, CA -- USA

Post by abwilson » Thu Oct 27, 2005 10:39 pm

I had a look at the WebScrape code. Apparently I never switched to use DownloadUrl because of the UTCoffset-URL feature that was added to WebScrape some time ago.

Taka, what is the right way to use DownloadUrl and support the UTCoffset feature?

Thanks

Allan

creeper
Posts: 24
Joined: Wed Sep 21, 2005 1:28 pm
Location: Paris, France

Post by creeper » Fri Oct 28, 2005 7:39 am

Thanks a lot for looking.

So If I understand correctly, Webscrape can't handle the authentification process (or the cookie) yet. Knowing that the subscription is free is not a secure server.

Thanks and please let me know if there is a way to go around this.
Cheers

User avatar
support
Site Admin
Posts: 3065
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Post by support » Fri Oct 28, 2005 8:47 am

abwilson wrote:Taka, will cookies indeed be handled correctly if I update WebScrape to use DownloadUrl?


Awasu uses WinInet to do the downloads so cookies *should* be handled.

abwilson wrote:But...before I can use DownloadUrl, please confirm the plugin's .ini file will contain DownloadUrlEncoding (or equivalent) as we once discussed; I didn't see anything in the documentation about it.


I don't remember talking about this. Content-Type, yes, and that's still in. All the headers received will be saved to the [DownloadUrl Response] section with key names of the form HttpHeader_XXX, where XXX is the header name. So it may well be in there.

User avatar
support
Site Admin
Posts: 3065
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Post by support » Fri Oct 28, 2005 8:50 am

abwilson wrote:Taka, what is the right way to use DownloadUrl and support the UTCoffset feature?


Just have a parameter called DownloadUrl and Awasu will download that file before invoking the plugin. You will be passed the location of the downloaded file via the DownloadUrlFile key in the [System] section (also in the [DownloadUrl Response] section).

What's UTCoffset and why is it causing a problem?

abwilson
Posts: 247
Joined: Sun Feb 09, 2003 12:36 am
Location: San Francisco, CA -- USA

Post by abwilson » Fri Oct 28, 2005 3:28 pm

I'm comfortable using "normal" DownloadUrl, thanks.

UTCoffset is related to the feature in WebScrape and WebScrapeSettings that allows the user to specify a URL that contains elements of date and time -- for example, a cartoon-of-the-day site. I don't know the proper way to pass such a "dynamic" URL to DownloadUrl.

Thanks

Allan

abwilson
Posts: 247
Joined: Sun Feb 09, 2003 12:36 am
Location: San Francisco, CA -- USA

Post by abwilson » Fri Oct 28, 2005 3:33 pm

Thanks and please let me know if there is a way to go around this.


Sorry, but right now I don't have a workaround. Until I can implement the UTCoffset feature with it I cannot use DownloadUrl -- which should solve the cookie problem and allow your scrape to work.

Allan

abwilson
Posts: 247
Joined: Sun Feb 09, 2003 12:36 am
Location: San Francisco, CA -- USA

Post by abwilson » Fri Oct 28, 2005 3:41 pm

To be specific about passing a dynamic URL to DownloadUrl, will DownloadUrl accept and "expand" this example from the WebScrape Release Notes?

http://www.newssite.com/folder/%Y-%m-%d/news.html

If so, how is the UTCoffset parameter communicated?

Thanks

Allan

User avatar
support
Site Admin
Posts: 3065
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Post by support » Fri Oct 28, 2005 4:03 pm

abwilson wrote:UTCoffset is related to the feature in WebScrape and WebScrapeSettings that allows the user to specify a URL that contains elements of date and time -- for example, a cartoon-of-the-day site.


Ah, now I remember :-)

<snip>long explanation of possible workarounds deleted</snip>

About a year ago, the thought crossed my mind that Awasu had grown so big that there were parts of the program where I wouldn't be able to explain to someone, off the top of my head, how they worked. I would have to check the code :oops:. Well, it seems that Awasu does strftime() processing on DownloadUrl, and has done since July '04 (around the time of this post) :-)

The format is slightly different; you have to Awasu-style template parameters but the actual parameter names are exactly the same as strftime() e.g. use {%m%} for the month number, {%M%} for minutes, etc.

User avatar
support
Site Admin
Posts: 3065
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Post by support » Fri Oct 28, 2005 4:05 pm

abwilson wrote:If so, how is the UTCoffset parameter communicated?


What exactly is UTCoffset? Is it for time-zone/daylight saving?

If so, we can just add another special parameter called DownloadUrl_UtfOffset that Awasu can use when doing strftime() processing on the download URL.

Is it even that important to have?

abwilson
Posts: 247
Joined: Sun Feb 09, 2003 12:36 am
Location: San Francisco, CA -- USA

Post by abwilson » Fri Oct 28, 2005 5:20 pm

UTCoffset (a value in minutes -- hours don't work because some timezones use 0.5 hour increments) is the way we decided to adjust for the difference in a WebScrape user's date/time vs. a site's date/time. Basically we need to work with a site's date/time to construct the right URLs.

I would be happy with "DownloadUrl_UtcOffset" (note respelling).

Should I go ahead and change WebScrape appropriately?

Thanks

Allan

User avatar
support
Site Admin
Posts: 3065
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Post by support » Fri Oct 28, 2005 6:08 pm

abwilson wrote:UTCoffset (a value in minutes -- hours don't work because some timezones use 0.5 hour increments) is the way we decided to adjust for the difference in a WebScrape user's date/time vs. a site's date/time.


That's what I figured. It's probably going to be a while until the next cut of Awasu comes out - we're flat out getting 2.2 ready and we have a lot of other customization work piling up :-( (actually, that should be :-)).

If the timezone is not correct, things'll still work, it just means that stuff will maybe arrive a bit later than it should. If you change WebScrape to use DownloadUrl and also look for DownloadUrl_UtcOffset and I'll update Awasu when I can. If creeper sends me an email, I'll get an interim version out to him when it's ready.

Just let me know what units DownloadUrl_UtcOffset will be in (minutes would probably be best) and which direction the time offset should be going.

abwilson
Posts: 247
Joined: Sun Feb 09, 2003 12:36 am
Location: San Francisco, CA -- USA

Post by abwilson » Fri Oct 28, 2005 8:50 pm

Sounds good. I'll proceed to make the changes.

DownloadUrl_UtcOffset should indeed be in minutes and the sign is the same as the timezone offset from UTC (GMT); so -(8 *60) = -480 would be the value for DownloadUrl_UtcOffset for San Francisco.

I'll send you the new version and creeper can get it from you.

Thanks!

Allan

Post Reply

Return to “Awasu - General Discussion”