Advice for scraping a website.

Casual conversation between friends. Anything goes (almost).
Post Reply
User avatar
Nemo
Posts: 868
Joined: Thu Jan 21, 2010 3:23 am
Location: Canada

Advice for scraping a website.

Post by Nemo » Thu Oct 05, 2017 2:55 pm

An academic website I am fond of may go offline. I would hate for some of the things on it to be lost. I would like to scrape the content to store it offline. Perhaps even print a copy. I have the authors permission. Can anyone point me in tbe right direction?

User avatar
DNS
Site Admin
Posts: 3074
Joined: Sun Apr 05, 2009 4:23 pm
Location: Las Vegas, Nevada, Estados Unidos de América
Contact:

Re: Advice for scraping a website.

Post by DNS » Thu Oct 05, 2017 3:56 pm

1. Copy and paste the plain text. Right click and save images.

2. A more permanent, better copy would be to get the html code. That can be done by right-clicking a page and then hit View Page Source, then copy and paste that code. And then you could use that code to restart the website, if you choose.

boda
Posts: 1910
Joined: Thu Jul 03, 2014 8:40 pm

Re: Advice for scraping a website.

Post by boda » Thu Oct 05, 2017 6:35 pm

Google something like: website scrapping app.

User avatar
Queequeg
Global Moderator
Posts: 6153
Joined: Tue Jul 03, 2012 3:24 pm

Re: Advice for scraping a website.

Post by Queequeg » Thu Oct 05, 2017 6:41 pm

Nemo wrote:
Thu Oct 05, 2017 2:55 pm
An academic website I am fond of may go offline. I would hate for some of the things on it to be lost. I would like to scrape the content to store it offline. Perhaps even print a copy. I have the authors permission. Can anyone point me in tbe right direction?
Is the site on archive.org AKA waybackmachine? If not, I think you can ask the admins to put it on their radar.
Those who, even with distracted minds,
Entered a stupa compound
And chanted but once, “Namo Buddhaya!”
Have certainly attained the path of the buddhas.

-Lotus Sutra, Expedient Means Chapter

I think each human being has things to find out in his own life that are inescapable. They’ll find them out the easy way or the hard way, or whatever.
-Jerry Garcia

User avatar
Mantrik
Posts: 1664
Joined: Sun Apr 09, 2017 8:55 pm
Contact:

Re: Advice for scraping a website.

Post by Mantrik » Thu Oct 05, 2017 6:47 pm

Nemo wrote:
Thu Oct 05, 2017 2:55 pm
An academic website I am fond of may go offline. I would hate for some of the things on it to be lost. I would like to scrape the content to store it offline. Perhaps even print a copy. I have the authors permission. Can anyone point me in tbe right direction?
If it uses a host, such as Wordpress, it may be possible to ask the site owner to dowlnoad an .XML file, which essentially backs up the site, and can be restored using the same type of host.
http://www.khyung.com

Om Thathpurushaya Vidhmahe
Suvarna Pakshaya Dheemahe
Thanno Garuda Prachodayath

Micchāmi Dukkaḍaṃ (मिच्छामि दुक्कडम्)

Norwegian
Posts: 1429
Joined: Thu Dec 01, 2011 7:36 pm

Re: Advice for scraping a website.

Post by Norwegian » Thu Oct 05, 2017 6:58 pm

Install wget. There may be a command line interface version as well as a GUI one. Pick what you're comfortable with. Read up on command lines via help command and other guides online. It's excellent.

User avatar
Thomas Amundsen
Posts: 1863
Joined: Thu Feb 03, 2011 2:50 am
Location: Los Angeles, CA
Contact:

Re: Advice for scraping a website.

Post by Thomas Amundsen » Thu Oct 05, 2017 7:27 pm

https://scrapy.org/ if you have any coding skills

User avatar
dzogchungpa
Posts: 6333
Joined: Sat May 28, 2011 10:50 pm

Re: Advice for scraping a website.

Post by dzogchungpa » Thu Oct 05, 2017 8:00 pm

There's something called HTTrack that seems to be well-reviewed, but I've never tried it.
There is not only nothingness because there is always, and always can manifest. - Thinley Norbu Rinpoche

User avatar
Kim O'Hara
Former staff member
Posts: 3727
Joined: Fri Nov 16, 2012 1:09 am
Location: North Queensland, Australia

Re: Advice for scraping a website.

Post by Kim O'Hara » Thu Oct 05, 2017 11:33 pm

DNS wrote:
Thu Oct 05, 2017 3:56 pm
1. Copy and paste the plain text. Right click and save images.

2. A more permanent, better copy would be to get the html code. That can be done by right-clicking a page and then hit View Page Source, then copy and paste that code. And then you could use that code to restart the website, if you choose.
"Right-clicking a page" to get the page source code doesn't work in all browsers - you may need to go the the "View" menu or "Tools". And these days a lot of sites assemble their pages on the fly by calling individual files as needed, so try both methods on one page before deciding which to use for the whole site.
I tried to use http://ricks-apps.com Site Sucker to grab a WordPress site some time ago and I ended up with a disorganised mess of unconnected files. :toilet:
Copying and pasting pages - contents - was tedious but far more satisfactory.

+1 on waybackmachine, by the way. I've often found it useful.

:namaste:
Kim

User avatar
Wayfarer
Global Moderator
Posts: 4092
Joined: Sun May 27, 2012 8:31 am
Location: Sydney AU

Re: Advice for scraping a website.

Post by Wayfarer » Fri Oct 06, 2017 12:31 am

there used to be a tool called Web Whacker for doing exactly that. Used it 10 years ago, worked then.
Only practice with no gaining idea ~ Suzuki Roshi

narhwal90
Posts: 653
Joined: Mon Jan 25, 2016 3:10 am

Re: Advice for scraping a website.

Post by narhwal90 » Fri Oct 06, 2017 1:47 am

wget is a good tool for sites not too far gone down the web2.0/mobile rathole. If running linux/osx its probably already installed, if windows I imagine its been ported. Invoke to perform a recursive get and enable retries. Depending on how the site html links are coded, it may or may not be usable on machines other than its original webhost. OTOH wget is free and easy to use so worth a try regardless- all you need is a computer to run it on & internet link.

on a web2.0 site scraping is probably not the best approach- an export or backup from the publishing tool or system will probably work better or at least stand a better chance of being hosted elsewhere.

Post Reply

Return to “Lounge”

Who is online

Users browsing this forum: Bing [Bot] and 45 guests