Advice for scraping a website.

Casual conversation between friends. Anything goes (almost).
Post Reply
User avatar
Nemo
Posts: 806
Joined: Thu Jan 21, 2010 3:23 am
Location: Canada

Advice for scraping a website.

Post by Nemo » Thu Oct 05, 2017 2:55 pm

An academic website I am fond of may go offline. I would hate for some of the things on it to be lost. I would like to scrape the content to store it offline. Perhaps even print a copy. I have the authors permission. Can anyone point me in tbe right direction?

User avatar
DNS
Site Admin
Posts: 2900
Joined: Sun Apr 05, 2009 4:23 pm
Location: Las Vegas, Nevada
Contact:

Re: Advice for scraping a website.

Post by DNS » Thu Oct 05, 2017 3:56 pm

1. Copy and paste the plain text. Right click and save images.

2. A more permanent, better copy would be to get the html code. That can be done by right-clicking a page and then hit View Page Source, then copy and paste that code. And then you could use that code to restart the website, if you choose.

boda
Posts: 1667
Joined: Thu Jul 03, 2014 8:40 pm

Re: Advice for scraping a website.

Post by boda » Thu Oct 05, 2017 6:35 pm

Google something like: website scrapping app.

User avatar
Queequeg
Global Moderator
Posts: 4346
Joined: Tue Jul 03, 2012 3:24 pm

Re: Advice for scraping a website.

Post by Queequeg » Thu Oct 05, 2017 6:41 pm

Nemo wrote:
Thu Oct 05, 2017 2:55 pm
An academic website I am fond of may go offline. I would hate for some of the things on it to be lost. I would like to scrape the content to store it offline. Perhaps even print a copy. I have the authors permission. Can anyone point me in tbe right direction?
Is the site on archive.org AKA waybackmachine? If not, I think you can ask the admins to put it on their radar.
“Once you have given up the ghost, everything follows with dead certainty, even in the midst of chaos.”
-Henry Miller

"Language is the liquid that we're all dissolved in.
Great for solving problems, after it creates the problems."
-Modest Mouse

"Wake up to find out that you are the eyes of the world!"
-The Grateful Dead

User avatar
Mantrik
Posts: 569
Joined: Sun Apr 09, 2017 8:55 pm
Contact:

Re: Advice for scraping a website.

Post by Mantrik » Thu Oct 05, 2017 6:47 pm

Nemo wrote:
Thu Oct 05, 2017 2:55 pm
An academic website I am fond of may go offline. I would hate for some of the things on it to be lost. I would like to scrape the content to store it offline. Perhaps even print a copy. I have the authors permission. Can anyone point me in tbe right direction?
If it uses a host, such as Wordpress, it may be possible to ask the site owner to dowlnoad an .XML file, which essentially backs up the site, and can be restored using the same type of host.
http://www.khyung.com

Micchāmi Dukkaḍaṃ (मिच्छामि दुक्कडम्)

Norwegian
Posts: 1043
Joined: Thu Dec 01, 2011 7:36 pm

Re: Advice for scraping a website.

Post by Norwegian » Thu Oct 05, 2017 6:58 pm

Install wget. There may be a command line interface version as well as a GUI one. Pick what you're comfortable with. Read up on command lines via help command and other guides online. It's excellent.

User avatar
Thomas Amundsen
Posts: 1688
Joined: Thu Feb 03, 2011 2:50 am
Location: Los Angeles, CA
Contact:

Re: Advice for scraping a website.

Post by Thomas Amundsen » Thu Oct 05, 2017 7:27 pm

https://scrapy.org/ if you have any coding skills

User avatar
dzogchungpa
Posts: 5629
Joined: Sat May 28, 2011 10:50 pm

Re: Advice for scraping a website.

Post by dzogchungpa » Thu Oct 05, 2017 8:00 pm

There's something called HTTrack that seems to be well-reviewed, but I've never tried it.
If you focus on an object, you are not meditating. - Dudjom Rinpoche

User avatar
Kim O'Hara
Former staff member
Posts: 3521
Joined: Fri Nov 16, 2012 1:09 am
Location: North Queensland, Australia

Re: Advice for scraping a website.

Post by Kim O'Hara » Thu Oct 05, 2017 11:33 pm

DNS wrote:
Thu Oct 05, 2017 3:56 pm
1. Copy and paste the plain text. Right click and save images.

2. A more permanent, better copy would be to get the html code. That can be done by right-clicking a page and then hit View Page Source, then copy and paste that code. And then you could use that code to restart the website, if you choose.
"Right-clicking a page" to get the page source code doesn't work in all browsers - you may need to go the the "View" menu or "Tools". And these days a lot of sites assemble their pages on the fly by calling individual files as needed, so try both methods on one page before deciding which to use for the whole site.
I tried to use http://ricks-apps.com Site Sucker to grab a WordPress site some time ago and I ended up with a disorganised mess of unconnected files. :toilet:
Copying and pasting pages - contents - was tedious but far more satisfactory.

+1 on waybackmachine, by the way. I've often found it useful.

:namaste:
Kim

User avatar
Wayfarer
Posts: 3518
Joined: Sun May 27, 2012 8:31 am
Location: Sydney AU

Re: Advice for scraping a website.

Post by Wayfarer » Fri Oct 06, 2017 12:31 am

there used to be a tool called Web Whacker for doing exactly that. Used it 10 years ago, worked then.
In the beginner's mind there are many possibilities; in the expert's mind there are few ~ Suzuki-roshi

narhwal90
Posts: 511
Joined: Mon Jan 25, 2016 3:10 am

Re: Advice for scraping a website.

Post by narhwal90 » Fri Oct 06, 2017 1:47 am

wget is a good tool for sites not too far gone down the web2.0/mobile rathole. If running linux/osx its probably already installed, if windows I imagine its been ported. Invoke to perform a recursive get and enable retries. Depending on how the site html links are coded, it may or may not be usable on machines other than its original webhost. OTOH wget is free and easy to use so worth a try regardless- all you need is a computer to run it on & internet link.

on a web2.0 site scraping is probably not the best approach- an export or backup from the publishing tool or system will probably work better or at least stand a better chance of being hosted elsewhere.

Post Reply

Who is online

Users browsing this forum: No registered users and 42 guests