Advice for scraping a website.
Advice for scraping a website.
An academic website I am fond of may go offline. I would hate for some of the things on it to be lost. I would like to scrape the content to store it offline. Perhaps even print a copy. I have the authors permission. Can anyone point me in tbe right direction?
- DNS
- Site Admin
- Posts: 5251
- Joined: Sun Apr 05, 2009 4:23 pm
- Location: Las Vegas, Nevada, Estados Unidos de América
- Contact:
Re: Advice for scraping a website.
1. Copy and paste the plain text. Right click and save images.
2. A more permanent, better copy would be to get the html code. That can be done by right-clicking a page and then hit View Page Source, then copy and paste that code. And then you could use that code to restart the website, if you choose.
2. A more permanent, better copy would be to get the html code. That can be done by right-clicking a page and then hit View Page Source, then copy and paste that code. And then you could use that code to restart the website, if you choose.
Re: Advice for scraping a website.
Google something like: website scrapping app.
Re: Advice for scraping a website.
Is the site on archive.org AKA waybackmachine? If not, I think you can ask the admins to put it on their radar.
There is no suffering to be severed. Ignorance and klesas are indivisible from bodhi. There is no cause of suffering to be abandoned. Since extremes and the false are the Middle and genuine, there is no path to be practiced. Samsara is nirvana. No severance achieved. No suffering nor its cause. No path, no end. There is no transcendent realm; there is only the one true aspect. There is nothing separate from the true aspect.
-Guanding, Perfect and Sudden Contemplation,
-Guanding, Perfect and Sudden Contemplation,
Re: Advice for scraping a website.
If it uses a host, such as Wordpress, it may be possible to ask the site owner to dowlnoad an .XML file, which essentially backs up the site, and can be restored using the same type of host.
http://www.khyung.com ཁྲོཾ
Om Thathpurushaya Vidhmahe
Suvarna Pakshaya Dheemahe
Thanno Garuda Prachodayath
Micchāmi Dukkaḍaṃ (मिच्छामि दुक्कडम्)
Om Thathpurushaya Vidhmahe
Suvarna Pakshaya Dheemahe
Thanno Garuda Prachodayath
Micchāmi Dukkaḍaṃ (मिच्छामि दुक्कडम्)
Re: Advice for scraping a website.
Install wget. There may be a command line interface version as well as a GUI one. Pick what you're comfortable with. Read up on command lines via help command and other guides online. It's excellent.
- Thomas Amundsen
- Posts: 2034
- Joined: Thu Feb 03, 2011 2:50 am
- Location: Helena, MT
- Contact:
Re: Advice for scraping a website.
https://scrapy.org/ if you have any coding skills
- dzogchungpa
- Posts: 6333
- Joined: Sat May 28, 2011 10:50 pm
Re: Advice for scraping a website.
There's something called HTTrack that seems to be well-reviewed, but I've never tried it.
There is not only nothingness because there is always, and always can manifest. - Thinley Norbu Rinpoche
- Kim O'Hara
- Former staff member
- Posts: 7047
- Joined: Fri Nov 16, 2012 1:09 am
- Location: North Queensland, Australia
Re: Advice for scraping a website.
"Right-clicking a page" to get the page source code doesn't work in all browsers - you may need to go the the "View" menu or "Tools". And these days a lot of sites assemble their pages on the fly by calling individual files as needed, so try both methods on one page before deciding which to use for the whole site.DNS wrote: ↑Thu Oct 05, 2017 3:56 pm 1. Copy and paste the plain text. Right click and save images.
2. A more permanent, better copy would be to get the html code. That can be done by right-clicking a page and then hit View Page Source, then copy and paste that code. And then you could use that code to restart the website, if you choose.
I tried to use http://ricks-apps.com Site Sucker to grab a WordPress site some time ago and I ended up with a disorganised mess of unconnected files.
Copying and pasting pages - contents - was tedious but far more satisfactory.
+1 on waybackmachine, by the way. I've often found it useful.
Kim
Re: Advice for scraping a website.
there used to be a tool called Web Whacker for doing exactly that. Used it 10 years ago, worked then.
'Only practice with no gaining idea' ~ Suzuki Roshi
Re: Advice for scraping a website.
wget is a good tool for sites not too far gone down the web2.0/mobile rathole. If running linux/osx its probably already installed, if windows I imagine its been ported. Invoke to perform a recursive get and enable retries. Depending on how the site html links are coded, it may or may not be usable on machines other than its original webhost. OTOH wget is free and easy to use so worth a try regardless- all you need is a computer to run it on & internet link.
on a web2.0 site scraping is probably not the best approach- an export or backup from the publishing tool or system will probably work better or at least stand a better chance of being hosted elsewhere.
on a web2.0 site scraping is probably not the best approach- an export or backup from the publishing tool or system will probably work better or at least stand a better chance of being hosted elsewhere.