Web Scraping Jupyter Notebook

Web Scraping Jupiter Notebooks
Jupyter Notebook Windows
Jupyter Notebook In Browser

This article discussed two very easy fixes for this problem faced by almost all Jupyter notebook users while doing data science projects. I have faced this issue myself while working folder of Jupyter notebook, the most preferred IDE of data scientists.

Installing Jupyter Python Notebook For Python 2 and 3 Pip is the default package management system or tool for installing/uninstalling and managing different packages in Python. It provides an OS independent system, so you can use it for any of the operating systems like Windows, Linux (Ubuntu), MacOS, etc.
Using Scrapy in Jupyter notebook. This notebook makes use of the Scrapy library to scrape data from a website. Following the basic example, we create a QuotesSpider and call the CrawlerProcess with this spider to retrieve quotes from In this notebook two pipelines are defined, both writing results to a JSON file.
In result, we will web scrape the site to get that unstructured website data and put into an ordered form to build our own dataset. Web Scraping (Scrapy) using Python. In order to scrape the website, we will use Scrapy. In short, Scrapy is a framework built to build web scrapers more easily and relieve the pain of maintaining them.

Read more about Data Science, Machine Learning(ML), Artificial Intelligence (AI), Deep Learning (DL)

Webscraping Tutorial briefly showing how to get the NY lottery winning numbers, events in history and the scrabble word of the day. The Jupyter Notebook is an open source web application that you can use to create and share documents that contain live code, equations, visualizations, and text. Jupyter Notebook is maintained by the people at Project Jupyter. Jupyter Notebooks are a spin-off project from the IPython project, which used to have an IPython Notebook project itself.

Although at the start it did not seem a big problem, as you start using Jupyter on a daily basis, you want it should start from your directory of choice. It helps you being organized, all your data science files at one place.

While I searched the internet thoroughly and got many suggestions, very few of them were really helpful. And it took quite a lot of my time to figure out the process which is really helpful. I thought to write it down as a blog so that in future I dont have to waste time again to fix the issue and so my readers.

So, without any further ado, lets jump to the solutions…

The easiest way: using anaconda powershell

The first and the quickest solution is to run your Jupyter notebook right from the Anaconda PowerShell. You need to just change the directory to the desired one there and run Jupyter notebook. It is that simple. See the below image

Here you can see that the default working folder of Jupyter notebook was c:userDibyendu as in the PowerShell. I have changed the directory to E: and simply run the command jupyter notebook. Consequently, PowerShell has run the Jupyter notebook with the start folder as mentioned.

This is very effective and changes the start folder for jupyter notebook very easily. But the problem is that this change is temporary and you have to go through this process every time you open the notebook.

To fix this problem one solution can be to create a batch file with these commands and just run this batch file while you need to work in jupyter notebook.

Creating shortcut with target as the working folder of Jupyter notebook

This solution is my favourite and I personally follow this procedure. Here the steps are explained with screenshots from my system.

You need to first locate the jupyter notebook app in your computer by right clicking the application in your menu as shown in the below image.

Now navigate to the file location and select the application file like the below image. Copy the file in your desktop or any location you want a shortcut of the application.

Now right-click the application and go to the shortcut tab. The target file you can see here is mentioned as “%USERPROFILE%”, which is indeed the default installation folder for jupyter notebook. That’s why it is the default start folder for the notebook.

Now you need to replace the “%USERPROFILE%” part with the exact location of your desired directory.

In the above image you can see that I have replaced the “%USERPROFILE%” with the data science folder which contains all of my data science projects. Now just click Apply and then OK. Now to open jupyter notebook click the shortcut and jupyter will open with your mentioned directory as the start folder as in the below image.

So, the problem is solved. Skins for mac. You can use this trick and create multiple shortcuts with different folders as the start folder of jupyter notebook.

Hey data hackers! Looking for a rapid way to pull down unstructured data from the Web? Here’s a 5-minute analytics workout across two simple approaches to how to scrape the same set of real-world web data using either Excel or Python. All of this is done with 13 lines of Python code or one filter and 5 formulas in Excel.

All of the code and data for this post are available at GitHub here. Never scraped web data in Python before? No worries! The Jupyter notebook is written in an interactive, learning-by-doing style that anyone without knowledge of web scraping in Python through the process of understanding web data and writing the related code step by step. Stay tuned for a streaming video walkthrough of both approaches.

Huh? Why scrape web data?

Web Scraping Jupiter Notebooks

If you’ve not found the need to scrape web data yet, it won’t be long … Much of the data you interact with daily on the web is not structured in a way that you could easily pull it down for analysis. Reading your morning feed of news and catch a great data table in the article? Unless the journalist links to machine-readable data, you’ll have to scrape it straight from the article itself! Looking to find the best deal across multiple shopping websites? Best believe that they’re not offering easy ways to download those data and compare! In this small instance, we’ll explore augmenting one’s Kindle library with Audible audiobooks.

Audiobooks whilst traveling …

When in transit during travel (ahh, travel … I remember that …) I listen to podcasts and audiobooks. I was, the other day, wondering what the total cost would be to add Audible audiobook versions of every Kindle book that I own, where they are available. The clever people at Amazon have anticipated just such a query, and offer the Audible Matchmaker tool, which scans your Kindle library and offers Audible versions of Kindle books you own.

Sadly, there is not an option to “buy all” or even the convenience function of a “total cost” calculation anywhere. “No worries,” I thought, “I’ll just knock this into a quick Excel spready and add it up myself.”

Part I: Web Scraping in Excel

Excel has become super friendly to non-spreadsheet data in the past years. To wit, I copied the entire page (after clicking through all of the “more” paging button until all available titles were shown on one page) and simply pasted this into a tab in the spready.

Removing all of the images left us with a column of a mishmash of text, only some of which is useful to our objective of calculating the total purchase price for all unowned Audible books. Filtering on the repeating “Audible Upgrade Price” text reduces the column down to the values we are after.

Perfect! All that remains is a bit of text processing to extract the prices as numbers we can sum. All of these steps are detailed in the accompanying spreadsheet and data package to this post. Mozilla sunbird.

$465.07 … not a horrible price for 60 audiobooks … just about $8 each. (As an Audible subscriber, each month’s book credit costs $15, so this is roughly half of that cost and not a bad deal …) NB – we could have used Excel’s cool Web Query capabilities to import the data from the website, though we’ll cover that separately as we cover scraping Web data requiring logins elsewhere.

I’m sure my Kindle library is like yours, in that it significantly expands with each passing week, so it occurred to me that this won’t be the last time I have this question. Let’s make this repeatable by coding a solution in Python.

Part II: Web Scraping in Python

The approach in Python is quite similar, conceptually, to the Excel-based approach.

Pull the data from the Audible Matchmaker page
Parse it into something mathematically useful & sum audiobook costs

Copy the data from the Audible Matchmaker page

The BeautifulSoup library in Python provides an easy interface to scraping Web data. (It’s actually quite a bit more useful than that, but let’s discuss that another time.) We simply load the BeautifulSoup class from the bs4 module, and use it to parse a request object made by calling the get() method of the requests module.

Parse the data into something mathematically useful

Similar to our approach in Excel, we’ll use the BeautifulSoup module to filter the page elements down to the price values we’re after. We do this in 3 steps:

Find all of the span elements which contain the price of each Audible book
Convert the data in the span elements to numbers
Sum them up!

Jupyter Notebook Windows

Donezo! See the accompanying Jupyter Notebook for a detailed walkthrough of this code.

Jupyter Notebook In Browser

There it is – scraping web data in 5 minutes using both Excel and Python! Stay tuned for a streamed video walkthrough of both approaches.

Comments are closed.