Web Scraping News Articles In Python

Python Web Scraping Tutorial
Web Scraping News Articles In Python Pdf
News Articles For Students

Newspaper3k: Article scraping & curation. Inspired by requests for its simplicity and powered by lxml for its speed: 'Newspaper is an amazing python library for extracting & curating articles.' - tweeted by Kenneth Reitz, Author of requests 'Newspaper delivers Instapaper style article extraction.' - The Changelog. Newspaper is a Python3 library! This tutorial demonstrates how to use the New York Times Articles Search API using Python. From the API's documentation: With the Article Search API, you can search New York Times articles from Sept. 18, 1851 to today, retrieving headlines, abstracts, lead paragraphs, links to associated multimedia and other article metadata.

We have already written few articles about web scraping using beautifulsoup and requests in python. This is yet another article where we will scrape news headlines from a news website.

For this article we have chosen the website inshorts.com. Lets start reading the news from their homepage https://inshorts.com/en/read/. To scrape headlines, we need to inspect the headline html element.

As we can see all the headlines are inside a span html tag with attribute name itemprop and its value as headline. In beautifulsoup, we can find all elements with given attribute value using the method find_all(attrs={'attribute_name': 'attribute_value'}) .

Before starting, we strongly recommend to create a virtual environment and install below dependencies in it.

Scrape news website's homepage:

Lets start by getting the response from the homepage url.

Create a seperate function to print headlines from the response text. This will be helpful later on as well (Remember DRY principal). Mac catalina microsoft office 2011.

Call print_headlines function and pass response.text to it.

Code so far would be

Save this code in a file with name, lets say news_headlines.py. Activate the virtual environment and run the script using command python news_headlines.py. Kromtech mackeeper. Script will print the headlines shown on first page on terminal.

Code written so far will print headlines shown on first page only. What if we want to fetch more headlines than that.

Fetching more headlines:

On the news website's homepage, you will see a load more button at the bottom. Open the devtool on chrome by pressing F12 and click on network tab. Here you can see all requests and responses.

When you click the Load More button, a request is sent to the server with 2 key values in form data which you can see in screenshot below.

Value of news_offset variable can be found from the source code of homepage. Open the source code of homepage and search for text min_news_id. Use value of this variable in news_offset.

Post request with form data:

URL used to load more news headlines is https://inshorts.com/en/ajax/more_news. Lets send the post request to this URL with required form data to fetch more headlines. We will send post requests inside a while loop until we keep getting 200 OK status.

Since the response returned is JSON string with two keys, min_news_id and html, we will parse the response into json object and get values of these two keys. min_news_id will be used to send next post request and html text will be used to get headlines by passing this text to the print_headlines function we defined earlier.

Complete Code:

Complete python code to get news headlines is also available on Github.

Now you can host your Django app for free within 5 minutes.

This tutorial demonstrates how to use the New York Times Articles Search API using Python. From the API's documentation:

With the Article Search API, you can search New York Times articles from Sept. 18, 1851 to today, retrieving headlines, abstracts, lead paragraphs, links to associated multimedia and other article metadata.

The API will not return full text of articles. But it will return a number of helpful metadata such as subject terms, abstract, and date, as well as URLs, which one could conceivably use to scrape the full text of articles.

To begin, you first need to obtain an API key from the New York Times, which is fast and easy to do. See here for more information.

You also need to install the nytimesarticle package, which is a python wrapper for the New York Times Article Search API. This allows you to query the API through python.

To get started, let's fire up our favorite Python environment (I'm a big fan of ipython notebook):

Now we can use the search function with our desired search parameters/values:

The q (for query) parameter searches the article's body, headline and byline for a particular term. In this case, we are looking for the search term ‘Obama’. The fq (for filter query) parameter filters search results by various dimensions. For instance, ‘headline’:’Obama’ will filter search results to those with ‘Obama’ in the headline. 'source':['Reuters','The New York Times'] will filter by source (Reuters, New York Times, and AP are available through the API.) The begin_date parameter (in YYYYMMDD format) limits the date range of the search.

Python Web Scraping Tutorial

As you can see, we can specify multiple filters by using a python dictionary and multiple values by using a list:fq = {'headline':'Obama', 'source':['Reuters','AP', 'The New York Times']}

Web Scraping News Articles In Python Pdf

There are many other parameters and filters we can use to specify our serach. Get a full list here.

The search function returns a dictionary of the first 10 results. To get the next 10, we have to use the page parameter. page = 2 returns the second 10 results, page = 3 the third 10 and so on.

News Articles For Students

If you run the code, you'll see that the returned dictionary is pretty messy. What we’d really like to have is a list of dictionaries, with each dictionary representing an article and each dictionary representing a field of metadata from that article (e.g. headline, date, etc.) We can do this with a custom function:

I’ve only included the fields that I find most relevant, but you can easily add any field that I missed.

Now that we have a function to parse results into a clean list, we can easily write another function that collects all articles for a search query in a given year. In this example, I want to find all the articles in Reuters, AP, and The New York Times with the search query ‘Amnesty International’:

This function will input a year and search query, and return a list of all articles that fit those parameters, parsing them into a nice list of dictionaries. With this, we can scale up and loop over as many years as we want:

Now we have an object called Amnesty_all that lists a dictionary for each article, each containing fields like Headline, Date, Locations, Subjects, Abstract, Word Count, URL, etc.

Pretty neat! We can then export the dataset into a CSV (with each row as an article, and columns for metadata) and analyze it to explore interesting questions.

The hundreds of included presets, filters, LUTs, textures and borders in ON1 Effects 2021 allow for limitless creative options. You can combine, blend, and customize any combination. Popular looks include B&W, HDR Look, Dynamic Contrast, Lens Blur (tilt-shift), Glows, Film Looks, Split Tone, Textures, Borders and more.

To export into a csv, I like to use the csv module:

And there you have it! You just learned how to collect years worth of articles from the New York Times, parse them, and download the resulting database as a csv.

Rochelle Terman

Rochelle Terman received her Ph.D. in Political Science at UC Berkeley in 2016, and is now a post-doctoral fellow at Stanford University. She studies international norms, gender, and identity using computational and data intensive methods. At the D-Lab, she gives training on Python, R, Git, webscraping, computational text analysis, web development and basic programming skills.