Web Page Scraping Using Newspaper3k: Library in Python

Computers & Internet

Web Page Scraping Using Newspaper3k: Library in Python

What is web scraping?


Web scraping is the process of using automated tools to fetch web pages and extract information from them.


What is newspaper3k about?


It is a article scrapping and curation library. With this you can easily fetch, parse and scrape the webpage. Newspaper3k is a Python 3 library. For python you can install normal newspaper with pip installer.


Development Environment:


Operating System: Ubuntu 20.04
Python Version: 3.8.10

Python Virtual Environment:


Create a virtual environment before proceeding.


We will discuss why a virtual environment necessary.


> python -m venv venv

Activate the created virtual environment:


> source venv/bin/activate

Installation:


pip install newspaper3k

Import newspaper in current Python script




# Import newspaper
from newspaper import Article

# Create an article instance and provide it with desired url.
article = Article("https://en.wikipedia.org/wiki/Web_mining")

# Necessary step, fetch the article.
article.download()

# Web page content
print("Page Content: ", article.html)

# Parse the article
article.parse()


# You are ready to scrape the article.



# Scrape page title
if article.title:
print("Title: ", article.title)
else:
print("Title not found")

# Scrape the authors
if article.authors:
print(article.authors)
else:
print("No authors found")

# Scrape the publish date
if article.publish_date:
print("Publish Date: ", article.publish_date)
else:
print("No publish date found")

# Scrape the article image
if article.top_image:
print("Image: ", article.top_image)
else:
print("No image found")

# Scrape the article summary
if article.summary:
print("Summary: ", article.summary)
else:
print("No summary found")

# Scrape the article text
if article.text:
print("Text: ", article.text)
else:
print("No text found")

References: