Project URL: GitHub Repository

A simple web scraper that uses the BeautifulSoup library to scrape the web. This project is still early in development and initially for personal use.

Features

  • Scrape a website content
  • Save to a file in Markdown or HTML format
  • Image download support
  • Get all images links in the website content
  • Get all links in the website content with or without the relative links

Requirements

  • Python 3.10+

Installation

To install the package, run the following command:

1
pip install git+https://github.com/fadhilyori/pywebscraper.git

Usage

Initialize the PyWebScraper class

1
2
3
4
5
from pywebscraper import PyWebScraper

url = 'https://www.example.com'

scraper = PyWebScraper(url)

Note:
The default output directory is in the output directory.

Scrape a website content and save to a file in Markdown format

1
2
output_file = 'output.md'
scraper.save_markdown(filename=output_file)

Scrape a website content and save to a file in HTML format

1
2
output_file = 'output.html'
scraper.save_content_html(filename='content.html')

Scrape a website content and download images

1
2
output_file = 'output.md'
scraper.save_markdown(filename=output_file, download_images=True)

Get the website Markdown content in a string

1
2
content = scraper.get_content_markdown()
print(content)

Example output:

# Example Domain

This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.

[More information...](https://www.iana.org/domains/example)

Get all images in the website content

1
2
images = scraper.extract_images()
print(images)

Example output:

[
    ('alt_text1', 'https://www.example.com/image1.jpg'),
    ('alt_text2', 'https://www.example.com/image2.jpg'),
]
1
2
links = scraper.extract_links()
print(links)

Example output:

[
    # External links
    'https://www.example.org/about',

    # Relative links
    'https://www.example.com/page3',    # original: /page3
    'https://www.example.com/#section', # original: # #section
    'https://www.example.com/?search=python', # original: # ?search=python
]
1
2
links = scraper.extract_links(include_relative=False)
print(links)

Example output:

[
    'https://www.example.org/about',
]

License

This project is licensed under the MIT License - see the LICENSE file for details.