PyWebScraper

Project URL: GitHub Repository

A simple web scraper that uses the BeautifulSoup library to scrape the web. This project is still early in development and initially for personal use.

Features

Scrape a website content
Save to a file in Markdown or HTML format
Image download support
Get all images links in the website content
Get all links in the website content with or without the relative links

Requirements

Python 3.10+

Installation

To install the package, run the following command:

`1`	`pip install git+https://github.com/fadhilyori/pywebscraper.git`

Usage

Initialize the PyWebScraper class

1
2
3
4
5
from pywebscraper import PyWebScraper

url = 'https://www.example.com'

scraper = PyWebScraper(url)

Note:
The default output directory is in the output directory.

Scrape a website content and save to a file in Markdown format

1
2
output_file = 'output.md'
scraper.save_markdown(filename=output_file)

Scrape a website content and save to a file in HTML format

1
2
output_file = 'output.html'
scraper.save_content_html(filename='content.html')

Scrape a website content and download images

1
2
output_file = 'output.md'
scraper.save_markdown(filename=output_file, download_images=True)

Get the website Markdown content in a string

1
2
content = scraper.get_content_markdown()
print(content)

Example output:

# Example Domain

This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.

[More information...](https://www.iana.org/domains/example)

Get all images in the website content

1
2
images = scraper.extract_images()
print(images)

Example output:

[
    ('alt_text1', 'https://www.example.com/image1.jpg'),
    ('alt_text2', 'https://www.example.com/image2.jpg'),
]

Get all the links in the content (including the relative links)

1
2
links = scraper.extract_links()
print(links)

Example output:

[
    # External links
    'https://www.example.org/about',

    # Relative links
    'https://www.example.com/page3',    # original: /page3
    'https://www.example.com/#section', # original: # #section
    'https://www.example.com/?search=python', # original: # ?search=python
]

Get all the links in the content (exclude the relative links)

1
2
links = scraper.extract_links(include_relative=False)
print(links)

Example output:

[
    'https://www.example.org/about',
]

License

This project is licensed under the MIT License - see the LICENSE file for details.

Features#

Requirements#

Installation#

Usage#

Initialize the PyWebScraper class#

Scrape a website content and save to a file in Markdown format#

Scrape a website content and save to a file in HTML format#

Scrape a website content and download images#

Get the website Markdown content in a string#

Get all images in the website content#

Get all the links in the content (including the relative links)#

Get all the links in the content (exclude the relative links)#

License#

Features

Requirements

Installation

Usage

Initialize the PyWebScraper class

Scrape a website content and save to a file in Markdown format

Scrape a website content and save to a file in HTML format

Scrape a website content and download images

Get the website Markdown content in a string

Get all images in the website content

Get all the links in the content (including the relative links)

Get all the links in the content (exclude the relative links)

License