Project URL: GitHub Repository
A simple web scraper that uses the BeautifulSoup library to scrape the web. This project is still early in development and initially for personal use.
Features
- Scrape a website content
- Save to a file in Markdown or HTML format
- Image download support
- Get all images links in the website content
- Get all links in the website content with or without the relative links
Requirements
- Python 3.10+
Installation
To install the package, run the following command:
|
|
Usage
Initialize the PyWebScraper class
Note:
The default output directory is in the output directory.
Scrape a website content and save to a file in Markdown format
Scrape a website content and save to a file in HTML format
Scrape a website content and download images
Get the website Markdown content in a string
Example output:
# Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
[More information...](https://www.iana.org/domains/example)
Get all images in the website content
Example output:
[
('alt_text1', 'https://www.example.com/image1.jpg'),
('alt_text2', 'https://www.example.com/image2.jpg'),
]
Get all the links in the content (including the relative links)
Example output:
[
# External links
'https://www.example.org/about',
# Relative links
'https://www.example.com/page3', # original: /page3
'https://www.example.com/#section', # original: # #section
'https://www.example.com/?search=python', # original: # ?search=python
]
Get all the links in the content (exclude the relative links)
Example output:
[
'https://www.example.org/about',
]
License
This project is licensed under the MIT License - see the LICENSE file for details.