Latest In

Breaking News

Not To Be Scrapped- Web Scraping Here To Stay

Information is gold. Mining data online paves way to mining real money in the real world. Companies exhaust all possible legal means to obtain data. Web scraping has been doing that for them and will likely remain doing so in the years to come.

Author:Hajra Shannon
Reviewer:Dexter Cooke
Jan 17, 2022
74.7K Shares
1.6M Views
People may typically resort to plain copy-and-paste procedure when collecting information online, but when dealing with voluminous data, web scraping is the way to do it.
Consider a probable scenario mentioned by computer science website GeeksforGeeks.
If you’re looking for information about a former American president, for example, you may go to Wikipedia to obtain it. As you go over with the content, you can simply copy and paste the text you need and that’s it.
In the case of “large amounts of information,” however, GeeksforGeeks strongly stated that such an approach “will not work!”
Likewise, what if the copy-and-paste task entails that you do it “1,000 or more times,” as asked by Shubham Prasad in Quora.
What GeeksforGeeks and Prasad, an Indian digital marketing professional, with experience in web scraping, are pointing out is that time is one important factor to be considered in data collection.
To collect tons of information “as quickly as possible,” according to GeeksforGeeks, web scraping is the method deemed fit to accomplish such an enormous task.

Why EVERYBODY should learn web scraping (4 reasons)

Web Scraping Meaning

Series of codes running as data gets extracted using web scraping with Python
Series of codes running as data gets extracted using web scraping with Python
Data scraping, web data extraction, and web harvesting are alternative terms to web scraping.
Generally speaking, Techopedia says that web scraping refers to a method “used to collect information from across the Internet.”
Specifically, this method of web scraping is an automated process of extracting “structured web data from any public website,” according to Zyte (formerly Scrapinghub).
This “extraction of data from a website” or web scraping, said ParseHub, can also be done manually. However, as emphasized earlier, when it comes to data collection, time is of essence because in business, time is money.
So, doing it automatically is more preferred than manually.
ParseHub gives an idea on how web scraping is done. As an example, it chose to collect information regarding couches sold by international furniture and home accessories company IKEA.
If done manually, the collected data appears like this in Excel as seen in the screenshot below:
Screenshot of sample data on IKEA sofas collected through manual web scraping
Screenshot of sample data on IKEA sofas collected through manual web scraping
The screenshot below shows how it appears when done automatically:
Screenshot of sample data on IKEA sofas collected through automated web scraping
Screenshot of sample data on IKEA sofas collected through automated web scraping
According to Sequentum, an award-winning New York-based software development company, web scraping can be both easy and difficult.
It’s easy if the data will be extracted from:
(a) static websites (websites that have HTML-coded content that are fixed; they don't change or remain “‘static’” for all website users/viewers, according to Canadian marketing agency H&C)
(b) websites that use AJAX (Asynchronous JavaScript and XML) or JavaScript
Now the level of difficulty increases as the amount of data to be collected also increases as well as when you extract the data from:
(a) dynamic websites (websites where contents and information change every time the database gets updated, according to H&C)
(b) “complex websites” (those where visitor and website interaction can happen “beyond a simple information request form,” according to Arizona-based business solutions company Lodestone Systems)
Examples of complex websites include e-commerce sites; content-heavy, information-based, magazine-style, and service-oriented websites; and websites about celebrities, fashion, gaming, news, and videos)
(c) “non-HTML content”
(d) “websites that use deterrents”
One good example of a deterrent is CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart). It identifies real users (actual human beings) from automated users (e.g., bots or robots).

Web Scraping Tools

Book titled ‘Web Scraping com Python,’ with an illustrated armadillo on the cover
Book titled ‘Web Scraping com Python,’ with an illustrated armadillo on the cover
Per Christensson, creator of Techterms.com, said that bots are used in web scraping (which then explains why there are websites that use CAPTCHA).
Christensson also mentioned that manual web scraping can be carried out through the “File – Save As” command and copy-paste.
Automated web scraping uses software tools such as the following (in alphabetical order):
(1) Bright Data (formerly Luminati)
(2) Diffbot
(3) Import.io
(4) ParseHub
(5) Scrape.do
(6) Scraper API
(7) ScrapingBee
(8) Scraping-Bot
(9) Scrapingdog
(10) Sequentum

Web Scraping 2022

Graphic representation of an excavator mining online data as represented by a random series of 0 and 1
Graphic representation of an excavator mining online data as represented by a random series of 0 and 1
In his Wired article “Beyond the Information Age” (2014), Professor Julian Birkinshaw mentioned how, for decades, companies have participated in “harnessing information and knowledge.”
According to Prof. Birkinshaw, who teaches Strategy and Entrepreneurship at the London Business School, companies do it to cut operation cost, to improve their products, to stay afloat, and to remain competitive.
Web scraping is generally used to collect prices (monitoring and comparison), website contents, and statistics that will be analyzed for marketingand research and development purposes.
Indian software company Juppiter AI Labs enumerated the following activities/businesses/fields where web scraping is applied:
(a) academic research
(b) e-commerce
(c) international news
(d) real estate
(e) sports betting
Techopedia mentioned auction details and weather reports as two examples of information that some companies find useful for their business operation.
Online education platform Edureka added that web scraping is used to collect email addresses and job listings. Social media sites are targets, too, to know trending topics.
Investment companies, according to Zyte, go for news scraping for their investment plans and strategies.
So, if web scraping is so useful, then why, in 2021, it “became uncool,” according to ScrapeOps, a web scraping DevOps tool.
By uncool, it means that web scraping loses its popularity.
ScrapeOps cited two main reasons: the increasing number of companies that provide data (so no need for other firms to scrape data) and legal issues.
In the U.S., “no law directly applies to web scraping,” according to Atty. Kieran McCarthy, founder of Colorado-based McCarthy Garber Law.
That explains why, according to him, lawyersturn to “using judicial frameworks designed for other purposes” when dealing with cases on web scraping.
Worse, Atty. McCarthy doubts that Congress will draft some web scraping laws anytime soon.
ScrapeOps thinks legalities concerning web scraping became “a little less gray” because of the outcome of the 2021 case (“Van Buren vs. United States”). For the U.S. Supreme Court, web scraping didn’t violate the U.S. Computer Fraud and Abuse Act (CFAA).
However, in his Hacker News comment, Atty. McCarthy (user “KieranMac”) disagreed with ScrapeOps, saying that it’s actually “a darker shade of gray.”
In his opinion, the path towards a lawsuit-free web scraping could be navigable; nonetheless, Atty. McCarthy cautioned that it could still get “tricky.”

Conclusion

We could presume that companies will continue to collect and compile information through web scraping for their day-to-day operations.
In 2021, per FinancesOnline, the volume of data created daily reached 1.134 trillion MB.
ScrapeOps predicted that web scraping will overcome usual challenges like it did before. Thus, it concluded that the “future is looking bright” for it.
Still, remember Atty. McCarthy’s words; so, companies and individuals should remain vigilant in their web scraping practices.
Jump to
Hajra Shannon

Hajra Shannon

Author
Hajra Shannona is a highly experienced journalist with over 9 years of expertise in news writing, investigative reporting, and political analysis. She holds a Bachelor's degree in Journalism from Columbia University and has contributed to reputable publications focusing on global affairs, human rights, and environmental sustainability. Hajra's authoritative voice and trustworthy reporting reflect her commitment to delivering insightful news content. Beyond journalism, she enjoys exploring new cultures through travel and pursuing outdoor photography
Dexter Cooke

Dexter Cooke

Reviewer
Dexter Cooke is an economist, marketing strategist, and orthopedic surgeon with over 20 years of experience crafting compelling narratives that resonate worldwide. He holds a Journalism degree from Columbia University, an Economics background from Yale University, and a medical degree with a postdoctoral fellowship in orthopedic medicine from the Medical University of South Carolina. Dexter’s insights into media, economics, and marketing shine through his prolific contributions to respected publications and advisory roles for influential organizations. As an orthopedic surgeon specializing in minimally invasive knee replacement surgery and laparoscopic procedures, Dexter prioritizes patient care above all. Outside his professional pursuits, Dexter enjoys collecting vintage watches, studying ancient civilizations, learning about astronomy, and participating in charity runs.
Latest Articles
Popular Articles