Implementation of Anti-Crawler System Based on Spark

Authors : Yisong Wang; Dongmei Zhang

Volume/Issue : Volume 6 - 2021, Issue 11 - November

Google Scholar :

Scribd :

With the advent of the data age, the extraction and utilization of data information has become a huge challenge. The crawler algorithm is designed to obtain website information in batches. However, the use of some malicious crawlers has interfered with the normal business and operation of the website, such as website ticket grabbing behavior and so on. So anti-reptiles was proposed as a new research topic. From the initial frontend anti-crawler, an anti-crawler system based on big data emerged, which greatly improved the efficiency of anti-crawler. The purpose of this topic is to develop an anti-crawler system. After conducting certain research on anti-crawler strategies and technologies, it is determined that the system functions include data classification, data landing, data processing, data access, and ip sensitive representation. The goal is to meet the anti-crawler needs of ticketing websites, ensure normal business operations, and improve user satisfaction. The system adopts technologies such as spark, redis, kafka, nginx + lua, and uses idea as a development tool. After the development of the system is completed, it has undergone functional and performance tests. Its functions are simple and convenient, with good accuracy, and good scalability, which can meet development needs.

Keywords : anti-reptile; hadoop; spark; redis; kafka; nginx


Paper Submission Last Date
31 - May - 2022

Paper Review Notification
In 1-2 Days

Paper Publishing
In 2-3 Days

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.