Skip to content

Ajax data crawling

Written by

Nauxniqnah

What is Ajax?

AJAX is a technology that uses JavaScript to exchange data with the server and update part of the web page without refreshing the page and changing the page link.

Basic principles:

Send requests

parse content

render web pages

Ajax analysis method

Viewing request

Ajax actually has its special request type, which is called XHR. There is a message in request headers called ‘x-requested-with: XMLHttpRequest‘, which marks the request as an Ajax request

And you can find the real data in Response.

Filtering requests

Select XHR to view only Ajax requests.

Ajax data extraction

Analysis request

Refresh the page and you’ll find some XHR information. Choose one of them.

Then you can find the requested information in the headers.

This is a POST request.

The URL requested is ‘https://m.wanplus.com/ajax/stats/list

If we go directly to the requested URL, we will find that the page returns an “unidentified request“.

Construct request header

Because direct access is forbidden, we need to construct the request header to get the ajax data.

headers = {
    'authority': 'm.wanplus.com',
    'method': 'POST',
    'path': '/ajax/stats/list',
    'scheme': 'https',
    'accept': 'application/json, text/javascript, */*; q=0.01',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'zh-CN,zh;q=0.9',
    'content-length': '4718',
    'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'cookie': 'wanplus_token=ca5c9886e44b030e6a72b616db11c999; wanplus_storage=lf4m67eka3o; wanplus_sid=b91cbc70de5d5852aac738d23d238b6f; wanplus_csrf=_csrf_tk_717215530; gameType=2; UM_distinctid=175fd2cc56933-021e75472229c2-c791e37-13c680-175fd2cc56a23c; wp_pvid=4036787096; wp_info=ssid=s3914832348; Hm_lvt_f69cb5ec253c6012b2aa449fb925c1c2=1606270372; Hm_lpvt_f69cb5ec253c6012b2aa449fb925c1c2=1606270585; CNZZDATA1275078652=1900053009-1606265507-%7C1606276334; wanplus_token=6d828b7bbfbd6254bc3a4b37acd51388; wanplus_storage=lf4m67eka3o; wanplus_sid=14951c1e109604192bfb362d68e84f28',
    'origin': 'https://m.wanplus.com',
    'referer': 'https://m.wanplus.com/lol/teamstats',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'User-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Mobile Safari/537.36',
    'x-csrf-token': '784324394',
    'x-requested-with': 'XMLHttpRequest'
}

And we also need to provide Form-Data information.

payload = {
    "_gtk": "784324394",
    "draw": 12,
    "columns[0][data]": "order",
    "columns[0][name]": "",
    "columns[0][searchable]": True,
    "columns[0][orderable]": False,
    "columns[0][search][value]": "",
    "columns[0][search][regex]": False,
    ...
    "start": 0,
    "length": 20,
    "search[value]": "",
    "search[regex]": False,
    "area": "",
    # "eid": '{}',
    "type": "team",
    "gametype": 2,
    "filter": "{\"team\":{},\"player\":{},\"meta\":{}}"
}

Request via Python

eid in payload determines the type of request data

for index, id in temp_list:
    time.sleep(2)
    payload['eid'] = id
    response = requests.request(method='POST', url=url, headers=headers, data=payload)
    content = response.content
    # .decode('unicode_escape')
    print(content)
    print(response.text)
    with open(list_match_name[index] + '.txt', 'w', encoding='utf-8') as f:
        f.write(content)
        f.flush()

You can get the complete project code through GitHub, and you can install it into the python library through PyPi.

HanQinXuan/lol_data_api (github.com)

pip install loleventdata

Previous article

Publish your own project on PyPi

Next article

Ensemble Learning

Join the discussion

Leave a Reply

Your email address will not be published. Required fields are marked *