Web Page Ranking Using Web Mining Techniques: A Comprehensive Survey

Due to the exponential growth of Internet users and traffic, information seekers depend highly on search engines to extract relevant information. Due to the accessibility of a large amount of textual, audio, video etc., contents, the responsibility of search engines has increased. The search engine provides relevant information to Internet users concerning their query, based on content, link structure, etc. However, it does not guarantee the correctness of the information. The performance of a search engine is highly dependent upon the ranking module. The performance of the ranking module is dependent upon the link structure of web pages, which analyze through Web structure mining (WSM) and their content, which analyzes through Web content mining (WCM). Web mining plays a vital role in computing the rank of web pages. This article presents web mining types, techniques, tools, algorithms, and their challenges. Further, it provides a critical comprehensive survey for the researchers by presenting different features of web pages, which are essential to check their quality. In this work, authors presented different approaches/techniques, algorithms and evaluation approaches in previous researches and identified some critical issues in page ranking and web mining, which provide future directions for the researchers working in the area.


Introduction
e size of web documents over the World Wide Web (WWW) has exponentially increased due to increasing the dependency of users over the Internet. An automatic system is required to fetch reliable information from such a huge collection of web documents because this task is challenging to analyze manually. Search engine [1][2][3] is an information retrieval tool for the Web like Google, Yahoo, Bing. e summary of various search engines is shown in Table 1. Still, these search systems can sometimes not guarantee reliable and accurate information, but still, these systems provide better results than performing the task manually by experts. ese tools often do not provide precise information because the IR system [6] returns information to Internet users based on speci c retrieval criteria. For instance, it fetches web documents based on the subject/title as given. To fetch huge web documents related to a speci c domain is very easy and common. erefore, search engines provide a ranking system to nd reliable web documents for user/client queries. Generally, a ranking mechanism creates the rank of web pages based on either keywords/reliability or links/popularity. Hyperlinked Structure [7] was developed in 1989 to share information among researchers in Switzerland. Later, it became a platform of WWW development guided by the WWW association at MIT (Massachusetts Institute of Technology) in Cambridge. e recent growth of WWW has changed the computer science & engineering and the people's lifestyles and economics of various countries.
Since its onset, the WWW has been increasing exponentially as shown in Figure 1(a) A 10-106 terabytes of tra c have increased in a month between 1995 and 2000. e total web tra c between 2005 and 2010 increased from 1 to 7 exabytes. Now in 2020, Internet tra c is increasing approximate 5.3 exabyte per day. According to Cisco, 82% of video-Internet traffic of all web traffic will be in 2021. In 2016, 73% of video traffic [8] of all Internets was present as shown in Figure 1(b). People view large amounts of video, but they also use high bandwidth to view good-quality videos.
All types of web content (like video, Netflix, webcam) generate demand. Now growing live videos is an integral part of the Internet. ese video offerings from various sources like live Facebook, Twitter's broadcast, live You-Tube, live sports is expected to increase approximately 13% of traffic as shown in Figure 1(c) of total video web traffic by 2021 [8]. WWW is an essential and widely used tool to provide reliable information to Internet users. It provides an essential and easy mechanism for information like static text, images, dynamic and interactive services such as audio/ video conferences. It provides the facility to view various types of information, including magazines, library resources in different sectors, current and business news, etc. Now the web is an essential source of all kinds of information.
e information retrieval systems [9] were developed to store and search web pages in efficient manner because the size of WWW increased exponentially. Generally, the text documents are stored in text databases, and the IR system provides a framework to enable searching. e IR system generates a list of documents in response to a query. In general, these are listed in descending order by estimated relevancy. Because most users only glance at the first 10-50 items (the maximum criterion), the algorithms try to put the most relevant papers at the top.
However, searching for information on the web is difficult for an information seeker. Web-based information retrieval systems called search engines [10] have made things easy for information seekers but do not provide guarantees about the correctness of the information. Many times, the information is not precise. It is a program that searches for the documents for specified queries and returns the list of documents where the query keywords were found.
It is important to understand that the term "popularity" is normally the result of link analysis and not user feedback. A web search engine as shown in Figure 2, typically consists of a ranking system that measures the importance of Web Pages [11,12]. Using the hybrid approach, one can fetch contentbased information from web documents [13]. e traffic of search engines is affected [14] by the following factors: size of the web, loading speed [15], web security condition, SEO Crawling Factor (Title, heading, Meta Description of web page, Content, URL), User behavior [11,16]. [17] presents a web page rank mechanism that is query dependent. is approach was much better and effective, but it took more time to rank. In [18], the authors present a ranking mechanism based on link attributes, but it was not able to check the content quality of the web page. Some content-based ranking approaches are presented in [19][20][21]. e main issue in content mining is that it was increasingly perceived latency, addressed in [22] by an additional component, said the proxy server. Search engines follow the following steps to process user queries: (a) Take user query and, based on its keywords, make a precise query to process.  add new URLs into URL_List to crawl all these web pages also. * // (2) indexed_web_page_Repository � indexer (web-PageRepository);/ * e indexer analyses all extracted documents by extracting relevant terms for creating an index to search documents against user queries * / (3) new_list_of_URLs � contentAnalysis (webPa-geRepository); * Content Analysis compute the relevance of a web page on the basis of its contents with respect to user query * / (4) Meta_data � contentAnalysis (webPageRepository); (5) Update_URL_List (URL_List, new_list_of_URLs); } QueryProcessor (UserQuery, Indexed_Web_Repository, Meta_data) { (1) WebPageRepository � Crawler (URL_List);/ * Crawl the web pages by crawler with help of robot.txt file, store into web page repository and add new URLs into URL_List to crawl all these web pages also. * / (2) indexed_web_page_Repository � indexer (web-PageRepository);/ * e indexer analyzes all extracted documents by extracting relevant terms for creating an index to search documents against user queries * / (3) new_list_of_URLs � contentAnalysis (webPa-geRepository);/ * Content Analysis compute the relevance of a web page on the basis of its contents with respect to user query * / (4) Meta_data � contentAnalysis (webPageRepository); (5) Update_URL_List (URL_List, new_list_of_URLs) }

Web Mining
Data mining is used to find out relevant patterns or knowledge from repositories (such as databases, texts, images), which should be valid, valuable, and understandable. Text mining has become popular and reliable by increasing the popularity of text documents. Web mining [23][24][25] is used to fetch useful/relevant information and use this information to generate knowledge and personalize the information and learn about users. e hyperlink structure of web pages, the content of web pages is used to collect the relevant information. Data mining techniques as shown in Figure 3 [25][26][27][28] are used to fetch and discover relevant information automatically from web pages and web services in web mining. Data mining services are discussed in [29] to extract something useful out of the Web. ere are following steps are needed to perform for this purpose:  (iv) Generalization: It is used to fetch patterns from a website or across various websites by applying machine learning (ML) and other data mining techniques. (v) Analysis: is phase analyses mined patterns by validation and interpretation. Pattern mining plays an important role in this phase. In knowledge generation on the web, human being plays an important role.
ere are three basic information such as the previous pattern, shared content's degree, and link structures in web mining discussed below:

Web Usage Mining (WUM).
Web and application servers are the main sources to collect web log data. Log files generate over the web whenever an Internet user interacts with the web through search engines (shown in Figure 4). e following techniques [3] are used in web usage mining: 2.1.1. Association Rules. By using association rule creation in the Web domain, pages that are frequently referred together can be combined into a single server session. Unordered correlation between objects observed in a repository of activities that can be discovered using association rule mining techniques. In web usage mining, the association rules apply to groups of pages that are accessed together and have a support value that is greater than a certain threshold. Support value is the percentage of activities for a specific pattern. e presence or absence of association rules can help Web designers rebuild their pages more effectively. Association rules can be used as a trigger for pre-fetching documents while loading a page from a distant site to reduce user perceived latency. Association rules in WUM provide the relationship between web pages that frequently appear next to one another in user sessions [6,7].
Statement of association rules written as follows: where A, B are sets of items in a series of transactions. For example, an association rule: Page A, Page B �> Page C shows, if the user/client observe page A, and B then page C will be observed in the same meeting.

Classifications. Classification is used to map a data item into predefined classes.
In the Web domain, it is necessary to extract and select attributes that best characterize the properties of a specific class or category in order to create a profile of people belonging to that class or category. e web usage mining process understands the existing data and behavior of new instances. It identifies a particular class/category of a user. Classification techniques use Machine Learning (ML), Neural Network (NN) and statistical. Decision tree classifiers, naive Bayesian classifiers, k-nearest neighbor classifiers, Support Vector Machines, and other supervised inductive learning techniques can be used for classification as shown in Figure 5.

Clustering.
Clustering is one of the most challenging unsupervised learning problems. Objects are sorted into groups of related members during the clustering process. As a result, a cluster is a group of things that are related to one another but not to the objects of other clusters. Clustering analysis is a method of grouping individuals or data objects (pages) with similar characteristics together. e formulation and execution of future marketing plans might be aided by grouping user information or pages. e usage of user clustering will aid in the discovery of groups of users who have similar navigation patterns. Clustering techniques make sets of similar items from a large volume of data by using distance functions that compute the similarity ratio between items [30]. e contrast of the user/client and individual groups is an essential factor in such type of searching. ere are two types of clustering available in this area: (i) User clustering (ii) Page clustering User clustering is used to find those users who have the same browser patterns, and page clustering is used to find similar content's web pages.

Sequential
Analysis. Sequential analysis is that which is found in those patterns in which one set or sets of pages are accessed one after another with a time sequence. For the prediction of future visitors, this application works by advertising on users group. Some techniques are utilized for sequential analysis [31], as shown in Figure 5. A detailed description of various algorithms of WUM Techniques is given in Table 2.

Web Content Mining (WCM). Web Content Mining
(WCM) [13,33,34] as shown in Figure 6, is used to fetch relevant & Reliable information from web pages which may contain text documents, Hyperlinks, Structured data, audio, and Video. Nowadays, web pages are increasing exponentially over www.
Fetching relevant data related to user queries from an extensive collection of web pages is very difficult and timeconsuming. Web content mining has the following approaches [33] to extract user relevant information from different types of data such as unstructured data, structure data, semistructured. ere are various content mining algorithms [35] used by the above content mining techniques are shown in Table 3.

Web Structure Mining (WSM).
Web Structure Mining detects the structural summary of a web page and its linked web pages as architecture shown in Figure 7. It finds out-link (forward/backwards) structure inside a web page by structure mining [33,36]. It is used to classify and compare web documents and integrate the number of different web documents. Some of the popular Web structure mining algorithms are summarized in Table 4.
Web structure mining (WSM) as shown in Figure 7 follow the following steps: (i) Apply link analysis on a web page repository to extract links (forward/backward) summary of web pages. (ii) Apply a link mining techniques in the summary to find out the weight or quality of the web pages.

Challenges in Web
Mining. Web mining is faced with some technical and nontechnical issues. Nontechnical issues occur due to management, fund, and resources (such as professional humans), Some technical issues are discussed below: (i) Inappropriate data: Collected data should be reliable and in proper format to do successfully mining because many times data are incomplete and unavailable. It is very difficult to assure the accuracy of such a data. (ii) Complexity of web pages: e structure of a web page is not predefined. It is stored in a digital library (order of data is not defined) in its original format. So, mining of data is very complex. (iii) Dynamic Web: In dynamic web, data are frequently changed due to new updation. For example, sports data. erefore, the complexity of mining is increased. (iv) Shortage of Mining Tools: Need to develop a mining tool because a very smaller number of appropriate and complete mining tools is available.

Features of Web Page and Importance of ese Features in a Ranking
System. In this, we find out features of web pages and the importance of these features in the ranking system [29,31], [30,[37][38][39] of the search engines (shown in Table 5).
For each web page, there are fifteen features as given in the table, these features further divide into seven groups. All seven groups were finally categorized into three parts based on Web Mining types (WCM, WUM & WSM).
Page: It has two characteristics one of them is Page rank (PR) score and the second one is the age (AGE) of web pages in an index of search engine. Links: It is associated with links/URLs (forward/ Backward Links) on the web Page. Query and Text Similarity: It indicates similarity ratio between query keywords and contents of a web page [40].
It has main three features: (i) Frequency of query keywords inside title (ii) Frequency of query keywords inside heading tags (H1, H2, . . ., H6) separately. (iii) Frequency of query keywords inside paragraph.
Head Tag: Head tag contains two features: title and meta data. Both are used based on keywords inside title and meta description. Body: it is associated with the density of keywords inside the body of a web page. Content: associate with different features which are part of content analysis such as headings, links/URLs. Session Specific: in this count total number of clicks, count unique clicks, and time duration for a session. e above web parameters used in mining by Search Engine to find the quality and relevant web pages for Mobile Information Systems It is an unsupervised algorithm used for data mining and pattern recognition. e aim of K-Means is to minimize the cluster performance index Greedy clustering using belief function [10,11] It is used to modelling evidence from expert opinions or statistical information Improved fuzzy C mean [12] It is a basic approach, used for image segmentation in which space divides into several clusters based on the pixel value of an image CLIQUE (clustering in quest) [13] It is a subspace clustering algorithm that follows a bottom-up approach used to create static grids. is algorithm reduces the search space by using the apriori approach Cluster optimization using fuzzy cluster chase [14] It is used to personalize web page clusters of end-users K means with genetic algorithm-minimizes objective function [15] e GKA is the most preferable algorithm for clustering to other evolutionary algorithms Hierarchical agglomerative clustering [16] It is a data exploratory analysis technique used in hierarchical clustering Cluster optimization using anti-estimate approach [17] It is used to remove redundant data that may occur during clustering EB-DBSCAN (entropy-based DBSCAN) [18] It is used to identify the high-density regions/areas DBSCAN [19] It is used to make clusters of arbitrary shapes Classification Naïve Bayesia [20] It is a work based on Bayes theorem to find a class with the highest probability from a predefined dataset by counting combination on values CART [21] It is a classification technique used to construct decision trees for historical data C4.5 [22] It is a quick classification & high precision algorithm. It is used frequently for classification SVM [32] It is a classification algorithms that can be applied to linear and nonlinear datasets Backpropagation [23] It is used as a gradient descent method to minimize error function in weight space

Sequential
Hashing and pruning based algorithm [24] It is a famous association rule mining technique to increase the performance of traditional apriori algorithms WAP tree association rule algorithm [25] WAP tree is a way to store the patterns in an effective manner by which these patterns are easily searchable High utility sequential patterns [26] It is a data mining task that consists of a set of values having importance in a quantitative transaction database PrefixSpan algorithm [27] It fetches sequential patterns using the pattern growth method. It works well for small datasets Transaction matrix comparison algorithm [28] It uses a Boolean vector to discover frequent itemset. It required less memory because itemset stored in bits

Web Page Repository
Convert/filter Web pages into XML Docs.

XML Documents
Create Index on the based on content in Documents Internet users for their queries. All the parameters are categorized according mining techniques. ere are following web mining tools discussed in Table 6.

Web Page Ranking System
Every day, millions of people's access search engines to retrieve information according to their needs; hence, it becomes a common knowledge retrieval platform. e weight of the ranking in expert search for web documents is explained in [41]. e search engines have become the driver of Internet users that move them toward the highly ranked web by using various web mining techniques [42]. In order to maintain the ranking of web pages, the main objective of the website is to attract Internet users or clients so that they can maintain the ranking on renewed search engines. Reinforcement learning for Web Pages Ranking (WPR) algorithms is explained in [43].
ere are several ways to improve the ranking of a web page on search engines, as SPAM farms are a very famous method to enhance a Website's ranking. During Rank calculation of web pages, cognitive spammer framework (CSF) deletes all spam web documents [44]. A framework Preference-based Universal Ranking Integration (PURI) [45] is designed by combining various ranking mechanisms. e Internet is an important source to access information from the web. At the same time, almost all web pages contain much noise such as advertisements, different types of banners, unreliable links that affect the performance of content and structure-based search engines, Question-Answering System, Web Summarization [13]. For instance, it fetches web documents based on the subject/title as given. To fetch huge web documents related to a specific domain is very easy and common. erefore, to find reliable/ matched web documents for user/client queries, search engines provide a ranking system. e g-index based expert-ranking system in which mainly Rep-FS, Exp-PC, and weighted Exp-PC techniques are used, explained in [46]. Ranking system utilize various web page ranking algorithms (as shown in Figure 8) like page rank [18,47], weighted page rank [48], Eigenrumor [49], HITS [50], Weight Links Rank [21], distance ranking [51], tag rank [52], query dependent [17] to compute a rank of web page. It returns the order of web pages (order is done based on their rank).   Compute the weight of the pages based on their structure (links) and this weight assigns to the page. Finally, generate rank based on weight Eigen rumor (ER) It is the modified version of WPR by applying some other parameters Page Rank is frequently used to calculate web page rank on the basis of in-link and out-link of the web page. e formula (shown in equation (2) Eigen Rumor is proposed to resolve the limitation of page rank and other web page ranking algorithm over blog i.e. it assign the rank value to each blog on the basis of weight of hub and authority of the blogger.  In query-dependent algorithm, use queries of the users to increase the performance of the page-ranking algorithm. A component was incorporated in the page-ranking algorithm which was dedicated to calculating the similarities between the user queries. e similarities between the user queries was analyzed by the algorithm and that information was used to decide the final results of return back to the user for a query.
A new approach (SimRank) using vector space model was proposed which uses the similarity from the vector space based model and finds the rank of the web page. e SimRank [17] algorithm assigns rank to the pages to be retrieved from the search engine in an effective way. Most of the traditional page rank algorithm uses the link structure of the web pages to find the page rank, and some of them are totally ignoring the content of the web pages. But SimRank algorithm also incorporates the content of the web pages to find the final rank score of a web page.
HITS algorithm computes rank of a web page by using popularity of web page. It also calculates the number of Inlinks and Out-links of a web page. e Hit based algorithm is basically computing the rank of a web page by calculating popularity of web page. e popularity is computes by determining input links (Authority) and output links (Hub) of a web page.
R. Baeza and E. Davis developed a Weighted Links Rank (WLR) algorithm with the help of standard PR algorithm.
is algorithm generates weight of a link on the basis of three different arguments, that is, the anchors text length, tags, and relative position in the web page.
ZarehBidoki and Yazdani [14] proposed a reliable and intelligent web page ranking mechanism is called distance rank algorithm which is working on the basis of reinforcement learning algorithm. e distance between pages is calculated by using shortest logarithmic distance between 2pages and assigns the rank accordingly to them.
is algorithm returns very fast high-quality web pages by using distance based solution. For this algorithm, crawler takes more time to compute the distance vector for new web crawled web page. Table 7 shows the summary of web mining techniques and ranking algorithms for each mining technique.

State-of-the-Art Review
Due to increasing the information for humans on the WWW, the responsibility of the Internet also increased. It is straightforward for us to collect the information from www using search engines. Search engines return a large number of web pages as information for a user query. It is challenging for users to select reliable information among them. erefore, in this section, we will discuss research papers in which the author tries to improve search engine techniques that support users to select reliable information.
In [54], authors give an approach to fetch experts' attributes by using text mining from the web, that is, it is a recommended model to return a precise record. is research has shown the effectiveness of the proposed approach in box-office revenue prediction. In [55], the author proposed a prediction for movie revenue based on YouTube trailer reviews. It is mainly utilized in business intelligence as well as in decision-making. In [56], the author developed a framework for Geographic Information Mining (GIM) framework. Microsoft discussion (MSD) forums used expert rank [57], a technique to find experts. is methodology used document-based relevance as well as authority. It does not consider MSD features (like rating by the user, which is a more reliable feature used to mine expert users). In [58], author identified user activities in the SO-forum and compared them with their GitHub repositories and feasible features of the user (active in both platforms). In [59], author proposed user activity models for stack overflow, Wenwo Forums & SinaWeibo to classify real experts. In [60], the model uses some basic features to compute the user weight. In this model, the question-answer ratio is used to generate user weight; still, it ignores the consistency of the user.     Besides this, the quality of the tag was not considered. Although, it may lead to more reliable and accurate recommendation systems. e link-based expert finding techniques mainly used the structure of links instead of their contents. Link analysis used question-answer relationship [61], to find experts, citation networks [62] and e-mail communications [63]. For online users, in [64], the author presented an automatic expert-finding model. In this model, the profile of user expertise was evaluated based on social network score and postconditions. e Z-Score, PageRank, In-degree, and HITS, etc., algorithms were used to compute social network authority scores. A search engine to fetch biomedical information [65] return all the documents corresponding user query from MEDLINE based on word/ concept indexes. Several researchers have investigated various ranking approaches by using different methodologies that increase the efficiency of search engines to provide highly relevant web pages for a particular user query.
In [66], recent research in CARSs is mainly directed by developing novel techniques or adapting and combining existing ones that can efficiently deal with the growing complexity and dynamicity of social networks.
e main consequences of the [67] are (1) ontologies of a corpus can be organized effectively, (2) no effort and time are required to select an appropriate ontology, (3) computational complexity is only limited to the use of unsupervised learning techniques, and (4) due to no requirement of context awareness, the proposed framework can be effective for any.

Observations from the State-of-the-Art-Reviews.
Following observation are made after the critical review of the state-of-the-art review: Observation 1: Mostly search engines return relevant web pages to users for their queries. Relevancy of web page depends upon in-link/ out-link (i.e., web structure mining) and popularity of web page. Many times, the most relevant web pages may be less important for user queries. Important web pages, according to user queries may be missing out from the result. So new techniques are required to develop that may consider user queries as an additional parameter to find the relevant web pages for those queries. Observation 2: Due to increasing the size of the web, search engines delay returning a list of web pages as output to users. e delay between user query submissions and to get output is called perceived latency. erefore, a pre-fetching mechanism needs to be developed to reduce the response time. Observation 3: Even with the introduction of a prefetching mechanism that aims to reduce the user perceived latency, unsuccessful predictions made to prefetch the pages may result in information overkill.
us, a mechanism is required that could actually make credible predictions for only those pages that are more relevant, that is, make correct predictions to minimize the problem of information overkill. Observation 4: Due to increasing WWW and Internet users, it is very difficult to fetch the information, which is looked at, by a specific group of users. For example, in an organization all employees may request the same type of information. erefore, it requires approaches that personalize the content of web pages with respect to the user's group.
A critical look at the available state-of-the-art reviews reveals that the following major gaps are identified: (1) Possibility of existence important page but less popular, which may not be linked (2) Delay in response as perceived by user (3) Need to search information in a similar interest group in an organization

Conclusion and Future Scope
ree categories of ranking algorithms are mainly discussed. e first category of algorithm based on the content of web pages is known as content-based page ranking. e second category of the algorithm, which uses the link structure of the worldwide web, is known as web structure-based page ranking algorithms, and the third category used a hybrid of the first and second categories. Ranking systems highly rely upon web mining techniques, but some issues need to be addressed in web mining due to improper data, shortage of mining tools, and other challenges in classification and clustering techniques. e existing ranking systems have several limitations, which define the challenge and new research paths for researchers. e observations about existing research work will help the researcher select the specific area where further research may be initiated.
ere are some challenges related to web page ranking, such as the following: (i) Web structure-based page ranking algorithms may ignore web pages with less page ranking score but good content for a user query. Content-based page ranking algorithms take more time to find page rank because of content mining at query time. (ii) e size of WWW is huge, so content mining is a very time-consuming process to check the quality of web pages. ere is a need to reduce the time taken by search engines to return the results. (iii) To improve the search results for user queries, it is needed to search for information in a similar interest group in an organization [107][108][109][110][111][112][113].

Conflicts of Interest
ere are no conflicts of interest.