.

IEEE International Conference on Data Mining identified 10 algorithms in 2006 using surveys from past winners and voting. This is a list of those algorithms a short description and related python resources. The detailed paper is given here.

C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.

C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S = {s_1, s_2, ...} of already classified samples. Each sample s_i consists of a p-dimensional vector (x_{1,i}, x_{2,i}, ...,x_{p,i}) , where the x_j represent attributes or features of the sample, as well as the class in which s_i falls.

At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists.

The C4.5 algorithm is available under SciKit's Decision Trees module.

k-means

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster

python resources:

k-means clustering is available in scipy, scikit-learn there is also a python wrapper for a basic c implementation.

support vector machines

Support vector machines(SVMs) are supervised learning models with learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. Given a set of marked training examples, an SVM training algorithm builds a model that assigns new examples into one of marked categories.

*Python resources:*

SVMs are available in scikit-learn, pyml

Apriori

Apriori algorithm is used for discovering interesting relations between variables in large databases.

python resources:

A python implementation is available

Expectation Maximization

An expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood estimates of parameters in statistical models. It is used in cases where the equations cannot be solved directly.

*Python resources:*

PageRank

PageRank is perhaps the most popular one in this list. Its the best known algorithm used by google to rank websites in their search engine results.

It is a link analysis algorithm and it assigns a numerical weighting called page rank to each element of a hyperlinked set of documents, with the purpose of "measuring" its relative importance within the set.

*Python Resources:*

An implementation in python is available

AdaBoost:

Adaboost is used in conjunction with many other types of learning algorithms to improve their performance. The output of the other learning algorithms called the weak learners is combined into a weighted sum that gives final output of the boosted classifier.

*Python resources:*

available in scikit-learn

k-Nearest Neighbors

k nearest neighbours algorithm for k closest training examples in the feature space outputs their class membership classified by a majority vote of its k neighbours if used for classification. If used for regression it outputs the average of the values of its k nearest neighbours.

*Python resources:*

available in scikit-learn

Naïve Bayes

Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong independence assumptions between the features.

*Python resources:*

CART

Classification and regression trees

*Python resources:*

A continuously updated list of Python resources for these algorithms is available on Pansop.

Posted 9 November 2021

© 2021 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central