K Nearest Neighbor
Contact: Stefan Kramer
|Input:||Instances, feature vectors, class values|
|Output:||Classification model (actually training instances are stored; lazy learning method)|
|Input format:||Weka's ARFF format|
|Output format:||Plain text, model binary|
|User-specified parameters:||k (the number of neighbors to use) whether hold-one-out cross-validation will be used to select the best k value whether to use distance weighting whether the mean squared error is used rather than mean absolute error when doing cross-validation for regression problems distance function.|
|Reporting information:||Performance measures (Confusion matrix, precision, recall, AUC, F-measure, true (false) positive rate, prediction accuracy)|
The k-nearest neighbors algorithm (kNN) is a method for classifying objects based on closest training
examples in the feature space. It is a type of instance-based learning, or lazy learning where the function is
only approximated locally and all computation is delayed until classification. A majority vote of an object‟s
neighbors is used for classification, with the object being assigned to the class most common amongst its k
(positive integer, typically small) nearest neighbors. If k is set to 1, then the object is simply assigned to the
class of its nearest neighbor. The knn algorithm can also be applied for regression in the same way by simply
assigning the property value for the object to be the average of the values of its k nearest neighbors. It can be
useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more to the
average than the more distant ones. No explicit training step is required since training consists of just storing
training instance feature vectors and corresponding class labels. In order to identify neighbors, the objects are
represented by position vectors in a multidimensional feature space. It is usual to use the Euclidean distance,
but also further distance measures, such as the Manhattan distance could be used instead. In the
classification/testing phase, the test sample is represented as a vector in the feature space. Distances from this
vector to all stored vectors are computed and the k closest samples are selected to determine the class/real-
value of the test instance.
The k-nearest neighbor algorithm is sensitive to the local structure of the data. The best choice of k depends
upon the data; generally, larger values of k reduce the effect of noise on the classification, but make
boundaries between classes less distinct. A good k can be selected by various heuristic techniques like cross-
validation. The accuracy of the kNN algorithm can be severely degraded by the presence of noisy or irrelevant
features, or if the feature scales are not consistent with their importance.
For further information we refer the reader to the literature [AHA91][MIT97].
Background (publication date, popularity/level of familiarity, rationale of approach, further comments)
Very popular method in the machine learning community. Simple approach that often
yields high predictive power. Can be used for classification and regression.
Bias (instance-selection bias, feature-selection bias, combined instance-selection/feature-selection bias, independence assumptions?, ...)
Lazy learning/eager learning
Interpretability of models (black box model?, ...)
Type of Descriptor:
External components: WEKA
Programming language(s): Java
Operating system(s): Linux, Win, Mac OS
Input format: Weka's ARFF format
Output format: Plain text, model binary
[AHA91] D. Aha, D. Kibler (1991). Instance-based learning algorithms. Machine Learning. 6:37-66.
[MIT97] Tom Mitchell, Machine Learning, McGraw Hill, 1997.