About Machine Learning

Accurate predictions from machine learning models require a large amount of reliable data for training.

During training, a machine learning model identifies connections within the data. The model then uses those connections to predict the output value based on the input that it receives. Considering this, the model can only be only as good as the information used to train it.

Improving Accuracy

To get accurate predictions, you must verify that the data used to train the model is correct. The examples you use to train a model teach the system what correct relationships between the input field and the field being predicted look like.

The amount of verified data is just as important as quality when training a model. To guarantee that an appropriate quantity of data continues to be available, base your model on fields that are used often and are consistently accurate.

Prediction models require a diverse data set. The data set used for training should include examples of each classification available. The training data should be verified as correct and representative of common uses for the fields. A model is trained using 80 percent of the training data set. The remaining 20 percent of the data is used for testing to verify the model. If all of the data used to train the model is identical, you get an error and the model is not trained.

Scoring Confidence

The CSM machine learning model calculates a confidence score for each prediction it makes. This score indicates the likelihood that the classification it predicts is correct. When a prediction is made, each possible classification is given a score. These scores are decimals whose sum is 1.0. The machine learning model provides the classification with the highest confidence score as the predicted value, and the decimal is displayed as the conficence score.

Some algorithms can produce a confidence score higher than 1.0. This is caused by the way numbers are rounded for calculations. This happens only in especially small or repetitive data sets. For better results, use large amounts of varied data.