Last Updated on July 3, 2022

Deep learning models can take hours, days or even weeks to train.

If the run is stopped unexpectedly, you can lose a lot of work.

In this post you will discover how you can check-point your deep learning models during training in Python using the Keras library.

**Kick-start your project** with my new book Deep Learning With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Jun/2016**: First published

**Update Mar/2017**: Updated for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.

**Update Mar/2018**: Added alternate link to download the dataset.

**Update Sep/2019**: Updated for Keras 2.2.5 API.

**Update Oct/2019**: Updated for Keras 2.3.0 API.

**Update Jul/2022**: Updated for TensorFlow 2.x API and mention about EarlyStopping

## Checkpointing Neural Network Models

Application checkpointing is a fault tolerance technique for long running processes.

It is an approach where a snapshot of the state of the system is taken in case of system failure. If there is a problem, not all is lost. The checkpoint may be used directly, or used as the starting point for a new run, picking up where it left off.

When training deep learning models, the checkpoint is the weights of the model. These weights can be used to make predictions as is, or used as the basis for ongoing training.

The Keras library provides a checkpointing capability by a callback API.

The ModelCheckpoint callback class allows you to define where to checkpoint the model weights, how the file should named and under what circumstances to make a checkpoint of the model.

The API allows you to specify which metric to monitor, such as loss or accuracy on the training or validation dataset. You can specify whether to look for an improvement in maximizing or minimizing the score. Finally, the filename that you use to store the weights can include variables like the epoch number or metric.

The ModelCheckpoint can then be passed to the training process when calling the fit() function on the model.

Note, you may need to install the h5py library to output network weights in HDF5 format.

### Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

## Checkpoint Neural Network Model Improvements

A good use of checkpointing is to output the model weights each time an improvement is observed during training.

The example below creates a small neural network for the Pima Indians onset of diabetes binary classification problem. The example assume that the *pima-indians-diabetes.csv* file is in your working directory.

You can download the dataset from here:

The example uses 33% of the data for validation.

Checkpointing is setup to save the network weights only when there is an improvement in classification accuracy on the validation dataset (monitor=’val_accuracy’ and mode=’max’). The weights are stored in a file that includes the score in the filename (weights-improvement-{val_accuracy=.2f}.hdf5).

# Checkpoint the weights when validation accuracy improves

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.callbacks import ModelCheckpoint

import matplotlib.pyplot as plt

import numpy as np

import tensorflow as tf

seed = 42

tf.random.set_seed(seed)

# load pima indians dataset

dataset = np.loadtxt(“pima-indians-diabetes.csv”, delimiter=”,”)

# split into input (X) and output (Y) variables

X = dataset[:,0:8]

Y = dataset[:,8]

# create model

model = Sequential()

model.add(Dense(12, input_shape=(8,), activation=’relu’))

model.add(Dense(8, activation=’relu’))

model.add(Dense(1, activation=’sigmoid’))

# Compile model

model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

# checkpoint

filepath=”weights-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5″

checkpoint = ModelCheckpoint(filepath, monitor=’val_accuracy’, verbose=1, save_best_only=True, mode=’max’)

callbacks_list = [checkpoint]

# Fit the model

model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10, callbacks=callbacks_list, verbose=0)

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example produces the following output (truncated for brevity).

…

Epoch 00134: val_accuracy did not improve

Epoch 00135: val_accuracy did not improve

Epoch 00136: val_accuracy did not improve

Epoch 00137: val_accuracy did not improve

Epoch 00138: val_accuracy did not improve

Epoch 00139: val_accuracy did not improve

Epoch 00140: val_accuracy improved from 0.83465 to 0.83858, saving model to weights-improvement-140-0.84.hdf5

Epoch 00141: val_accuracy did not improve

Epoch 00142: val_accuracy did not improve

Epoch 00143: val_accuracy did not improve

Epoch 00144: val_accuracy did not improve

Epoch 00145: val_accuracy did not improve

Epoch 00146: val_accuracy improved from 0.83858 to 0.84252, saving model to weights-improvement-146-0.84.hdf5

Epoch 00147: val_accuracy did not improve

Epoch 00148: val_accuracy improved from 0.84252 to 0.84252, saving model to weights-improvement-148-0.84.hdf5

Epoch 00149: val_accuracy did not improve

You will see a number of files in your working directory containing the network weights in HDF5 format. For example:

…

weights-improvement-53-0.76.hdf5

weights-improvement-71-0.76.hdf5

weights-improvement-77-0.78.hdf5

weights-improvement-99-0.78.hdf5

This is a very simple checkpointing strategy.

It may create a lot of unnecessary check-point files if the validation accuracy moves up and down over training epochs. Nevertheless, it will ensure that you have a snapshot of the best model discovered during your run.

## Checkpoint Best Neural Network Model Only

A simpler check-point strategy is to save the model weights to the same file, if and only if the validation accuracy improves.

This can be done easily using the same code from above and changing the output filename to be fixed (not include score or epoch information).

In this case, model weights are written to the file “weights.best.hdf5” only if the classification accuracy of the model on the validation dataset improves over the best seen so far.

# Checkpoint the weights for best model on validation accuracy

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.callbacks import ModelCheckpoint

import matplotlib.pyplot as plt

import numpy as np

# load pima indians dataset

dataset = np.loadtxt(“pima-indians-diabetes.csv”, delimiter=”,”)

# split into input (X) and output (Y) variables

X = dataset[:,0:8]

Y = dataset[:,8]

# create model

model = Sequential()

model.add(Dense(12, input_shape=(8,), activation=’relu’))

model.add(Dense(8, activation=’relu’))

model.add(Dense(1, activation=’sigmoid’))

# Compile model

model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

# checkpoint

filepath=”weights.best.hdf5″

checkpoint = ModelCheckpoint(filepath, monitor=’val_accuracy’, verbose=1, save_best_only=True, mode=’max’)

callbacks_list = [checkpoint]

# Fit the model

model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10, callbacks=callbacks_list, verbose=0)

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example provides the following output (truncated for brevity).

…

Epoch 00139: val_accuracy improved from 0.79134 to 0.79134, saving model to weights.best.hdf5

Epoch 00140: val_accuracy did not improve

Epoch 00141: val_accuracy did not improve

Epoch 00142: val_accuracy did not improve

Epoch 00143: val_accuracy did not improve

Epoch 00144: val_accuracy improved from 0.79134 to 0.79528, saving model to weights.best.hdf5

Epoch 00145: val_accuracy improved from 0.79528 to 0.79528, saving model to weights.best.hdf5

Epoch 00146: val_accuracy did not improve

Epoch 00147: val_accuracy did not improve

Epoch 00148: val_accuracy did not improve

Epoch 00149: val_accuracy did not improve

You should see the weight file in your local directory.

weights.best.hdf5

This is a handy checkpoint strategy to always use during your experiments.

It will ensure that your best model is saved for the run for you to use later if you wish. It avoids you needing to include code to manually keep track and serialize the best model when training.

## Use EarlyStopping together with Checkpoint

In the examples above, we tried to fit our model with 150 epochs. In reality, it is not easy to tell how many epochs we need to train our model. One way to address this problem is to overestimate the number of epochs. But this may take a significant time. After all, if we are checkpointing the best model only, we may find that over the several thousand epochs we run, we already achieved the best model in the first hundred epochs and no more checkpoints are made afterwards.

This is quite common to see we use the ModelCheckpoint callback together with EarlyStopping. It helps to stop the training once we do not see the metric improve for several epochs. The example below adds the callback es for making the training early stop once we do not see the validation accuracy improve for 5 consecutive epochs:

# Checkpoint the weights for best model on validation accuracy

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

import matplotlib.pyplot as plt

import numpy as np

# load pima indians dataset

dataset = np.loadtxt(“pima-indians-diabetes.csv”, delimiter=”,”)

# split into input (X) and output (Y) variables

X = dataset[:,0:8]

Y = dataset[:,8]

# create model

model = Sequential()

model.add(Dense(12, input_shape=(8,), activation=’relu’))

model.add(Dense(8, activation=’relu’))

model.add(Dense(1, activation=’sigmoid’))

# Compile model

model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

# checkpoint

filepath=”weights.best.hdf5″

checkpoint = ModelCheckpoint(filepath, monitor=’val_accuracy’, verbose=1, save_best_only=True, mode=’max’)

es = EarlyStopping(monitor=’val_accuracy’, patience=5)

callbacks_list = [checkpoint, es]

# Fit the model

model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10, callbacks=callbacks_list, verbose=0)

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example provides the following output:

Epoch 1: val_accuracy improved from -inf to 0.51969, saving model to weights.best.hdf5

Epoch 2: val_accuracy did not improve from 0.51969

Epoch 3: val_accuracy improved from 0.51969 to 0.54724, saving model to weights.best.hdf5

Epoch 4: val_accuracy improved from 0.54724 to 0.61417, saving model to weights.best.hdf5

Epoch 5: val_accuracy did not improve from 0.61417

Epoch 6: val_accuracy did not improve from 0.61417

Epoch 7: val_accuracy improved from 0.61417 to 0.66142, saving model to weights.best.hdf5

Epoch 8: val_accuracy did not improve from 0.66142

Epoch 9: val_accuracy did not improve from 0.66142

Epoch 10: val_accuracy improved from 0.66142 to 0.68504, saving model to weights.best.hdf5

Epoch 11: val_accuracy did not improve from 0.68504

Epoch 12: val_accuracy did not improve from 0.68504

Epoch 13: val_accuracy did not improve from 0.68504

Epoch 14: val_accuracy did not improve from 0.68504

Epoch 15: val_accuracy improved from 0.68504 to 0.69685, saving model to weights.best.hdf5

Epoch 16: val_accuracy improved from 0.69685 to 0.71260, saving model to weights.best.hdf5

Epoch 17: val_accuracy improved from 0.71260 to 0.72047, saving model to weights.best.hdf5

Epoch 18: val_accuracy did not improve from 0.72047

Epoch 19: val_accuracy did not improve from 0.72047

Epoch 20: val_accuracy did not improve from 0.72047

Epoch 21: val_accuracy did not improve from 0.72047

Epoch 22: val_accuracy did not improve from 0.72047

This training process stopped after epoch 22 as there are no better accuracy achieved for the last 5 epochs.

## Loading a Check-Pointed Neural Network Model

Now that you have seen how to checkpoint your deep learning models during training, you need to review how to load and use a checkpointed model.

The checkpoint only includes the model weights. It assumes you know the network structure. This too can be serialize to file in JSON or YAML format.

In the example below, the model structure is known and the best weights are loaded from the previous experiment, stored in the working directory in the weights.best.hdf5 file.

The model is then used to make predictions on the entire dataset.

# How to load and use weights from a checkpoint

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.callbacks import ModelCheckpoint

import matplotlib.pyplot as plt

import numpy as np

# create model

model = Sequential()

model.add(Dense(12, input_shape=(8,), activation=’relu’))

model.add(Dense(8, activation=’relu’))

model.add(Dense(1, activation=’sigmoid’))

# load weights

model.load_weights(“weights.best.hdf5”)

# Compile model (required to make predictions)

model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

print(“Created model and loaded weights from file”)

# load pima indians dataset

dataset = np.loadtxt(“pima-indians-diabetes.csv”, delimiter=”,”)

# split into input (X) and output (Y) variables

X = dataset[:,0:8]

Y = dataset[:,8]

# estimate accuracy on whole dataset using loaded weights

scores = model.evaluate(X, Y, verbose=0)

print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100))

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example produces the following output.

Created model and loaded weights from file

acc: 77.73%

## Summary

In this post you have discovered the importance of checkpointing deep learning models for long training runs.

You learned two checkpointing strategies that you can use on your next deep learning project:

Checkpoint Model Improvements.

Checkpoint Best Model Only.

You also learned how to load a checkpointed model and make predictions.

Do you have any questions about checkpointing deep learning models or about this post? Ask your questions in the comments and I will do my best to answer.

The post How to Check-Point Deep Learning Models in Keras appeared first on Machine Learning Mastery.

Read MoreMachine Learning Mastery