Artificial Intelligence and Machine Learning

Static Analyzers in Python

By mullaned2002

April 1, 2022

1420

Last Updated on March 30, 2022

Static analyzers are tools that help you check your code without really running your code. The most basic form of static analyzers are the syntax highlighers in your favorite editors. If you need to compile your code, such as C++, your compiler such as LLVM may also provide some static analyzer functions to warn you about potential issues (e.g., mistaken assignment “=” for equality “==” in C++). In Python, we have some tools to identify potential errors or point out violations of coding standards.

After finishing this tutorial, you will learn some of these tools. Specifically,

What can the tools Pylint, Flake8, and mypy do?
What are coding style violations?
How can we use type hints to help analyzers to identify potential bugs?

Let’s get started.

Static analyzers in Python
Photo by Skylar Kang. Some rights reserved

Overview

This tutorial is in three parts, they are

Introduction to Pylint
Introduction to Flake8
Introduction to mypy

Pylint

Lint was the name of a static analyzer for C created long time ago. Pylint borrowed its name and it is one of the most widely used static analyzer. It is available as a Python package and we can install with pip:

$ pip install pylint

Then we have the command pylint available in our system.

Pylint can check one script or the entire directory. For example, if we have the following script saved as lenet5-notworking.py:

import numpy as np
import h5py
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dense, AveragePooling2D, Dropout, Flatten
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping

# Load MNIST digits
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

# Reshape data to (n_samples, height, wiedth, n_channel)
X_train = np.expand_dims(X_train, axis=3).astype(“float32”)
X_test = np.expand_dims(X_test, axis=3).astype(“float32”)

# One-hot encode the output
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# LeNet5 model
def createmodel(activation):
model = Sequential([
Conv2D(6, (5,5), input_shape=(28,28,1), padding=”same”, activation=activation),
AveragePooling2D((2,2), strides=2),
Conv2D(16, (5,5), activation=activation),
AveragePooling2D((2,2), strides=2),
Conv2D(120, (5,5), activation=activation),
Flatten(),
Dense(84, activation=activation),
Dense(10, activation=”softmax”)
])
return model

# Train the model
model = createmodel(tanh)
model.compile(loss=”categorical_crossentropy”, optimizer=”adam”, metrics=[“accuracy”])
earlystopping = EarlyStopping(monitor=”val_loss”, patience=4, restore_best_weights=True)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, batch_size=32, callbacks=[earlystopping])

# Evaluate the model
print(model.evaluate(X_test, y_test, verbose=0))
model.save(“lenet5.h5”)

we can ask Pylint to tell us how good is our code before even running it:

$ pylint lenet5-notworking.py

The output is as follows:

************* Module lenet5-notworking
lenet5-notworking.py:39:0: C0301: Line too long (115/100) (line-too-long)
lenet5-notworking.py:1:0: C0103: Module name “lenet5-notworking” doesn’t conform to snake_case naming style (invalid-name)
lenet5-notworking.py:1:0: C0114: Missing module docstring (missing-module-docstring)
lenet5-notworking.py:4:0: E0611: No name ‘datasets’ in module ‘LazyLoader’ (no-name-in-module)
lenet5-notworking.py:5:0: E0611: No name ‘models’ in module ‘LazyLoader’ (no-name-in-module)
lenet5-notworking.py:6:0: E0611: No name ‘layers’ in module ‘LazyLoader’ (no-name-in-module)
lenet5-notworking.py:7:0: E0611: No name ‘utils’ in module ‘LazyLoader’ (no-name-in-module)
lenet5-notworking.py:8:0: E0611: No name ‘callbacks’ in module ‘LazyLoader’ (no-name-in-module)
lenet5-notworking.py:18:25: E0601: Using variable ‘y_train’ before assignment (used-before-assignment)
lenet5-notworking.py:19:24: E0601: Using variable ‘y_test’ before assignment (used-before-assignment)
lenet5-notworking.py:23:4: W0621: Redefining name ‘model’ from outer scope (line 36) (redefined-outer-name)
lenet5-notworking.py:22:0: C0116: Missing function or method docstring (missing-function-docstring)
lenet5-notworking.py:36:20: E0602: Undefined variable ‘tanh’ (undefined-variable)
lenet5-notworking.py:2:0: W0611: Unused import h5py (unused-import)
lenet5-notworking.py:3:0: W0611: Unused tensorflow imported as tf (unused-import)
lenet5-notworking.py:6:0: W0611: Unused Dropout imported from tensorflow.keras.layers (unused-import)

————————————-
Your code has been rated at -11.82/10

If you provided the root directory of a module to Pylint, all components of the module will be checked by Pylint. In that case, you will see the path of different files at the beginning of each line.

There are several things to note here. Firstly the complains from Pylint are in different categories. Most commonly we would see issues on convention (i.e., a matter of style), warnings (i.e., the code may run in a sense not consistent with what you intended to do), and error (i.e., the code may fail to run and throw exceptions). They are identified by the code such as E0601, which the first letter is the category.

Pylint may give false positives. In the example above, we see Pylint flagged the import from tensorflow.keras.datasets as error. It is caused by an optimization in the Tensorflow package that not everything would be scanned and loaded by Python when we import Tensorflow, but a LazyLoader is created to help load only the necessary part of a large package. This saves significant time in starting the program but it would also confuse Pylint that we seem to import something not exists.

Furthermore, one of the key feature of Pylint is to help us make our code align with the PEP8 coding style. When we defined a function without a docstring, for instance, Pylint will complain that we didn’t follow the coding convention even if the code is not doing anything wrong.

But the most important use of Pylint is to help us identify potential issues. We misspelled y_train as Y_train with a uppercase Y and Pylint will tell us that we are using a variable without assigning any value to it. It is not straightforwardly telling us what went wrong, but definitely point us to the right spot to proofread our code. Similarly, when we defined the variable model on line 23, Pylint told us that there is a variable of the same name at the outer scope. Hence the reference to model later on may not be what we were thinking. Similarly, unused imports may be just we misspelled the name of the modules.

All these are hints provided by Pylint. We still have to use our judgement to correct our code (or ignore Pylint’s complaints).

But if you know what Pylint should stop complaining about, you can request to ignore those. For example, we know the import statement are fine, we can invoke Pylint with:

$ pylint -d E0611 lenet5-notworking.py

which all errors of code E0611 will be ignored by Pylint. You can disable multiple codes by a comma-separated list, e.g.,

$ pylint -d E0611,C0301 lenet5-notworking.py

If you want to disable some issues on only a specific line or a specific part of the code, you can put a special comments to your code, as follows:

…
from tensorflow.keras.datasets import mnist # pylint: disable=no-name-in-module
from tensorflow.keras.models import Sequential # pylint: disable=E0611
from tensorflow.keras.layers import Conv2D, Dense, AveragePooling2D, Dropout, Flatten
from tensorflow.keras.utils import to_categorical

The magic keyword pylint: will introduce Pylint-specific instructions. The code E0611 and the name no-name-in-module are the same. In the example above, Pylint will complain about the last two import statements but not the first two because of those special comments.

Flake8

The tool Flake8 is indeed a wrapper over PyFlakes, McCabe, and pycodestyle. When you install flake8 with

$ pip install flake8

you will install all these dependencies.

Similar to Pylint, we have the command flake8 after installing this package and we can pass in a script or a directory for analysis. But the focus of Flake8 is inclined toward coding style. Hence we would see the following output for the same code as above:

$ flake8 lenet5-notworking.py
lenet5-notworking.py:2:1: F401 ‘h5py’ imported but unused
lenet5-notworking.py:3:1: F401 ‘tensorflow as tf’ imported but unused
lenet5-notworking.py:6:1: F401 ‘tensorflow.keras.layers.Dropout’ imported but unused
lenet5-notworking.py:6:80: E501 line too long (85 > 79 characters)
lenet5-notworking.py:18:26: F821 undefined name ‘y_train’
lenet5-notworking.py:19:25: F821 undefined name ‘y_test’
lenet5-notworking.py:22:1: E302 expected 2 blank lines, found 1
lenet5-notworking.py:24:21: E231 missing whitespace after ‘,’
lenet5-notworking.py:24:41: E231 missing whitespace after ‘,’
lenet5-notworking.py:24:44: E231 missing whitespace after ‘,’
lenet5-notworking.py:24:80: E501 line too long (87 > 79 characters)
lenet5-notworking.py:25:28: E231 missing whitespace after ‘,’
lenet5-notworking.py:26:22: E231 missing whitespace after ‘,’
lenet5-notworking.py:27:28: E231 missing whitespace after ‘,’
lenet5-notworking.py:28:23: E231 missing whitespace after ‘,’
lenet5-notworking.py:36:1: E305 expected 2 blank lines after class or function definition, found 1
lenet5-notworking.py:36:21: F821 undefined name ‘tanh’
lenet5-notworking.py:37:80: E501 line too long (86 > 79 characters)
lenet5-notworking.py:38:80: E501 line too long (88 > 79 characters)
lenet5-notworking.py:39:80: E501 line too long (115 > 79 characters)

The error codes beginning with letter E are from pycodestyle and those beginning with letter F are from PyFlakes. We can see it complains about coding style issues such as the use of (5,5) for not having a space after the comma. We can also see it can identify the use of variables before assignment. But it does not catch some code smells such as the function createmodel() reuses the variable model that already defined in outer scope.

Similar to Pylint, we can also ask Flake8 to ignore some complains. For example,

flake8 –ignore E501,E231 lenet5-notworking.py

then those lines will not be printed in the output:

lenet5-notworking.py:2:1: F401 ‘h5py’ imported but unused
lenet5-notworking.py:3:1: F401 ‘tensorflow as tf’ imported but unused
lenet5-notworking.py:6:1: F401 ‘tensorflow.keras.layers.Dropout’ imported but unused
lenet5-notworking.py:18:26: F821 undefined name ‘y_train’
lenet5-notworking.py:19:25: F821 undefined name ‘y_test’
lenet5-notworking.py:22:1: E302 expected 2 blank lines, found 1
lenet5-notworking.py:36:1: E305 expected 2 blank lines after class or function definition, found 1
lenet5-notworking.py:36:21: F821 undefined name ‘tanh’

We can also use magic comments to disable some complaints, e.g.,

…
import tensorflow as tf # noqa: F401
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential

Flake8 will look for the comment # noqa: to skip some complains on those particular lines.

Mypy

Python is not a typed language so, unlike C or Java, you do not need to declare the types of some functions or variables before use. But lately Python introduced the type hint notation so we can specify what type a function or variable intended to be without enforcing its compliance like a typed language.

One of the biggest benefit of using type hints in Python is to provide additional information for static analyzers to check. Mypy is the tools that can understand type hints. Even without type hints, Mypy can still provide complaints similar to Pylint and Flake8.

We can install Mypy from PyPI:

$ pip install mypy

Then the example above can be provided to the mypy command:

$ mypy lenet5-notworking.py
lenet5-notworking.py:2: error: Skipping analyzing “h5py”: module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:2: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
lenet5-notworking.py:3: error: Skipping analyzing “tensorflow”: module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:4: error: Skipping analyzing “tensorflow.keras.datasets”: module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:5: error: Skipping analyzing “tensorflow.keras.models”: module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:6: error: Skipping analyzing “tensorflow.keras.layers”: module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:7: error: Skipping analyzing “tensorflow.keras.utils”: module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:8: error: Skipping analyzing “tensorflow.keras.callbacks”: module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:18: error: Cannot determine type of “y_train”
lenet5-notworking.py:19: error: Cannot determine type of “y_test”
lenet5-notworking.py:36: error: Name “tanh” is not defined
Found 10 errors in 1 file (checked 1 source file)

We see similar error as Pylint above, although sometimes not as precise (e.g., the issue with the variable y_train). However we see one characteristic of mypy above: It expects all libraries we used to come with a stub so the type checking can be done. It is because type hints are optional. In case the code from a library does not provide type hints, the code can still work but mypy cannot verify. Some of the libraries have typing stubs available that enables mypy to check better.

Let’s consider another example:

import h5py

def dumphdf5(filename: str) -> int:
“””Open a HDF5 file and print all the dataset and attributes stored

Args:
filename: The HDF5 filename

Returns:
Number of dataset found in the HDF5 file
“””
count: int = 0

def recur_dump(obj) -> None:
print(f”{obj.name} ({type(obj).__name__})”)
if obj.attrs.keys():
print(“tAttribs:”)
for key in obj.attrs.keys():
print(f”tt{key}: {obj.attrs[key]}”)
if isinstance(obj, h5py.Group):
# Group has key-value pairs
for key, value in obj.items():
recur_dump(value)
elif isinstance(obj, h5py.Dataset):
count += 1
print(obj[()])

with h5py.File(filename) as obj:
recur_dump(obj)
print(f”{count} dataset found”)

with open(“my_model.h5”) as fp:
dumphdf5(fp)

This program supposed to load a HDF5 file (such as a Keras model) and print every attribute and data stored in it. We used the h5py module (which does not have a typing stub and hence mypy cannot identify the types it used) but we added type hints to the function we defined, dumphdf5(). This function expects the filename of a HDF5 file and prints everything stored inside. At the end, the number of dataset stored will be returned.

We save this script into dumphdf5.py and pass it into mypy, we will see the following:

$ mypy dumphdf5.py
dumphdf5.py:1: error: Skipping analyzing “h5py”: module is installed, but missing library stubs or py.typed marker
dumphdf5.py:1: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
dumphdf5.py:3: error: Missing return statement
dumphdf5.py:33: error: Argument 1 to “dumphdf5” has incompatible type “TextIO”; expected “str”
Found 3 errors in 1 file (checked 1 source file)

We misused our function that an opened file object is passed into dumphdf5() instead of just the filename (as a string) and mypy can identify this error. We also declared that the function should return an integer but we didn’t have the return statement in the function.

However, there is one more error in this code that mypy didn’t identify, namely, the use of the variable count in the inner function recur_dump() should be declared nonlocal because it is defined out of scope. This error can be caught by Pylint and Flake8, but mypy missed it.

The following is the complete, corrected code with no more errors. Note that we added the magic comment “# type: ignore” at the first line to mute the typing stubs warning from mypy:

import h5py # type: ignore

def dumphdf5(filename: str) -> int:
“””Open a HDF5 file and print all the dataset and attributes stored

Args:
filename: The HDF5 filename

Returns:
Number of dataset found in the HDF5 file
“””
count: int = 0

def recur_dump(obj) -> None:
nonlocal count
print(f”{obj.name} ({type(obj).__name__})”)
if obj.attrs.keys():
print(“tAttribs:”)
for key in obj.attrs.keys():
print(f”tt{key}: {obj.attrs[key]}”)
if isinstance(obj, h5py.Group):
# Group has key-value pairs
for key, value in obj.items():
recur_dump(value)
elif isinstance(obj, h5py.Dataset):
count += 1
print(obj[()])

with h5py.File(filename) as obj:
recur_dump(obj)
print(f”{count} dataset found”)
return count

dumphdf5(“my_model.h5”)

In conclusion, the three tools we introduced above can be complementary to each other. You may consider to run all of them to look for any possible bugs in your code, or to improve the coding style. Each tool allows some configuration, either from the command line or from a config file, to customize for your needs (e.g., how long a line should be too long to deserve a warning?) Using a static analyzer is also a way to help yourself develop a better programming skills.

Summary

In this tutorial, you’ve see how some common static analyzers can help you write better Python code. Specifically you learned

The strengths and weaknesses of three tools, Pylint, Flake8, and mypy
How to customize the behavior of these tools
How to understand the complaints made by these analyzers

The post Static Analyzers in Python appeared first on Machine Learning Mastery.

Static Analyzers in Python

Overview

Pylint

Flake8

Mypy

Further reading

Summary

Building scalable, secure, and reliable RAG applications using Knowledge Bases for Amazon Bedrock

Significant new capabilities make it easier to use Amazon Bedrock to build and scale generative AI applications – and achieve impressive results

The executive’s guide to generative AI for sustainability

LEAVE A REPLY Cancel reply

Most Popular

The overwhelmed person’s guide to Google Cloud: week of April 11

Regional Persistent Disks: Delivering maximum resilience for your mission-critical workloads

Introducing the Verified Peering Provider program, a simple alternative to Direct Peering

Direct VPC egress on Cloud Run is now generally available

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Liberating your mainframe data with Confluent and Google Cloud

Driving Innovation in the Data Cloud: StreamSets Announces General Availability of Transformer for Snowflake

CloverDX 101: Some basic concepts explained

POPULAR CATEGORY