Thursday, April 18, 2024
No menu items!
HomeArtificial Intelligence and Machine LearningA Guide to Obtaining Time Series Datasets in Python

A Guide to Obtaining Time Series Datasets in Python



Last Updated on March 29, 2022

Datasets from real-world scenarios are important for building and testing machine learning models. You may just want to have some data to experiment with an algorithm. You may also want to evaluate your model by setting up a benchmark or determining its weaknesses using different sets of data. Sometimes, you may also want to create synthetic datasets, where you can test your algorithms under controlled conditions by adding noise, correlations, or redundant information to the data.

In this post, we’ll illustrate how you can use Python to fetch some real-world time-series data from different sources. We’ll also create synthetic time-series data using Python’s libraries.

After completing this tutorial, you will know:

How to use the pandas_datareader
How to call a web data server’s APIs using the requests library
How to generate synthetic time-series data

Let’s get started.

A Guide to Working With Datasets in Python
Photo by Mehreen Saeed, some rights reserved

Tutorial Overview

This tutorial is divided into three parts; they are:

Using pandas_datareader
Using the requests library to fetch data using the remote server’s APIs
Generate synthetic time-series data

Loading Data Using pandas-datareader

This post will depend on a few libraries. If you haven’t installed them in your system, you may install them using pip:

pip install pandas_datareader requests

The pandas_datareader library allows you to fetch data from different sources, including Yahoo Finance for financial market data, World Bank for global development data, and St. Louis Fed for economic data. In this section, we’ll show how you can load data from different sources.

Behind the scene, pandas_datareader pulls the data you want from the web in real time and assembles it into a pandas DataFrame. Because of the vastly different structure of web pages, each data source needs a different reader. Hence, pandas_datareader only supports reading from a limited number of sources, mostly related to financial and economic time series.

Fetching data is simple. For example, we know that the stock ticker for Apple is AAPL, so we can get the daily historical prices of Apple stock from Yahoo Finance as follows:

import pandas_datareader as pdr

# Reading Apple shares from yahoo finance server
shares_df = pdr.DataReader(‘AAPL’, ‘yahoo’, start=’2021-01-01′, end=’2021-12-31′)
# Look at the data read
print(shares_df)

The call to DataReader() requires the first argument to specify the ticker and the second argument the data source. The above code prints the DataFrame:

High Low Open Close Volume Adj Close
Date
2021-01-04 133.610001 126.760002 133.520004 129.410004 143301900.0 128.453461
2021-01-05 131.740005 128.429993 128.889999 131.009995 97664900.0 130.041611
2021-01-06 131.050003 126.379997 127.720001 126.599998 155088000.0 125.664215
2021-01-07 131.630005 127.860001 128.360001 130.919998 109578200.0 129.952271
2021-01-08 132.630005 130.229996 132.429993 132.050003 105158200.0 131.073914
… … … … … … …
2021-12-27 180.419998 177.070007 177.089996 180.330002 74919600.0 180.100540
2021-12-28 181.330002 178.529999 180.160004 179.289993 79144300.0 179.061859
2021-12-29 180.630005 178.139999 179.330002 179.380005 62348900.0 179.151749
2021-12-30 180.570007 178.089996 179.470001 178.199997 59773000.0 177.973251
2021-12-31 179.229996 177.259995 178.089996 177.570007 64062300.0 177.344055

[252 rows x 6 columns]

We may also fetch the stock price history from multiple companies with the tickers in a list:

companies = [‘AAPL’, ‘MSFT’, ‘GE’]
shares_multiple_df = pdr.DataReader(companies, ‘yahoo’, start=’2021-01-01′, end=’2021-12-31′)
print(shares_multiple_df.head())

and the result would be a DataFrame with multi-level columns:

Attributes Adj Close Close
Symbols AAPL MSFT GE AAPL MSFT
Date
2021-01-04 128.453461 215.434982 83.421600 129.410004 217.690002
2021-01-05 130.041611 215.642776 85.811905 131.009995 217.899994
2021-01-06 125.664223 210.051315 90.512833 126.599998 212.250000
2021-01-07 129.952286 216.028732 89.795753 130.919998 218.289993
2021-01-08 131.073944 217.344986 90.353485 132.050003 219.619995

Attributes Volume
Symbols AAPL MSFT GE
Date
2021-01-04 143301900.0 37130100.0 9993688.0
2021-01-05 97664900.0 23823000.0 10462538.0
2021-01-06 155088000.0 35930700.0 16448075.0
2021-01-07 109578200.0 27694500.0 9411225.0
2021-01-08 105158200.0 22956200.0 9089963.0

Because of the structure of DataFrames, it is convenient to extract part of the data. For example, we can plot only the daily close price on some dates using the following:

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

# General routine for plotting time series data
def plot_timeseries_df(df, attrib, ticker_loc=1, title=’Timeseries’,
legend=”):
fig = plt.figure(figsize=(15,7))
plt.plot(df[attrib], ‘o-‘)
_ = plt.xticks(rotation=90)
plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(ticker_loc))
plt.title(title)
plt.gca().legend(legend)
plt.show()

plot_timeseries_df(shares_multiple_df.loc[“2021-04-01″:”2021-06-30”], “Close”,
ticker_loc=3, title=”Close price”, legend=companies)

Multiple shares fetched from Yahoo Finance

The complete code is as follows:

import pandas_datareader as pdr
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

companies = [‘AAPL’, ‘MSFT’, ‘GE’]
shares_multiple_df = pdr.DataReader(companies, ‘yahoo’, start=’2021-01-01′, end=’2021-12-31′)
print(shares_multiple_df)

def plot_timeseries_df(df, attrib, ticker_loc=1, title=’Timeseries’, legend=”):
“General routine for plotting time series data”
fig = plt.figure(figsize=(15,7))
plt.plot(df[attrib], ‘o-‘)
_ = plt.xticks(rotation=90)
plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(ticker_loc))
plt.title(title)
plt.gca().legend(legend)
plt.show()

plot_timeseries_df(shares_multiple_df.loc[“2021-04-01″:”2021-06-30”], “Close”,
ticker_loc=3, title=”Close price”, legend=companies)

The syntax for reading from another data source using pandas-datareader is similar. For example, we can read an economic time series from the Federal Reserve Economic Data (FRED). Every time series in FRED is identified by a symbol. For example, the consumer price index for all urban consumers is CPIAUCSL, the consumer price index for all items less food and energy is CPILFESL, and personal consumption expenditure is PCE. You can search and look up the symbols from FRED’s webpage.

Below is how we can obtain two consumer price indices, CPIAUCSL and CPILFESL, and show them in a plot:

import pandas_datareader as pdr
import matplotlib.pyplot as plt

# Read data from FRED and print
fred_df = pdr.DataReader([‘CPIAUCSL’,’CPILFESL’], ‘fred’, “2010-01-01”, “2021-12-31”)
print(fred_df)

# Show in plot the data of 2019-2021
fig = plt.figure(figsize=(15,7))
plt.plot(fred_df.loc[“2019”:], ‘o-‘)
plt.xticks(rotation=90)
plt.legend(fred_df.columns)
plt.title(“Consumer Price Index”)
plt.show()

Plot of Consumer Price Index

Obtaining data from World Bank is also similar, but we have to understand that the data from World Bank is more complicated. Usually, a data series, such as population, is presented as a time series and also has the countries dimension. Therefore, we need to specify more parameters to obtain the data.

Using pandas_datareader, we have a specific set of APIs for the World Bank. The symbol for an indicator can be looked up from World Bank Open Data or searched using the following:

from pandas_datareader import wb

matches = wb.search(‘total.*population’)
print(matches[[“id”,”name”]])

The search() function accepts a regular expression string (e.g., .* above means string of any length). This will print:

id name
24 1.1_ACCESS.ELECTRICITY.TOT Access to electricity (% of total population)
164 2.1_ACCESS.CFT.TOT Access to Clean Fuels and Technologies for coo…
1999 CC.AVPB.PTPI.AI Additional people below $1.90 as % of total po…
2000 CC.AVPB.PTPI.AR Additional people below $1.90 as % of total po…
2001 CC.AVPB.PTPI.DI Additional people below $1.90 as % of total po…
… … …
13908 SP.POP.TOTL.FE.ZS Population, female (% of total population)
13912 SP.POP.TOTL.MA.ZS Population, male (% of total population)
13938 SP.RUR.TOTL.ZS Rural population (% of total population)
13958 SP.URB.TOTL.IN.ZS Urban population (% of total population)
13960 SP.URB.TOTL.ZS Percentage of Population in Urban Areas (in % …

[137 rows x 2 columns]

where the id column is the symbol for the time series.

We can read data for specific countries by specifying the ISO-3166-1 country code. But World Bank also contains non-country aggregates (e.g., South Asia), so while pandas_datareader allows us to use the string “all” for all countries, usually we do not want to use it. Below is how we can get a list of all countries and aggregates from the World Bank:

import pandas_datareader.wb as wb

countries = wb.get_countries()
print(countries)

iso3c iso2c name region adminregion incomeLevel lendingType capitalCity longitude latitude
0 ABW AW Aruba Latin America & … High income Not classified Oranjestad -70.0167 12.5167
1 AFE ZH Africa Eastern a… Aggregates Aggregates Aggregates NaN NaN
2 AFG AF Afghanistan South Asia South Asia Low income IDA Kabul 69.1761 34.5228
3 AFR A9 Africa Aggregates Aggregates Aggregates NaN NaN
4 AFW ZI Africa Western a… Aggregates Aggregates Aggregates NaN NaN
.. … … … … … … … … … …
294 XZN A5 Sub-Saharan Afri… Aggregates Aggregates Aggregates NaN NaN
295 YEM YE Yemen, Rep. Middle East & No… Middle East & No… Low income IDA Sana’a 44.2075 15.3520
296 ZAF ZA South Africa Sub-Saharan Africa Sub-Saharan Afri… Upper middle income IBRD Pretoria 28.1871 -25.7460
297 ZMB ZM Zambia Sub-Saharan Africa Sub-Saharan Afri… Lower middle income IDA Lusaka 28.2937 -15.3982
298 ZWE ZW Zimbabwe Sub-Saharan Africa Sub-Saharan Afri… Lower middle income Blend Harare 31.0672 -17.8312

Below is how we can get the population of all countries in 2020 and show the top 25 countries in a bar chart. Certainly, we can also get the population data across years by specifying a different start and end year:

import pandas_datareader.wb as wb
import pandas as pd
import matplotlib.pyplot as plt

# Get a list of 2-letter country code excluding aggregates
countries = wb.get_countries()
countries = list(countries[countries.region != “Aggregates”][“iso2c”])

# Read countries’ total population data (SP.POP.TOTL) in year 2020
population_df = wb.download(indicator=”SP.POP.TOTL”, country=countries, start=2020, end=2020)

# Sort by population, then take top 25 countries, and make the index (i.e., countries) as a column
population_df = (population_df.dropna()
.sort_values(“SP.POP.TOTL”)
.iloc[-25:]
.reset_index())

# Plot the population, in millions
fig = plt.figure(figsize=(15,7))
plt.bar(population_df[“country”], population_df[“SP.POP.TOTL”]/1e6)
plt.xticks(rotation=90)
plt.ylabel(“Million Population”)
plt.title(“Population”)
plt.show()

Bar chart of total population of different countries

Fetching Data Using Web APIs

Instead of using the pandas_datareader library, sometimes you have the option to fetch data directly from a web data server by calling its web APIs without any authentication needed. It can be done in Python using the standard library urllib.requests, or you may also use the requests library for an easier interface.

World Bank is an example where web APIs are freely available, so we can easily read data in different formats, such as JSON, XML, or plain text. The page on the World Bank data repository’s API describes various APIs and their respective parameters. To repeat what we did in the previous example without using pandas_datareader, we first construct a URL to read a list of all countries so we can find the country code that is not an aggregate. Then, we can construct a query URL with the following arguments:

country argument with value = all
indicator argument with value = SP.POP.TOTL
date argument with value = 2020
format argument with value = json

Of course, you can experiment with different indicators. By default, the World Bank returns 50 items on a page, and we need to query for one page after another to exhaust the data. We can enlarge the page size to get all data in one shot. Below is how we get the list of countries in JSON format and collect the country codes:

import requests

# Create query URL for list of countries, by default only 50 entries returned per page
url = “http://api.worldbank.org/v2/country/all?format=json&per_page=500”
response = requests.get(url)
# Expects HTTP status code 200 for correct query
print(response.status_code)
# Get the response in JSON
header, data = response.json()
print(header)
# Collect a list of 3-letter country code excluding aggregates
countries = [item[“id”]
for item in data
if item[“region”][“value”] != “Aggregates”]
print(countries)

It will print the HTTP status code, the header, and the list of country codes as follows:

200
{‘page’: 1, ‘pages’: 1, ‘per_page’: ‘500’, ‘total’: 299}
[‘ABW’, ‘AFG’, ‘AGO’, ‘ALB’, …, ‘YEM’, ‘ZAF’, ‘ZMB’, ‘ZWE’]

From the header, we can verify that we exhausted the data (page 1 out of 1). Then we can get all population data as follows:

# Create query URL for total population from all countries in 2020
arguments = {
“country”: “all”,
“indicator”: “SP.POP.TOTL”,
“date”: “2020:2020”,
“format”: “json”
}
url = “http://api.worldbank.org/v2/country/{country}/”
“indicator/{indicator}?date={date}&format={format}&per_page=500”
query_population = url.format(**arguments)
response = requests.get(query_population)
# Get the response in JSON
header, population_data = response.json()

You should check the World Bank API documentation for details on how to construct the URL. For example, the date syntax of 2020:2021 would mean the start and end years, and the extra parameter page=3 will give you the third page in a multi-page result. With the data fetched, we can filter for only those non-aggregate countries, make it into a pandas DataFrame for sorting, and then plot the bar chart:

# Filter for countries, not aggregates
population = []
for item in population_data:
if item[“countryiso3code”] in countries:
name = item[“country”][“value”]
population.append({“country”:name, “population”: item[“value”]})
# Create DataFrame for sorting and filtering
population = pd.DataFrame.from_dict(population)
population = population.dropna().sort_values(“population”).iloc[-25:]
# Plot bar chart
fig = plt.figure(figsize=(15,7))
plt.bar(population[“country”], population[“population”]/1e6)
plt.xticks(rotation=90)
plt.ylabel(“Million Population”)
plt.title(“Population”)
plt.show()

The figure should be precisely the same as before. But as you can see, using pandas_datareader helps make the code more concise by hiding the low-level operations.

Putting everything together, the following is the complete code:

import pandas as pd
import matplotlib.pyplot as plt
import requests

# Create query URL for list of countries, by default only 50 entries returned per page
url = “http://api.worldbank.org/v2/country/all?format=json&per_page=500”
response = requests.get(url)
# Expects HTTP status code 200 for correct query
print(response.status_code)
# Get the response in JSON
header, data = response.json()
print(header)
# Collect a list of 3-letter country code excluding aggregates
countries = [item[“id”]
for item in data
if item[“region”][“value”] != “Aggregates”]
print(countries)

# Create query URL for total population from all countries in 2020
arguments = {
“country”: “all”,
“indicator”: “SP.POP.TOTL”,
“date”: 2020,
“format”: “json”
}
url = “http://api.worldbank.org/v2/country/{country}/”
“indicator/{indicator}?date={date}&format={format}&per_page=500”
query_population = url.format(**arguments)
response = requests.get(query_population)
print(response.status_code)
# Get the response in JSON
header, population_data = response.json()
print(header)

# Filter for countries, not aggregates
population = []
for item in population_data:
if item[“countryiso3code”] in countries:
name = item[“country”][“value”]
population.append({“country”:name, “population”: item[“value”]})
# Create DataFrame for sorting and filtering
population = pd.DataFrame.from_dict(population)
population = population.dropna().sort_values(“population”).iloc[-25:]
# Plot bar chart
fig = plt.figure(figsize=(15,7))
plt.bar(population[“country”], population[“population”]/1e6)
plt.xticks(rotation=90)
plt.ylabel(“Million Population”)
plt.title(“Population”)
plt.show()

Creating Synthetic Data Using NumPy

Sometimes, we may not want to use real-world data for our project because we need something specific that may not happen in reality. One particular example is to test out a model with ideal time-series data. In this section, we will see how we can create synthetic autoregressive (AR) time-series data.

The numpy.random library can be used to create random samples from different distributions. The randn() method generates data from a standard normal distribution with zero mean and unit variance.

In the AR($n$) model of order $n$, the value $x_t$ at time step $t$ depends upon the values at the previous $n$ time steps. That is,

$$
x_t = b_1 x_{t-1} + b_2 x_{t-2} + … + b_n x_{t-n} + e_t
$$

with model parameters $b_i$ as coefficients to different lags of $x_t$, and the error term $e_t$ is expected to follow normal distribution.

Understanding the formula, we can generate an AR(3) time series in the example below. We first use randn() to generate the first 3 values of the series and then iteratively apply the above formula to generate the next data point. Then, an error term is added using the randn() function again, subject to the predefined noise_level:

import numpy as np

# Predefined paramters
ar_n = 3 # Order of the AR(n) data
ar_coeff = [0.7, -0.3, -0.1] # Coefficients b_3, b_2, b_1
noise_level = 0.1 # Noise added to the AR(n) data
length = 200 # Number of data points to generate

# Random initial values
ar_data = list(np.random.randn(ar_n))

# Generate the rest of the values
for i in range(length – ar_n):
next_val = (np.array(ar_coeff) @ np.array(ar_data[-3:])) + np.random.randn() * noise_level
ar_data.append(next_val)

# Plot the time series
fig = plt.figure(figsize=(12,5))
plt.plot(ar_data)
plt.show()

The code above will create the following plot:

But we can further add the time axis by first converting the data into a pandas DataFrame and then adding the time as an index:

# Convert the data into a pandas DataFrame
synthetic = pd.DataFrame({“AR(3)”: ar_data})
synthetic.index = pd.date_range(start=”2021-07-01″, periods=len(ar_data), freq=”D”)

# Plot the time series
fig = plt.figure(figsize=(12,5))
plt.plot(synthetic.index, synthetic)
plt.xticks(rotation=90)
plt.title(“AR(3) time series”)
plt.show()

after which we will have the following plot instead:

Plot of synthetic time series

Using similar techniques, we can generate pure random noise (i.e., AR(0) series), ARIMA time series (i.e., with coefficients to error terms), or Brownian motion time series (i.e., running sum of random noise) as well.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Libraries

pandas-datareader
Python requests

Data source

Yahoo! Finance
St. Louis Fed Federal Research Economic Data
World Bank Open Data
World Bank Data API documentation

Books

Think Python: How to Think Like a Computer Scientist by Allen B. Downey
Programming in Python 3: A Complete Introduction to the Python Language by Mark Summerfield
Python for Data Analysis, 2nd edition, by Wes McKinney

Summary

In this tutorial, you discovered various options for fetching data or generating synthetic time-series data in Python.

Specifically, you learned:

How to use pandas_datareader and fetch financial data from different data sources
How to call APIs to fetch data from different web servers using the requests library
How to generate synthetic time-series data using NumPy’s random number generator

Do you have any questions about the topics discussed in this post? Ask your questions in the comments below, and I will do my best to answer.



The post A Guide to Obtaining Time Series Datasets in Python appeared first on Machine Learning Mastery.

Read MoreMachine Learning Mastery

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments