期权数据获取

by Harry Sauers

哈里·绍尔斯(Harry Sauers)

我如何免费获得期权数据 (How I get options data for free)

网页抓取金融简介 (An introduction to web scraping for finance)

Ever wished you could access historical options data, but got blocked by a paywall? What if you just want it for research, fun, or to develop a personal trading strategy?

曾经希望您可以访问历史期权数据，但是却被付费专区阻止了吗？如果您只是想将其用于研究，娱乐或制定个人交易策略该怎么办？

In this tutorial, you’ll learn how to use Python and BeautifulSoup to scrape financial data from the Web and build your own dataset.

在本教程中，您将学习如何使用Python和BeautifulSoup从Web刮取财务数据并构建自己的数据集。

入门 (Getting Started)

You should have at least a working knowledge of Python and Web technologies before beginning this tutorial. To build these up, I highly recommend checking out a site like codecademy to learn new skills or brush up on old ones.

在开始本教程之前，您应该至少具有Python和Web技术的工作知识。要建立这些基础，我强烈建议您访问codecademy之类的网站，以学习新技能或学习旧技能。

First, let’s spin up your favorite IDE. Normally, I use PyCharm but, for a quick script like this Repl.it will do the job too. Add a quick print (“Hello world”) to ensure your environment is set up correctly.

首先，让我们启动您最喜欢的IDE。通常，我使用PyCharm，但是对于像Repl.it这样的快速脚本也可以完成此工作。添加快速打印(“ Hello world”)以确保正确设置您的环境。

Now we need to figure out a data source.

现在我们需要找出一个数据源。

Unfortunately, Cboe’s awesome options chain data is pretty locked down, even for current delayed quotes. Luckily, Yahoo Finance has solid enough options data here. We’ll use it for this tutorial, as web scrapers often need some content awareness, but it is easily adaptable for any data source you want.

不幸的是，即使对于当前的延迟报价， Cboe令人敬畏的期权链数据也已被锁定。幸运的是，Yahoo Finance 在这里拥有足够可靠的期权数据。我们将在本教程中使用它，因为网络抓取工具通常需要一些内容意识，但是它很容易适应您想要的任何数据源。

依存关系 (Dependencies)

We don’t need many external dependencies. We just need the Requests and BeautifulSoup modules in Python. Add these at the top of your program:

我们不需要很多外部依赖。我们只需要Python中的Requests和BeautifulSoup模块。将这些添加到程序顶部：

from bs4 import BeautifulSoupimport requests

Create a main method:

创建一个main方法：

def main():  print(“Hello World!”)if __name__ == “__main__”:  main()

刮HTML (Scraping HTML)

Now you’re ready to start scraping! Inside main(), add these lines to fetch the page’s full HTML:

现在您就可以开始抓取了！在main()内部，添加以下行以获取页面的完整HTML ：

data_url = “https://finance.yahoo.com/quote/SPY/options"data_html = requests.get(data_url).contentprint(data_html)

This fetches the page’s full HTML content, so we can find the data we want in it. Feel free to give it a run and observe the output.

这将获取页面的完整HTML内容，因此我们可以在其中找到所需的数据。随意运行并观察输出。

Feel free to comment out print statements as you go — these are just there to help you understand what the program is doing at any given step.

随时随地注释打印语句-这些语句可以帮助您了解程序在任何给定步骤中的操作。

BeautifulSoup is the perfect tool for working with HTML data in Python. Let’s narrow down the HTML to just the options pricing tables so we can better understand it:

BeautifulSoup是在Python中处理HTML数据的理想工具。让我们将HTML的范围缩小到期权定价表，以便我们可以更好地理解它：

content = BeautifulSoup(data_html, “html.parser”) # print(content)

options_tables = content.find_all(“table”) print(options_tables)

That’s still quite a bit of HTML — we can’t get much out of that, and Yahoo’s code isn’t the most friendly to web scrapers. Let’s break it down into two tables, for calls and puts:

那仍然是HTML大部分-我们不能从中得到很多，而且Yahoo的代码对网络抓取工具并不是最友好的。让我们将其分解为两个表，用于看涨期权和看跌期权：

options_tables = [] tables = content.find_all(“table”) for i in range(0, len(content.find_all(“table”))):   options_tables.append(tables[i])

print(options_tables)

Yahoo’s data contains options that are pretty deep in- and out-of-the-money, which might be great for certain purposes. I’m only interested in near-the-money options, namely the two calls and two puts closest to the current price.

雅虎的数据包含大量的价内和价外选项，对于某些用途而言可能非常有用。我只对近价期权感兴趣，即最接近当前价格的两个看涨期权和两个看跌期权。

Let’s find these, using BeautifulSoup and Yahoo’s differential table entries for in-the-money and out-of-the-money options:

让我们使用BeautifulSoup和Yahoo的差异表条目来选择价内和价外选项，以找到这些：

expiration = datetime.datetime.fromtimestamp(int(datestamp)).strftime(“%Y-%m-%d”)

calls = options_tables[0].find_all(“tr”)[1:] # first row is header

itm_calls = []otm_calls = []

for call_option in calls:    if “in-the-money” in str(call_option):  itm_calls.append(call_option)  else:    otm_calls.append(call_option)

itm_call = itm_calls[-1]otm_call = otm_calls[0]

print(str(itm_call) + “ \n\n “ + str(otm_call))

Now, we have the table entries for the two options nearest to the money in HTML. Let’s scrape the pricing data, volume, and implied volatility from the first call option:

现在，我们有了最接近HTML的money的两个选项的表条目。让我们从第一个看涨期权中抓取定价数据，数量和隐含波动率：

itm_call_data = [] for td in BeautifulSoup(str(itm_call), “html.parser”).find_all(“td”):   itm_call_data.append(td.text)

print(itm_call_data)

itm_call_info = {‘contract’: itm_call_data[0], ‘strike’: itm_call_data[2], ‘last’: itm_call_data[3],  ‘bid’: itm_call_data[4], ‘ask’: itm_call_data[5], ‘volume’: itm_call_data[8], ‘iv’: itm_call_data[10]}

print(itm_call_info)

Adapt this code for the next call option:

将此代码改编为下一个调用选项：

# otm callotm_call_data = []for td in BeautifulSoup(str(otm_call), “html.parser”).find_all(“td”):  otm_call_data.append(td.text)

# print(otm_call_data)

otm_call_info = {‘contract’: otm_call_data[0], ‘strike’: otm_call_data[2], ‘last’: otm_call_data[3],  ‘bid’: otm_call_data[4], ‘ask’: otm_call_data[5], ‘volume’: otm_call_data[8], ‘iv’: otm_call_data[10]}

print(otm_call_info)

Give your program a run!

运行您的程序！

You now have dictionaries of the two near-the-money call options. It’s enough just to scrape the table of put options for this same data:

现在，您将拥有两个近乎全额认购期权的字典。只需为这些相同的数据刮入看跌期权表即可：

puts = options_tables[1].find_all("tr")[1:]  # first row is header

itm_puts = []  otm_puts = []

for put_option in puts:    if "in-the-money" in str(put_option):      itm_puts.append(put_option)    else:      otm_puts.append(put_option)

itm_put = itm_puts[0]  otm_put = otm_puts[-1]

# print(str(itm_put) + " \n\n " + str(otm_put) + "\n\n")

itm_put_data = []  for td in BeautifulSoup(str(itm_put), "html.parser").find_all("td"):    itm_put_data.append(td.text)

# print(itm_put_data)

itm_put_info = {'contract': itm_put_data[0],                  'last_trade': itm_put_data[1][:10],                  'strike': itm_put_data[2], 'last': itm_put_data[3],                   'bid': itm_put_data[4], 'ask': itm_put_data[5], 'volume': itm_put_data[8], 'iv': itm_put_data[10]}

# print(itm_put_info)

# otm put  otm_put_data = []  for td in BeautifulSoup(str(otm_put), "html.parser").find_all("td"):    otm_put_data.append(td.text)

# print(otm_put_data)

otm_put_info = {'contract': otm_put_data[0],                  'last_trade': otm_put_data[1][:10],                  'strike': otm_put_data[2], 'last': otm_put_data[3],                   'bid': otm_put_data[4], 'ask': otm_put_data[5], 'volume': otm_put_data[8], 'iv': otm_put_data[10]}

Congratulations! You just scraped data for all near-the-money options of the S&P 500 ETF, and can view them like this:

恭喜你！您只需收集S＆P 500 ETF所有近价期权的数据，就可以像这样查看它们：

print("\n\n") print(itm_call_info) print(otm_call_info) print(itm_put_info) print(otm_put_info)

Give your program a run — you should get data like this printed to the console:

运行您的程序-您应该将这样的数据打印到控制台：

{‘contract’: ‘SPY190417C00289000’, ‘last_trade’: ‘2019–04–15’, ‘strike’: ‘289.00’, ‘last’: ‘1.46’, ‘bid’: ‘1.48’, ‘ask’: ‘1.50’, ‘volume’: ‘4,646’, ‘iv’: ‘8.94%’}{‘contract’: ‘SPY190417C00290000’, ‘last_trade’: ‘2019–04–15’, ‘strike’: ‘290.00’, ‘last’: ‘0.80’, ‘bid’: ‘0.82’, ‘ask’: ‘0.83’, ‘volume’: ‘38,491’, ‘iv’: ‘8.06%’}{‘contract’: ‘SPY190417P00290000’, ‘last_trade’: ‘2019–04–15’, ‘strike’: ‘290.00’, ‘last’: ‘0.77’, ‘bid’: ‘0.75’, ‘ask’: ‘0.78’, ‘volume’: ‘11,310’, ‘iv’: ‘7.30%’}{‘contract’: ‘SPY190417P00289000’, ‘last_trade’: ‘2019–04–15’, ‘strike’: ‘289.00’, ‘last’: ‘0.41’, ‘bid’: ‘0.40’, ‘ask’: ‘0.42’, ‘volume’: ‘44,319’, ‘iv’: ‘7.79%’}

设置定期数据收集 (Setting up recurring data collection)

Yahoo, by default, only returns the options for the date you specify. It’s this part of the URL: https://finance.yahoo.com/quote/SPY/options?date=1555459200

默认情况下，Yahoo仅返回您指定日期的选项。这是URL的这一部分： https: //finance.yahoo.com/quote/SPY/options ? date = 1555459200

This is a Unix timestamp, so we’ll need to generate or scrape one, rather than hardcoding it in our program.

这是Unix时间戳，因此我们需要生成或刮取一个时间戳，而不是在程序中对其进行硬编码。

Add some dependencies:

添加一些依赖项：

import datetime, time

Let’s write a quick script to generate and verify a Unix timestamp for our next set of options:

让我们编写一个快速脚本来为下一组选项生成并验证Unix时间戳：

def get_datestamp():  options_url = “https://finance.yahoo.com/quote/SPY/options?date="  today = int(time.time())  # print(today)  date = datetime.datetime.fromtimestamp(today)  yy = date.year  mm = date.month  dd = date.day

The above code holds the base URL of the page we are scraping and generates a datetime.date object for us to use in the future.

上面的代码保存了我们要抓取的页面的基本URL，并生成了datetime.date对象供我们将来使用。

Let’s increment this date by one day, so we don’t get options that have already expired.

让我们将此日期增加一天，这样我们就不会得到已经到期的选项。

dd += 1

Now, we need to convert it back into a Unix timestamp and make sure it’s a valid date for options contracts:

现在，我们需要将其转换回Unix时间戳，并确保它是期权合约的有效日期：

options_day = datetime.date(yy, mm, dd) datestamp = int(time.mktime(options_day.timetuple())) # print(datestamp) # print(datetime.datetime.fromtimestamp(options_stamp))

# vet timestamp, then return if valid for i in range(0, 7):   test_req = requests.get(options_url + str(datestamp)).content   content = BeautifulSoup(test_req, “html.parser”)   # print(content)   tables = content.find_all(“table”)

if tables != []:   # print(datestamp)   return str(datestamp) else:   # print(“Bad datestamp!”)   dd += 1   options_day = datetime.date(yy, mm, dd)   datestamp = int(time.mktime(options_day.timetuple()))  return str(-1)

Let’s adapt our fetch_options method to use a dynamic timestamp to fetch options data, rather than whatever Yahoo wants to give us as the default.

让我们调整fetch_options方法以使用动态时间戳来获取选项数据，而不是Yahoo想要给我们的默认值。

Change this line:

更改此行：

data_url = “https://finance.yahoo.com/quote/SPY/options"

To this:

对此：

datestamp = get_datestamp()data_url = “https://finance.yahoo.com/quote/SPY/options?date=" + datestamp

Congratulations! You just scraped real-world options data from the web.

恭喜你！您只是从网上抓取了真实的期权数据。

Now we need to do some simple file I/O and set up a timer to record this data each day after market close.

现在，我们需要执行一些简单的文件I / O，并设置一个计时器，以在收市后每天记录此数据。

改善程序 (Improving the program)

Rename main() to fetch_options() and add these lines to the bottom:

将main()重命名为fetch_options()并将这些行添加到底部：

options_list = {‘calls’: {‘itm’: itm_call_info, ‘otm’: otm_call_info}, ‘puts’: {‘itm’: itm_put_info, ‘otm’: otm_put_info}, ‘date’: datetime.date.fromtimestamp(time.time()).strftime(“%Y-%m-%d”)}return options_list

Create a new method called schedule(). We’ll use this to control when we scrape for options, every twenty-four hours after market close. Add this code to schedule our first job at the next market close:

创建一个名为schedule()的新方法。市场收盘后每隔24小时，我们将使用它来控制何时刮取期权。添加以下代码以安排我们在下一个市场收盘时的第一份工作：

from apscheduler.schedulers.background import BackgroundScheduler

scheduler = BackgroundScheduler()

def schedule():  scheduler.add_job(func=run, trigger=”date”, run_date = datetime.datetime.now())  scheduler.start()

In your if __name__ == “__main__”: statement, delete main() and add a call to schedule() to set up your first scheduled job.

在if __name__ == “__main__”:语句中，删除main()并添加对schedule()的调用以设置您的第一个计划作业。

Create another method called run(). This is where we’ll handle the bulk of our operations, including actually saving the market data. Add this to the body of run():

创建另一个名为run()方法。我们将在这里处理大部分业务，包括实际保存市场数据。将此添加到run()的主体中：

today = int(time.time()) date = datetime.datetime.fromtimestamp(today) yy = date.year mm = date.month dd = date.day

# must use 12:30 for Unix time instead of 4:30 NY time next_close = datetime.datetime(yy, mm, dd, 12, 30)

# do operations here “”” This is where we’ll write our last bit of code. “””

# schedule next job scheduler.add_job(func=run, trigger=”date”, run_date = next_close)

print(“Job scheduled! | “ + str(next_close))

This lets our code call itself in the future, so we can just put it on a server and build up our options data each day. Add this code to actually fetch data under “”” This is where we’ll write our last bit of code. “””

这样一来，我们的代码就可以在将来自行调用，因此我们可以将其放在服务器上，并每天建立选项数据。添加此代码以实际获取“”” This is where we'll write our last bit of code. “””下的数据。 “”” This is where we'll write our last bit of code. “”” “”” This is where we'll write our last bit of code. “””

options = {}

# ensures option data doesn’t break the program if internet is out try:   if next_close > datetime.datetime.now():     print(“Market is still open! Waiting until after close…”)   else:     # ensures program was run after market hours     if next_close < datetime.datetime.now():      dd += 1       next_close = datetime.datetime(yy, mm, dd, 12, 30)       options = fetch_options()       print(options)       # write to file       write_to_csv(options)except:  print(“Check your connection and try again.”)

保存数据 (Saving data)

You may have noticed that write_to_csv isn’t implemented yet. No worries — let’s take care of that here:

您可能已经注意到write_to_csv尚未实现。不用担心-让我们在这里解决：

def write_to_csv(options_data):  import csv  with open(‘options.csv’, ‘a’, newline=’\n’) as csvfile:  spamwriter = csv.writer(csvfile, delimiter=’,’)  spamwriter.writerow([str(options_data)])

打扫干净 (Cleaning up)

As options contracts are time-sensitive, we might want to add a field for their expiration date. This capability is not included in the raw HTML we scraped.

由于期权合约对时间敏感，因此我们可能想为其到期日添加一个字段。此功能未包含在我们抓取的原始HTML中。

Add this line of code to save and format the expiration date towards the top of fetch_options():

添加以下代码行以保存到期日期并将其格式化为fetch_options()的顶部：

expiration =  datetime.datetime.fromtimestamp(int(get_datestamp())).strftime("%Y-%m-%d")

Add ‘expiration’: expiration to the end of each option_info dictionary like so:

在每个option_info字典的末尾添加'expiration': expiration ，如下所示：

itm_call_info = {'contract': itm_call_data[0],  'strike': itm_call_data[2], 'last': itm_call_data[3],   'bid': itm_call_data[4], 'ask': itm_call_data[5], 'volume': itm_call_data[8], 'iv': itm_call_data[10], 'expiration': expiration}

Give your new program a run — it’ll scrape the latest options data and write it to a .csv file as a string representation of a dictionary. The .csv file will be ready to be parsed by a backtesting program or served to users through a webapp. Congratulations!

运行您的新程序-它会刮擦最新的选项数据，并将其作为字典的字符串表示形式写入.csv文件。 .csv文件将可以通过回测程序进行解析，也可以通过网络应用程序提供给用户。恭喜你！