Does Wikipedia reflect the market?
When I was as student, not a single day went by that I did not lean on Wikipedia for help. The free site helped me more than any expensive text book. I have always admired the more than one hundred thousand editors and have always been curious what motivates their work.
Do users edit only topics they are passionate about? Do current events spark sudden revisions? This led me to the idea that the edit activity on a company's page may have some connection to that company's stock price.
The conclusion was immediately apparent; there is absolutely no relationship. But that didn't matter! Exploring this idea was a an exciting lesson in web scraping and a fun exploration of several linear regression models. If you come across any unique that you think may be correlated with the stock, perhaps my failed models can offer you some inspiration.
Data and Assumptions
English Wikipedia alone contains over five million articles. There are over 100,000 active editors across all languages collectively working at a rate of about 10 edits per second, and each one of these edits is recorded.
Paired with every Wikipedia article is a revision history page. This page lists every change made to an article since it was first created. Each record contains many pieces of information, including:
- Date and time of edit
- The editor's username
- Size of the page after the edit
- Change in page size due to the edit
- Whether or not the edit is classified as a minor edit.
Of these entries, I pulled the change in page size, number of minor edits, and number of non-minor edits for my model.
To get a list of company pages to scrape I needed a list of publicly traded companies. Conveniently, the NASDAQ offers clean and comprehensive CSV files for every company listed on the NASDAQ and NYSE. I used Selenium and Beautiful Soup to search for each company's page and scrape its revision history. I found that searching a company's ticker symbol followed by (NASDAQ) generally directed Selenium to the right page, but not always. Wikipedia is an excellent site to choose as a first web scraping project; it's stupendously formulaic. On my next go around, I want to spend more time parsing search result and disambiguation pages.
My idea was that Wikipedia page edits might serve as a proxy for general public interest in a company. I wanted to reinforce this data with a more obvious indicator of public interest. For that, I turned to Google search history. In a fantastic episode of Freakonomics, I learned how google searches are an accurate measure of people's interests and behaviors. People are much more honest with Google than they are with surveys, other people, or even themselves.
Google trend data is also easy to access and download. For any search term, google assigns a search interest score, with 100 representing peak popularity over the time period chosen. I used Selenium to search for the name of each company and download the data.
I took care to remove words like Inc. and Limited from company names, since people interested in Apple probably don't google Apple, Inc. But this method could use some improvement. I used the list of company names provided by the NASDAQ, and I doubt too many people searched for International Business Machines.
Quandl offers a free API for accessing troves of core financial data. I pulled historical stock prices for each company from there WIKI Prices table. A friend recommended Quandl to me, and ten minutes later I was pulling data.
Features and Target
Google trend data was sampled by week, and so I grouped the rest of the data to match. From my scraping, I had collected four features to use as inputs:
- weekly change page size
- weekly total of minor edits
- weekly total of non-minor edits
- weekly search interest
I explored several models all trying to predict the same thing:
- change in price over a week.
Modeling stock prices is inherently a time series problem. A simple and common approach when building linear regression models on time dependent data is to backshift the features. That is, when predicting the change in price over a certain week, I could have included as a feature the change in price over the previous week. I purposely left this feature out for a few reasons.
First, I assumed it's predictive power would overshadow the other features (It was clear early on that the features I had collected weren't terribly predictive.) Secondly, I wasn't interested in the time dependent behavior of stock prices. I wanted to see if actions on the internet could add any predictive power to traditional methods.
Model One: Looking at a Specific Company
First, I decided to look at companies one at a time, building a model on only data associated with that company. All the companies I tried produced pretty similar models. Below are my results for Starbucks, because that is what I am drinking while I type this.
I started with a simple linear regression applied on all features. The predictions were all clustered around the mean. Below is a graph of the truth data vs. the basic model's predictions. The blue line represents what would be a perfect model.
The second plot above shows the coefficients calculated for each feature. It appears at first like the model prefers one feature over the rest, but just look at the scale of the y-axis.
When applying regularization, a mathematical way of filtering features to only those with the most predictive power, it's clear just how useless each of my features are. Of the models below, the one that performed the best was the last one. Each coefficient has been driven to zero.
I played with three different L1 ratios, the ratio of Lasso (L1) to Ridge (L2) regularization. Lasso will tend to favor certain coefficients over other, while Ridge will act on all coefficients a little more evenly.
The best performing model, where all the coefficients are zero, can do no better than guess the mean every time. In fact, the set of predictions contains only only one unique value.
Method Two: Comparing Companies Against the Market
Next, I explored whether the behavior of any one company could be predicted by the market as whole. I pulled ten companies out of all my data and set them aside. I trained a model on the rest of the market, and then tested that model on each company in my basket.
From the companies that I had collected the most data on, I tried to pick ten that would represent as many different sectors as possible. I didn't have any manufacturing companies at my disposal, so I substituted with Autodesk, a company that makes mechanical design and manufacturing software.
- AMZN: Amazon.com, Inc.
- SBUX: Starbucks Corporation
- TSLA: Tesla, Inc.
- COST: Costco Wholesale Corporation
- HAS: Hasbro, Inc.
- AAL: American Airlines Group, Inc.
- FOX: Twenty-First Century Fox, Inc.
- MAR: Marriott International
- ADSK: Autodesk, Inc.
- ALXN: Alexion Pharmaceuticals, Inc.
Again, I "tuned" my models with regularization and selected the "best" model via cross-validation. There were no winners, but I'll feature Costco, because it's one of the greatest companies of all time.
None of my models proved to be of any worth, but that wasn't the point of this exercise. I so often have ideas that spark, and with data science I am now able to explore those ideas. The spark doesn't fizzle just because my hypothesis isn't validated.
If you can also attribute much of your academic success to Wikipedia, I would encourage you to contribute towards it's mission. Wikipedia makes accessing knowledge so easy that it can be taken for granted. Every time I reference Wikipedia instead of an expensive text book, I promised myself that I would give back once I had a salary. I'm proud to say that I've kept that promise.