Factor Analysis For Companies Based On Financial Metrics#

前言#

最近对因子分析(Factor Analysis)比较咁兴趣,其实最早接触因子分析(Factor Analysis)的知识点是在考CFA三级的时候,但没有过多深入。 最近因为一直在研究BackTrader这个开源量化平台,于是乎就着手回顾一下相应的理论知识模型。 关于因子分析的文章都好看了很多,但是发现基本没有人将其应用于股票市场当中,秉着“前人种树后人乘凉”的理念,终于在Medium上发现有大佬写了基于美股SP500的因子分析教程。出于学习的目的,这里就边学习边分析一下。

这里将跳过主因子分析(PCA)部分直接进入因子分析(Factor Analysis)。

import sys
# !conda install seaborn scikit-learn statsmodels numba pytables
# !conda install -c conda-forge python-igraph leidenalg
# !{sys.executable} -m pip install quanp
# !{sys.executable} -m pip install factor_analyzer 

引入库#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import quanp as qp

from IPython.display import display
from matplotlib import rcParams
%matplotlib inline

# setting visualization/logging parameters
pd.set_option('display.max_columns', None)
qp.set_figure_params(dpi=100, color_map = 'viridis_r')
qp.settings.verbosity = 1
qp.logging.print_versions()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 4
      2 import pandas as pd
      3 import matplotlib.pyplot as plt
----> 4 import quanp as qp
      6 from IPython.display import display
      7 from matplotlib import rcParams

ModuleNotFoundError: No module named 'quanp'

下载数据#

# S&P 500 股票基本信息
#https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
# df_metadata = qp.datasets.get_wiki_sp500_metadata()

# S&P 500 基本面数据(fundamentals)
# df_fundamental = qp.datasets.download_tickers_fundamental()
The metadata file has been initialized and saved as D:\OneDrive - The City University of New York\Carnets\cnvar.cn\Factor Analysis\data\metadata/sp500_metadata.csv
Total tickers to download: 504
Downloading fundamentals...:  20%|█████████▎                                     | 100/504 [01:40<07:40,  1.14s/ticker]C:\Users\bing\AppData\Roaming\Python\Python37\site-packages\quanp\datasets\_datasets.py:420: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  df_fundamental[ticker] = df_tmp_fundamental[ticker]
Downloading fundamentals...: 100%|███████████████████████████████████████████████| 504/504 [08:43<00:00,  1.04s/ticker]
Fundamentals appended sp500_metadata file has been save as D:\OneDrive - The City University of New York\Carnets\cnvar.cn\Factor Analysis\data\metadata/sp500_metadata_fundamentalAdded.csv

说明:数据会自动下载到data/metadata/ 这个文件夹中,下面直接加载即可

加载数据#

# 加载下载的csv数据
df_fundamental = pd.read_csv('data/metadata/sp500_metadata_fundamentalAdded.csv', index_col=0)

#剔除含零的列
df_fundamental = df_fundamental.loc[:, (df_fundamental != 0).any(axis=0)]

df_fundamental.columns
ls_fundamental_target = ['beta','bookValuePerShare','currentRatio', 'dividendYield',
                         'epsChangePercentTTM','epsChangeYear', 'epsTTM', 'grossMarginMRQ', 
                         'grossMarginTTM', 'interestCoverage', 'ltDebtToEquity', 
                         'marketCap', 'marketCapFloat', 'netProfitMarginMRQ', 
                         'netProfitMarginTTM', 'operatingMarginMRQ', 'operatingMarginTTM', 
                         'pbRatio', 'pcfRatio','peRatio', 'pegRatio', 'prRatio', 
                         'quickRatio', 'returnOnAssets','returnOnEquity', 'returnOnInvestment', 
                         'revChangeIn', 'revChangeTTM', 'sharesOutstanding', 
                         'totalDebtToCapital', 'totalDebtToEquity', 'vol10DayAvg', 
                         'vol1DayAvg', 'vol3MonthAvg']

df_fundamental_target = df_fundamental[ls_fundamental_target]
df_fundamental_target.head()
beta bookValuePerShare currentRatio dividendYield epsChangePercentTTM epsChangeYear epsTTM grossMarginMRQ grossMarginTTM interestCoverage ltDebtToEquity marketCap marketCapFloat netProfitMarginMRQ netProfitMarginTTM operatingMarginMRQ operatingMarginTTM pbRatio pcfRatio peRatio pegRatio prRatio quickRatio returnOnAssets returnOnEquity returnOnInvestment revChangeIn revChangeTTM sharesOutstanding totalDebtToCapital totalDebtToEquity vol10DayAvg vol1DayAvg vol3MonthAvg
MMM 0.90883 8.80101 1.57911 4.05 0.00000 0.00000 9.60319 45.33922 46.00515 15.80000 99.13597 83674.410 568.60530 14.73553 15.83222 18.58648 19.85679 5.60445 11.14470 15.31158 0.000000 2.36817 1.00109 12.02558 39.01011 14.84253 2.51974 7.19964 5.690588e+08 52.64188 111.70800 2695393.0 2695390.0 67823610.0
AOS 1.13934 5.87302 1.71779 1.88 32.22964 26.08051 3.17847 34.93914 36.39396 105.40000 15.95445 9302.692 129.10130 12.25325 13.58736 16.01718 17.50720 5.19698 15.82360 18.76060 0.582092 2.48231 1.23012 15.56782 27.84111 21.65196 0.00000 23.78939 1.300359e+08 14.03792 16.33037 864387.0 864390.0 23492210.0
ABT 0.71987 7.79507 1.85253 1.64 34.92305 37.77927 4.33188 58.07482 57.14382 25.54386 48.26690 201130.700 1738.05700 20.57167 17.35409 24.48087 20.73056 5.68182 18.03540 26.51736 0.759308 4.51837 1.40255 10.52510 22.40397 12.74469 3.72340 19.21903 1.750942e+09 32.41716 48.27820 5811787.0 5811790.0 116640030.0
ABBV 0.80438 24.51570 0.81513 3.75 145.95560 26.21560 6.97193 70.29842 69.77876 8.91667 390.11240 266073.800 1764.87800 33.18806 22.01322 34.84267 32.68048 16.33839 12.73568 21.59660 0.147967 4.69059 0.70803 8.50289 82.67262 10.89683 0.00000 13.00926 1.767110e+09 81.83053 451.23140 6319170.0 6319170.0 157797210.0
ABMD 1.36084 26.10269 7.05254 0.00 0.00000 5.56105 2.96416 80.93126 81.76327 0.00000 0.00000 11890.820 44.60342 22.35501 13.23040 24.08857 13.63892 7.90639 72.24334 88.04181 0.000000 11.52487 6.37376 8.61841 9.63678 9.41308 3.32113 21.73761 4.556394e+07 0.00000 0.00000 390682.0 390680.0 6340010.0

充分性测试(Adequacy Test)#

在做因子分析之前, 我们需要先做充分性检测, 判断数据集中是否存在因子(factor):

  • Bartlett's Test: 是用来检测已知(可观察)变量的相关矩阵是否单位矩阵,如果不是,那么变量之间存在相关性, 可采用因子分析

  • Kaiser-Meyer-Olkin(KMO) Test: 检测数据对因子分析的适用性。KMO 值的范围介于 0 和 1 之间。 KMO数值越大,变量间的相关性越强。KMO 值需要大于 0.6 才视为适用

from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
print('Bartlett-sphericity Chi-square: {}'.format(calculate_bartlett_sphericity(df_fundamental_target)[0]))
print('Bartlett-sphericity P-value: {}'.format(calculate_bartlett_sphericity(df_fundamental_target)[1]))
Bartlett-sphericity Chi-square: 18834.170518792038
Bartlett-sphericity P-value: 0.0

说明: p_value=0, 表明数据变量间的相关矩阵不是单位矩阵,具有显著的关联性,可采用因子分析. p_value需要少于0.5

from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all, kmo_model = calculate_kmo(df_fundamental_target[df_fundamental_target.columns]);
print('KMO score: {}'.format(kmo_model));
KMO score: 0.6726149090611923

说明: KMO 值大于 0.6 ,符合因子分析的适用性

估算因子数和剔除少于0.2公因子方差的变量(Estimating number of factors and filtering for variables with communalities > 0.2)#

Here, as we are only interested to know how many significant factors we should use for the following work, we used a simple factor analysis without any rotation. In fact, we can just use the estimation based on the PCA variance ratio (eigenvalue) above.

在这里,由于我们只想知道我们应该在接下来的工作中使用多少重要因子,因此我们只进行简单的因子分析,不使用任何矩阵旋转方式(之前的example中使用了varimax作为rotation)。 实际上,我们可以只使用上面基于 PCA 方差比(特征值)的估计(这里删掉了PCA部分)。

from factor_analyzer import FactorAnalyzer

fa = FactorAnalyzer(rotation=None)
fa.fit(df_fundamental_anndata)
ev, v = fa.get_eigenvalues()
# create scree plot using matplotlib
plt.scatter(range(1, 25), ev[:24])
plt.plot(range(1,25), ev[:24])
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue/Variance ratio')
plt.grid()
plt.show()
../_images/FactorAnalysisForCompaniesBasedOnFinancialMetricsDuringCovid19_18_0.png

Again, we see that the elbow point on the screen plot fall at about F5. Next, the communalities of the variables/features are inspected to see if the variables are well defined by the solution. Communalities indicate the percent of variance in a variable that overlaps variance in the factors. Ideally, we should drop variables with low communalities, for example, exclude those variables with <0.2 communalities. Here, found only 22 of 34 variables had commualities >0.2. However, for the exploratory purpose of this dataset, we don’t consider removing those variables.

我们看到图上的曲线肘点(elbow point)落在 F5 左右。 接下来,检查变量(特征)的公因子方差(communalities)以查看变量是否很好地被解释。公因子方差表示变量中与因子方差重叠的方差百分比。 理想情况下,我们应该删除具有低公因子方差的变量,例如,排除那些具有 <0.2 的公因子方差的变量。 在这里,发现 34 个变量中只有 22 个的公因子方差 > 0.2。 但是,出于对这个数据集的探索性目的,我们不考虑删除这些变量。

communalities = pd.DataFrame(fa.get_communalities(), index=list(df_fundamental_anndata.columns))
features_comm = list(communalities[communalities[0] > 0.2].index)
print('Total variables/features with communalities >0.2: {}'.format(len(features_comm)))
    
Total variables/features with communalities >0.2: 18

Factor analysis with Varimax (orthogonal) rotation and Maximum Likelihood factor extraction method#

fa = FactorAnalyzer(rotation='varimax', n_factors=5, method='ml')
fa.fit(df_fundamental_anndata);

# # check eigenvalues
# ev, v = fa.get_eigenvalues()
fa_loading_matrix = pd.DataFrame(fa.loadings_, columns=['FA{}'.format(i) for i in range(1, 5+1)], 
                              index=df_fundamental_anndata.columns)
fa_loading_matrix['Highest_loading'] = fa_loading_matrix.idxmax(axis=1)
fa_loading_matrix = fa_loading_matrix.sort_values('Highest_loading')
fa_loading_matrix
FA1 FA2 FA3 FA4 FA5 Highest_loading
vol3MonthAvg 0.941290 -0.000984 -0.010029 0.018500 0.275434 FA1
vol10DayAvg 0.946649 -0.005869 -0.014880 0.022796 0.316813 FA1
sharesOutstanding 0.903524 0.078796 -0.016071 -0.126117 -0.395341 FA1
dividendYield 0.085416 0.048170 0.028086 -0.227944 -0.145986 FA1
returnOnEquity 0.012507 -0.011500 -0.010655 -0.005573 -0.029119 FA1
... ... ... ... ... ... ...
currentRatio 0.012777 0.030565 -0.042528 0.300665 0.082227 FA4
epsChangeYear -0.010450 -0.030207 -0.012540 0.002032 0.023429 FA5
revChangeIn -0.021071 0.070437 -0.042578 -0.004015 0.096852 FA5
revChangeTTM 0.055924 -0.070893 0.094255 -0.020104 0.162444 FA5
beta 0.061148 0.016309 -0.038862 0.028620 0.272750 FA5

We can revisit the correlation matrix plot of the features/variables of this dataset - shown below. The underlying variables in respective factors are usually highly correlated but least correlated with the underlying variables in other factors.

我们可以重新访问该数据集的特征/变量的相关矩阵图 - 如下所示。 各个因子中的基础变量通常与其他因子中的基础变量高度相关,但相关性最低。

import seaborn as sns

plt.figure(figsize=(25,5))

# plot the heatmap for correlation matrix
ax = sns.heatmap(fa_loading_matrix.drop('Highest_loading', axis=1).T, 
                vmin=-1, vmax=1, center=0,
                cmap=sns.diverging_palette(220, 20, n=200),
                square=True, annot=True, fmt='.2f')

ax.set_yticklabels(
    ax.get_yticklabels(),
    rotation=0);
../_images/FactorAnalysisForCompaniesBasedOnFinancialMetricsDuringCovid19_25_0.png
import seaborn as sns

plt.figure(figsize=(25,25))

# plot the heatmap for correlation matrix
corr = df_fundamental_anndata.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

ax = sns.heatmap(corr, 
                 vmin=-1, vmax=1, center=0,
                 cmap=sns.diverging_palette(220, 20, n=200),
                 mask=mask, square=True, 
                 annot=True, fmt='.2f')
../_images/FactorAnalysisForCompaniesBasedOnFinancialMetricsDuringCovid19_26_0.png

As a rule of thumb, only variables with loading of 0.32 and above are interpreted. The greater the loading, the more the variable is a pure measure of the factor. Comrey and Lee (1992) suggest that loadings of:-
>71% (50% overlapping variance) are considered excellent;
>63% (40% overlapping variance) very good;
>55% (30% overlapping variance) good;
>45% (20% overlapping variance) fair;
<32% (10% overlapping variance) poor;

根据经验,只有负载为 0.32 及以上的变量才会被解释。 成分矩阵(loading)值越大,变量越可能作为因子去度量。 Comrey and Lee (1992) 建议加载:-

  • >71%(50%的重叠方差)被认为是优秀的;

  • >63%(40% 重叠方差)非常好;

  • >55%(30% 重叠方差)好;

  • >45%(20% 重叠方差)一般;

  • <32%(10% 重叠方差)差;


In summary, based on the heatmap above:

  • the Factor 1 (FA1) seems to suggest operational/return performance or valuations of a company;

  • FA2 is more correlated to volalities of company share values/market capitals;

  • FA3 seems to suggest long term debt obligations;

  • FA4 seems to suggest short-term debt obligation (i.e. quick/current ratio); and lastly

  • FA5 seems to be correlated with gross margin performance of a company.

综上所述,根据上面的热图:

  • 因子1 (FA1) 似乎暗示了公司的运营/回报业绩或估值;

  • FA2 与公司股票价值/市值的波动性更相关;

  • FA3 似乎暗示长期债务义务;

  • FA4 似乎暗示短期债务义务(即速动/流动比率);

  • FA5 似乎与公司的毛利率表现相关。


Final words#

Factor analysis (FA) dimensional reduction seems to be a better approach than PCA (which is a more empirical summary approach compared to FA) if we are interested in a theoretical solution uncontaminated by unique and error variability and have designed our study on the basis of underlying constructs that are expected to produce scores on the observed variables. In this FA study, we were using Varimax rotation (an orthogonal rotation so that all the factors are uncorrelated with each other) - this could in turn be more useful latent variables for factor model in both multiple linear or logistic-based regressions, eg. to predict positive or negative return of a stock price performance.

References#

  1. Loadings vs eigenvectors in PCA: when to use one or another? https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another

  2. Making sense of principal component analysis, eigenvectors & eigenvalues https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/35653#35653

  3. Tabachnick & Fidell. Using Multivariate Statistics, Sixth Edition. PEARSON 2013; ISBN-13:9780205956227.

new_df = pd.DataFrame(fa.transform(df_fundamental_anndata),index=df_fundamental_target.index, columns=['Factor_{}'.format(i) for i in range(1,6)])
new_df.sort_values(by="Factor_1", ascending=False)
Factor_1 Factor_2 Factor_3 Factor_4 Factor_5
AAPL 12.588554 0.482123 0.224747 -1.837693 -5.377854
AMD 7.840623 -0.614083 -0.316257 2.996816 14.473996
BAC 5.554225 0.784526 -0.229512 -3.032899 -2.793153
T 4.957927 -0.480587 0.051169 -0.084891 -3.376689
F 4.841125 -1.490035 0.170968 -0.218702 3.313468
... ... ... ... ... ...
AWK -0.554037 1.060867 -0.198427 -2.224445 0.603216
RE -0.562844 -0.577679 -0.513558 -1.369796 0.300551
NVR -0.570999 0.124867 -0.312521 -0.485159 0.220999
FLT -0.625292 1.123866 -0.138445 -2.245078 0.698421
SIVB -0.646572 1.373542 -0.612413 -2.274774 0.773403