Polynomial Regression

Predict RSV in USA using CDC data

Arun & Zhongyi

Introduction

  • Respiratory Syncytial Virus (RSV) was discovered in the year 1956 and has been recognized as one of the most common causes of childhood illness.

  • RSV symptoms usually look like a common cold, but it can be serious leading to bronchiolitis (inflammation of the small airways in the lung) and pneumonia, especially for infants and older adults.

  • According to CDC(Center for Disease Control)RSV results around 58,000 hospitalizations annually and 100 to 300 deaths among children under 5.

Trend in USA

  • In most regions of the United States, RSV circulation starts in the fall and peaks in the winter.

  • With mask-wearing and physical distancing for COVID-19, there were fewer cases of RSV in 2020.

  • RSV cases began to increase in spring 2021 when safety measures relaxed with the arrival of COVID-19 vaccines.

  • This year, RSV in multiple U.S. regions are nearing seasonal peak levels.

Research using RSV Data

Respiratory syncytial virus (RSV) infection trend has gained many researchers’ concerns globally. Researchers are using different modeling approaches to predict the RSV trend.

  • Thongpan, Ilada: applied multivariate time-series analysis to show the possible prediction of RSV activity based on the climate in Thailand.

  • Manuel, Britta: applied logistic regression to develop a prediction model and developed a web-based application to predict the individual probability of RSV infection.

  • Reis, Julia: tried to built a real-time RSV prediction system using a susceptible-infectious-recovered (SIR) model in conjunction with an ensemble adjustment Kalman filter (EAKF) and 10 years CDC data[6]

  • Corberán-Vallet: presented Bayesian stochastic susceptible‐infected‐recovered‐susceptible (SIRS) model to understand RSV dynamics in the region of Valencia, Spain.

  • Leecaster, Molly: used simple linear regression to explore the relationship between three epidemic characteristics (final epidemic size, days to peak, and epidemic length).

About Data

Data set for this research is from RSV Hospitalization Surveillance Network (RSV-NET) (one of CDC research and surveillance platforms).

  • RSV-NET has been collecting RSV-associated hospitalizations in adults and children since 2018-2019 season from 58 counties in 12 states, including California, Colorado, Connecticut, Georgia, Maryland, Michigan, Minnesota, New Mexico, New York, Oregon, Tennessee, and Utah.

  • They conduct population-based surveillance system for laboratory-confirmed COVID-19, RSV, and influenza-associated hospitalizations in the US among children younger than 18 years of age and adults.

  • A case is defined by laboratory-confirmed RSV in a person who lives in a defined RSV-NET surveillance area and Tests positive for RSV withn 14 days before or during hospitalization.

  • Time frame: In season 2018-2019, 2019-2020, data collected is from October 1 to April 30. In season 2020-2021, 2021-2022, 2022-2023, data collected is from October 1 to October 1 next year.

  • Data was last updated on 17th November 2022.
  • Oneyear Data: Rate/wk for 52 weeks(YTD)
  • Twoyear Data: Rate/wk for 104 weeks(YTD)

Why Polynomial Regression

Simple Linear Regression algorithm only works when the relationship between the data is linear, suppose if we have non-linear data then linear regression will not be capable to draw a best-fit line and it fails in such conditions.

Consider the below diagram which has a non-linear relationship and you can see the Linear regression results on it, which does not perform well and doesn’t come close to reality.

Non-linear relationship between dependent and independent variables we add some polynomial terms to linear regression to convert it into Polynomial regression.

In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x.

Data Analysis and Results

Data Distribution

  • To build a model to predict RSV hospitalization rate, what range of data to be included. one year or two year.
  • Data distribution of RSV hospitalization rates from 2018 to till date.

Legend:
  orange: 2018-2019 
   green: 2019-2020
    pink: 2020-2021 
  purple: 2021-2022
   brown: 2022-2023
  • Data was examined before our modeling, including checking for missing values and removing outlines.

  • One-year-to-date data distribution with a curve line.
  • We can see that the straight line is unable to capture the patterns in the data.
  • Data is being under-fitting.
  • Polynomial regression is needed to increase the complexity of the model.

Two-year-to-date data distribution with a curve line was shown below.

  • Data is under-fitting.
  • Polynomial regression is needed to increase the complexity of the model.

Polynomial Regression

One year (2021-2022)

  • The best model we looking for is the one with high multiple R square (0.9606) and low RMSE (0.16), so we select the model with degree of 6.

Two Year (2020-2022)

  • The best model for the most recent two year data is at the degree of 5 with multiple r-square 0.92 and error 0.24.

Polynomial Regression

By comparing the two datasets, two year-to-date data with the one year-to-date data, it shows that building RSV hospitalization rate model containing most recent one year data creates a best prediction model.

Model Performance

Model for RSV hospitalization rate from Nov, 2021 to Nov, 2022 is,

When we compare the actual hospitalization with the predicted value from our model, we can get the numbers as follows.

Final Model

We can conclude that the model created is a good fit. It is shown as a graph below.

Prediction

We have got our model with the equation for the RSV hospitalization rate using last one year data:

RSV Hospitalization Rate(y):

Y = 0.917 + 0.312Week - 0.074Week2 + 0.0054Week3 - 0.00018Week4 + 0.0000027Week5 - 0.000000015Week6

  • RSV hospitalization rates in the next three months. A table is listed below to show the trend.
  • Following the trend in our model, rates keep going up and a rate of 9 could be reached at the beginning of next year.

Conclusion

  • We have built a model that has a well fit (multiple R square =0.9606 and RMSE=0.16). Also, next 3 month (11/14/2022-2/5/2023) RSV hospitalization rates were calculated.
  • Compared the model based on the data with one year span with two year span, we found that using one year-to-date data might be a better solution to model the RSV hospitalization rates.
  • Our data set includes 58 counties in 12 states from 2018 to now. More data containing the other states will give us more precise prediction of how respiratory syncytial virus evolved and spread in USA.
  • Our model created from polynomial regression analysis may have a good fit within certain range of the data we selected, but for outside the range of the data, the prediction might not be accurate.
  • To gain a better accurate prediction, people need to repeatedly generate new models.