The abstract and introduction for the project can be obtained by visiting here.
The National Health Interview Survey (NHIS) is one of the major data collection programs of the National Center for Health Statistics (NCHS) which is part of the Centers for Disease Control and Prevention (CDC). While the NHIS has been conducted continuously since 1957, the content of the survey has been updated about every 15-20 years to incorporate advances in survey methodology and coverage of health topics. In January 2019, NHIS launched a redesigned content and structure that differs from its previous questionnaire design. Persons excluded from the sample are those with no fixed household address, military personnel, persons in long-term care institutions, persons in correctional facilities, and U.S. nationals living in foreign countries. Data collection on the NHIS is continuous, i.e., from January to December each year.
Access data set here
The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world. The basic philosophy was to collect data on actual behaviors, rather than on attitudes or knowledge, that would be especially useful for planning, initiating, supporting, and evaluating health promotion and disease prevention programs. In addition to age, gender, and race/ethnicity, raking permits more demographic variables to be included in weighting such as education attainment, marital status, tenure (property ownership), and telephone ownership.
Access data set here
Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. In New York City, access to open data is the law. In 2012, the “Open Data Law,” which amended the New York City administrative code to mandate that all public data be made available on a single web portal by the end of 2018. Since then, there have been several amendments approved to the Open Data Law. These laws, which include stronger requirements on data dictionaries and data retention, response timelines for public requests, and an extension of the Open Data mandate into perpetuity, help make it easier for New Yorkers to access City data online and anchor the city’s transparency initiatives around Open Data.
New York City Locations Providing Seasonal Flu Vaccinations
Data provide the location and facility information for places in New York City providing seasonal flu vaccinations. Data were provided by Department of Health and Mental Hygiene (DOHMH)
Access data set here
New York City influenza-like and pneumonia hospital admissions
Data provide the total emergency department visits, and visits and admissions for influenza-like and/or pneumonia illness by modified ZIP code tabulation area of patient residence. Data were provided by Department of Health and Mental Hygiene (DOHMH).
Access data set here
Using rtweet
, data were scraped from the website.
“What are the attitudes of adults regarding the seasonal influenza vaccine?”
“What trends emerge regarding adults’ seasonal influenza vaccine?”
“What are the effects from adult’s engagement or disengagement with the seasonal influenza vaccine?”
Affective component: this involves a person’s feelings / emotions about the attitude object. “I hate vaccines”.
Behavioral (or conative) component: the way the attitude we have influences on how we act or behave. For example: “I will obtain a influenza vaccine once it becomes available”.
Cognitive component: this involves a person’s belief / knowledge about an attitude object. For example: “I believe vaccines are safe”.
Why? This will help us understand how adults might feel in regards to available or theoretical vaccines.
Why? This will help us understand what actions should be encouraged or discontinued in regards to available or theoretical vaccines.
Why? This will help us understand gaps in education, misinformation that might be occurring, and where information should be directed towards.
Highlights trends in flu vaccine data to ultimately support efforts in research and program implementation, to ensure that all of those eligible to obtain a flu vaccine to get one. This is in a desire to encourage immunization rates for other communicable disease, especially for the ongoing COVID-19 pandemic.
Our study utilizes data composed of a random sample of inhabitants from the United States, as BRFSS and NHIS data consists of respondents from all US geographical and social communities. However, for part of this study, the geographical community is specifically geared towards those in the NYC metro area. We find this selection to be fitting considering our research question is aimed at understanding the attitudes, associations, and influences surrounding flu vaccination engagement. This geographic awareness brings us to the current situation unfolding in the United States, that the country is leading the world in having one of the poorest responses to the current COVID-19 pandemic. Consequently, we found it appropriate to explore any trends that might arise from this area. We are all students of Columbia University’s Mailman School of Public Health. Attending a graduate institution located in NYC and being residents in the city allows for us to be particularly engaged with the study for its’ apparent relevance. Dense, metropolitan areas tend to be the most affected during health crisis, lending support to our exploration of data regarding influenza like symptoms.
The process of tidying data began with the import of the collected data sets. Considering the size of the data gathered, we then selected variables that were relevant and appropriate to our research question. Using respective codebooks, we then mutated and recoded variable names so that the values were readable and gave the actual description. For example, values were coded as “female” or “male”, rather than “0” or “1”. In order to engage with the discrete nature of the data, we then changed the variables classes into a format that was appropriate for analysis. Additionally, any appropriate releveling occurred in order to present the logical progression of some information. Tidy data was then saved and exported to our project repository. The intention of these steps was to ensure that the code was readable and reproducible.
Twitter data was easily scraped from the web using the rtweet package
and the command search_tweets
, where search terms can be specified. rtweet
connects to twitter’s API through a web-based app, and API access tokens are limited to once per hour. As a result, two separate .csv files were generated using the write_as_csv
function within the package to limit the number of “calls” being made to request twitter data. Further information and code examples can be found here
Analytical methods for the project can be obtained by visiting the page below.
Statistical information from National Health Interview Survey
Because we are interested in the binary outcome of whether an individual received an influenza vaccine or not, we will be primarily analyzing the data using logistic regression. We hypothesize that an individual’s health insurance status is an important predictor of their decision to obtain a flu shot, so we will consider this our main effect variable. We will also include other important covariates, and attempt to build a parsimonious model that explains this relationship in a clear way. We also will look at the distribution of demographic and health variables in our data set.
IRB approval was not necessary to access the data. Data from the NHIS, BRFSS, and NYC Open Data were de-identified. Additionally, data scraped from Twitter have been de-identified to protect all human subjects.
Results for the project can be obtained by visiting the pages below.
National Health Interview Survey Results
Behavioral Risk Factor Surveillance System Results
NYC influenza vaccination locations
Flu-like illness and pneumonia ED visits in NYC, March to December 2020
We are all aware of the fickleness of R. Although not rampant throughout the project, we did encounter some hiccups with codes, inability to knit, and general lack of knowledge regarding some R language and grammar. Considering the size of some of our data, we had difficulties committing and pushing data files within Github’s size allotment. After troubleshooting and learning about the benefits of the gitignore file, future instances of the problem were averted.
Although quite satisfied with the quality of the data, we would have greatly appreciated more variables that consisted of individual level data (continuous). Most of the variables were discrete which limited some of the visualizations that we were interested in exploring. Additionally, this project showcased a dearth of open data available concerning vaccines within the Americas. Initially we wanted to hinge our research question on adult immunization schedules, concerning Tetanus, HPV, etc. However, this information was limited. As a result, we shifted direction during the project to focus on flu vaccines considering the universal nature of its’ effect.
To advance the project, it would be necessary to obtain more data, from more diverse sources, in order to deepen our exploratory and statistical analysis. Especially, we would like this data to concern communities that traditionally disengage from vaccinations or those that have low resources. Additionally, to touch more on the cognitive component of attitudes, it would be fruitful to collect data from surveys that assess respondents’ knowledge regarding vaccines and immunization schedules. To engage with a deeper dive of the affective component of attitudes, it would also be fruitful to gather information from more nuanced qualitative sources, aside from Twitter, such as Reddit. Reddit can contain separate “communities” internationally, in which polarized thoughts and opinions can be analyzed.
We recognize that the data did not gather information from those with no fixed household address, military personnel, persons in long-term care institutions, persons in correctional facilities, and U.S. nationals living in foreign countries. This is a problem. Ethical research efforts and program implementation should be directed towards these vulnerable populations, especially persons experiencing homelessness, living in long term institutions, and in “correctional” facilities. We also must recognize that those apart of black and brown communities are significantly less likely to engage with the medical community, health professionals, and resources such as vaccines, due to the trauma that has occurred in the past. In order to properly engage with data science concerning vaccine engagement, this barrier must be broached and trust must be amended to gather appropriate data.