The UTSA Journal of Undergraduate Research & Scholarly Work

Permanent URI for this communityhttps://hdl.handle.net/20.500.12588/5

The University of Texas at San Antonio Journal of Undergraduate Research and Scholarly Works (JURSW) is a peer-reviewed academic journal published by the Office of the Vice President for Research. The JURSW publishes scholarly inquiry from a wide variety of disciplines and from interdisciplinary and multidisciplinary frameworks.

Browse

Now showing 1 - 15 of 15

Application of the Cox Proportional Hazards Model for the Quantitative Analysis of LC-MS Proteomics Data
(Office of the Vice President for Research, 2019) Arreola, Ivan; Han, David
Along with quantitative, analytical genomics, proteomics continues to be a growing field for determining the gene and cellular functions at the protein level. As the liquid chromatography mass spectrometryphy (LC-MS) experiments produce protein peak intensities data, statistical and computational techniques are required to conduct quantitative analytical proteomics. The LC-MS proteomics data often have large quantities of missing peak intensities due to censoring of the low-abundance spectral features. Because of this, the observed peak intensities from the LC-MS method are all positive, skewed, and often left-censored. The classical survival analysis methods are ideal to detect differentially expressed proteins among different groups. These methods include the non-parametric rank sum (RS) tests such as the Kolmogorov-Smirnov (KS) and Wilcoxon-Mann-Whitney (WMW) tests, parametric surivival models such as the accelerated failure time (AFT) model with popular lifetime distributions; log-normal (LN), log-logistic (LL), and Weibull (W) for modeling the peak intensity data. As an alternative approach, here we propose the Cox proportional hazards (PH) method, a popular semi-parametric model for modeling survival data. The proposed regression-based method allows for leniency on the hazard function by alleviating the requirements of distribution-specific hazard functions. With the hopes of gaining more insightful biological information for cellular functions at the protein level, the statistical properties of each method are investigated through a simulation study and an application to the Type I diabetes dataset.
Can We Predict Big 5 Personality Traits from Demographic Characteristics?
(UTSA Office of Undergraduate Research, 2022-12) Woods, Ethan; Han, David
Here we aim to predict the Big Five personality traits based on the demographic information using a generalized linear model. Data was obtained from openpsychometrics.org, pre-processed in MS Excel, and imported to R for statistical analysis. First, it was attempted to predict each individual response item using an ordinal regression model. It was however found to be not viable, even after various weightings were applied to the demographic data. The response variables were then aggregated to form five categories, one for each personality trait: conscientiousness, agreeableness, neuroticism, openness to experience, and extraversion. We then applied a dimension reduction technique to the country variable as well as the race variable in order to achieve an adequate model fit. It was determined that although the demographic information could be useful, precise prediction of the Big Five traits require other information that was not captured in the dataset.
Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes
(Office of the Vice President for Research, 2018) Arreola, Ivan; Han, David
Microarray analysis can help identify changes in gene expression which are characteristic to human diseases. Although genomewide RNA expression analysis has become a common tool in biomedical research, it still remains a major challenge to gain biological insight from such information. Gene Set Analysis (GSA) is an analytical method to understand the gene expression data and extract biological insight by focusing on sets of genes that share biological function, chromosomal regulation or location. Thing systematic mining of different gene-set collections could be useful for discovering potential interesting gene-sets for further investigation. Here, we seek to improve previously proposed GSA methods for detecting statistically significant gene sets via various score transformations.
Comparison of Regression Methods to Identify Differential Expression in RNA-Sequencing Count Data from the Serial Analysis of Gene Expression
(Office of the Vice President for Research, 2019) Arreola, Ivan; Han, David
Comparative RNA-sequencing analysis for the Serial Analysis of Gene Expression (SAGE) can help identify changes in gene expression which are characteristic to human diseases. Since the RNA-sequencing experiment measures gene expressions in the form of counts, usually with a large degree of skewness, the analysis methods based on continuous probability distributions are generally inappropriate for modeling this type of data. Currently, the parametric regression techniques for solving this problem are based on the well-known discrete probability distributions such as Poisson and negative binomial. In order to overcome this modeling challenge with higher flexibilities to account for a wide range of dispersion levels, here we introduce an alternative Generalized Linear Model (GLM) based on the Conway-Maxwell-Poisson distribution, also known as COM-Poisson or CMP distribution. The CMP regression model generalizes the standard Poisson and negative binomial regressions, and it is suitable for fitting count data with varying degrees of over- and under-dispersions. Using simulated and real SAGE datasets, the performance of the proposed method is assessed in comparison to the Poisson- and negative binomial-based regression models.
An expectation-maximization algorithm for estimating the parameters of the correlated binomial distribution
(UTSA Office of Undergraduate Research, 2022-12) Bennett, Andrea; Wang, Min
The correlated binomial (CB) distribution was proposed by Luceño (Computational Statistics & Data Analysis 20, 1995, 511–520) as an alternative to the binomial distribution for the analysis of the data in the presence of correlations among events. Due to the complexity of the mixture likelihood of the model, it may be impossible to derive analytical expressions of the maximum likelihood estimators (MLEs) of the unknown parameters. To overcome this difficulty, we develop an expectation-maximization algorithm for computing the MLEs of the CB parameters. Numerical results from simulation studies and a real-data application showed that the proposed method is very effective by consistently reaching a global maximum. Finally, our results should be of interest to senior undergraduate or first-year graduate students and their lecturers with an emphasis on the interested applications of the EM algorithm for finding the MLEs of the parameters in discrete mixture models.
Meta-analysis of Odds Ratios from Heterogeneous Clinical Studies
(UTSA Office of Undergraduate Research, 2022-12) Song, Mina; Belle, Macy; Han, David
Many systematic reviews of randomized clinical trials require meta-analyses of odds ratios. A conventional method estimates the overall odds ratios via weighted averages of the logarithm of individual odds ratios. However, this approach has several deficiencies due to the underlying assumptions and approximations. The goal of this study is to understand and quantify the methodological pitfalls in conducting a meta-analysis of odds ratios. The fixed-effect and random-effect models of pooled odds ratios are compared by applying to a meta-analysis of SNP studies. A popular statistical software R is used for the analysis along with SPSS and SAS. It is found that the point estimates and confidence intervals for the overall log odds ratio can differ substantially between the traditional and alternative methods, which would affect the resulting statistical inferences. It is recommended that for producing reliable results, the traditional methods for meta-analysis of odds ratios should be discouraged.
Optimal Dynamic Treatment Regime by Reinforcement Learning in Clinical Medicine
(UTSA Office of Undergraduate Research, 2020-12) Song, Mina; Han, David
Precision medicine allows personalized treatment regime for patients with distinct clinical history and characteristics. Dynamic treatment regime implements a reinforcement learning algorithm to produce the optimal personalized treatment regime in clinical medicine. The reinforcement learning method is applicable when an agent takes action in response to the changing environment over time. Q-learning is one of the popular methods to develop the optimal dynamic treatment regime by fitting linear outcome models in a recursive fashion. Despite its ease of implementation and interpretation for domain experts, Q-learning has a certain limitation due to the risk of misspecification of the linear outcome model. Recently, more robust algorithms to the model misspecification have been developed. For example, the inverse probability weighted estimator overcomes the aforementioned problem by using a nonparametric model with different weights assigned to the observed outcomes for estimating the mean outcome. On the other hand, the augmented inverse probability weighted estimator combines information from both the propensity model and the mean outcome model. The current statistical methods for producing the optimal dynamic treatment regime however allow only a binary action space. In clinical practice, some combinations of treatment regime are required, giving rise to a multi-dimensional action space. This study develops and demonstrates a practical way to accommodate a multi-level action space, utilizing currently available computational methods for the practice of precision medicine.
Performance of Machine Learning Algorithms for Heart Disease Prediction: Logistic Regressions Regularized by Elastic Net, SVM, Random Forests, and Neural Networks
(UTSA Office of Undergraduate Research, 2022-12) Ikpea, Obehi Winnifred; Han, David
Heart disease, a medical condition caused by plaque buildup in the walls of the arteries, is the leading cause of death in the U.S. and worldwide. About 697,000 people suffer from this condition in the U.S. alone. This research project aims to assess and compare the performance of several classification algorithms for predicting heart disease so that the method can be considered as a clinical indicator of cardiovascular health. These methods include multiple logistic regression regularized with or without elastic nets, support vector machine, random forest, and artificial neural networks. A low prevalence of the disease is reflected in the data imbalance, and an oversampling technique is also suggested to deal with the computational challenges posed by this data imbalance.
Policy-Guided Susceptible-Infected-Recovered Modeling of the COVID-19 Spread in Texas
(UTSA Office of Undergraduate Research, 2022-12) Woods, Ethan; Han, David
The goal of this research was to create an SIR model for the Texas COVID-19 cases based on the state data from March of 2020 through October of 2020, and to investigate the impact of public policies on the transmission of COVID. The data was pre-processed using Excel; some basic time series graphs were produced in Excel as well. All other data analysis, including the production of all graphs relating to the SIR model, was performed in R. Difficulty in estimating the model parameters by the maximum likelihood method was encountered due to the short durations between the implementation dates of various policies designed to curb the spread of COVID-19. Examining the estimate trends of beta, gamma, and R0, a stabilizing pattern for R0 was observed over time, which would require further investigations to understand the epidemiology of COVID-19 in Texas.
Predicting the Expected Waiting Time of Popular Attractions in Walt Disney World
(Office of the Vice President for Research, 2019) Mendoza, Dayanira; Wu, Wenbo; Leung, Mark T.
Waiting lines are inevitable consequence of imbalance in service operations at modern theme parks. Because of that, parks have introduced different approaches to reduce standard waiting time; some of which are at no extra cost to guests whereas some others require a price premium. These approaches usually feature a variety of schemes by which guests can bypass the standard waiting line or enter an express lane featuring a minimal wait. Our current study primarily develops statistical learning models to analyze the empirical data gathered from “touringplans.com,” which encompasses some of Walt Disney World’s (WDW) popular attractions located in Orlando, Florida. Results from data analysis and visualization indicate that each of the four parks had similar patterns throughout the years of 2012 through 2018. The study also examines the time-temporal effect and found out which rides having more popularity is dependent upon the season (period) in the year. Empirical analytics are then conducted on each of the four parks using regression modeling (statistical learning) to predict the waiting times for a particular ride during a specific season. Overall, a sample of 13 rides (attractions) over 17 seasons are used to model the waiting times at each theme park, yielding a total of 13x17x4 = 884 possible combinations.
Predicting the Next Big Impact: Modelling the Rate of Massive Meteorite Strikes
(UTSA Office of Undergraduate Research, 2020-12) Woods, Ethan; Han, David
Meteorites are solid pieces of debris from an astronomical object such as a comet, asteroid, or meteoroid that originates in outer space and survives its passage through the atmosphere to reach the surface of a planet. Although rare, a collision between massive astronomical objects, known as an impact event, can have measurable effects, and physical and biospheric consequences. In this work, we investigate the distributional trend of heavy meteorites that strike the earth and determine if any probability distributions can serve as effective predictive models. NASA meteorite data from 1980 to 2012 were imported into R after pre-processing. Pre-processing activities involved the following: removal of missing data, irrelevant features to meteorite mass or the year of meteorite impact. Statistical analysis was then restricted to meteorites at or above the 98th percentile of mass. It was found that while the distribution of mass for all meteorites is lognormal, the distribution for the top 2% is severely right-skewed, indicating that an extreme-value distribution could be used to model them. Furthermore, the rate of impact for these massive meteorites can be modelled with a zero-inflated negative binomial distribution.
Quantum Computation, Quantum Algorithms and Implications on Data Science
(UTSA Office of Undergraduate Research, 2020-12) Kim, Nathan; Garcia, Jeremy; Han, David
Quantum computing is a new revolutionary computing paradigm, first theorized in 1981. It is based on quantum physics and quantum mechanics, which are fundamentally stochastic in nature with inherent randomness and uncertainty. The power of quantum computing relies on three properties of a quantum bit: superposition, entanglement, and interference. Quantum algorithms are described by the quantum circuits, and they are expected to solve decision problems, functional problems, oracular problems, sampling tasks and optimization problems so much faster than the classical silicon-based computers. They are expected to have a tremendous impact on the current Big Data technology, machine learning and artificial intelligence. Despite the theoretical and physical advancements, there are still several technological barriers for successful applications of quantum computation. In this work, we review the current state of quantum computation and quantum algorithms, and discuss their implications on the practice of Data Science in the near future. There is no doubt that quantum computing will accelerate the process of scientific discoveries and industrial advancements, having a transformative impact on our society.
Statistical Perspectives in Teaching Deep Learning from Fundamentals to Applications
(UTSA Office of Undergraduate Research, 2020-12) Kim, Nathan; Han, David
The use of Artificial Intelligence, machine learning and deep learning have gained a lot of attention and become increasingly popular in many areas of application. Historically machine learning and theory had strong connections to statistics; however, the current deep learning context is mostly in computer science perspectives and lacks statistical perspectives. In this work, we address this research gap and discuss how to teach deep learning to the next generation of statisticians. We first describe some backgrounds and how to get motivated. We discuss different terminologies in computer science and statistics, and how deep learning procedures work without getting into mathematics. In response to a question regarding what to teach, we address organizing deep learning contents and focus on the statistician’s view; form basic statistical understandings of the neural networks to the latest hot topics on uncertainty quantifications for prediction of deep learning, which has been studied in the Bayesian frameworks. Further, we discuss how to choose computational environments and help develop programming skills for the students. We also discuss how to develop homework incorporating the idea of experimental design. Finally, we discuss how to expose students to the domain knowledge and help to build multi-discipline collaborations.
Stochastic SIR-based Examination of the Policy Effects on the COVID-19 Spread in the U.S. States
(UTSA Office of Undergraduate Research, 2020-12) Song, Mina; Belle, Macy K.; Medlovitz, Aaron; Han, David
Since the global outbreak of the novel COVID-19, many research groups have studied the epidemiology of the virus for short-term forecasts and to formulate the effective disease containment and mitigation strategies. The major challenge lies in the proper assessment of epidemiological parameters over time and of how they are modulated by the effect of any publicly announced interventions. Here we attempt to examine and quantify the effects of various (legal) policies/orders in place to mandate social distancing and to flatten the curve in each of the U.S. states. Through Bayesian inference on the stochastic SIR models of the virus spread, the effectiveness of each policy on reducing the magnitude of the growth rate of new infections is investigated statistically. This will inform the public and policymakers, and help them understand the most effective actions to fight against the current and future pandemics. It will aid the policy-makers to respond more rapidly (select, tighten, and/or loosen appropriate measures) to stop/mitigate the pandemic early on.
Strategic Analysis and Evaluation of Cheesecake Factory’s Supply Chain: Uncertainties, Challenges, and Remedies
(Office of the Vice President for Research, 2019) Farley, Brittany; Kidd, Michele; Morgan, Scot; Leung, Mark T.
In the business world, it is important to maintain a profitable balance between efficiency (cost) and responsiveness (to changes in the market, customer demand, etc.) We took the fundamentals of supply chain theory and used them to analyze the real-world case of The Cheesecake Factory’s retail cheesecake supply chain. After an examination of the background of its supply chain structure, The Cheesecake Factory’s supply and demand uncertainties were first identified and assessed. We reviewed how supply uncertainties are influenced by disruptions to material flow on the supplier side as well as how implied demand uncertainties are influenced by changes in customers’ behavior and preferences. It follows that these different forms of uncertainties led to many supply chain challenges faced by The Cheesecake Factory, and we made remedial recommendations to address those challenges, including adding and continuously improving the flow of information with advances in technology and partnering with ecofriendly farms. Finally, we reviewed the ability and thus sustainability of The Cheesecake Factory to maintain a strategic balance between cost and responsiveness with their high-end cheesecake products given the ongoing challenges. Understanding supply chain variables is key to remaining profitable in business. The Cheesecake Factory’s cheesecake supply chain displays similar operational and consumption characteristics experienced by many other counterpart food processing supply chains. Our strategic analysis and evaluation can offer valuable insight to manage these supply chains and to improve their profitability.

Browse

Browsing The UTSA Journal of Undergraduate Research & Scholarly Work by Department "Management Science and Statistics"