Optimizing Instagram Influencer Content for Brands

Simran Sawhney
8 min readMay 3, 2021
Source: Venture Harbour

In a world with almost insurmountable amounts of digital noise, businesses are urgently scouting out innovative ways to engage with their consumer bases and attract new ones. Because self-promotion has lost its credibility over time, brands have turned to influencer marketing as a more reliable strategy: collaborating with niche social media experts that have loyal followings. It bypasses celebrity endorsements and goes directly to individuals that are seen as “ordinary” and “regular,” but extremely trustworthy and qualified to recommend products/services within their respective communities.

In this post, I aim to build a strong predictive model that brands can use when deciding which influencers to invest in. I will also analyze key Instagram trends that brands should relay when giving strategic marketing recommendations.

Data Collection

Sources:

Process:

High Level Schema of Data Collection Process

Step 1: To begin the data collection process, I web scraped the top 50 instagram influencer accounts on HypeAuditor for 8 content categories: Art/Artists, Beauty, Fashion, Fitness, Food & Cooking, Travel, Health, and DIY/Design. Because the web page was dynamically loaded using JavaScript, I scraped with the RSelenium package and a Firefox driver. Additionally, to correct for any skewness, I filtered out influencers with more than 3 million followers. This cleaned data was then exported out of R and into Python for the next step.

Step 2: I downloaded a Github repo scraper using Git clone and opened the files within a Python environment. From there, I sifted through different built-in functions to create my own test file with a customized data frame that outputted information I found relevant:

Step 3: The parsed data was then exported from Python and re-imported in R for analysis.

Data Analysis

Understanding the Importance of Influencer Marketing: A Case Study

A widely acknowledged success story is that of Daniel Wellington, a Swedish company that emphasizes elegant and minimalistic design in crafting its watch/jewelry collections. Founded in 2009 (with only $1,500 in investments), the brand was able to scale to $220 million in profit by 2015, exclusively using influencer marketing strategies. By sending watches as gifts to influencers and generating a network of #DanielWellington social media posts, embedded with discount codes and photo contests, Daniel Wellington exponentially grew engagement with its offerings.

Example of Daniel Wellington influencer post (Source: @elorabee on Instagram)

The wordcloud below summarizes a blog review on the brand and its signature marketing style.

Words like “success,” “love,” “outstanding,” and “creative” are indicative of the positive sentiment associated with Daniel Wellington’s influencer campaigns. Words like “instagram,” “hashtag,” “commenting,” “followers,” and “content” provide us with a glimpse on what features may be important when evaluating influencer posts.

Predicting Post Popularity

A correlation matrix helps in providing a high-level overview of the data and different trends to look out for as we delve into deeper analysis. From the graphic below, we can see that the number of likes a post receives is positively correlated with (from highest to lowest magnitude): follower count, number of comments, if the post is uploaded on the weekend, if the post is in sidecar format (multiple images, videos, or some combination), and if the post is created in the evening. While these correlations don’t necessarily have predictive power, they are indicators that we should keep an eye out for as we develop our statistical models.

Iterating upon a basic predictive foundation is one of the most effective ways to develop a strong statistical model. A simple regression model, incorporating all features in the data, shows that only follower count, the number of comments, video status, and caption length are statistically significant predictors. When this linear model is applied to the test sample, it yields an RMSE (root mean square error) of 89,088.12. As plotted below, there are wide discrepancies between the actual and predicted number of likes. Thus, overfitting is likely occurring and I posit that multicollinearity may be swaying results.

To account for potential multicollinearity, a PCA (Principal Component Analysis) model should be created. PCA is a dimensionality-reduction method that transforms a large variable set into a smaller one, while retaining key predictive qualities. After training the model, I find that 10 is the optimal number of components to be included. This assumption, when applied to test data, yields an RMSE of 88,842.77. While less than the multiple regression model, the PCA model does little to improve predictive power. In fact, the plot of the actual versus predicted number of likes below oddly resembles the one created for multiple regression.

The last model we look at is XGBoost: an applied machine learning algorithm. Gradient boosting techniques have become widely commended in the statistics space; they convert weak learners/trees into stronger ones by iterating upon them and tracking the reduction in error. For my XGBoost model, I assigned a maximum depth of 3 and specified 50 rounds of improvement. In evaluating feature importance, I examined the “Gain” column which represents the fractional contribution of each feature and can be used as a proxy for importance. The two most important features are number of comments and follower count, with caption length, video status, and number of hashtags trailing. To visualize the process of gradient boosting, I embedded the 24th tree below (out of 50).

To evaluate performance, I once again plotted the actual number of likes versus the predicted number. What can be seen are more closely aligned plot lines (with some outliers here and there). Test RMSE, using XGBoost, drops significantly to 64,829.19.

Key Takeaway: A normal regression model or PCA does not do the data justice. Instead, an ML algorithm (tree-based) is needed to optimize predictive power. The most important features in predicting the number of likes a post will receive are: comment count (positive effect), follower count (positive effect), caption length (negative effect), video status (negative effect), and hashtag count (negative effect).

Examining Trends

While some features may not be significant when it comes to prediction, it is still interesting to parse out trends. I wanted to first segment the data by content type and day of the week to see if any underlying tendencies existed. I find that:

  • Posts involving fitness, as well as food and cooking, do the best on weekends. Because these posts oftentimes promote offline action (e.g., exercising and meal prepping), engagement is likely to be higher when Instagram users have more time (i.e. on the weekend).
  • Interestingly enough, for a large number of categories (Art/Artists, Fashion, Health, and Travel) there are mid-week spikes in post popularity.

Looking at the data more holistically, posts actually do the best on Wednesdays, followed by the weekends. Posts generally do poorly on Mondays and Tuesdays, when overall social media usage is likely low due to users getting back into school/work environments.

Posts do the best in the evening on weekdays (after school/work), and on weekends, they do the worst in the afternoon (when users are likely disconnected from social media and spending time with family/friends).

In terms of media types, sidecars and images receive the most popularity across most content. The two exceptions are for health and food/cooking; images receive the most engagement for health, while videos do the best for food/cooking.

Using a logarithmic trend line, we see that as captions become longer, engagement and popularity decline. Users likely get overwhelmed or bored by large pieces of texts and abandon posts.

Finally, I classified each caption within the dataset in terms of sentiment (from -2.5 to 20). While I hypothesized that posts with higher sentiment scores would be more popular than those with low or negative scores, the data shows an inverse trend. This may be because of something called “negativity bias”: we have evolved to react to negative content, like threats and hyperbole, because they stand out to us. As Stuart Soroka, a professor at the University of Michigan, puts it: “humans may [be] neurologically or physiologically predisposed towards focusing on negative information because the potential costs of negative information far outweigh the potential benefits of positive information.” That being said, in the context of influencer marketing, an influencer should obviously not tailor content to portray a sponsoring brand in a negative light.

Key Takeaways: Brands that are scouting out influencers to advertise their products/services should consider the following:

(1) Follower count and average number of comments per post are good indicators of reach and audience engagement.

(2) Brands should encourage influencers to minimize caption length and the number of hashtags used in marketing posts, while still retaining key content.

(3) Posts should generally be uploaded on the weekends or, depending on the content category, midweek. If uploaded midweek, it should be in the evening; if uploaded on the weekend, it should either be in the morning or evening.

(4) Posts should be in sidecar or image format; videos should only be used for food/cooking related content.

--

--