美国队长3:内战
There are plenty of reasons why one would want to find solitude in the wilderness, from the therapeutic effects of being immersed in nature, to not wanting to contribute to trail degradation and soil erosion on busier trails.
人们有很多理由想要在旷野找到孤独,从沉浸在大自然中的治疗效果到不想在繁忙的小径上造成小径的退化和土壤侵蚀。
Now more than ever the reprieve of the outdoors is greatly needed. But in a post-COVID 19 world, where it can be practically impossible to maintain proper social distancing measures when passing hikers on a narrow trail, it is especially important to find less frequented trails to hike.
现在比以往任何时候都更需要户外缓刑。 但是在19后COVID的世界中,在狭窄的步道上经过远足者时,几乎不可能维持适当的社会疏远措施,因此寻找不那么频繁的远足径尤为重要。
I set out on a mission to use data science and machine learning to find the best little-known trails in America. You can check out the code on my github if you want to jump into the nitty gritty, or read on for analysis and a list of the hidden gems in your state!
我的任务是使用数据科学和机器学习来找到美国鲜为人知的最佳路径。 您可以在我的github上签出代码,如果想跳入更多细节,或者继续阅读以进行分析以及您所在州的隐藏宝石清单!
该方法 (The Approach)
If you’re anything like me, before you go anywhere or buy anything, you’re going to read all the reviews. When looking for trails to hike, a popular medium for discovering where to go is AllTrails.com.
如果您像我一样,在去任何地方或购买任何东西之前,您需要阅读所有评论。 当您寻找远足小径时, AllTrails.com是找到目的地的一种流行媒介。
When I first approached this project, I wanted to answer the question, “What makes a trail good?” That is, what combination of features and statistics about a trail would lead to it having a high overall rating?
当我第一次接触这个项目时,我想回答一个问题:“什么让步道更好?” 就是说,特征和统计信息的组合如何才能使它具有较高的总体评价?
What I pretty quickly found out though, is that across the 35,000 trails I scraped and analyzed, basically all of them were rated “pretty good” — that is, with an average user rating of 4.2 out of 5 stars and standard deviation of less than 0.6, it was really hard to distinguish which trails were excellent, and which were just okay, from their 5-star rating alone.
不过,我很快发现,在我抓取和分析的35,000条路径中,基本上所有路径都被评为“相当好”,也就是说,平均用户评分为5颗星中的4.2颗,标准偏差小于0.6,真的很难从它们的5星评级中区分出哪些是优秀的,哪些还可以。
What there was huge variation in across all the trails though, was their popularity as represented by the total number of reviews each trail had. While the vast majority of trails had only 100 or so reviews, a select few had several thousand! What was making these trails so popular?
但是,所有路径之间的差异都很大,它们的受欢迎程度由每个路径的评论总数表示。 虽然绝大多数足迹只有100条左右的评论,但很少的一条只有数千条! 是什么让这些足迹如此受欢迎?
I thus pivoted to try to predict not the rating of a trail, but instead determine, via a data-driven model, the relationship between the various features of a given trail and its popularity. In finding commonalities, I could then apply that model to unpopular trails, to find which ones check all the same boxes and are likely to be great, even though they haven’t been discovered yet.
因此,我转而尝试不预测路线的等级,而是通过数据驱动模型确定给定路线的各种特征与其受欢迎程度之间的关系。 在寻找共性时,我可以将该模型应用于不受欢迎的线索,以找出哪些会选中所有相同的框,即使它们尚未被发现,也可能很棒。
方法 (Methodology)
- ) With Selenium and Beautiful Soup, scrape AllTrails.com to obtain trail data about 35,000 trails in the United States. This included information about the length of the hike, its elevation gain, its location, and a list of all of the natural features (such as waterfall, wild flowers, paving) the trail had. )使用Selenium和Beautiful Soup,抓取AllTrails.com以获取有关美国35,000条路径的路径数据。 其中包括有关远足时间,海拔提升,位置以及所有自然特征(例如瀑布,野花,铺路)的列表的信息。
- ) Clean this data and create a Pandas DataFrame. This included one-hot encoding dummy variables for all of categorical feature columns. )清理此数据并创建一个Pandas DataFrame。 其中包括所有分类要素列的一键编码伪变量。
- ) Utilize the VADER Sentiment Analysis module to analyze the text reviews via simple Natural Language Processing for each trail and determine a mean composite score. )利用VADER情绪分析模块通过简单的自然语言处理对每条线索进行文本评论分析,并确定平均综合得分。
- ) Use linear regression modeling methodologies including Statsmodels OLS to determine the relationship between a trail’s features and its’ popularity. )使用包括Statsmodels OLS在内的线性回归建模方法来确定路径特征与其受欢迎程度之间的关系。
- ) Perform feature engineering and regularization via LassoCV to remove multicollinearity amongst those features and optimize the model. )通过LassoCV执行特征工程和正则化,以消除这些特征之间的多重共线性并优化模型。
- ) Apply that model to trails that are described as “lightly trafficked”, to find trails which would be expected to be popular based on their combination of features, but just haven’t been discovered yet. )将该模型应用于描述为“轻度贩运”的路径,以根据其功能组合查找预期会流行的路径,但尚未发现。
发现 (Findings)
A linear regression model was fit to the trail’s stats with the number of reviews (and hence, popularity) serving as the target variable. The model yielded a list of the most influential features on a trail on it being popular. These included there being a fee, having a high sentiment analysis score, it being rocky, and having a scramble and no shade, amongst others.
线性回归模型适合于线索的统计数据,其中评论数(因此受欢迎程度)用作目标变量。 该模型列出了受欢迎的路径上最有影响力的功能。 这些包括收费 , 情感分析得分高 , 不算困难 , 争夺和没有阴影 ,等等。
I interpret those important features like this:
我将解释以下重要特征:
A fee: If the most popular trails have a fee to use, this indicates they are likely located inside National Parks. As many National Parks are closed due to COVID, or may be very busy, it is even more important to find alternatives.
收费 :如果最受欢迎的步道需要付费,则表明它们可能位于国家公园内。 由于许多国家公园因COVID而关闭,或者可能非常繁忙,因此寻找替代方案显得尤为重要。
Sentiment analysis score: Since all trails have roughly the same score out of 5 stars, its hard to gather a lot of reliable information about their quality from this rating alone. By using natural language processing to analyze the written text reviews themselves, I was able to gain an actual useful metric in determining how people actually feel about the trail. The higher the score (on a scale of -1=very negative to +1=very positive), the stronger people felt positively toward the trail, which was super useful in finding hidden gems.
情感分析得分 :由于所有足迹在5星中的得分大致相同,因此仅凭此评分就很难收集有关其质量的大量可靠信息。 通过使用自然语言处理本身来分析书面评论,我能够获得一个实际有用的指标来确定人们对这条路的实际感觉。 分数越高(从-1 =非常负到+1 =非常正),人们对步道的感觉越强,这对于发现隐藏的宝石非常有用。
Rocky/scramble/no shade: What this says to me is that the very popular trails take place above tree line! It’s on those more difficult hikes with higher elevation gain that you encounter these features. And with higher elevation, you’ll likely get better views! As it turns out, people love these tougher trails.
崎//无序/无阴影 :这对我说的是,非常受欢迎的步道发生在林线上方! 在遇到这些功能的情况下,就是那些具有更高仰角增益的较困难的远足。 随着海拔的升高,您可能会获得更好的视野! 事实证明,人们喜欢这些艰难的路。
The R² of this model was optimized to 0.19. Though this isn’t a very high score, you can see below that this is because the relationship between trail features and popularity simply isn’t linear. The residuals plot below showing the difference between the predicted popularity values and actual values demonstrates this pretty clearly (if this were linearly dependent, residuals would all fall in a fairly horizontal bar around 0!) So what’s actually determining a trail’s popularity if not it having all the right features of a popular trail?
该模型的R²优化为0.19。 尽管这并不是一个很高的分数,但是您可以在下面看到这是因为足迹特征和受欢迎程度之间的关系不是线性的。 下面的残差图显示了预测的流行度值与实际值之间的差异,很清楚地证明了这一点(如果线性相关,则残差都将落在0附近的相当水平的条形中!)流行路线的所有正确功能?
My key finding was that AllTrail’s algorithm shows the trails with the most reviews first and foremost, which leads to a form of recursive confirmation bias. If all trails have roughly the same rating, users will turn to the reviews to determine whether a trail is good, will choose to do one with a lot of reviews, hence feeding in to the loop of making the very few busiest trails even busier. Meanwhile, other similar trails may have plenty of opportunity but go neglected.
我的主要发现是,AllTrail的算法首先显示了具有最多评论的路径,这导致了递归确认偏差的形式。 如果所有路径的评分大致相同,则用户将转向评论来确定一条路径是否良好,并选择对一条路径进行大量评论,从而进入使最繁忙的路径变得更加繁忙的循环。 同时,其他类似的路线可能有很多机会,但被忽略了。
那么,什么使小道受欢迎呢? (So What Makes a Trail Popular?)
There are tens of thousands of hikes listed on AllTrails.com, but their search algorithm always offers viewers the most popular hikes first. Trails with the most reviews get the most hikes, and hence even more reviews; while lesser known trails may be just a good, but are harder to find on the website, and hard to know for sure whether they’ll be a good trail if they have so few ratings.
AllTrails.com上列出了数以万计的远足,但他们的搜索算法始终始终为观众提供最受欢迎的远足。 评论最多的步道获得最多的加息,因此获得更多评论; 虽然鲜为人知的足迹可能只是一个好选择,但很难在网站上找到,并且如果它们的评分太少,很难确定它们是否会是一个好的足迹。
So what makes a trail popular? Ultimately, AllTrails does.
那么,什么使小道受欢迎呢? 最终, AllTrails做到了。
It’s time we break out of that feedback loop, and find some amazing alternative hikes where we can avoid the crowds. But how will you know if a trail is going to be worth your time? Well, I used Machine Learning to do that work for you.
现在该是我们打破这种反馈循环的时候了,找到一些令人惊奇的替代远足方案,我们可以避开人群。 但是,您怎么知道一条小路是否值得您花时间呢? 好吧,我使用机器学习为您完成了这项工作。
I fit the best model on a subset of trails which were designated as being “lightly trafficked”, and the R² for these trails was 0.08. This was actually encouraging, considering that these are specifically a selection of trails which aren’t popular, but according to this, given their features, should be.
我将最佳模型应用于被指定为“轻度贩运”的部分路径,这些路径的R²为0.08。 这实际上是令人鼓舞的,考虑到这是专门选择的路径不属于流行的,但根据这一点,由于其特点,应该是。
A potential area of future work for this project could be fitting a polynomial features model instead of a linear one. Early exploration into this method yielded a promising R² improvement to 0.26, but did induce some feature collinearity by duplicating features, that would need to be feature engineered out. I’m looking forward to continuing this work once I have more machine learning tools at my disposal! But I’m absolutely thrilled to present you with this list of the best lesser-known trails in America as my very first end-to-end data science project.
该项目未来工作的潜在领域可能是拟合多项式特征模型而不是线性模型。 对该方法的早期探索使R²改善到了0.26,但确实通过复制特征引起了某些特征共线性,这需要进行特征设计。 一旦我拥有更多可用的机器学习工具,我期待继续这项工作! 但是,作为我的第一个端到端数据科学项目,我非常高兴向您介绍这份美国鲜为人知的最佳路径。
远足径 (Hike The Trails)
Check out the Hidden Gems in your State below!
在下面查看您所在州的隐藏宝石!
翻译自: https://towardsdatascience.com/hidden-gems-finding-the-best-secret-trails-in-america-d9203e8ad073
美国队长3:内战
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388251.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!