RoboBrowser: Automating Online Forms

    Date:

    Background

    RoboBrowser is a Python 3.x package for crawling through the web and submitting online forms. It works similarly to the older Python 2.x package, mechanize. This post is going to give a simple introduction using RoboBrowser to submit a form on Wunderground for scraping historical weather data.

    Initial setup

    RoboBrowser can be installed via pip:

    pip install robobrowser

    Let’s do the initial setup of the script by loading the RoboBrowser package. We’ll also load pandas, as we’ll be using that a little bit later.

    from robobrowser import RoboBrowser
    import pandas as pd

    Create RoboBrowser Object

    Next, we create a RoboBrowser object. This object functions similarly to an actual web browser. It allows you to navigate to different websites, fill in forms, and get HTML from webpages. All of a RoboBrowser object’s actions are completely invisible, so you won’t actually see a physical browser while any of this is happening.

    Using our object, browser, we navigate to Wunderground’s historical weather main site (see https://www.wunderground.com/history). This is done using the open method, as below.

    '''Create RoboBrowser object
       This will act similarly to a typical web browser'''
    browser = RoboBrowser(history=True)
     
    '''Navigate browser to Wunderground's historical weather page'''
    browser.open('https://www.wunderground.com/history')

    Filling out the form

    Next, let’s get the list of forms on the webpage, https://www.wunderground.com/history.

    forms = browser.get_forms()

    Above, we can see there are two forms on the webpage. The last of these is the one we need, as this is what we’ll use to submit a zip code with a historical date to get weather information. You can tell this by looking at the inputs in the print-out above of forms i.e. code, month, day etc.

    form = forms[1]

    Inputting data into our form is similar to adding values to a dictionary in Python. We do it like this:

    '''Enter inputs into form'''
    form['code'] = '12345'
    form['month'] = 'September'
    form['day'] = '18'
    form['year'] = '2017'

    As you can see, our inputs require a zip code (just called ‘code’ in the html), along with values for month, day, and year.

    Now, we can submit the web form in one line:

    '''Submit form'''
    browser.submit_form(form)

    You can get the HTML of the page following submission using the parsed method of browser.

    With the pandas read_html method, we can get the HTML tables on the page. In our example, we pull just the 0th (Python indexed 0) table on the page.

    '''Scrape html of page following form submission'''
    html = str(browser.parsed)
     
    '''Use pandas to get primary table on page'''
    data = pd.read_html(html)[0]

    Putting it all together…

    Let’s wrap what we’ve done into a function, like this:

    def get_historical_data(code, month, day, year):
         
        '''Create RoboBrowser object
           This will act similar to a typical web browser'''
        browser = RoboBrowser(history=True)
         
        '''Navigate browser to Wunderground's historical weather page'''
        browser.open('https://www.wunderground.com/history')
         
        '''Find the form to input zip code and date'''
        forms = browser.get_forms()
         
        form = forms[1]
         
        '''Enter inputs into form'''
        form['code'] = code
        form['month'] = month
        form['day'] = day
        form['year'] = year
         
        '''Submit form'''
        browser.submit_form(form)
         
        '''Scrape html of page following form submission'''
        html = str(browser.parsed)
         
        '''Use pandas to get primary table on page'''
        data = pd.read_html(html)[0]
     
        return data

    We now have a function we can call by passing a zip code with a particular month, day, and year.

    get_historical_data('23456','September','18','2000')    
     
    get_historical_data('23456','March','18','2000')

    We can simplify the parameters we’re passing so that we just need to input a zip code with a date, rather than three separate arguments for month, day, and year, respectively. We’ll do this by converting a date input from the user to a pandas timestamp, which allows us to parse out these needed attributes about the date.

    def get_historical_data(code, date):
         
        '''Create RoboBrowser object
           This will act similar to a typical web browser'''
        browser = RoboBrowser(history=True)
         
        '''Navigate browser to Wunderground's historical weather page'''
        browser.open('https://www.wunderground.com/history')
         
        '''Find the form to input zip code and date'''
        forms = browser.get_forms()
         
        form = forms[1]
         
        '''Convert date to pandas timestamp'''
        date = pd.Timestamp(date)
         
        '''Enter inputs into form'''
        form['code'] = code
        form['month'] = date.month
        form['day'] = date.day
        form['year'] = date.year
         
        '''Submit form'''
        browser.submit_form(form)
         
        '''Scrape html of page following form submission'''
        html = str(browser.parsed)
         
        '''Use pandas to get primary table on page'''
        data = pd.read_html(html)[0]
     
        return data

    So calling our function now looks like this:

    get_historical_data("23456", "9/18/2000")
     
    get_historical_data("23456", "3/18/2000")

    Originally posted on TheAutomatic.net blog.

    Disclosure: Interactive Brokers

    Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.

    This material is from TheAutomatic.net and is being posted with its permission. The views expressed in this material are solely those of the author and/or TheAutomatic.net and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.

    Go Source

    Chart

    Sign up for Breaking Alerts

    Share post:

    Popular

    More like this
    Related

    High points for US economic data scheduled for April 15 week

    Your Privacy When you visit any website it may use...

    Weekly Market Recap: April 15, 2024

    The week in review CPI grew 0.4% m/m (3.5% y/y)...

    Economic Update: April 15, 2024

    Growth The U.S. economy expanded at an impressive 3.4% annualized...

    Second Quarter Strategic Income Outlook: Rip Van Winkle Would Be Confused

    Despite elevated interest rates, economic conditions improved throughout the...