You need to get RID of the pulldown that has been applied to the one that runs at 29.97 fps.
Film is shot at 24 frames per second. NTSC video (US standard definition - in other words, this is what's on DVDs) is 59.94 fields per second, or 29.97 frames per second.
In order to get 24fps film to match 29.97fps video, it's first slowed down ever so slightly to 23.976 frames per second. This is pretty much an imperceptible difference that no one ever notices, though some Blu-Ray players are now capable of displaying true 24fps content. Anyway, that part is usually not an issue at all, as when 24fps material is digitized, it's almost always digitized at 23.976 from the get-go.
To get from 23.976 to 29.97, you undergo a process called "3:2 pulldown" (or 2:3 pulldown, but my film school profs have always used the two terms interchangeably). Basically, this takes each full frame, splits them into individual fields, and distributes each field across the frames of video.
Basically, in order to edit with the 29.97 material, you need to undo this process so you have 23.976 full frames per second to work with.
--COMPLICATED TECHNICAL STUFF BELOW; SORRY IF THIS IS TOO COMPLICATED, IT'S HARD TO EXPLAIN IN TEXT FORM---
This is an issue in editing because to apply 2:3 pulldown, this is what happens:
FILM FRAMES: A B C D
Each of these four frames will be split into two fields - that means that every other line of information is on a different field (so line 1 is on field 1, line 2 is on field 2, line 3 is on field 1, line 4 is on field 2, and so on - this is interlaced, whereas all lines containing all the picture information of a frame is progressive). So now you have eight fields from four frames - A1, A2, B1, B2, C1, C2, D1.
To get this into a format that can be played at interlaced 29.97fps, the fields are blended together. This is where the process (2:3) gets its name, as this is what you end up with:
VIDEO FRAMES: A1/A2 B1/B2 B1/C1 C2/D1 D2/D1
So that turns 4 frames into 5 frames. Video frame 1 is two fields of film frame A; video frame 2 is two fields of film frame B; video frame 3 is one field of film frame B and one field of film frame C; VF 4 is one field of FF C and one of FF D; and VF 5 is two fields of FF D.
(Note that video frame 5 might actually be D1/D2, I don't remember. Likewise, video frame 3 might be B2/C1.)
As you can see, this shows as a cadence of 2 fields, then 3 fields, then 2 fields, then 3 fields ... 2:3:2:3.
The reason this has to be reversed is film frame C. It's split across two video frames, and needs to be recovered. If you cut after video frames 3 or 4, what's the end of your shot? Frame B? Frame C? Frame D? These frames need to be recovered from the 2:3 pulldown cadence so you can work with them and not worry about splitting fields when you edit.
There are many different ways to do this, but since I've never used Vegas, I don't know how to help you out in this situation. I'm sure someone else will. I just thought you'd like to know why the framerates are different, and which one is the one you should go with.